A listener wrote in recently to tell us about an interesting data center troubleshooting problem. We thought it would be cool to share his story because people could learn something from it.
Ethan Banks talks with Joel Spencer, a network engineer who works for a large telco/ISP in Australia. Joel has more than 30 years of experience in IT.
The Mystery Of The Missing Bytes
Ethan and Joel kick off the conversation with a review of Joel’s background, and then walk through the data center infrastructure relevant to this conversation.
Then Joel discusses how he arrived at work one morning to complaints of intermittent disconnects between applications. Packet captures showed mysterious missing bytes in packets.
From there, the podcast drills into the troubleshooting steps that Joel and his team took to identify and remediate the problem.
I won’t spoil the ending, but the problem included both physical and virtual components in the data center network. And the response required some clever sleuthing.
Thanks to Joel for sharing his story! You can follow him on Twitter at @matingara.
Some facts about Joel:
- Started in networks in 1984
- First projects involved ISO OSI stacks
- First major network project was the National EFTPOS network in Australia
- This involved the design of network hardware and software for ATMs and stores across Australia
- This project transferred me back to Boston (my home town) for 7 years in late 80s and most of the 90s
- I have received US Patents for my work on this system
- After the project finished I transferred into Open Systems engineering in the US
- And, of course started in TCP/IP
- Moved back to Australia in 1995
- Worked on initial Broadband cable infrastructure etc
- In 2010, after dabbling in managed services, decided to move back to my Number One Love. Networking
- Fortunately for me,, rolling forward – THIS IS THE PLACE TO BE
- For the last four years I have been building new data center infrastructure using Cisco (Physical and virtual), Juniper (Physical and virtual), F5 (Physical and virtual)
- We have written a fully automated engine to allow users to spin up VMs, load balancers etc via a portal
- I wrote the python code to automatically allocate/release IP addresses. This is my code and uses the Solarwinds IPAM database
2. The physical infrastructure
- Juniper SRX 5800 (Zone-based firewalls)
- Cisco Nexus 7010
- Cisco UCS
- Cisco 6000
- Cisco FEX
- Juniper MAG
- Juniper STRM
- F5 Viprion
- The connections to the rest of the world
3. The virtual infrastructure
- Cisco Nexus 1000v
- Juniper vGW (now firefly – previously Altor)
- VMware ESXi
- vCD etc
- F5 vCMP
- Zone based Firewalling
4. High-level packet walks
- How does traffic flow?
- It depends on where the source and target VMs are located
- If the two VMs are in the same security zone AND on the same UCS blade that is one flow
- If the two VMs are in the same zone but on different security zone that is a different flow
- If the two VMs are in different Zones then, regardless of where the VM is that is a different flow
5. Extract from the email that alerted me to the problem (Names changed to protect the innocent)
FW: Application BOB: Disconnection issue : War room sessions (Case ID 147220 ; ID:111111)
An application hosted on “the Network” is having intermittent issues. I’m seeking assistance from your team to decipher packet captures (details in attached email).
Is anybody available from your team on Monday?
Joel (cc’d) will be your key contact.
That has to be the email that no network guy EVER wants to see.
6. What is the first step?
- Wireshark/pcap traces from the END systems
- Wait for the problem to occur
- What can we see?
- Is there anything interesting? Yes there is.
7. What was the interesting thing?
- Where did occur?
- Where in the packet did this interesting thing manifest itself?
- Did it point to a problem in the physical or virtual network?
- Was it a known problem?
- Why were we using a problematic component?
- How to fix it (I hope) forever