Chris Wahl and Ethan Banks continue their silo-busting mission on Datanauts. They take up their roles as Server Dude and Network Guy to look at how performance problems get resolved in the virtualization and network realms.
They also share tips and ideas on how to make performance troubleshooting more about collaboration and helping the business and less about pointing fingers.
Section 1 – Virtualization/Server
Poor virtualization performance tends to fall into a few buckets:
- Sizing of the VM is wonky, too big for the host or too small for the application
- Ran out of memory on the host
- The host is suffering an issue – documented KB (purple screen of death, driver is flaking out), hardware is dying, resources are overloaded, NUMA, noisy neighbor
- The storage is suffering an issue – dropping packets, too busy, the network is congested, latency to the array
- The network is borked – Using the wrong device driver (VLANCE, e1000, instead of VMXNET3), routing over the wrong VM interface
Complex problems can always be distilled into smaller segments with simpler decision points. By working through a series of spheres and eliminating any spheres that do not overlap, you can quickly get down to the root of an issue and hone in on the real problem.
- Spheres of Datastores: Do all of the NFS datastores disappear or just specific datastores? If it’s all the datastores, you’ve eliminated any spheres related to the datastore and export configuration.
- Spheres of Physical Ports: Does it matter if you move the physical network connection to a different port? If not, you’ve eliminated the possibility of a bad cable (although it’s rare anyway).
- Spheres of Configuration: Are there any differences in the physical and logical configuration on the switch port? If not, you may want to keep looking at the ESXi configuration.
Section 2: Networking
How Networking handles poor performance reports:
Network performance trouble tickets often start out as “everything is slow.” Networks tend to have many access points and junction points, so tickets like this lack the granularity required to have a starting point for investigation.
- “Everything” means “something,” and the initial job is to figure out the specifics of what, exactly, is actually slow, and under what circumstances.
- Specific applications or a single application?
- Or only exhibited when using a specific client?
- Slow sometimes or all the time? Every day, or certain days? What time of day?
- Slow from everywhere, or only specific locations?
- Find the point of commonality.
Now that the network engineer has got a specific symptom identified that may or may not have anything to do with the network, all appropriate parties can be engaged.
- Application managers
- Other infrastructure silo experts (DBA, storage, virtualization, sysadmin)
- Business users / stakeholders
- Now let’s work on problem resolution together…
Bad, Naughty Network Engineer Responses
- Confrontational (Have you checked your crap?)
- Defensive (There’s no way it’s my crap.)
- Terse (It’s not the network, because I said it’s not.)
- Blame-shifting (The last time this happened, it was the blah-blah system. Or…the vendor told me this would work. It’s not my fault!)
Section 3 – Takeaways and Futures
Unified monitoring architecture–every gets to see everything
- Monitor application delivery, not siloed infrastructure components
- Service Level Agreements
Learn application flows & dependencies
- When a request comes into a multi-tiered application, what really happens?
- EVERYONE in IT infrastructure should know flow of business-critical applications, at least at a high level
Bust your silos
- This starts with culture.
- Doesn’t have to change your org structure, but might be a good idea
- Defensive, territorial managers will struggle the most with this (and add the LEAST business value because they impede IT progress for the business, as opposed to leading)