I spent much of the first two weeks of my new job troubleshooting bad site to site VPN performance. Which wasn’t the real problem. I didn’t know for sure what the problem was after the firewall cut I was doing Saturday night didn’t help the issue. At the time, it seemed like the cut should have resolved the problem. The firewall in doubt was very old, running code almost as old, and had a chronically hot CPU. No brainer, right? The theory was that the company was pushing the box beyond its limits, and so encryption performance, a CPU intensive task, was taking a hit. But then after putting a brand-new firewall in with a zillion times the horsepower and the latest and greatest OS – no change whatsoever. So Monday morning, I got into the office, and started doing a step-by-step teardown of the packet loss. I needed to determine if it was the ISP, or somewhere else in the network, making no assumptions other than everyone was guilty until proven innocent. Step 1. Set up a ping from my desktop to a box across a tunnel. Step 2. Set up tcpdump on the transit firewall.
And then step 3. Watch about 30% of my ICMP echo requests not even make it to transit firewall interface facing me, there being exactly one network switch between me and that firewall. In other words, pretty much stone cold proof that the switch was dropping my traffic. Now, it was still a bit premature to definitively point the finger at that switch without some more testing. Happy to oblige me, someone unexpectedly came by my desk and mentioned tests they were running, where they were okay pinging the IP address of the switch in question, but going beyond the switch turned their ICMP response to junk. Thank you for the corroboration. A couple more tests of my own, and I was good to get my bat out and put that switch out of my misery.
Pondering the issue a bit further, it occurred to me that this might be a layer 3 only problem, but not affect layer 2…if I was lucky. Using HSRP, I swung the default gateway of my local VLAN over to a different switch. Like magic, my packet loss was gone. Ooooooo. I got a little excited. I started moving all the VLANs over with HSRP, and began getting reports back from all over the place that life was good again. Score.
I’m happy, my new boss is happy, his boss is happy, scads of users are happy. Great. So what exactly is the issue -the smoking gun? In the log of the switch in question was a series of messages similar to the following:
Mmm. Yeah. Cisco says that means the hardware is tanking. That message does pretty well point to what we were experiencing – layer 3 packet loss. Sounds to me like the sup engine hardware is failing to the point that it can’t do reliable lookups in the layer 3 forwarding engine any more. I’ll prove that it’s really the supervisor engine by failing over to a standby sup during the next maintenance window, moving some routing back, and seeing where we end up. It’s a Cat4500 series, so I can’t imagine what it could be other than the sup.
For me, the lesson of the day was to never make assumptions that what people are telling you is the whole picture. I am brand new to this network, so I was relying more on what other people were telling me about the problem than I was on my own troubleshooting. When I first got the report of poor site to site VPN performance, I should have done testing to prove that the packet loss was or wasn’t limited to the VPN, and not taken anyone’s work on that. Fatal mistake in this case – I could have ferreted out the problem a bit sooner otherwise. That’s the way it goes sometimes.
At least now I can move onto a couple of other issues. (1) Getting RADIUS authentication working on the new firewalls. (2) Upgrading some MPLS circuits over in the UK…which sounds simple enough, except that I don’t have any documentation on what the routers are, where they are, or how to connect to them. I’ll be poking at routing tables tomorrow to sort that out.