Example: Why switch in VSS mode crashed?
Few weeks ago there was a great podcast about logging (show 192). Recently I came with great example about how important logging is. If there is only one thing that I could pick up from the podcast it would be following statement “log as much as you can, because you would never know what can be important!”
Let me start from the beginning. The case has landed in the queue waiting for someone to take ownership. I was the lucky one to pick it up and look for the customer concern. Problem was simple “Switch rebooted yesterday twice. We need to know why?” I also would like to know I thought. No logs attached yet, so initial communication start with seeking for more data. Show tech-support would be enough for the start. If I am lucky crash info would be also generated and I would be able to provide valid and full of technical details answer. Few minutes’ later opening “show tech” and looking for buzzwords which can help me find the answer. Oh! It is not a single switch. It is VSS, so two Catalyst 6500 connected with VSL link. They act logically as 1, but reload was only on one peer, so “half of the switch was not operational ;)”. Good. It means that customer did not lose internet access. But what else is in the logs? Nothing. Let’s look for another string “crash”. The answer was very quick and challenging. There is no crash info. Hmm “reload”? Here it is! “Last reload reason – dual active detection”. I have got you. Dual active is when both switches in VSS setup think they are primary active. This scenario is bringing network down on their knees just like spanning tree loop. We can have two active switches when VSL link (link responsible for sending control plain messages between switches) is down and no mechanism is configured to detect dual active scenario. As per best practice customer had dual active detection configured. By design (and documentation which can be found here) if dual active is detected “an active chassis which see dual-active condition shuts down all of its non-VSL interfaces”. “When the “shut down” chassis detects that VSL is operational again, the chassis reloads and returns to service as the standby chassis.” Finally I have something for customer!
So what do we know?
- One peer experienced reload
- Reload was expected as was triggered by dual-active detection (customer configuration)
- Hello knows why there was dual active. Nothing in the logs. No historical info. No clue.
The reply blasts my mailbox with the speed if light “We understand that dual-active detection occurred and when VSL came back the “shut down” chassis rebooted, but why it occurred? Can we do anything to prevent it from happening in the future?” Very valid question. How the hell I would find the answer? Wait! There was something… A feature on cat 6500… sea… SEA… I can see the horizon. SEA logs! SEA stands for System Event Archive. The feature has started its existence with 12.2(33)SXH software version and it stores system events and messages on the nonvolatile memory (bootdisk). The cool thing about it is that it stores system messages in 2 files (sea_log.dat and sea_console.dat) which are 32 MB each. What both files contain and how it works? The SEA enables each of the CPUs on a switch to report events to the management processor using an out-of-band interface. The messages are recorded with detailed time stamps.
•sea_log.dat— store the most recent system events in this file.
•sea_console.dat—store the most recent console messages are stored in this file.
OK, now as we know that we can gather more info lets use the show command “show logging system”. When gathered sea_console.dat from both VSS chassis I had 64 MB of logs to follow. I have found the date and started to look for anything meaningful to realize after half an hour later that day and month is good, but hundred pages below there is the year I was looking for (for my excuse I have to write that year is not mentioned, but when you deal with 32 MB of logs in one file you can pick up wrong one ;)). When I found correct timing I started to be lucky. VSS was configured in a way that both VSL links were going from supervisor to supervisor, so interfaces 5/4 and 5/5 on both chassis were VSLs. Having in mind the ultimate question WHY? Why dual-active occurred?
%VSLP-SW1_SPSTBY-3-VSLP_LMP_FAIL_REASON: Te1/5/4: Link down
%VSLP-SW1_SPSTBY-3-VSLP_LMP_FAIL_REASON: Te1/5/5: Link down
%VSLP-SW1_SPSTBY-2-VSL_DOWN: All VSL links went down while switch is in Standby role
%DUAL_ACTIVE-SW1_SPSTBY-1-VSL_DOWN: VSL is down – switchover, or possible dual-active situation has occurred
Now we have something interesting. Both VSL links went down. Still WHY is valid question here. Why they went down? Let’s look on the peer side as almost always when you see terrifying logs on one site issue appears to be on the other. Looked at sea_console.dat from other chassis (this time found correct year immediately) and here it is. The reason of whole evil in the world. Non-recoverable parity error occurred on supervisor.
%LTL-SW2_SP-2-LTL_PARITY_CHECK: LTL parity check request for 0x777
Parity error occurred —> caused VSL link to go down —> caused dual-active detection —> caused chassis to reload. End of story:).