“Hey Fish, how good are you at BFD? None of my BFD neighbors will come up.”
Two simple sentences and I am “hooked.” I love troubleshooting! Troubleshooting is just a blast for me! It’s like being a Network Detective trying to figure out “whodunit”
As I sit down in front of the CLI and the diagram, I ask, “What is the routing protocol the BFD is for?” The answer, admittedly, makes my heart sink a little. This isn’t just “BFD”, this is BFD for FabricPath ISIS. I’m in no way, shape, or form a FabricPath subject matter expert. I don’t think I’m going to be able to help them. But I’ll try.
I’m very lucky to be troubleshooting in the lab instead of in production. It affords me the opportunity to isolate a failure domain to as few pieces as I can. Since we are looking at BFD neighbors the question becomes, “what can I remove and still have it fail?” Just need 2 boxes.
After confirming the basics (CDP, supported code, configurations, and trying commands I know) I start entering commands and putting “?” at the end. The analogy I use for this is snooping around a crime scene. I begin with the below two
- show bfd ?
- show fabricpath ?
After a little while I trip across one that I like – show bfd neighbors fabricpath details and I stop and look closer at this. Why? Honestly I can’t always explain it. I guess it would be like someone asking a detective why they are spending so much time looking at a specific part of a crime scene. I don’t always know. Years of troubleshooting? Hunch?
What makes me stop at the output to this command is that the Rx and Tx counts are ALL zero. We openly acknowledge that we’ve got NOTHING. This is pulling me in.
Should these be non-zero if it is working? I have no experience with what a working show bfd neighbors fabricpath details is supposed to look like. If I had this working anywhere at all (even if it was not the same platform) I would go to that environment and compare. Just to help me see what is different to help me gather clues.
For this situation I have no working BFD over FabricPath ISIS anywhere so I’m going to stop here for awhile and keep snooping. See what clues (if any) pop up.
“AdminDown” on the State bit causes me to assume that is short for “Administratively Down”. Couple that with the Tx count being zero kind of feels like maybe something needs to be “no shut” or enabled.
Wait! What is that at the bottom? “Vlan used for FP-BFD session not in Fabricpath mode” So what this tells me is maybe there is something else in the box keeping the BFD session “administratively down” because it knows that something isn’t right.
I ask the FabricPath subject matter expert what this might mean and I am directed to check the configurations. I find that, indeed, vlan 2006 is configured to be in mode fabricpath. Scratch that theory.
Since vlan 2006 is properly configured for mode fabricpath, is the FP-BFD code using a different VLAN than vlan 2006? Time to go to sniffer traces or debugs. I decide to go to the debugs first. Why? The “admindown” for the state and also the Tx count of zero. I’m not thinking the FP-BFD isn’t going out on the wire.
- Pick one of the 2 boxes
- Shutdown all interfaces except for management
- Make sure no interfaces are up except for management
- Double check no interfaces are up except for management
- Two windows to the box — console connection AND a telnet
- Debugs can get quite “verbose”. My experience tells me that since I will be using the “all” version of a debug, I’m not going to be able to keep up. So I start logging on both windows. One screen will get the debugs, the other will be what I use for “show” commands and also to “undebug all”.
- Enable debug bfd all
- No shut the interface I’m focusing on
- Watch the debugs fly on the screen
- Turn off debugs
Did I catch it? Maybe. Maybe not. It’s a first pass.
Next is going thru all the debugs. I do this by taking the log and putting it into a document so I can easily search, highlight, make notes and try to “make sense” out of what I’m seeing.
- First? I search for the word “vlan” to see if anything “pops”. No.
- Next? Vlan 2006 is, in theory, what we think is being used for the FP-BFD session so search for this. Nope.
- 2006 in hex? 0x7D6. Search for that. Nothing.
- Can I find either of the MAC addresses in there? Nada.
Stare at the screen. Stare some more at the screen. Back to the beginning. Start at the top of the debug document and just go page by page and see what jumps out.
That looks potentially promising. What are all those “F”s though?
- Look again to see if either Ethernet MAC address is in there so I can figure out if I can find a SMAC or DMAC and make sense of what I’m looking at. No.
- Assume that maybe this is higher layer.
- Go to google and search BFD sniffer trace. Go to Wireshark and pull it down and open it. Can I get ANY matches in the detailed hex between the two to help me identify the debug hex? Uh… no.
Stare at the screen. Stare some more at the screen.
Does that say “81 00?” Could it be? Please tell me that an 802.1Q header!
Assuming that the 0x8100 in my debug is the TPID for an 802.1Q header, that would make
- SMAC = 002A.6A1C.127C
- DMAC = 0180.C200.0042
- Ethertype = 0x8946
None of this makes ANY sense to me. None of these are Ethernet MAC addresses that I can find on other box, and I have never heard of ethertype 0x8946 before.
As my friend Amy would say, “Le Sigh”
Wait!!! “002a” … I saw that somewhere in those show commands. Which one was that? Oh right, I was logging both of my sessions, including the one with the show commands. Open the document and search for “002a”.
But that isn’t a MAC address. That is the FabricPath ISIS system ID. Wait. What does that next line say? “Fabric Control MAC”? Well I guess this is BFD for FabricPath ISIS. But who is 0180.C200.0042? It isn’t showing up in either switch. Okay, I’ll let that go and take the clue that the Fabric Control MAC/System ID is in the SMAC portion before what looks to be an 802.1Q header. So if that is the case I can break the 802.1q header down and see what VLAN it is trying to use for this.
If I assume “81 00 E0 01″ is an 802.1q header, then “E” breaks down to ‘1110’, of which the first 3 bits is the priority. 7 = ‘111’. Okay. That would make sense. So far so good. So based on all that, if this really is an 802.1q header I’m looking at, then the vlan being used for FP-BFD is vlan 1. So, in theory, if we configure vlan 1 to be in mode fabricpath this will all work.
… drumroll please…
CASE CLOSED: Misconfiguration. Vlan 1 was not in fabricpath mode.
Additional Geekiness. After solving the “whodunnit”, it bothered me later that I still didn’t know what that DMAC and ethertype were. Ethertype 0x8946 and destination mac address 0180.c200.0042 are both associated with TRILL.