In a past post, we’ve discussed microloops in link state protocols. If we examine a small ring topology (if you come to my Interop talk, you’ll discover that ring topologies are the heart of network convergence), we can see where and how a microloop forms.
If the link between A and B fails, A and B will be notified of the link failure first, followed by E and C, and finally D will be notified. If every one of these routers runs SPF after the same “hold down,” then A and B will adjust to the new topology first (A pointing towards E, and B towards C), E and C next, and D last. The microloop (or transient loop) forms because A points towards E for any routes that once passed through B, while E continues to point towards A for those same routes – at least until A and E have both run SPF.
One solution to this problem is one of the tunneled varieties of fast reroute. If A tunnels traffic to D, for instance, when the (A,B) link fails, E’s table won’t matter in convergence terms. A is routing traffic through E, such that E doesn’t examine the actual destination address while forwarding the packet. Another solution would be to change the order in which these routers install the new routing information into their tables. Even though A is the first to know about the change about paths using B as their next hop, A needs to be the last to install these routes. Thus, ordered FIB.
Ordered FIB first assumes that every router in the network will receive a new link state advertisement outlining the topology changes which need to be converged on before any router in the network begins running SPF. This means the SPF hold time must be greater than the time it takes to flood a link state update through the flooding domain (area border break the flooding domain, of course, so we don’t need to count routers outside the local area in either OSPF or IS-IS here). This means ordered FIB assumes that convergence is “not urgent.” In this case, ordered FIB would assume that a fast reroute repair has been placed in the network, and hence that quickly converging on the new topology after the failure is less important than insuring no routing loops occur during reconvergence.
Once a router receives the updated link state packet, it runs SPF twice. The first time, it calculates the new shortest path tree to any affected destinations – E would calculate a new best path to B, for instance. The second time, it calculates a reverse tree from the point of failure all the way through the network. Router E, for instance, would run an SPF rooted at A back to B. Noting the best paths towards B through A stop at D, it would calculate the maximum path length impacted by this failure as three hops (the longest being (D,E,A)).
This number is then used by D to calculate a rank, which is essentially the number of routers that rely on E to reach the destinations impacted by the failure. For D, the rank is 0; for E, the rank is 1; for A, the rank is 2. Now we have the order in which these routers must install new routes in their local forwarding tables to prevent a microloop. What do we do with this bit of information?
Each router along the path needs to hold any new routing information until its certain routers that rely on it in the reverse shortest path tree have also installed their new routes (i.e., E must be certain D has installed its routes before E can install the new routes, and A must be certain E has installed its routes before A can install the new routes). The simplest way to do this is to calculate a hold time based on the rank – and this is precisely what ordered FIB does. At D, the rank is 0, so D doesn’t wait to install the new routing information; as soon as it is finished running SPF, D installs the new routes. Router E, however, must wait until its certain D has installed the new routes, so it waits 1 * MAX_FIB before installing. Router A, likewise, must wait 2 * MAX_FIB.
What is MAX_FIB? This is just a new wait time configured on each router. MAX_FIB needs to be set to at least as long as the normal SPF hold time, probably plus the amount of time it takes to flood a new link state packet through the network, for all of this to be effective. Adding MAX_FIB into the installation time will dramatically slow down network convergence – hence the assumption that the network change is “non urgent,” or rather that preventing loops is more important than converging the network.
Note the process is slightly different for link up, router up, or a decreasing metric someplace in the network; rather than using a shortest path rooted at the point of failure to calculate rank, a shortest path rooted at the local node across the new paths is used.