I’ve been spending some time in the last few months talking through various fast reroute systems – we’ve looked at one (unconventional!) view of P/Q space, an alternate way of explaining MRT, Not-Via, LFAs, and a few others. Now, let’s close this series by asking: How does all this relate to the “new wave’ of control plane centralization?
The obvious answer: any/all of these techniques can provide the “fast convergence” piece of a centralized control plane just as easily as they do for a distributed control plane. It’s possible for an OpenFlow controller, for instance, to calculate a set of alternate paths using MRT, and install those paths as “backups,” just as an EIGRP Feasible Successor, or an IS-IS LFA, is installed today in the forwarding plane. If the interface through which the best path runs fails, a forwarding device (switch) can remove the primary path, falling back to a precalculated backup path – just like a device participating in a distributed control plane might do.
This won’t, however, always work. Why?
Underlying a number of these fast reroute mechanisms is the assumption that the control plane will “ride to the rescue,” with a new best path before using a backup path has any bad effects. Let’s take LFAs as an example, using the figure below.
Assume, for a moment, that some centralized control plane configures D so traffic normally passed along (D,C,B), but in the case of a failure at (D,C), traffic will be routed through (E,A,B). This is, essentially, what IS-IS, using standard LFAs might do. That this rerouting of traffic is a good idea depends, of course, on the path through E being useable at the moment of the (D,C) failure. But what if it isn’t? What if the metrics along (A,E) changed just moments before the (D,C) failure, causing A to use the path through E as its best path to B?
In this case, we could do much worse than simply dumping our tokens in the ether. We could, in fact, build a loop in the network that will persist until the control plane that set all this stuff up actually rides to the rescue.
The point is that when you invoke most fast convergence mechanisms, you’re actually placing the network into a “fallback state.” The risk of a double failure, and hence of a real blowout, increases each moment the network is in this “fallback state.” While many folks would argue that centralized control planes will fare no worse in this situation than distributed ones, I’m not certain I agree. There is an ability to react to local events locally that can never be beat by a centralized controller reacting to events in some distant corner of the network.
What can we do about this problem? One not so obvious answer is to ensure the fast reroute mechanism we use will drop traffic, rather than looping it, in the case of a double failure, or a rapid change in network conditions once the reroute mechanism has been triggered. In other words, any reroute mechanism that dumps traffic back into the normal routing table once it has been rerouted increases risk; doubling this risk by centralizing the control plane might not be the best idea in the world.
So for centralized control planes, it’s best to towards alternate overlay topology solutions, such as MPLS based FRR solutions (which we’ve not discussed in this series – after all, this is supposed to be IP fast reroute!), or something like MRT. The key point is that once a packet is placed on an alternate topology, it must be either delivered or dropped.
IP/FRR does apply to centralized control planes then, but not in the clean, obvious, way we might expect at first glance. Just another case of piling complexity on complexity, and the need to be very careful in evaluating how complex systems interact.