This last week I received an email from a friend asking about scaling. The situation is this: a particular company has well over 100 EIGRP routers on a single L2 service from a provider. Will this scale? What’s more interesting than simply asking about scale, though, is to ask the “why” question — no matter what answer you give, tell me why you think that answer is true. Let me give you my analysis, and then you can tell me if I’m right or wrong the comments.
The first step in determining the answer is to think through how EIGRP works on this type of link. When any EIGRP router receives an update, it will examine its local table to determine if this impacts the existing best path to a given destination. There are three possibilities here; let’s look at each in turn.
The first is that the new path is the same or worse (metric wise) as the current path, and it doesn’t change the current path. In this case, nothing happens.
The second is that the new path is better than the old path. In this case, the EIGRP router must send an update across the broadcast link. What does this actually look like from a packets on the wire perspective? The updating routing sends a single packet and (assuming it’s not lost) each of the connected routers will replay with an ACK. The ACK itself counts as a hello packet, so it doesn’t (really) count as a “full” extra packet on the wire. So in this case the result is one and a half packets or so additional on the wire for EIGRP to operate.
The third is that this is either a query or a route down that impacts the current best path, in which case (assuming there is no feasible successor), the router will need to move the route into active mode and send a query to each of its neighbors. This query is actually sent as a multicast, but here each reply must be sent as a unicast to the querying neighbor. Combined with ACK’s (but remembering an ACK replaces a hello), this could mean somewhere on the order of 100 to 150 additional packets.
There is actually a fourth, but it involves some odd stuff buried down in QO_MULTI, and it would take longer to explain than this blog post should be.
From a packet flow perspective, then, I don’t see a scaling issue. OSPF would double this packet counts (because the updater must send the packet to the DR via multicast, which must then reflood the packet across the same wire to the connected routers), but this still doesn’t seem like a lot of work from a packet perspective.
But — this packet analysis is per update. To find the real impact, you need to multiply the number of packets in one update or query chain by the number updates per second at each EIGRP router connected to the network on average. Knowing the number of EIGRP routers connected to a single broadcast link, then, doesn’t really help us understand whether or not the network will scale. The speed at which the routes change makes a big difference, too.
The second area we need to look at to think through this problem is the size of the routing table being transferred over the link. The more routes there are, the longer it will take to converge across the link in the case of any sort of neighbor failure. Taking the worst case situation, a stuck in active across this link can cause a major flood of information through the link. Since EIGRP turns pacing off at any link speed above T1 by default, it could be easy enough, with enough rapid change and a dominoe of stuck in active events to top the link, causing EIGRP to fail to converge at all. The odds of this event are actually really low, though — in my years in TAC and on the Escalation Team, working on thousands/tens of thousands of networks, I saw this happen twice. Both were caused, ultimately, by another problem destroying the EIGRP neighbor relationships fast enough that links were overrun; neither were a chain reaction within EIGRP itself (with no outside actors).
So far, then, my answer is — this should be fine. But let’s say we wanted to be safe, rather than sorry, or that the provided link had a low enough speed, and the [state x speed] of change numbers were high enough to cause concern. What solutions are available?
My first response would be — aggregate and filter to the minumum number of routes possible across this link. In fact, in the “bad old days,” we’d actually ship customers enough routers to put an extra layer of EIGRP routers between the routers connected to the link and the rest of the network. This is a lot of hardware, but it allowed us to aggregate one hop back from the routers connected to the actual link. Why? I’ll let you puzzle over this for a bit — if you know the answer, put it in the comments. I wouldn’t suggest this today, as the SIA rewrite has pretty much solved the problems we were attacking with this strategy (and that’s the only hint I’m going to give you in solving this riddle).
My second response? If you really can’t get the [speed x state] product down on the link, consider converting it to BGP. This really doesn’t impact your neighbor count, and it will probably actually increase the packet count, but with BGP you have the ability to dampen routes. This can really reduce the state x speed product across the link.
My third response? If most of the traffic is passing to and from one or two nodes, turn off EIGRP neighbor discovery and configure the link as if it were an NBMA hub and spoke network (remember Frame Relay? I know, it was a long time ago, but just reach way down in the well). This would direct reduce the number of routers participating in convergence, effectively reducing the [state x speed] quotient across the link. This only works, though, if you don’t have a lot of traffic across the link. You can turn off split horizon on the link and turn off next hop self, making the hub act like a route server. This seems a little drastic, but it is possible.
So — now it’s your turn. What do you think? Would you worry about scaling here? Why or why not? Dig into the protocol theory in your answer; it’s good practice. Either way, which solution would you deploy, and why?
BTW, the [state x speed] construction is straight out of my new book; it’s one of the pairs in the four piece model of complexity I’ve been mulling over. What’s interesting to me about this problem is that so nicely illustrates the intersection of practical networking knowledge with protocol theory and an internal model of what drives complexity in a network.