This article is imported from packetattack.org, a technical blog I maintained before planting my flag at packetpushers.net. I’ll be moving the most popular blog posts from packetattack.org to packetpushers.net in the coming weeks.
Slaying SPOFs
Some of you know I took on a new job earlier this year, where the challenge was (and is) to transform a globally distributed network for a growing company into an enterprise class operation. A major focus area has been eliminating single points of failure (SPOFs): single links, single routers, single firewalls, etc. If it can break and consequently interrupt traffic flow, part of my job is to design around the SPOF within the constraints of a finite budget.
The network documentation I inherited ranged from “mostly right but vague and outdated” to “a complete and utter fantasy requiring mind-altering substances to make sense of”. Ergo, untrustworthy to the point of being useless beyond perhaps slideware to show a particularly dim collection of simians. I have therefore been doing complete network explorations, building new documentation as I go.
To my horror, I one day discovered an egregious SPOF, where a single, fragile piece of CAT5 provided the sole physical path between two major concentrations of network activity. If that link ran into any trouble, an entire room containing hundreds of physical and virtual servers (and their storage) would have been cut off from the rest of the company.
To eliminate the physical path SPOF, the easy choice was to transform the single link into an etherchannel. This I did; the single 1Gbps became a 4x1Gbps etherchannel plumbed back to one core switch. For good measure, I added a second 4x1Gbps, plumbed to a second core switch. Spanning-tree roots had already been established such that even-numbered VLANs would traverse one of the 4x1Gbps etherchannels, and odds the other…which you can read more about here if interested.
All should now be sweetness and light, right? A 1Gbps SPOF (and probable bottleneck) was transformed into a load-distributed pair of 4x1Gbps etherchannels, and hey, if they weren’t complaining about the 1Gbps link before, they ought to be blissfully happy now!
Mad Maths
Enter the scaling problem: when it comes to etherchannel, 1+1 does not equal 2.
The reason adding more physical links does not proportionally grow your available bandwidth is that your friendly neighborhood Cisco switch does not load-balance across etherchannel members frame by frame. You might assume that the frame #1 gets sent down etherchannel member #1, frame #2 down etherchannel member #2, etc. in a round-robin fashion. Reality is rather different. What the switch actually does is math. The sort of math will vary depending on the capabilities of the switch, and on what you have configured.
Commonly available etherchannel load-balancing methods include source and destination MACs, source and destination IPs, and (my personal favorite) source and destination layer 4 port. To determine which etherchannel member will be used to forward a frame, the switch performs mad maths based on the load-balancing method you’ve selected. The practical upshot is that the same conversation is always going to be forwarded across the same etherchannel member, because the math always works out the same.
This behavior can impact the network. Imagine backup server BEAST with enough horsepower to fill a 1Gbps link who is runing a restore operation to server NEEDY. BEAST and NEEDY are uplinked to different switches interconnected by an etherchannel. As the restore runs, each frame is hashed by the switch to determine which etherchannel member to forward across; the math will work out the same for every frame, meaning the entire conversation between BEAST and NEEDY is going to be forwarded across the same etherchannel member. The result is kind of like the picture above – one member that’s crushed, while the other members lie comparatively idle.
Congestion Indignities
The switch is not sensitive to an etherchannel member getting crushed; the switch just keeps on doing mad maths. Therefore, some other conversations heading across the link will just happen to get hashed to the same link that the BEAST-NEEDY restore operation is using. Those other unfortunate conversations will therefore suffer the indignities that happen during link congestion: dropped frames and increased latency. The real-world experience is that certain applications act slow or throw errors. Storage could dismount. Monitoring applications get upset as thresholds are exceeded.
Yuck.
Of course, it’s now up to the network engineer (you) to discover why the alarms are going off, track down the offending traffic flow (you are modeling your interswitch links, right?), and figure out what is to be done about it. In my experience, you won’t have a lot of luck explaining what’s happening to non-network people. I’ve had a hard time explaining that 1+1 doesn’t equal 2, (or 1+1+1+1 doesn’t equal 4). You don’t really have a 2Gbps or 4Gbps link just because you’ve built a fancy etherchannel. You’ve really got multiple parallel 1Gbps links, any one of which can still get congested in BEAST-NEEDY scenarios.
So Fix It, Network Guy
There’s a few ways to tackle the challenge of 1+1 not equaling 2.
- Learn your traffic patterns. See if you can group heavy hitters into the same switch. That’s a pretty old-school way to go after the problem, and it won’t scale to large data center deployments. But you can find wins in this approach from time to time.
- Build a dedicated link. By this, I mean that you could build a link dedicated to just the traffic that’s causing the interswitch etherchannel all the heartburn. If your etherchannel is a trunk carrying a whole bunch of VLANs, you could build a parallel link that carries traffic for just a problem VLAN, while pruning that problem off of the etherchannel trunk. Might help, might not, depending on your situation…and of course, it’s a “one-off” fix, not a scalable solution necessarily. Some shops build networks dedicated to storage or to backup, and plumb specific interfaces on hosts to these specific networks for exactly this reason. There are increased in hardware, cabling, and complexity to make it happen, though.
- Add even more 1Gbps links to the etherchannel. This is not terribly practical. At the end of the day, you still have a potential bottleneck, but at least you’ve decreased the number of conversations that are likely to get hashed to a congested link.
- Replace the 1Gbps links with 10Gbps links. Increasing bandwidth is always an option. The jump to 10Gbps is a tough one, though: new switch hardware, higher power requirements, and likely new cabling will be required. And don’t forget to break out your checkbook.
- Apply QoS. If you have known offenders or predictable traffic patterns, you can write a QoS scheme to help manage the congestion. I tend to pump traffic like this through a traffic shaper, but there are other approaches, such as guaranteeing minimum bandwidth to important traffic, while dumping the link beast into the scavenger class. I have found that latency still tends to suffer when using a guaranteeing minimum bandwidth (CBWFQ) scheme. I have had the best luck with shaping.
- Tweak the beastly application. It’s not uncommon for certain applications to have a built-in throttle, so that you can cap network utilization right at the app. Talk to your system engineer and see…I’ve heard they’re people, too.

Pretty timely post. And I agree wholeheartedly about the etherchannel. I think some people tend to also forget that although it’s better than nothing, you aren’t exactly get double the throughput. You will gain port redundancy and allow your traffic more options, but you can never go beyond what the physical interface can handle.
Downside is that if you are tight on ports, it will take away even more available ports….
Nice photo choice 🙂
Hi,
you have further information about how the algorithm works? Would be very useful about further designs.
Regards
Tom
Hey Tom,
Best reference I have found on DocCD is:
http://www.cisco.com/en/US/tech/tk389/tk213/technologies_tech_note09186a0080094714.shtml .
They whole etherchannel section has quite a bit of useful information. Check it out at:
http://www.cisco.com/en/US/tech/tk389/tk213/tsd_technology_support_protocol_home.html
Hope that helps.
Kurt (@networkjanitor)
Whatever header fields you choose for load balancing (src/dst of mac/ip/L4 port), outgoing frames are always hashed into 8 buckets.
The 8 hash buckets are distributed among the link members:
8 links in the aggregation? Each link is assigned a bucket.
7 links? One link handles two buckets, the rest handle only one.
6 links? Two links handle two buckets. the rest handle only one.
.
.
1 link? It handles all 8 buckets.
Which buckets are assigned to each link? Check the “load” parameter of ‘sho int po x etherchannel’ or ‘sho etherchannel detail’
Take the value of each link. Turn it into binary. An aggregation here shows the following “load” values for its 6 links.
0x41 = 01000001 : bucket 1 and 7
0x02 = 00000010 : bucket 2
0x04 = 00000100 : bucket 3
0x88 = 10001000 : bucket 4 and 8
0x10 = 00010000 : bucket 5
0x20 = 00100000 : bucket 6
Which traffic gets assigned to each bucket?
I’ve seen various unsatisfying explanations of the algorithm, but have not been able to nail it down to the point that I could predict which link will be chosen for a given packet. It probably doesn’t matter, because link failures will move things around on you. Look into ‘lacp max-bundle’ if that’s a concern. Note also that re-assigning links to buckets takes time, and that packets belonging to a given flow may become misordered during bucket redistribution.
Hi,
It’s interesting to note that Brocade’s latest VCS technology seems to have solved this problem by enabling frame by frame distribution that eliminates the typical issues with hash based EtherChannel. http://community.brocade.com/message/14481#14481
I’m curious how Brocade is doing frame-by-frame. The IEEE High Speed Ethernet Study Group paper http://grouper.ieee.org/groups/802/3/hssg/public/apr07/frazier_01_0407.pdf implies that it’s risky to do so:
“All packets associated with a given ‘conversation’ are transmitted on the same link to prevent mis-ordering”
All the other 802.3ad / Etherchannel documentation I’ve seen clearly indicates that once a load-balancing algorithm is chosen, any given TCP session always rides over the same link, that is, 1 + 1 + 1 + … always = 1. The benefit of link aggregation, as Ethan and commenters indicate, is fault tolerance and a larger pool of links for the LB algorithm to pick among.
If Brocade is choosing to do frame-by-frame, then I think the risk is inherent that some packets arrive out-of-order. In that the speeds and latencies of parallel etherchannel links are presumed to be identical, I think that risk is largely mitigated. Even during congestion, all links would be equally congested in a frame-by-frame scenario, and so our-of-order packet risk should be small.
Good question for a Brocade engineer – whether they have put specific technical means in place to cope with this issue (i.e. monitoring of flows and buffering to ensure ordered packets), or if they are just hoping it all works out based on the assumption that it won’t happen all that often.
TCP and many UDP based apps handle out of order by design and definition. So what’s so damn bad about reordered packets that all the switch makers (who should know better) run in fear from?
It places a burden on the receiving system to re-order them. The results can range from “who cares” to notable CPU utilization. Worst case, an out-of-order packet is deemed to be missing, and and entire TCP group must be retransmitted. Similar reasons that fragmentation is to be avoided when possible. It’s not that it can’t be coped with, just that it’s not desirable.
Who cares how the arrive – the processing engine has to assume that they are out of order 100% of the time anyway and put them back in place … even if that means putting them in the same sequence.
The CPU should never see the load , it should be done in silicon with perhaps a exceptionally minor delay if there really is a out of order packet stream.
Cisco allows round robin packet delivery in their fibre equipment – they should do so in all others as well so that 1+1 does equal 2.
Its only code and I daresay – with no experience in the coding of TCP packet reassembly – that it would be easier than trying to figure out which hash bucket to put a frame into – just send them in a 1,2,3,4,5,1,2,3,4,5,1,2,3,4,5 sequence.
My assumption is that there is a very good technical reason round-robin frame delivery isn’t being done across aggregated links carrying IP traffic, and not because it just never occurred to the engineering teams who code these things. Why do I think that? Round-robin is the painfully obvious first choice. Since round-robin is not even an option you can pick in IOS under duress (like you can with CEF), my guess is that somebody knows something we don’t. More to the story than out-of-order IP packets? Maybe.
Out of order packet delivery will technically work, but (depending on how out of order things get) it won’t necessarily work well.
An extreme case: Many years ago, I had both an ISDN (128k) and an IDSL (144k) connection from the ISP I ran to my house. One day, for fun and the glory of faster Internet, I set up equal cost round robin packet delivery across these links. Because the links had different latency and different link speeds, this resulted in significant out of order packet delivery. I learned a few things.
PPTP has packet sequence numbers and simply DROPS out of order packets. This made my VPN experience terrible.
Realplayer was unusable, presumably for the same reason PPTP was so terrible.
TCP downloads WERE faster than a single link, but their behavior was interesting. There were many duplicate packets. You might think I’d get 128k + 144k speed. Or, you might think I get 128k*2 speed.
Instead, what I saw was that 128k*2 bandwidth was used on the links in order to provide me with 128k*1.5 (192k) of throughput. The other 0.5 of a link (64k) was wasted bandwidth by duplicated transmissions.
Without SACK, I believe out of order packets would cause duplicate acks and trigger a retransmit. It seems to me that SACK should significantly help this situation, but I’m not sure offhand how bad it would be.
My example was extreme with significant reordering. When aggregating slower links into a higher speed multi-link backbone link, statistically speaking out of order packets are not likely to happen very often. But when they do happen, they might cause retransmits or they might cause some effective packetloss for some protocols or they might cause other bad things. Most things will work most of the time. But is this good enough?
The moral of the story: Just because something can accept out of order packets doesn’t mean it will be done efficiently. Out of order packets can cause unexpected issues in unexpected applications.
Imagine if you were an end user and your ISP was causing some reordering. You can download fine. Your once a second pings show 0% packetloss. But inside your VPN you see 1% packetloss. Good luck trying to find the cause from your limited end-user point of view, and even if you figure it out then good luck getting your ISP to fix it.
Based upon the subject, I thought this post was related to the above issue in which traffic can be unevenly balanced even if you get the distributions right with IP/TCP balancing. IE: you have 8 links and all is happy, you drop to 7 and the load becomes uneven because of the “bucket assignment decision “BAD” :)”
I did the ether channel in a lab some years ago with a few 3550’s and they worked fine. Bundling up to 4 interfaces and getting 400mbs between either switches or servers. I was floored but recalled that Kalpana invented ether channel and were purchased by Cisco for a reason. I do not recall if I used ISL or dot1q but I know I had an early 12 IOS and it was an enterprise image. I also recall using the Intel ProSet software from Intel and it supported PAGP. I think Intel dropped support for PAGP and just support “mode on” for static.
Great article Ethan. While I am not a network engineer, I have see this design misconception bite a lot of people and you have explained it in a very clear and concise manor. This helps a lot.
Hi,
It’s now Q4 2016,
with VMware vDS + LACP
can we see 1+1 = 2 or 1.99 ?
Please & Thanks.