Packet Pushers

Where Too Much Technology Would Be Barely Enough

  • Podcasts
    • Day Two Cloud
    • Full Stack Journey
    • Heavy Networking
    • Heavy Strategy
    • Heavy Wireless
    • IPv6 Buzz
    • Kubernetes Unpacked
    • Network Break
    • Tech Bytes
    • The Community Show
    • Datanauts (Retired)
    • Priority Queue (Retired)
  • Hosts
  • Articles
    • Tech Blogs
    • Industry News
    • Books And Whitepapers
    • Toolbox – IT Resource Collections
  • Library
  • Newsletter
  • Slack
  • Subscribe
  • Sponsor
You are here: Home / Blogs / The Scaling Limitations of Etherchannel -Or- Why 1+1 Does Not Equal 2

The Scaling Limitations of Etherchannel -Or- Why 1+1 Does Not Equal 2

Ethan Banks November 27, 2010

Dam

Image via Wikipedia

This article is imported from packetattack.org, a technical blog I maintained before planting my flag at packetpushers.net. I’ll be moving the most popular blog posts from packetattack.org to packetpushers.net in the coming weeks.

Slaying SPOFs

Some of you know I took on a new job earlier this year, where the challenge was (and is) to transform a globally distributed network for a growing company into an enterprise class operation. A major focus area has been eliminating single points of failure (SPOFs): single links, single routers, single firewalls, etc. If it can break and consequently interrupt traffic flow, part of my job is to design around the SPOF within the constraints of a finite budget.

The network documentation I inherited ranged from “mostly right but vague and outdated” to “a complete and utter fantasy requiring mind-altering substances to make sense of”. Ergo, untrustworthy to the point of being useless beyond perhaps slideware to show a particularly dim collection of simians. I have therefore been doing complete network explorations, building new documentation as I go.

To my horror, I one day discovered an egregious SPOF, where a single, fragile piece of CAT5 provided the sole physical path between two major concentrations of network activity. If that link ran into any trouble, an entire room containing hundreds of physical and virtual servers (and their storage) would have been cut off from the rest of the company.

To eliminate the physical path SPOF, the easy choice was to transform the single link into an etherchannel. This I did; the single 1Gbps became a 4x1Gbps etherchannel plumbed back to one core switch. For good measure, I added a second 4x1Gbps, plumbed to a second core switch. Spanning-tree roots had already been established such that even-numbered VLANs would traverse one of the 4x1Gbps etherchannels, and odds the other…which you can read more about here if interested.

All should now be sweetness and light, right? A 1Gbps SPOF (and probable bottleneck) was transformed into a load-distributed pair of 4x1Gbps etherchannels, and hey, if they weren’t complaining about the 1Gbps link before, they ought to be blissfully happy now!

Mad Maths

Enter the scaling problem: when it comes to etherchannel, 1+1 does not equal 2.

The reason adding more physical links does not proportionally grow your available bandwidth is that your friendly neighborhood Cisco switch does not load-balance across etherchannel members frame by frame. You might assume that the frame #1 gets sent down etherchannel member #1, frame #2 down etherchannel member #2, etc. in a round-robin fashion. Reality is rather different. What the switch actually does is math. The sort of math will vary depending on the capabilities of the switch, and on what you have configured.

Commonly available etherchannel load-balancing methods include source and destination MACs, source and destination IPs, and (my personal favorite) source and destination layer 4 port. To determine which etherchannel member will be used to forward a frame, the switch performs mad maths based on the load-balancing method you’ve selected. The practical upshot is that the same conversation is always going to be forwarded across the same etherchannel member, because the math always works out the same.

This behavior can impact the network. Imagine backup server BEAST with enough horsepower to fill a 1Gbps link who is runing a restore operation to server NEEDY. BEAST and NEEDY are uplinked to different switches interconnected by an etherchannel. As the restore runs, each frame is hashed by the switch to determine which etherchannel member to forward across; the math will work out the same for every frame, meaning the entire conversation between BEAST and NEEDY is going to be forwarded across the same etherchannel member. The result is kind of like the picture above – one member that’s crushed, while the other members lie comparatively idle.

Congestion Indignities

The switch is not sensitive to an etherchannel member getting crushed; the switch just keeps on doing mad maths. Therefore, some other conversations heading across the link will just happen to get hashed to the same link that the BEAST-NEEDY restore operation is using. Those other unfortunate conversations will therefore suffer the indignities that happen during link congestion: dropped frames and increased latency. The real-world experience is that certain applications act slow or throw errors. Storage could dismount. Monitoring applications get upset as thresholds are exceeded.

Yuck.

Of course, it’s now up to the network engineer (you) to discover why the alarms are going off, track down the offending traffic flow (you are modeling your interswitch links, right?), and figure out what is to be done about it. In my experience, you won’t have a lot of luck explaining what’s happening to non-network people. I’ve had a hard time explaining that 1+1 doesn’t equal 2, (or 1+1+1+1 doesn’t equal 4). You don’t really have a 2Gbps or 4Gbps link just because you’ve built a fancy etherchannel. You’ve really got multiple parallel 1Gbps links, any one of which can still get congested in BEAST-NEEDY scenarios.

So Fix It, Network Guy

There’s a few ways to tackle the challenge of 1+1 not equaling 2.

  1. Learn your traffic patterns. See if you can group heavy hitters into the same switch. That’s a pretty old-school way to go after the problem, and it won’t scale to large data center deployments. But you can find wins in this approach from time to time.
  2. Build a dedicated link. By this, I mean that you could build a link dedicated to just the traffic that’s causing the interswitch etherchannel all the heartburn. If your etherchannel is a trunk carrying a whole bunch of VLANs, you could build a parallel link that carries traffic for just a problem VLAN, while pruning that problem off of the etherchannel trunk. Might help, might not, depending on your situation…and of course, it’s a “one-off” fix, not a scalable solution necessarily. Some shops build networks dedicated to storage or to backup, and plumb specific interfaces on hosts to these specific networks for exactly this reason. There are increased in hardware, cabling, and complexity to make it happen, though.
  3. Add even more 1Gbps links to the etherchannel. This is not terribly practical. At the end of the day, you still have a potential bottleneck, but at least you’ve decreased the number of conversations that are likely to get hashed to a congested link.
  4. Replace the 1Gbps links with 10Gbps links. Increasing bandwidth is always an option. The jump to 10Gbps is a tough one, though: new switch hardware, higher power requirements, and likely new cabling will be required. And don’t forget to break out your checkbook.
  5. Apply QoS. If you have known offenders or predictable traffic patterns, you can write a QoS scheme to help manage the congestion. I tend to pump traffic like this through a traffic shaper, but there are other approaches, such as guaranteeing minimum bandwidth to important traffic, while dumping the link beast into the scavenger class. I have found that latency still tends to suffer when using a guaranteeing minimum bandwidth (CBWFQ) scheme. I have had the best luck with shaping.
  6. Tweak the beastly application. It’s not uncommon for certain applications to have a built-in throttle, so that you can cap network utilization right at the app. Talk to your system engineer and see…I’ve heard they’re people, too.

About Ethan Banks

Co-founder of Packet Pushers Interactive. Writer, podcaster, and speaker covering enterprise IT. Deep nerdening for hands-on professionals. Find out more at ethancbanks.com/about.

Comments

  1. Brandon Kim says

    November 27, 2010 at 4:06 pm

    Pretty timely post. And I agree wholeheartedly about the etherchannel. I think some people tend to also forget that although it’s better than nothing, you aren’t exactly get double the throughput. You will gain port redundancy and allow your traffic more options, but you can never go beyond what the physical interface can handle.

    Downside is that if you are tight on ports, it will take away even more available ports….

  2. chrismarget says

    November 27, 2010 at 5:07 pm

    Nice photo choice 🙂

  3. Tom says

    November 27, 2010 at 5:43 pm

    Hi,

    you have further information about how the algorithm works? Would be very useful about further designs.

    Regards
    Tom

    • Kurt Bales says

      November 27, 2010 at 8:27 pm

      Hey Tom,

      Best reference I have found on DocCD is:

      http://www.cisco.com/en/US/tech/tk389/tk213/technologies_tech_note09186a0080094714.shtml .

      They whole etherchannel section has quite a bit of useful information. Check it out at:

      http://www.cisco.com/en/US/tech/tk389/tk213/tsd_technology_support_protocol_home.html

      Hope that helps.

      Kurt (@networkjanitor)

      • chrismarget says

        November 29, 2010 at 10:25 am

        Whatever header fields you choose for load balancing (src/dst of mac/ip/L4 port), outgoing frames are always hashed into 8 buckets.

        The 8 hash buckets are distributed among the link members:

        8 links in the aggregation? Each link is assigned a bucket.
        7 links? One link handles two buckets, the rest handle only one.
        6 links? Two links handle two buckets. the rest handle only one.
        .
        .
        1 link? It handles all 8 buckets.

        Which buckets are assigned to each link? Check the “load” parameter of ‘sho int po x etherchannel’ or ‘sho etherchannel detail’

        Take the value of each link. Turn it into binary. An aggregation here shows the following “load” values for its 6 links.
        0x41 = 01000001 : bucket 1 and 7
        0x02 = 00000010 : bucket 2
        0x04 = 00000100 : bucket 3
        0x88 = 10001000 : bucket 4 and 8
        0x10 = 00010000 : bucket 5
        0x20 = 00100000 : bucket 6

        Which traffic gets assigned to each bucket?

        I’ve seen various unsatisfying explanations of the algorithm, but have not been able to nail it down to the point that I could predict which link will be chosen for a given packet. It probably doesn’t matter, because link failures will move things around on you. Look into ‘lacp max-bundle’ if that’s a concern. Note also that re-assigning links to buckets takes time, and that packets belonging to a given flow may become misordered during bucket redistribution.

        • Manish says

          November 30, 2010 at 11:22 pm

          Hi,

          It’s interesting to note that Brocade’s latest VCS technology seems to have solved this problem by enabling frame by frame distribution that eliminates the typical issues with hash based EtherChannel. http://community.brocade.com/message/14481#14481

          • Jeremy says

            December 12, 2010 at 9:26 am

            I’m curious how Brocade is doing frame-by-frame. The IEEE High Speed Ethernet Study Group paper http://grouper.ieee.org/groups/802/3/hssg/public/apr07/frazier_01_0407.pdf implies that it’s risky to do so:

            “All packets associated with a given ‘conversation’ are transmitted on the same link to prevent mis-ordering”

            All the other 802.3ad / Etherchannel documentation I’ve seen clearly indicates that once a load-balancing algorithm is chosen, any given TCP session always rides over the same link, that is, 1 + 1 + 1 + … always = 1. The benefit of link aggregation, as Ethan and commenters indicate, is fault tolerance and a larger pool of links for the LB algorithm to pick among.

          • Ethan Banks says

            December 12, 2010 at 11:42 am

            If Brocade is choosing to do frame-by-frame, then I think the risk is inherent that some packets arrive out-of-order. In that the speeds and latencies of parallel etherchannel links are presumed to be identical, I think that risk is largely mitigated. Even during congestion, all links would be equally congested in a frame-by-frame scenario, and so our-of-order packet risk should be small.

            Good question for a Brocade engineer – whether they have put specific technical means in place to cope with this issue (i.e. monitoring of flows and buffering to ensure ordered packets), or if they are just hoping it all works out based on the assumption that it won’t happen all that often.

          • matt says

            December 17, 2010 at 12:46 pm

            TCP and many UDP based apps handle out of order by design and definition. So what’s so damn bad about reordered packets that all the switch makers (who should know better) run in fear from?

          • Ethan Banks says

            December 17, 2010 at 3:07 pm

            It places a burden on the receiving system to re-order them. The results can range from “who cares” to notable CPU utilization. Worst case, an out-of-order packet is deemed to be missing, and and entire TCP group must be retransmitted. Similar reasons that fragmentation is to be avoided when possible. It’s not that it can’t be coped with, just that it’s not desirable.

          • david hanson says

            December 28, 2010 at 5:05 pm

            Who cares how the arrive – the processing engine has to assume that they are out of order 100% of the time anyway and put them back in place … even if that means putting them in the same sequence.

            The CPU should never see the load , it should be done in silicon with perhaps a exceptionally minor delay if there really is a out of order packet stream.

            Cisco allows round robin packet delivery in their fibre equipment – they should do so in all others as well so that 1+1 does equal 2.

            Its only code and I daresay – with no experience in the coding of TCP packet reassembly – that it would be easier than trying to figure out which hash bucket to put a frame into – just send them in a 1,2,3,4,5,1,2,3,4,5,1,2,3,4,5 sequence.

          • Ethan Banks says

            December 29, 2010 at 8:39 am

            My assumption is that there is a very good technical reason round-robin frame delivery isn’t being done across aggregated links carrying IP traffic, and not because it just never occurred to the engineering teams who code these things. Why do I think that? Round-robin is the painfully obvious first choice. Since round-robin is not even an option you can pick in IOS under duress (like you can with CEF), my guess is that somebody knows something we don’t. More to the story than out-of-order IP packets? Maybe.

          • Matt Buford says

            February 4, 2011 at 2:51 am

            Out of order packet delivery will technically work, but (depending on how out of order things get) it won’t necessarily work well.

            An extreme case: Many years ago, I had both an ISDN (128k) and an IDSL (144k) connection from the ISP I ran to my house. One day, for fun and the glory of faster Internet, I set up equal cost round robin packet delivery across these links. Because the links had different latency and different link speeds, this resulted in significant out of order packet delivery. I learned a few things.

            PPTP has packet sequence numbers and simply DROPS out of order packets. This made my VPN experience terrible.

            Realplayer was unusable, presumably for the same reason PPTP was so terrible.

            TCP downloads WERE faster than a single link, but their behavior was interesting. There were many duplicate packets. You might think I’d get 128k + 144k speed. Or, you might think I get 128k*2 speed.

            Instead, what I saw was that 128k*2 bandwidth was used on the links in order to provide me with 128k*1.5 (192k) of throughput. The other 0.5 of a link (64k) was wasted bandwidth by duplicated transmissions.

            Without SACK, I believe out of order packets would cause duplicate acks and trigger a retransmit. It seems to me that SACK should significantly help this situation, but I’m not sure offhand how bad it would be.

            My example was extreme with significant reordering. When aggregating slower links into a higher speed multi-link backbone link, statistically speaking out of order packets are not likely to happen very often. But when they do happen, they might cause retransmits or they might cause some effective packetloss for some protocols or they might cause other bad things. Most things will work most of the time. But is this good enough?

            The moral of the story: Just because something can accept out of order packets doesn’t mean it will be done efficiently. Out of order packets can cause unexpected issues in unexpected applications.

            Imagine if you were an end user and your ISP was causing some reordering. You can download fine. Your once a second pings show 0% packetloss. But inside your VPN you see 1% packetloss. Good luck trying to find the cause from your limited end-user point of view, and even if you figure it out then good luck getting your ISP to fix it.

        • phoose says

          May 2, 2011 at 2:58 pm

          Based upon the subject, I thought this post was related to the above issue in which traffic can be unevenly balanced even if you get the distributions right with IP/TCP balancing. IE: you have 8 links and all is happy, you drop to 7 and the load becomes uneven because of the “bucket assignment decision “BAD” :)”

  4. SmokinJoe says

    January 15, 2011 at 4:48 pm

    I did the ether channel in a lab some years ago with a few 3550’s and they worked fine. Bundling up to 4 interfaces and getting 400mbs between either switches or servers. I was floored but recalled that Kalpana invented ether channel and were purchased by Cisco for a reason. I do not recall if I used ISL or dot1q but I know I had an early 12 IOS and it was an enterprise image. I also recall using the Intel ProSet software from Intel and it supported PAGP. I think Intel dropped support for PAGP and just support “mode on” for static.

  5. Andy says

    October 2, 2011 at 6:33 pm

    Great article Ethan.  While I am not a network engineer, I have see this design misconception bite a lot of people and you have explained it in a very clear and concise manor.  This helps a lot.

  6. fbifido (@fbifido) says

    November 18, 2016 at 2:13 am

    Hi,

    It’s now Q4 2016,
    with VMware vDS + LACP
    can we see 1+1 = 2 or 1.99 ?

    Please & Thanks.

  • Email
  • Facebook
  • LinkedIn
  • RSS
  • Twitter
  • YouTube

RSS Day Two Cloud

  • D2C226: Creating An Effective Cloud Migration Strategy December 20, 2023

RSS Full Stack Journey

  • The Final Journey Of Full Stack Journey October 31, 2023

RSS Heavy Networking

  • HN714: Building The Branch Of The Future With SASE Powered By AI (Sponsored) December 15, 2023

RSS Heavy Strategy

  • HS061 What is IT Training or Education December 19, 2023

RSS Heavy Wireless

  • HW017: The Story Behind The CWNA Study Guide December 12, 2023

RSS IPv6 Buzz

  • IPB141: IPv6 End Of Year Wrap-Up  December 14, 2023

RSS Kubernetes Unpacked

  • KU043: How (& Why) To Contribute To The Kubernetes Release Team December 14, 2023

RSS Network Break

  • NB460: VMware Ditches Perpetual Licenses; GenAI Is Coming To Network Ops December 18, 2023

RSS Tech Bytes

  • Tech Bytes: Fortinet Advisor Brings GenAI To Support SecOps Teams (Sponsored) December 18, 2023

RSS YouTube

  • Creating An Effective Cloud Migration Strategy December 20, 2023

Recent Comments

  • Robin Grindley on NB458: Broadcom Debuts On-Chip Neural Net, Lays Off VMware Staff; Okta Breach Gets Worse
  • Eduardo Barrios on HW015: What Every Wi-Fi Pro Needs To Know About Private LTE
  • Kentzo on IPB139: Avoiding Typical IPv6 Pitfalls
  • Jeff Cameron on NB457: Broadcom, VMware Tie The Knot; Nvidia SuperNICs Target AI Ethernet Acceleration
  • Ronny Aasen on HS027 Broadcom and VMware – What’s Gonna Happen?
  • Seth Lane on 3 Takeaways From AutoCon0

PacketPushers Podcast

  • Heavy Networking
  • Day Two Cloud
  • Network Break
  • Briefings In Brief & Tech Bytes
  • Full Stack Journey
  • IPv6 Buzz
  • Community Podcast
  • Heavy Strategy
  • Priority Queue (Retired)
  • Datanauts (Retired)

PacketPushers Articles

  • All the News & Blogs
  • Only the Latest News
  • Only the Community Blogs
  • Virtual Toolbox

Search

Website Information

  • Frequently Asked Questions
  • Subscribe
  • Sponsorship
  • Meet The Hosts
  • Pitch Us
  • Privacy Policy
  • Website Terms

Connect

  • Contact The Packet Pushers
  • Join Our Slack Group
  • Subscribe To Podcasts
  • Subscribe To Newsletter
  • Become A Sponsor
  • Facebook
  • LinkedIn
  • RSS
  • Twitter
  • YouTube

© Copyright 2023 Packet Pushers Interactive, LLC · All Rights Reserved