This guest blog post is by Kevin Deierling, VP of marketing at Mellanox Technologies. We thank Mellanox for being a sponsor.
Many Ethernet switch vendors stress the need for large buffers in switches to achieve optimal network performance. In fact, enormous buffers aren’t needed to achieve optimal performance, but rather to accommodate architectural flaws in the switch ASICs used by these systems.
Instead of improving performance, these big, bloated buffers actually lead to increased latency and reduced performance.
To justify enormous buffers, vendors invoke the idea of accommodating temporary fluctuations or “microburst” traffic. Interestingly, Arista published a whitepaper debunking the myth that large buffers are needed to accommodate microbursts.
This paper correctly identifies that large buffers are required not because of microbursts, but because “legacy designs” did not have sufficient switching capacity to support full wire-speed switching. The motivation for buffering was not to overcome microbursts, but rather to overcome the performance limitations of legacy designs.
Unfortunately these performance limitations have persisted and indeed have actually gotten worse, even in so-called “high-performance” switch ASICs. For example, the majority of switches have been demonstrated to be unable to switch small packets at full line rate without losing packets (www.zeropacketloss.com).
For example, switches based on the Broadcom Trident or Tomahawk ASICs not only exhibit severe packet loss (~30 percent of the frames are dropped for minimum-sized packets), but even 4 percent of packets when using a typical Internet mix (IMIX) of large and small packets.
This level of packet loss absolutely kills performance. So some system vendors have proposed big bloated buffers as the solution.
In fact, relying on buffer bloat to patch the problem of poorly performing switch ASICs not only increases costs dramatically but actually increases latency and degrades application performance. An optimized switch data plane is a far superior solution and delivers improved price/performance across the board.
Two Types of Packet Loss In Lossy Networks: Unavoidable And Inexcusable
In a traditional Ethernet network there are only two[¹] reasons that packets are dropped at an appreciable rate:
- Incast microbursts – Congestion due to network traffic flows that oversubscribe the bandwidth of a given link or endpoint. These network traffic patterns are the microburst flows described above.
- Avoidable – Packet loss due to internal switch packet forwarding limitations – even under circumstances when the traffic pattern should have been switchable without causing any port contention or over-subscribed links.
The first type of packet loss is an undesirable but unavoidable consequence of transports like TCP/IP, which use dropped packets as a form of implicit congestion notification.
However, there is absolutely no excuse for the second type of packet loss. In this case the ASIC itself is the cause of the problem, as it is unable to forward small packets at full wire speed because of internal crossbar or forwarding table lookup limitations. Such sub-line-rate ASICs may resort to larger buffers in order to hide these forwarding limitations of the internal crossbar switch.
However, if the ASIC is unable to switch at full line rate, then no buffer size is large enough to prevent eventual packet loss under conditions of sustained small-packet traffic.
Furthermore, making the packet buffers larger to accommodate internal switch device limitations actually increases forwarding latency and degrades network performance. This is because a new packet entering the switch goes to the back of the line for a given output and class of service level and all the packets ahead of it in the buffer must be transmitted before the new packet can be transmitted.
So big, full buffers increase the latency dramatically. Instead of worrying about store-and-forward latency of jumbo frames with a sub-line-rate ASIC, the network engineer needs to worry about the packet latency caused by megabyte packet buffers (1.8us/Jumbo Frame vs 200us/MB @ 40Gb/s).
Some switches boast of Gigabytes of buffering; however this would incur 200ms/GB of buffer. This level of buffering is entirely useless as senders would interpret delays of this magnitude as lost packets and retransmit, thereby exacerbating the congestion problem. So Big, Full Buffers (BFB) mean LATENCY with a capital ‘L’ and are really a bad idea.
Non-Blocking Switches Critical
It is critical to use zero-packet-loss switches and avoid sub-line-rate ASICs. ASICs with internal packet forwarding rate limitations have the unfortunate consequence that the buffers can be filled to near capacity, even when there is no microburst incasts to cause congestion!
With standard Internet mix traffic, a blocking ASIC ultimately fills up these big buffers, which means packets have to navigate all the way through this traffic jam before finally getting forwarded. Instead of worrying about the latency of a 9K jumbo frame, now you need to start thinking about latency caused by megabytes or even gigabytes of buffering!
With reasonably sized buffers (a shared buffer on the order of 10 Mbytes) the latency is manageable and it is possible to achieve good throughput. However, for very large buffers it may be desirable to artificially set the threshold for congestion marking or pause frame generation at a much lower buffer level to avoid the increased latency.
This effectively causes the switch ASIC to operate as if it had smaller buffers than it actually does. It is ironic that these large buffers–which you have to pay for, and were originally included to overcome internal switch ASIC limitations—are frequently not used at all because of the undesired increase in latency.
The importance of a zero packet loss internal crossbar, low-latency cut-through switching, and an efficient, fully shared buffer is clearly illustrated by the Mellanox Spectrum 100 Gb/s switch ASIC.
The Tolly Performance Report demonstrates the fundamental advantages of Mellanox Spectrum-based switches over switches based on Broadcom Tomahawk silicon. With Spectrum-based switches, the on-chip buffering accommodates large incast flows, provides fairness, and supports RFC2544 full wire speed switching for all packet sizes. So with Spectrum there is no reason to incur the penalties and increased cost and latency of big, bloated buffers.
Sub-line-rate ASICs can significantly degrade network performance due to unnecessary packet loss as a result of internal crossbar forwarding limitations. The performance impact will be most severe when using implicit congestion notification and sender-side timeouts, such as those that occur with TCP/IP based transports, including iWARP. Big, bloated buffers do not solve this problem, and in fact result in significant increases in latency.
The choice between packet overflow and moderately sized buffers versus the far worse impact of massive latency from big buffers is in fact a false one. This bad vs. worse choice is due to sub-line-rate ASICs that are not able to sustain full wire speed forwarding of small packets.
In fact, it’s possible to use moderately sized buffers and still avoid packet loss. This is the preferred option, but requires switch ASICs that are capable of forwarding packets at full wire speed across all packet sizes.
So, while big buffers are often a boasting point for switch ASIC vendors, they are also often an attempt to hide internal switch crossbar limitations. These large buffers can actually reduce performance, rather than increase it. The best option is to use an ASIC capable of operating at full line rate across all packet sizes.
 In fact, packet corruption due to bit error rates at the physical layer is another reason that packet drops can occur. However, packet drops due to these effects are much less freqent than network-induced packet loss, and so are not further discussed here.