Lot’s of folks like to claim they have the fastest switch for your application. Hell, some of those folks that don’t have the fastest switch will try to change the argument and say that they’ll never drop a packet with their big buffer switches (read Arista’s Big Buffer B.S. for more details). In this post, I’m going to go over the tools for y’all to take these vendors to task to prove if they truly meet your needs.
Let’s start off with some basic buffering concepts in switching ASICs.
Shared Vs Segregated Buffers
Each ASIC has a finite set of ports/interfaces (SerDes) that it can support. In addition, the process to serialize/deserialize, route, switch or inspect a given packet is also finite (sticking with a fixed pipeline architecture).
Thus, when a device receives more packets than it can handle at a given time interval (talking nano seconds here), the ASIC has two options: drop the packet or store for a later time interval: a.k.a. buffer the packet.
In some architectures, the ASIC has a large contiguous area of memory that is used for all ports and it is up to the NOS to manage the memory space to the ports. This can be done by the NOS either statically assigning memory to each port at initialization or can be done dynamically. The key here is that a given port can address anywhere in this memory space. An example of this type of architecture is Mellanox’s Spectrum ASIC. Mellanox Spectrum == Shared.
In some architectures, the ASIC has a large contiguous area of memory that is used for all ports but is segregated into what’s called port groups (typically in groups of 4 or 8 ports). A port belonging in one port group cannot access memory that belongs in another port group, meaning if a buffer in one port group is maxed, no more data from any of those ports can be buffered without draining the existing buffer. An example of this type of architecture is Broadcom’s Trident line. Broadcom Trident == Segregated.
There’s even a hybrid approach that some ASICs have started doing where more control is given to the NOS to make the determination of buffer size per port group, along with other parameters for tuning, typically called profiles. Ultimately these ASICs function more like the segregated model than the shared one. An example of this type of architecture is Broadcom’s Tomahawk ASIC. Broadcom Tomahawk == Segregated.
Choose Your Weapon
Now that we’ve discussed ASIC buffering in broad strokes, let’s talk about how a few NOSes expose how to monitor port buffers.
Cisco Active Buffer Monitoring
This feature in NXOS is paired with a hardware component called the Algo Boost Engine “that collects histogram of counters for unicast buffer usage per individual port, total buffer usage per buffer block, and multicast buffer usage per buffer block.” (citation)
These histograms feature a fixed 18 bin spread, a configurable sampling interval between 20ms and 10ns, and a reporting period of 1s. To rephrase: for every second, a histogram is updated with 50 up to 100 million measurements into 18 bins. The system maintains the last 70 minutes worth of histograms. There’s even the ability to trigger a log message if a set threshold is reached.
On the surface, this feature looks promising for the general case in small to medium enterprise size networks, but when you look further one can start seeing issues. The main issue being the the histogram parameters: a fixed set of 18 bins of 384KB each. With this type of bin resolution and being unable to adjust it, the ability to use the high resolution sampling interval of 10ns is pointless, esp. in High Frequency Trading (HFT) networks.
Arista LANZ (Latency Analyzer)
This feature in EOS took the lessons from Active Buffer Monitoring and addressed them (sort of). The first is that Arista is using the hardware already available in the ASIC. This means LANZ has different limitations depending on which switch you are on.
For the purpose of this post, I’m going to stick with the more “capable” one. LANZ doesn’t use histograms to give buffer state; instead it uses queue length at 480B intervals to determine small packet occupancy (like one or two packets) of a given buffer along with an sampling period of “a few 100ns.” (citation)
What does “a few 100ns” mean? It could mean 300ns, 900us, or 1ms, I have no idea and clarification from Arista is allusive. (My guess is they are being vague because of the variability of the underlying ASIC and its capabilities.) Given this information, LANZ is only able to report that an interface had a queue length of X Bytes with an avg latency of Y usecs for a duration of Z usecs. LANZ lets the user set up to two thresholds (high and low) to be actionable (syslog messages, data collection, additional reporting, etc…).
This feature definitely makes sense for folks wanting (sub)microsecond reporting of when a port’s queue length exceeds a given amount (which is needed for micro-burst monitoring). However, it doesn’t answer any questions related to how the data is distributed in that particular port, which is the general case Cisco’s Active Buffer Monitoring does well.
Cumulus Linux Buffer Monitoring on Mellanox Spectrum
This new feature in Cumulus Linux showed up in 3.3 for Mellanox Spectrum only. Hopefully, Broadcom support is not too far behind (crossing my fingers that Broadcom’s ASICs and SDK have similar capabilities to Mellanox Spectrum…but I’m definitely not going to hold my breath 😉 ).
Mellanox Spectrum provides a mechanism to monitor port queues using histograms consisting of configurable bin sizes (max of 10 bins) that represent queue lengths using a configurable sampling interval (1024ns as the max) and a configurable reporting interval (1 second to 7 days) with the ability for additional actions (syslog messages, data collection, etc… .) (citation)
One way to think of the Cumulus/Mellanox solution is that it’s the Goldilocks solution compared to Cisco and Arista. It provides the user the ability to adjust bin size, sampling interval, and reporting period, which is important to folks with latency-sensitive applications such as HFT while still being useful for the general case for enterprise customers.
That’s it for now; y’all are now dangerous enough to take your sales reps and their sales engineers to task when they’re trying to push a given solution.