It’s been a while since I last posted a sizeable blog post regarding an issue in our industry, and lately I have become somewhat disenfranchised by our presidential election process to the point where my normal Texan wit has given way to something a little bit more raw. Hopefully, through lots of meditation and sending lots of lead down range for the rest of the year, I may get back to my normal self.
Also, in the interests of full disclosure, I’m the networking co-chair at the Open Compute Project and I work for Cumulus Networks in my day job. The opinions in this post are all mine, though. Enjoy.
Howdy.
This whole big buffer nonsense has gotten out of hand (kinda like Trump and his H-yuge Border Wall or Clinton and her lack of ethics) and it’s time for this issue to be put to rest.
For the last few years, the main messaging from Arista has been that you need big buffers for everything, especially for big data applications. I’m not saying it doesn’t make for a good marketing soundbite; hell, I can imagine Andy Bechtolsheim singing to the beat of Sir Mix A Lot’s Baby Got Back: “I like big buffers and I cannot lie.” (Ok Internet…let’s make this happen.) But it’s a hammer with no nails around. Let me explain.
Background: Arista’s Buffer Paper
This is the positional paper (some call it a whitepaper) that seems to be written from at least three other papers:
- One simulating 20 servers in a rack with one switch
- One simulating 5 racks of 20 servers where each leaf connects to a spine switch using 2 x 40G links in a CLOS network
- One reporting stats with the 7048, 1G Big Buffer switch
In the first sub-paper, a dated network simulator (NS-2) was used to simulate 20 servers connected to a switch using 10G links and one 40G uplink from the switch. Each server ran 10 ‘threads’ which yields 200 flows. The switch also had a variable that adjusted the buffer size between 256MB and 4MB (FYI, the current generation switches have between 12 to 16MB worth of buffers). They ran the simulation many times at various rates using TCP Reno (a version of TCP not used since the UNIX of the eighties/early nineties like ATT UNIX SYSV, 4.4 BSD, and SunOS 4.x/Solaris 2) for a total of 10 seconds for each run.
The result was that the average throughput was 200Mbs with plenty of variance, including packet drops with the 4MB simulated switch and 0 packet drops with the 256MB one. Their conclusion: bigger buffers for no dropped packets.
The second sub-paper started with the conclusions from the first but expanded it to be 5 racks with each rack connecting to a spine switch with 2 x 40G links in a CLOS network.
This time, they used a newer version of the NS-2 network simulator called NS-3. They also changed some key parameters, like setting a fixed load of 95% and actually using a more modern TCP version called New Reno, but they adjusted some values (retransmission timeout set to 200ms and disabled flow control) that may have be done to illustrate some bias in their favor. Their 10-second simulation resulted in something similar to their first sub-paper, need big buffers to prevent dropped packets.
And finally, the third sub-paper used their 7048 1G big buffer switch to demonstrate real-world effects (not too scientific this time around). The paper concluded that based on the first two sub-papers, Arista has a switch that will ‘fix’ this dropped packet issue(s) as it’s a big problem for Big Data applications.
Issues With The Paper
As I mentioned before, there’s lots of peculiarities that should be red flags for people. The first was their use of an ancient version of TCP. Why does this matter? Well, here’s a seminal paper from 20 years ago by Sally Floyd. Notice how the performance of TCP Reno compares to the then state-of-the-art TCP SACK.
To give some color, Linux uses TCP New Reno with TCP SACK and TCP Vegas extensions to improve retransmit, throughput, and congestion notification, leading towards a fair and proportional flow even in the face of resource contention.
Another red flag is the failure to tune the TCP stack at the server end; this is what admins of HPC and big data applications do. Tuning examples include adjusting the socket buffer size on the host, the application, or on a particular socket.
Now if only someone actually took Arista to task and countered their big buffer messaging…oh wait, they did.
Here’s a paper from Cisco (written by Miercom and sponsored by Cisco) that shows that big buffers do not help switching performance for big data applications (they actually have the opposite effect) but rather having a reasonable buffer size and better switch latency make for better performance. Check it out for yourselves.
The Arista big buffer switch actually performed worse for most of the runs. Of course, the Cisco paper has its own issues (surprise, surprise from a sponsored paper). Why couldn’t they just use the same parameters as the third sub-paper from Arista for a pure apples to apples comparison? Probably because the results would have been less dramatic.
Lastly, Arista’s assumptions for big data applications seems to be targeted at folks that don’t really focus on big data applications but rather are big data “curious”. A good example is to compare what TACC does and what a typical Hadoop cluster on Azure or AWS does.
For example, when you’re doing ‘big data’, you are:
- Not using VMs, instead opting for bare metal to reduce the shared resource contention of the CPU, memory/cache, disk IO, and PCI bus
- Not using Ethernet, instead using Infiniband or some other proprietary ultra low latency fabric because data needs to be transmitted with predictable latency and jitter
- Not using Hadoop but instead using MPI because there’s too much overhead for job scheduling and message passing in Hadoop compared to MPI
Where Do We Go From Here?
What I outlined isn’t as scientific as I would like (plus if I did treat it as a scientific endeavor, most of y’all wouldn’t read it 😉 ) but it puts the key flaws used in the big buffer argument in the sunlight for all to see.
So why would a company base products on a flawed argument? Easy, buffers are used as a band aid to cover up other issues in the network such as latency. In HPC, the reason people increase their buffers on the server/application is because the CPU is so busy performing computations that there was not enough cycles to handle all the interrupts for IO such as disk or network and compute at the same time; the data had to be placed somewhere until the application could use it.
Since then, new techniques have been introduced to greatly reduce this issue such as GSO (Generic Segment Offload) which greatly increases bandwidth efficiency for NICs that support it.
The key lesson here is network devices that have high latency need more buffering.
Questions And Answers
Before I end this post (I can feel my inbox filling up with various questions), I’m going to do a little Q/A ahead of time.
Which switch has the best latency/buffer combo in the market today?
In my opinion it’s Mellanox Spectrum-based boxes. They have the lowest port-to-port latency that is consistent (read: predictable), a working cut-through routing implementation, and an efficient buffering implementation. This is an area where there’s good competition (something the industry has lacked for the last 20-30 years) so expect different vendors taking the crown from time to time.
Can we expect a true apple- to-apples ‘whitepaper’ coming soon from you?
I can neither confirm or deny that an entity is working on such an endeavor for mass consumption. 😉
Will this post cause people to not buy Arista’s big buffer switches?
Possibly but y’all have to remember: There’s a sucker born every minute.
If you loathe both candidates (Trump and Clinton), who are you going to vote for?
For this election cycle, I’m supporting the sanest candidate(s): the Libertarian ticket of Johnson/Weld.
+–+
Carlos