A recent ‘conversation’ around VXLAN encapsulation and MTU with Matt Oswalt got me thinking about this subject recently. My calculations were mostly wrong (Matt’s were not) and I also found a shocking amount of incorrect information on the subject out on the ‘net too. So, let’s let the maths do the talking.
TL;DR – As ever, skip to the end for the summary if you really can’t face the math.
The following assumptions have been made when formulating these calculations;
- No retries, packet losses or other events occur
- One way, one to one host communication data and overhead
- UDP & IP v4 packet headers of 28Bytes
- TCP and IP v4 packet headers of 40Bytes
- Use of a full-duplex communications medium, i.e. the full bandwidth is available both upstream and downstream, at the same time
- Where division of the data into the maximum packet size results in a fraction, a packet for that remaining data must still be transmitted so I always round-up when calculating the number of packets
Units of Measurement
The abbreviations used with the data/traffic values in this article are metric prefixes (aka SI prefixes) indicating decimal multiplication (rather than binary prefixes (based on powers of two) where 1kbit = 1024bits), as follows;
- 1 kB = 1,000 Bytes (8,000 bits)
- 1 MB = 1,000,000 Bytes (8,000,000 bits)
A few things to remember;
- Serial line speeds are typically quoted using binary prefixes, so a 2Mb E1 is actually 2.048Mb using metric prefixes
- Ethernet speeds are typically quoted using metric prefixes, so a 100Mb Ethernet link is exactly that (100,000,000bit)
- Linux commands display file size information using metric prefixes. However, using a –human-readable or -h switch normally results in output using binary prefixes which is rather confusing
- Windows displays file sizes using binary prefixes. To make accurate calculations, view the file properties or use a command prompt to discover the file size in Bytes and apply the metric prefix
- Most other storage systems use binary prefixes
- Hard drive manufacturers typically use metric prefixes which means they appear smaller than specified when capacity is displayed using binary prefixes
Bits & Bytes
- 1 Byte = 8 bits
- File sizes are normally quoted in Bytes
- Linux commands display file size information in Bytes
- Link speeds are quoted in Mb (Megabits) per second, not MegaBytes, thus (ignoring all overheads) it’ll take 8s to move 100MB (MegaByte) over a 100Mb (Megabit) per second Fast Ethernet link
Just to summarise how VXLAN works before we get into the detail, it goes like this: the original layer two frame has a VXLAN header added and then that is transmitted across the network within UDP segments carried within IP datagrams. I could of just said UDP packets right?
Based on information from section 5 of the VXLAN IETF draft (v8) found here, we find that the VXLAN header is 8Bytes long; 1B is for flags (interestingly 7bits are unused), 3B is for the VXLAN Network Identifier (VNI) and a further 4B are reserved. So, quite a bit of waste there as over half of the header is reserved, presumably for future use.
Of course, the VXLAN header is added to the start of the original layer two frame which itself has it’s own headers, but note the 4B original Frame Check Sequence (FCS) is discarded, which leaves us with 14B of headers. You could add another 4B for a VLAN tag but I really don’t see this as being necessary; surely the VTEPs will remove and add this as necessary at each end on ingress and egress? What if a frame is tagged at one end but not the other right?
This means that up until this point using VXLAN has added 22Bytes of additional data.
Then of course it’s all placed in a UDP segment, adding 8B and transported over IP, adding another 20B. Better than the 40B or more for TCP/IP for sure and an overhead that would be incurred for any UDP/IP traffic, but lets not forget that the original layer two frame contains within it the original TCP/IP or UDP/IP layer three and four headers as well. For that reason, I count this as a VXLAN overhead.
That makes for a total of 50Bytes of overhead for VXLAN [Note: the original version of this article incorrectly stated 40B – thanks to Anurag for spotting this],
strangely the same as the overhead for TCP/IP; if VXLAN is used to encapsulate TCP/IP traffic, that’s an 90Byte protocol overhead.
The 50Byte overhead created by VXLAN clearly takes the frame size over the standard Ethernet MTU of 1500 (not counting the Ethernet headers of the VXLAN frame) which means, although I rarely see it mentioned, that in order to use VXLAN (or NVGRE and STT I think) you really need Jumbo Frames supported on the network connecting the VTEPs. Great to see Ivan Pepelnjak, as ever, being very clear about this in his post here: http://blog.ipspace.net/2013/08/a-day-in-life-of-overlaid-virtual-packet.html.
You could instead decrease the layer three MTU or MSS on each host by 50Bytes but that’s a significant change and hard to manage; I wouldn’t recommend it. It’s interesting to note that in the testing described in the VXLAN Performance Evaluation on VMware vSphere 5.1 technical paper it’s noted that they increased the layer two MTU to 1600 on the physical NICs used.
For the sake of simplicity I’ve ignored Ethernet’s preamble, start frame delimeter and interpacket gap.
Other Performance Considerations
As NIC hardware features such as TCP Segmentation Offload (TSO), also known as Large Segment Offload (LSO) and Checksum Offload (CSO) can’t generally handle VXLAN encapsulation within the frames they process, a significant performance boost is lost and processing is done in software instead. This can have an impact on CPU resources as well as general networking performance and throughput. Again, thanks to Ivan Pepelnjak for pointing this out in his article here: http://blog.ipspace.net/2012/03/do-we-really-need-stateless-transport.html.
It’s interesting to note that in the testing described in the VXLAN Performance Evaluation on VMware vSphere 5.1 technical paper it’s noted that they specifically chose NICs that could at least do CSO with frames containing VXLAN encapsulated data.
Update January 2016: It’s great to note that Intel, QLogic, Mellanox and no doubt others have now implemented VXLAN offload (that also re-enables TSO in some cases) in their Linux network card drivers. Hopefully performance issues will be a thing of the past soon, assuming users enable the feature (if its not automatic). It turns out Ivan was well ahead of me as always: http://blog.ipspace.net/2015/02/performance-of-hypervisor-based-overlay.html.
So, we’re finally getting to the math. I’ll make the calculation based on an original TCP/IP packet, so our total overheads are 40B for TCP/IP and 50B for VXLAN (including UDP/IP). Let’s do it;
1 Byte of Data
This might seem unlikely but programs such as Telnet and SSH transmit a packet for every character sent or received during a session.
- 1B can be contained in 1 packet not exceeding 1460Bytes (the default TCP MSS)
- TCP/IP and VXLAN add an 90Byte, 9,000% TCP/IP over VXLAN overhead
- Thus, 91Bytes of data is actually transmitted over the network
1kB of Data
- 1kb (1,000Bytes) can be contained in 1 packet not exceeding 1460Bytes (1,000 / 1460 = 0.684.)
- TCP/IP and VXLAN add an 90Byte, 9% TCP/IP over VXLAN overhead
- Thus, 1090Bytes of data is actually transmitted over the network
20kB of Data
- 20kB (20,000Bytes) must be split into 14 packets, each packet not exceeding 1460Bytes (20,000 / 1460 = 13.70.)
- 14 x 90Bytes of TCP/IP and VXLAN overhead equals a 1,260Byte, 6.3% TCP/IP over VXLAN overhead
- Thus, 21,260Btyes of data is actually transmitted over the network
480kB of Data
- 480kB (480,000Bytes) must be split into 329 packets, each packet not exceeding 1460Bytes (480,000 / 1460 = 328.77.)
- 329 x 90Bytes of TCP/IP and VXLAN overhead equals a 29,610Byte, 6.169% TCP/IP over VXLAN overhead
- Thus, 509,610Bytes of data is actually transmitted over the network
1MB of Data
- 1MB (1,000,000Bytes) must be split into 685 packets, each packet not exceeding 1460Bytes (1,000,000 / 1460 = 684.93.)
- 685 x 90Bytes of TCP/IP and VXLAN overhead equals a 61,650Byte, 6.165% TCP/IP over VXLAN overhead
- Thus, 1,061,650Bytes of data is actually transmitted over the network
So, as demonstrated, for data payloads in excess of the common TCP payload maximum segment size (the MSS) of 1460 Bytes, the TCP/IP over VXLAN bandwidth overhead is approximately 6.17%. This equates to an ‘efficiency’ of 94.19% (1460/1550) – in other words, that’s how much bandwidth is left for actual data if you’re putting as much data in each packet as possible.
Keep in mind that for very small data payloads (common with applications such as Telnet, TN3270 mainframe emulation and SSH) the bandwidth overhead can as high as 9,000%.
If you add Ethernet and VLAN tagging (of the final, ‘outside’ frame) into the mix (see the calculations from Wikipedia here) then the throughput of a 100Mb link is 100 x 0.9419 (TCP/IP over VXLAN efficiency (including UDP)) x 0.9728 (Ethernet (with tagging) efficiency) which equals 91.63Mbps, a combined efficiency of 91.63%. assuming ideal conditions.
If you increased the MTU of your hosts to take full advantage of the use of Jumbo Frames I suspect you could greatly improve this figure. If you increased the MTU by just 50Bytes you’d be up to 94.14%.
Of course, don’t forget those other performance considerations.
Other articles in this series;