Could You Set the “Go Faster” Bit?

When people (frequently from the apps team) complain about the performance of the network, I usually offer the following smarmy response, “Ohhhhh, I forgot to set the go-faster bit.” However, after doing some research on the subject of wire-speed packet capturing and optimal IDS/IPS architectures for virtual environments, I’ve discovered that there are actually ways to do this on Linux. Many of you may be familiar with some of these tools, and Doug Burks of Security Onion fame shared some more with me after a Twitter discussion.

Virtual PF_RING looks really promising. Until I saw this inexpensive tool (brought to you courtesy of the developers at Ntop, one of my favorite free NetFlow tools), I thought the Phantom Virtual Tap from NetOptics was one of the few options available for capturing at the hypervisor.

vPF_RING (Virtual PF_RING) extends the operating system-bypass approach followed by PF_RING to the context of virtual environments implementing an hypervisor-bypass approach. This means that it is now possible to capture packets directly, in zero-copy fashion, without the involvement of the hypervisor. vPF_RING can do this by creating a mapping between the host kernel-space and the guest user-space, allowing packets to follow a straight path from the NIC to the monitoring applications running on VMs.

What?! Wire-speed packet capture on Linux?

Direct NIC Access

PF_RING DNA (Direct NIC Access) is a way to map NIC memory and registers to userland so that packet copy from the NIC to the DMA ring is done by the NIC NPU (Network Process Unit) and not by NAPI. This results in better performance as CPU cycles are used uniquely for consuming packets and not for moving them off the adapter. The drawback is that only one application at time can open the DMA ring (note that modern NICs can have multiple RX/TX queues thus you can start simultaneously one application per queue), or in other words that applications in userland need to talk each other in order to distribute packets.
Now for the suite that sent me down the original rabbit hole. I’ve been looking for an open source (i.e. FREE) traffic generator for a long time and was so excited when I saw trafgen in the Netsniff-NG toolkit:

netsniff-ng is a free, performant Linux networking toolkit.

The gain of performance is reached by zero-copy mechanisms, so that on packet reception and transmission the kernel does not need to copy packets from kernel space to user space and vice versa.

For this purpose, the netsniff-ng toolkit is libpcap independent, but nevertheless supports the pcap file format for capturing, replaying and performing offline-analysis of pcap dumps. Furthermore, we are focussing on building a robust and clean analyzer and utilities that complete netsniff-ng as a support for network development, debugging or network reconnaissance.

The netsniff-ng toolkit consists of the following utilities:

    netsniff-ng, a zero-copy analyzer, pcap capturer and replayer
    trafgen, a high-performance zero-copy network traffic generator
    bpfc, a Berkeley Packet Filter compiler supporting Linux extensions
    ifpps, a top-like kernel networking and system statistics tool
    flowtop, a top-like netfilter connection tracking tool
    curvetun, a lightweight multiuser IP tunnel based on elliptic curve cryptography
    ashunt, an Autonomous System (AS) trace route and ISP testing utility

I was really impressed after reading about the work Corey Satten from the University of Washington is doing with his tool called Gulp. It’s inception came about after attempting to perform lossless Gigabit packet captures on Linux.

The solution was simply to explicitly assign the reader and writer threads to different CPU/cores and to increase the scheduling priority of the packet reading thread. These two changes improved performance so dramatically that dropping any packets on a gigabit capture, written entirely to disk, is now a rare occurrence and many of the system performance tuning hacks I resorted to earlier have been backed out. (I now suspect they mostly helped by indirectly influencing process scheduling and cpu affinity–something I now control directly–however on systems with more than two CPU cores, the inter-core-benchmark I developed may still be helpful to determine which cores work most efficiently together).

Here’s a good blog post  from Randy Caldejon of nPulse comparing 10Gbps capture speeds of Daemonlogger and Gulp writing to disk.

Now if I can only get the time from my new employer to play with all of this in our lab.

I’m always excited when hearing about the great work happening at the edge of the commercial realm. I think this type of development pushes mainstream companies to work harder in developing better products. Additionally, you might want to check out a couple of my favorite resources, CAIDA and the perfSONAR project, for network data analysis and performance monitoring tools. If you have suggestions for additions to my list, please send them along.


  1. Jim MacLeod says

    Re: traffic generation. Could you use something like scapy, or is that the wrong tool for the job? (Fine control, but a bulk tool is required.)

Leave a Reply

Your email address will not be published. Required fields are marked *