1. Opinions are like…
In the past 5 years I’ve participated in three proper evaluations of WAN acceleration products. Twice with Cisco’s products (an older and then a newer product) and once with Riverbed. On the surface, the viability of these products seems obvious. Locality of Reference (both in transmission and in storage) is a well-known phenomenon in computer science. Caching, in general, would be meaningless if LoR was not real. For the network nerds reading this article, think about ARP and DNS caching. These things exist because there is an assumption (a valid one) that such data will be used over and over again in the course of time. In fact, Kozierok mentions it in reference to both ARP and DNS in his “TCP/IP Guide” book. See here and here.
The facts are that most enterprises use Microsoft products: Office, network drives, print servers, etc. If you have a network drive that has thousands of office documents there are loads and loads of duplicated data on that drive. Think of all the Word documents and Excel documents and all of the formatting and meta-data embedded in those. Not to mention graphics like company logos that might be contained in every one of those documents. So on the surface, anyway, it would seem WAN acceleration makes perfect sense. Factor in TCP optimizations, caching, and compression and it seems like a real no-brainer.
Unfortunately, all three of the evaluations we have undergone resulted in the equipment being removed from the network. You see, not all traffic can be optimized. Traffic that is already encrypted or compressed is not suitable for deduplication or compression. Certainly caching can become more difficult in these cases. You could start putting keys/certificates into your acceleration devices, but that can become quite a burden. Just certain traffic types also are not suited for acceleration, such as any traffic that consists of dynamic, unique, simple transactions with small amounts of data.
There is also a “perception” issue with WAN acceleration. Statistically it could look great for certain applications (and to be honest, the statistics looked quite good for certain types of traffic in our testing), but if the user’s response to this is “meh” than what does it matter? An FTP transfer that takes 3.5 hours instead of 4 hours may not matter. In other words, if the users think the network doesn’t feel much faster then the perception of the value of WAN acceleration takes a hit.
All of the above, though, falls into the category of “branch” acceleration. I can accept that it could make sense for the right customer with the right applications. In general, though, I’ve found that proper tuning of QoS and TCP can improve things so much that at the branch level, its good enough. Many people do not realize how a bad QoS scheme and default TCP settings are really the culprit for poor WAN performance.
2. Enter Infineta
Infineta’s focus, in contrast, is on inter-Data Center communications. This means that their viability isn’t bound to some end-users subjective “feeling” of speed and responsiveness of the network. It means that the performance and usefulness of their product are bound exclusively to each other. It really is all about the statistics. The sort of traffic that Infineta has focused on accelerating are things like replication and big data. Contrast this with the types of traffic that traditional WAN acceleration has focused on.
In the case of DCI traffic of this sort, there are two important things to remember. First, No amount of fiddling with software, TCP, or QoS settings will result in significant pay off at very high speeds. It is well known that TCP doesn’t scale. The second thing is that increases of 20, 40, or 60 percent efficiency are measured in *gigabits* per second.
I’ve never really thought about acceleration in this context, so during Network Field Day #3, imagine my skepticism walking into the Infineta conference room. Well, that skepticism would turn into great enthusiasm by the end of the discussion. Infineta blew my socks off during their presentation. Aside from hearing about their deduplication algorithm in great, great detail, they also talked about their hardware platform. What I found remarkable is not that their deduplication algorithm was “mindblowingly” revolutionary (it kind of isn’t, actually, as you’ll see) but that their solution is holistic: They solved known problems with deduplication using highly parallelized hardware. In a nutshell, they worked out the existing known bottlenecks for WAN acceleration by carefully choosing the right algorithmic pieces and then parallelizing them.
This allows them to scale to the ability of taking in 10gbps of non-optimized traffic, optimizing it, and passing it along with an average of 50 microseconds of delay. Microseconds, not milliseconds. Think about that for a moment. We’re talking about dictionary searches, hash generation, and compression all in that time span. I’ve been sitting on this article the last few weeks because I wasn’t sure what the most important part of this solution was to blog about. You see, the sorts of algorithms that Infineta has managed to optimize happen to belong to a family of algorithms that are applicable to computer science in general. They didn’t optimize these algorithms strictly through math, but through hardware. This is not insignificant, and if you are a major networking vendor like one that might start with the letter “C” or “J” you should be paying close attention to this. Infineta sounds like a great purchase.
Infineta’s secret weapon is their hardware platform. They have created an end-to-end parallelized appliance platform that could be used for any number of purposes. They revealed that they intend to do just that during our visit, as they investigate putting merchant NPUs into the box. You could use this platform for encryption (SSL-offload, in-line ethernet encryption), IPS/IDS, and firewalling, to name a few. Even their memory controller and the memory banks in the box are parallelized. When you do a lookup in memory, you are really executing numerous parallel lookups. The elegant manner in which their platform was designed allows it to scale linearly, too, by the simple addition of components at the various stages in the processing pipeline. There is no reason at all this platform couldn’t scale to 40 or 100 gbps. Most other appliances can not scale in this way. Certainly not other WAN acceleration appliances.
This leads me to believe that, really, the deduplication algorithm is just a showcase application for the hardware.
Note: So, Infineta, throw in support for MPLS and support for encryption and you’ve made this toad extremely happy. I could accelerate and encrypt my MPLS DCI backbone transparently, regardless of what MPLS application I’m using (L2VPN, L3VPN).
On the one hand, what they have done is remarkable and its worth ranting about. Clearly a lot of work went into this. On the other hand the simplicity of the overall concept is breathtaking: Parallelize the bottlenecks out.
3. The deduplication algorithm
You can watch the entire Infineta presentation on Vimeo here, here, and here. Ram from Infineta does a great job explaining the history of deduplication in the second video. So, lets get straight into Infineta’s approach. See diagram below for the visual.
(a) Packet arrives in the receive ring of the Data Center side ethernet interface and is chosen for deduplication.
(b) A random position in the payload of the packet is chosen and from this position thirty-two (32) strings, each thirty-two (32) bytes long are pulled from the packet. These are called search handles. So if the random position is 223, then the first search handle is bytes 223-254. The second is bytes 224-255. The third is 225-256, and so on, until the thirty-second search handle which is bytes 254-285.
(c) For each of the search handles, a 4-byte hash is created. In order to minimize the possibility of a collision, the hash is generated using an irreducible polynomial. This is a well-known mathematical method for randomizing the hashes as much as possible. See here.
(d) A lookup is done for each of the 32 newly created hashes against a hash table. The value associated with each found hash is a list of one or more offsets in the dictionary where the search handle appears. These offsets correspond to the 32-byte dictionary handles previously described. Most hashes will not find a match in the table in this step. However, its entirely possible for multiple hashes to find matches and therefore multiple lists would be returned.
(e) For each offset in each list, the packet is byte-aligned with the 32 byte dictionary handle pointed to by the offset. In our example, if the eleventh search handle has a match, then the 233rd byte of the packet is lined up with the 1st byte of the matching dictionary handle. This implies that bytes 1-232 of the packet are now aligned with the previous 232 bytes of the dictionary “row.” If there are fewer than 232 bytes preceding the matched dictionary handle in the row, then the bytes in the packet “wrap around” to the previous dictionary row. Similarly on the other end if there are more trailing bytes in the packet than there are in the dictionary row, then the trailing bytes wrap around to the next row.
(f) At this point, a compare of the input packet is done against the aligned portion of the dictionary. Some parts will match (at least the byte aligned dictionary handle), and others will not match. The whole packet does not have to match, and in fact, the matching parts may be discontiguous. The smallest match unit in this comparison is 8 bytes. A “match-vector” is generated from this comparison and it describes which parts matched and which parts did not match. This match-vector contains information such as the starting position in the dictionary. Following that, there is 1 bit in the match vector for every 8 bytes of the original packet. A “1” indicates a match against the 8 bytes aligned in the dictionary. A “0” indicates no match.
(g) In the event that multiple match-vectors are generated (e.g. multiple hash matches are found in step (d) and/or multiple offsets are associated with a given hash), then a tie-breaking algorithm chooses the optimal match. Effectively, you want the match-vector with the highest possible integer value derived from the sequence of match-bits. In this diagram, we see that the first and eleventh search handles have matches. The first search handle is found twice in the dictionary. This would result in three different match-vectors being produced. In our case, the eleventh search handle has the optimal match.
(h) The winning “match-vector” is put into a shim along with the starting offset in the dictionary and placed just after the TCP header. The unmatched portions are packed together in the payload of the packet. Contents of the original packet may be added to the dictionary if not matched. If a search handle is added to the dictionary (thus becoming a dictionary handle), then the hash computed for that search handle is also added to the hash table that is searched in step (d). Infineta currently supports up to 2 billion dictionary handles. This is why the use of the irreducible polynomial in step (c) is necessary. Thats a whole lot of hashes.
(i) Further TCP optimizations and compression may occur after the deduplication stage, and the packet is sent out the wire. On the receiving end the match-vector is parsed to derive the portions of the dictionary needed to restore the original packet. What makes the match-vector method efficient in this case is the embedded tree structure of the dictionary.
Two important design choices were made here. The first was the use of fixed-length handles (32 bytes). This makes the programming and execution of the algorithm efficient and particularly suited for implementation with FPGAs, as the operands are now of a fixed (and therefore predictable) size. Previous dedupe algorithms have variable length search handles which have some arbitrarily derived boundary based on the content of packets themselves. Not only does this make it difficult to optimize from an implementation perspective, but it also makes partial matching virtually impossible to do.
The second significant design choice is Infineta’s particular use of the “match-vector.” Typically in other vendor’s approaches the entire input packet must match a portion of the dictionary for deduplication to occur. Infineta’s approach does not have this restriction. This vastly increases the efficiency gained by the appliance. Looking at our diagram in step (f), we can see that the 1st byte of the packet does not fall on a 32-byte boundary in the dictionary after it is aligned. In most cases this will be true on both the front-end and tail-end of the packet. We also see that there is a non-matching portion in the middle of the packet. Infineta simply quantifies all of this in the match-vector as shown, then dedupes (i.e., removes) the matching portions of the packet. The remaining non-matching portions of the packet are packed together in the resulting optimized packet. Traditional acceleration solutions would have simply passed the original packet through with no deduplication occuring at all.
In the next part, we’ll go over the hardware architecture in a little bit more detail and how some of these steps map to the components of the platform.
This is a WAN acceleration solution that is completely viable. The application of it is clearly defined and its built to execute on this at very high speeds with extremely low latency. If you have high-speed DCI requirements now, then you really ought to be looking at this solution.
Disclosure: I was a participant at Network Field Day #3. Infineta, along with other vendors including one of their competitors, did pay for the NFD3 participants’ travel expenses. Additionally Infineta is a sponsor of PacketPushers. Regardless, my opinion is my own here. If I thought they sucked, this article would have been very different. I’m not shy about my opinions.