Riverbed Steelhead and Juniper WX Part II – When a tunnel is not a tunnel

In my previous post “Getting Nerdy about WAN optimisation“, I discussed some of the basics of WAN optimisation and how the Juniper WX and Riverbed Steelhead capture traffic. Now I’d like to get into the detail on how they shift traffic between networks.

Once traffic enters either appliance, there are differences on how the traffic is then encapsulated. Both products terminate TCP sessions locally (there are some differences on how they achieve this) and retransmit the payload point-to-point by default.

By default, the WX uses a tunnel mode based upon UDP 3577 – one outbound tunnel (by default) to each peer WX. Within this tunnel, all traffic is parsed. In this context, UDP is stateless and avoids the “Ping-Pong” of TCP sessions and acknowledgements. The most efficient tunnel mode is IPComp (protocol number 108), as it saves a few bytes per packet because unused headers are removed.

Below are some basic diagrams of the UDP and IPComp headers. As you can see, IPComp only saves 4 bytes per packet; if you are that desperate for bandwidth, you would want to combine this with some fairly serious ingress filters to ensure that only the really important stuff actually hits the WAN optimiser, let alone the WAN.

UDP Headers

IPComp Headers

The Steelhead in “Correct Addressing” mode does something similar. Once the adjacency is formed, the traffic is shipped point to point between the “in-path” IP address attached to a transparent bridge (even if the device is in an “off-path” mode). Riverbed doesn’t actually call this tunneling, but the net effect is pretty darn similar.

Flows optimised by the Steelhead are encapsulated in a TCP transport; for environments with both high bandwidth (GigE+) and high latency (100ms+), high-speed TCP (HS-TCP, RFC 3649) can be enabled to increase the achievable throughput. HS-TCP does not actually alter the headers in use (as seen below) but rather adjusts the TCP window size on the Steelhead so that much longer flows can be transmitted and buffered before an acknowledgement is required. This feature (apparently) can produce around a four-fold increase in throughput from connections from 5ms to 600ms and makes traffic performance much more predictable across long distances.

TCP Headers

Pouring multiple pints into one pint-pot

To optimise packet flow on serial networks, both the WX and the Steelhead employ techniques to bundle several packets into one compressed frame. In the world of WX this is called “meta-packetisation”, although I can find scant little documentation on the subject. Essentially, if two packets are received into the tunnel (regardless of flow) which compress down to less than a single packet, they are encapsulated before being put onto the WAN. If your network flows are very redundant, this *theoretically* could result in 1000 packets entering optimisation, one very dense packet crossing the WAN, and 1000 packets leaving optimisation on the far end. The use case I have seen is a WAN link with an enormous amount of SNMP from a remote location to a management hub. Obviously, this has a side-effect of a loss of a single meta-packet means that all the contained packets are lost, forcing many flows to retransmit. There are mechanisms built-in to deal with that (Forward Error Correction) but it’s worth understanding what is going on that this level if you suspect you’ve got a lot of packet loss.

The Steelhead transport uses the alarmingly straightforward and well documented “Nagle’s Algorithm” to stuff as many packets together on the wire as possible

From Wikipedia:

if there is new data to send

if the window size >= MSS and available data is >= MSS

send complete MSS segment now

else

if there is unconfirmed data still in the pipe

enqueue data in the buffer until an acknowledge is received

else

send data immediately

end if

end if

end if

It is entirely possible that the WX uses the same algorithm or some version of it, but I can’t find any detail to confirm or deny. Riverbed call this “Neural Framing mode” and there is a little bit of tuning which can be done to either disable, force on, use TCP PUSH flag (to give the application some level of control to flush the queue immediately), and what amounts to “full auto” with Dynamic mode. With “Fixed target” rules (think firewall rule base for WAN optimisation), one can choose different neural framing modes for different traffic flows. Citrix ICA and SQL Client/Server are examples of latency sensitive/acceleration resistant protocols where you may want to deliberately disable neural framing rather than leave it on the default “Dynamic”.

Eschewing Obfuscation

Both of the default transport methods have a side effect: traffic is obfuscated from the network. This means your router flow counters make no sense and firewall rules are meaningless. QoS polices based on source/destination/port also cease to work. With both products, you can filter stuff out and export NetFlow data (I’ve seen weird results. Which set of stats do you believe, client side or server side?). Both products also support re-tagging of packets, but to “solve” this, both products offer oddly similar solutions. The WX has “Multi-flow emulation” which randomly assigns a different UDP source number to the egress port so that Weighted Fair Queuing (WFQ) gets a chance to work, but this doesn’t help with visibility much. “Application Visibility mode” goes one step further and preserves the source and destination ports, but weirdly, translates the port into its UDP equivalent. For example, TCP port 80 becomes UDP port 80 etc. This way, your firewall rules and anything policy based on the router stands a chance, albeit with judicious modifications.

The Steelhead handles this better; “Port Transparency” tunnels the traffic between the “real IPs” of the appliances. Essentially, your logs will show a shed-load of traffic between two IPs, but it should at least be recognisable and firewall rules should make some sort of sense. The Steelhead “Full Transparency” option makes the optimisation relatively invisible. The source and destination IP and ports are preserved which allows ACLS and QoS work the way there are supposed to. The downside here is with the firewall I keep banging on about. Suddenly, all that fancy deep packet inspection on the firewall goes bananas and screams “but that’s not HTTP traffic! The payload is all wrong!”. To deal with this, you need a tunnelling mechanism such IPComp tunnel mode or “Correct Addressing” on the Steelhead. You can see how circular these things get. Essentially to optimise WAN traffic you have to choose between the depth of inspection on the firewall or visibility on the WAN.

UDP and IPv6

One of the other interesting aspects of both products is how they deal with non TCP-Protocols. The WX can perform MSR (Molecular Sequence Reduction, multi-stage RAM-based compression) on any IP protocol. Realistically this means UDP and voice, but in theory any exotic IP application could be compressed. Until recently, UDP handling was an “Achilles heel” of the Steelhead product as apart from a few specific protocols, it had no support. Version 7.0 fixes this limitation weirdly, by performing Scalable Data Referencing (SDR, the rough equivalent to MSR) and encapsulating UDP in TCP. Given that the IP stack is hyper-optimised for TCP, this was probably the path of least resistance.

The other clue to this is IPv6 support; something that the WX will sadly never have any concept of. The Steelhead RiOS v7 can also encapsulate IPv6 traffic in IPv4 TCP; I realise that this would be enough to get some Packet Pushers regulars foaming at the mouth, but I’m guessing this is an interim solution to allow dual-stack environments with WAN optimisation whilst the rest of the world catches up.

In Summary

To my mind, the WX was an elegant product and for some requirements only now is Riverbed catching up (QoS support, routing integration, UDP tunnelling, a handful of others). However, the protocol support in Steelhead, especially for “knotty” protocols such as Exchange 2010, Lotus Notes and Citrix is really unmatched. Riverbed is slowly whittling away the tiny niggles I’ve had with their product. I’ve been a WX bigot for a number of years and have put a lot of effort into learning in backwards. I’ve now got a “new” product to evangelise over; it’s just kind of a drag that I’ve got to learn everything all over again.

Sources

My main sources for these articles have been the Riverbed Steelhead and Juniper WX official course-ware which for fairly obvious reasons I don’t have permission to reproduce. However, Juniper and Riverbed make a point of putting the vast majority of their documentation on-line and can be found here (Juniper) and here (Riverbed). The diagrams are my own creation in Excel. I welcome any criticism, constructive or otherwise!

 

Glen Kemp
Enterprise Security Architect. Designing & deploying “keep the bad guys out” technologies. Delivering elephants and not hunting unicorns. Please free to add me on , follow me on Twitter or check out my other blogs on Juniper J-Net, sslboy.net and SearchNetworking.
  • http://twitter.com/metaip Allen Baylis

    Ok why are we discussing EOL products ? WAN accelerator has been a dead business for years and to be honest it’s integrated within software, OS …etc. Bandwidth is so cheap , who really cares . Just my opinion !

    • http://twitter.com/Dnoles Dave Noles

      When you have clients on the other end of a vsat connection where another 128k costs an arm and a leg, these things are a lifesaver.

      With only Steelhead experience, Even the pure latency optimisation for branch sites with ample bandwidth sees a noticeable improvement

    • Steve B

      aInterested to know what software or OS level products/features match HW WAN optimisation performance? Regardless, the management aspect of having one device per site rather than every endpoint involved in the optimisation process is a big plus IMO.

      Also bandwidth being cheap is a relevant term, connecting 300+ sites to a WAN with carrier and local issues mean anything better than xDSL is not “cheap”.

    • http://twitter.com/ssl_boy Glen Kemp

      I think everyone else has answered the questions for me, as to the “why”, well that’s mostly covered in the first paragraph of the the first post, linked at the top.

    • http://twitter.com/maeltor Josh

      You can throw GIGs of bandwidth at something but if the issue is latency and a poorly wan optimized & chatty protocol, more bandwidth isn’t going to do a damn thing…
      In what products are these technologies built right into the software, and if so, how do they compare? I’d venture to say that a hardware appliance is still the best options.

      Besides, I’d MUCH rather have this type of optimization processing and heavy lifting OFF of my primary routers as much as possible. Let them do their job of routing traffic effectively and as quickly as possible.

  • fellowjerry .

    A very good article though… keep WAN-Opt learning as normal as how switching is..