Buy One Steelhead, Get One Free: Creative Use Of VRFs With Inline Deployment

Below is a very generic WAN diagram.  It consists of L3 MPLS links in blue and point-to-point links in red.  Not all these links actually exist (unless you’re made of money), but bear with me.

Generic Multi-Site WAN Setup

My cohorts and I recently installed a pair of Riverbed Steelheads between Site A and our remote data center, as you can see in the above drawing.  The performance improvement was so large my boss actually had users telling him “things seem faster.”  I’ve never had that happen before, have you?  Anyway, with a response like this you can imagine how eager my boss was to deploy WAN acceleration to all our sites.

Standard practice dictates that each site have another Steelhead installed locally.  In the above drawing, that would require buying and installing two more devices – one each for sites B and C.  There is but one problem.

Riverbed’s Steelhead 7050 is the single most expensive item I’ve ever successfully held in my hands.  (FYI, do not try to lift a Nexus 7010 by yourself).

Here’s where Distance X and Distance Y in the above drawing come into play.  Site C is a long ways away from Site A and the data center.  Hairpinning all traffic from Site C to the data center through Site A wouldn’t make much sense, and bosses understand that (hopefully).  However, for all bosses, there exists some distance X and Y such that you will be told

“Hey, network guy/gal, I’m not buying another Steelhead for Site B.  You’ll have to use the one in Site A.”

This is the opportunity you’ve been looking for, right?  To prove you can do more with less?  Right?

You can fill in your own details, but in our case Site B was just a few miles from Site A and the two were connected via fiber in the street.  This didn’t sound like much of a challenge until I started mocking up some pictures on a napkin.  The napkin looked like the drawing below.

Riverbed's Standard Deployment

Since a Steelhead (or a WAAS/SilverPeak/Certeon) only has a limited number of interfaces, if you want to deploy the appliance inline you must do so at some point of aggregation.  For us it was between the core and edge switches.  Given that all our edge connectivity was kept separate from the core switches, traffic between Site B (labeled Remote Site above) and the data center would simply bounce through Site A (labeled HQ above) and never touch the Steelhead.  It was around this time I started whispering under my breath “Curse you, Distance X!”

WCCP redirection was off the table for other reasons, and we didn’t want to disrupt optimization for existing clients.  We needed a way to get traffic to bounce through the Steelhead to the core switches, then back through in the “correct” (LAN-to-WAN) direction.  And vice versa for the return traffic.

What do you in your test lab when you need to simulate the presence of several switches, but you only have one?  Multiple VRFs!

We ended up turning the L3 links between the core and edge switches into 802.1Q trunks that carried two VLANs.  On the edge switches, one VLAN replaced the existing link, and the other was placed in a new VRF (we called it PASSTHRU).  The core switches had no knowledge of a new VRF and happily forwarded packets from subinterface A to subinterface B.  Packets were effectively bouncing through the Steelhead to the core, then back through in the correct direction!  Logically, the new configuration looks like the drawing below.  Notice the different VRFs on the edge switches.

Riverbed's Standard Design with multi-VRF Edge Switch

Clients in Remote Site B can now make use of the existing Steelhead in Site A.  Penny-pinching worked – this time.

In Riverbed’s case, you will see a lot of connections being passed through with reason code FROM_WAN.  These are the SYN packets going from the WAN side to the LAN side, before being bounced back to the other VRF.  I assume other vendors would give you similar notices.

This probably isn’t screaming “elegant solution” to you, so I think it’s useful to list the pros and cons of such a solution.

Pros

  • Saved a lot of money
  • Caused no outage (as long as you’re careful about renumbering the two core switches’ interfaces)
  • Quick to implement (1 night vs. waiting for a new Steelhead to arrive)
  • Does not require WCCP redirection (remember, WCCP is just like HSRP and EIGRP – proprietary)
  • Supports enabling WAN optimization for other edge services with a quick outage (i.e. remote access)

Cons

  • Additional load on the Steelhead (can your box handle it?)
  • Additional load on the edge switch – multiple OSPF databases in our case
  • “Unnecessary” load on the core switches – how much traffic is Site B sending to the data center?
  • You won’t find this in any Riverbed Deployment Guide

I’m very interested in hearing other people’s thoughts on this implementation.  If you have any more pros/cons to add to the list, please share.

Note: These drawings were all done using Dia, you’ll have to forgive the odd Steelhead stencil.  If you know of a way to import .vss files into Dia, please let me know.

Old Core Switch Config (Cisco Nexus 7010)


interface port-channel2
description to Edge Switch (via Steelhead)
no shutdown
ip address 1.2.3.3/29
ip ospf message-digest-key 1 md5 3 ABC123
ip ospf network point-to-point
ip router ospf 1 area 0.0.0.0
ip pim sparse-mode

Old Edge Switch Config (Cisco 3750G)


interface Port-channel1
description to Core Switch (via Steelhead)
no switchport
ip address 1.2.3.4/29
ip ospf message-digest-key 1 md5 ABC123
ip ospf network point-to-point
ip pim sparse-mode
interface Gi1/0/1
description to MetroE Router
ip address 1.1.1.1/31
router ospf 1
no passive-interface Port-channel1

New Core Switch Config (Cisco Nexus 7010)


interface port-channel2
description to Edge Switch (via Steelhead)
interface port-channel2.3002
description Edge Switch's default VRF
encapsulation dot1q 3002
no shutdown
ip address 1.2.3.3/29
! be sure to change the VLAN of the Steelhead's in-path address
ip ospf message-digest-key 1 md5 3 ABC123
ip ospf network point-to-point
ip router ospf 1 area 0.0.0.0
ip pim sparse-mode
interface port-channel2.3003
description Edge Switch's PASSTHRU vrf
encapsulation dot1q 3003
no shutdown
ip address 10.10.10.10/31
ip ospf message-digest-key 1 md5 3 ABC123
ip ospf network point-to-point
ip router ospf 1 area 0.0.0.0
ip pim sparse-mode

New Edge Switch Config (Cisco 3750G)


vlan 3002
name default_to_core
vlan 3003
name PASSTHRU_to_core
!
ip vrf PASSTHRU
!
interface Port-channel1
description to Core Switch (via Steelhead)
switchport trunk encapsulation dot1q
switchport mode trunk
switchport trunk allowed vlan 3002,3003
switchport nonegotiate
spanning-tree portfast trunk
int Vlan3002
description Optimized connection
ip address 1.2.3.4/29
ip ospf message-digest-key 1 md5 ABC123
ip ospf network point-to-point
ip pim sparse-mode
interface Vlan3003
description PASSTHRU connection
ip vrf forwarding PASSTHRU
ip address 10.10.10.11/31
ip ospf message-digest-key 1 md5 ABC123
ip ospf network point-to-point
ip pim sparse-mode
interface Gi1/0/1
! be sure to move this interface to the PASSTHRU vrf
ip vrf forwarding PASSTHRU
ip address 1.1.1.1/31
router ospf 1
no passive-interface Vlan3002
router ospf 2 vrf PASSTHRU
router-id A.B.C.D
! you must manually specify a new OSPF router ID for this process to start
no passive-interface Vlan3003

Mark Ciecior
Mark is a network engineer, racquetball player, and CCIE #28274. Follow @mciecior on twitter to hear more about packets, racquets, and why he can't give up the original StarCraft.
  • http://twitter.com/ntwrkarchitect Brant Stevens

    slick.

  • Eddie

    We have a similar scenario we are jumping into but Why not change the l3 links between the routers and your MLS to l2 and bring the l3 connectivity back to the core? Then just have separate /30s between your wan routers and the core all of which would get trinked through and would flow through the RB

    • Mark Ciecior

      We thought about this, and the question ends up being “do you want all of your edge traffic touching your core?”

      In our case, with Nexus 7010s, bandwidth wasn’t an issue.  But bringing all the /30s back to the core means

      a) The core switches are now doing OSPF summarization/filtering (and maintaining databases for multiple areas)
      b) Third-party equipment is now connected directly to our core switches
      c) All third-party-to-third-party traffic touches the core unnecessarily

      If these aren’t concerns for you, /30s back to the core would definitely be easier.

  • http://twitter.com/cloudtoad Derick Winkworth

    Put both “sides” of the Steelhead into the Edge switch, plug the edge-switch directly into the core. 

    The VRFs will now have four interfaces attached to them:  An interface on the “wan” side, an interface on the “core” side and two interfaces going to the steelhead.   Use PBR or WCCP on the “wan” and “core” interfaces to selectively redirect traffic to the steelhead.

    You can add an additional interface in each VRF passing through the steelhead for WAN-to-WAN traffic.  Again you can use PBR to selectively pass traffic through the steelhead and then for WAN-WAN traffic not passing through the steelhead, leak routes as appropriate between the VRFs…

    [edit: Or.. replace the edge switch with an OpenFlow switch and use "waypoint routing" aka "appliance routing" to selectively redirect flows to the appliance. You could do this today with NEC products. This is effectively what I'm aiming for above. Either way you reduce suboptimal routing and unnecessary load on the appliance.]