Sakura Internet operates several data centers across Japan, including this one, and my team is in charge of building and taking care of our IP backbones. In this article, I will introduce the ongoing process of upgrading our DDoS mitigation solution, which happens to be a down-to-earth, if not widely applicable, use case for OpenFlow. I will share some of the benefits and limitations of the current implementations of OpenFlow 1.0 that we have discovered so far.
At Sakura, d/RTBH has been a regular countermeasure against large-scale DDoS attacks so as to avoid collateral damage against customers who share the same uplink with the target hosts. However, as long as traffic is blackholed based solely on destination, we are unable to forward packets from legitimate sources to customers under DDoS attack. Besides, more often than not we failed to take actions in timely manner. It’s common for DDoS attacks to grow up to several Gbps or more in seconds, and it was too late when we identified the targets.
Towards the end of 2011, we revamped our in-house DoS detection app with a “high-velocity” in-memory database called VoltDB as its backend. Our new app has been a real success thanks to this new generation database that does the heavy lifting of sFlow data processing. It not only tells us who’s under attack in realtime, but also gives us detailed profiles including incoming bps per source IP, also in realtime. With that information, we can finally make a step forward from d/RTBH to source-and-destination-based filtering. Then we can keep customers being DDoS’ed connected to the Internet.
We thought about how to accomplish this. Again, with a RTBH route being active, BGP routers would blackhole all packets destined to the target, no matter the sender. So, it’s about how to get cleaned up incoming packets to pass through our iBGP mesh to the non-BGP routers that are nearest to the target host. Available solutions such as tunneling and VRF separation have their drawbacks, at least in Sakura’s backbone layout, that are enough to give me cold feet.
Around that time, I started self-study on the basics of SDN. (That’s how I came across this website!) And what clicked with me quickly was the idea of using OpenFlow, more specifically using Floodlight’s static flow pusher API, instead of tunneling or VRF. The point here is that our app knows all the necessary information to dispatch appropriate requests to our trigger router for RTBH, and to Floodlight controllers for proactive flow insertions via REST API.
I first came up with this idea:
- Modify the destination (DDoS targets’) IP addresses of packets from legitimate sources at OF switches.
- These OF switches are connected via a single 10G link to the AS border routers and sends the packets back to them through IN_PORT.
- Restore the address to the original destination IP at OF switches near the customer edge.
No packet_in messages are involved here. Therefore, once establishing a connection with the master controller, every switch will receive a “default drop” static flow entry with the lowest priority. This trick worked perfectly in lab tests using Open vSwitch. But it turned out that no hardware OF switches could do it right. Specifically, there seems to be no OF switches available that support modification of L3 headers in ofp_action in hardware. I’d be happy to stand corrected if I was wrong, although I think someone else may have already heard my rant.
There’s nothing like having an undaunted work buddy, and my partner on this project came up with an alternative. In this second approach:
- OF switches won’t mess with L3 headers.
- Instead, OF switches will modify the vlan-id and send the packets through IN_PORT back to the connected BGP routers.
- The BGP routers have PBR configured on each vif.
I have attached a simplified picture of what this is like. Multipath issues, controller redundancies, etc. are not included in the picture, but they are actually well taken into account.
Each BGP router will have two to four possible sets of next-hops, so each router needs to have two to four vifs on the port connected to their peer OF switch. On each vif, a corresponding PBR is preconfigured. The app stores a [router:vif(vlan-id):PBR next-hop] mapping. Upon detecting an attack, the app initiates RTBH by dispatching a series of static flow insertion requests that (a) drop evil packets with a source IP address matching our conditions and (b) steer clean packets all the way to the DDoS target host. The app can do all of this because, as I mentioned, it knows everything: the mappings, attack sources, attack target IP, and which set of non-BGP speakers to forward clean packets to.
To do DDoS mitigation in this way requires lots of preparation, especially when compared to the aforementioned L3 header tweaking trick. These vif/PBR combos aren’t pretty at all. You might wonder if we’d be better off forgetting about OpenFlow to solve this problem. Well, we have stayed with this method because we were impressed with the simple REST API while coding to test the initial plan, for one thing. Our app only has to talk to the controllers, who will then reach out and push flows to the deployed switches. This results in a much cleaner and leaner program than we had thought, and also means that maintenance won’t cause much of a headache. For another thing, it will also be easy to save installed flow entries to the data store and read later, because they come from the controller in JSON format. It also appears that flow entries can be stored as-is in a varchar column which can be queried directly with the current version of VoltDB, so neither parsing nor normalizing is needed. I haven’t tried it yet, but it looks promising.
In any case, this alternative setup worked as expected with Open vSwitch, but was still no good with the hardware switches in our lab. We later found out that the Broadcom chipset in our switches didn’t support the action of forwarding packets to IN_PORT. We were not going to allocate more than one 10G port on each BGP router for a connection to the OF switch just for this, at least not until we would need to filter >10G traffic. But the game wasn’t over yet. NEC kindly offered two PF524x switches for our testing. Lab tests with these hardware switches have been very successful. When injecting 9.9Gbps of 64byte packets with 80% being hypothetical DDoS attacks, we only see about a 10usec delay in each hairpin route. Soon, we will get rolling on the work of coding, and I’m very excited about it.
As network engineers simply trying to get packets to the right destination, we often face challenging situations like Sakura’s. We compare all available solutions and try better ones. Next time you face a difficult challenge like this, you might consider OpenFlow as an option, because OF’s merits might outweigh its limitations, depending on your situation. Plus, a good thing about OpenFlow is that it requires far less physical equipment to test your design than typical, because you can use Open vSwitch. That said, even if your design is validated when testing with OVS, hardware OF switches might not yield the same results. Therefore, a final test with hardware switches is still very important to properly qualify a design for production.
I hope this use-case story will inspire some of the patient readers who are still with me to give OpenFlow and SDN a look.