The repercussions of the CenturyLink outage are still being felt weeks after the provider announced the problem solved. Even the FCC has gotten involved, announcing an investigation into what happened (see our blog on SD-WAN-Experts for more details). For enterprise IT, the outage certainly should give pause for anyone considering CenturyLink’s backbone as part of their SD-WAN.
But here’s the thing: CenturyLink isn’t the only Internet provider we’ve seen face network problems in the core. Just last summer, for example, the Interoute (now GTT) network went down for more than a day. This just after a cable cut in June took down Comcast’s network. Going back further, last November (2017 that is), Comcast and several other major ISPs were taken offline for more than an hour when a Level 3 Border Gateway Protocol (BGP) router leaked routes as a result of a misconfigured Autonomous System (AS). There are numerous other examples as well.
The truism that the Internet has enough density in the core to route around problems and keep operating is, like so many other urban myths, untrue. Organizations need to apply caution and healthy skepticism when evaluating the availability of Internet backbones. If Internet-based SD-WANs are to form the backbone of enterprise networks, it’s critical to assess and build sufficient diversity in the network core.
MPLS Thinking Interferes With SD-WAN Redundancy Planning
Focusing on core resiliency is a change for many IT organizations. Traditionally, the prevailing assumption has been that the middle mile, the network core, had sufficient pathing to route around single event outages. The problem, if there was one, existed in the last mile. The errant backhoe or misconfigured router — many problems could disconnect locations.
Such truism reflects an MPLS-way of thinking where the middle mile was carefully monitored and managed by a single provider. Those who were concerned about outages typically focussed on last-mile redundancy, connecting their data centers with multiple MPLS connections from the same provider. Where that’s too expensive, companies will purchase a provider’s MPLS backup services. AT&T’s Anira service, for example, is the company failover solution for its MPLS service. Should MPLS drop, the site can use its DIA link to connect to the MPLS core network. Masergy, Verizon, and other offer something similar.
Some enterprises do create core redundancy even with MPLS. One of my customers, for example, a large financial services firm, had dual-homed MPLS connections from two different MPLS providers for all of their primary data centers creating two MPLS backbones. In my experience, though, that’s the exception, not the rule.
The Performance Problems of Core Redundancy
But as we move towards a world reliant on the Internet, at least in the last mile, it’s time re-evaluate how we design for redundancy in our networks. With SD-WANs, we’ve seen a lot of focus on last-mile redundancy. The ability to use multiple last-mile access in parallel, failing over (and failing back) should there be a line outage is a hallmark of SD-WANs.
Such an approach protects against the failures we’ve seen in the MPLS world but they won’t protect us against failures in the core of our wide area networks. With SD-WANs we need to think through those issues as well.
And here’s the rub: ensuring core redundancy flies in the face of conventional Internet engineering. Normally, when connecting sites across the Internet you try to locate them on the same backbone to minimize the number of hops, reducing latency. But it’s precisely our tendency to put them on the same backbone that exposes companies to the outage from single events like the one that took down CenturyLink.
Ideally, you’ll dual-home every location with connections to two ISPs on two different backbones. This way your SD-WAN solution will be able to pick the optimum path — the ISP connected to the backbone shared with other locations most of the time and the ISP connected to the alternate backbone should in the event of an outage or brownout on the primary backbone.
SD-Core: Multiple Internet Cores Without the Performance Problems
Of course, not all locations have easy access to ISPs sitting on multiple Tier 1 backbone. On this score, SD-Core solutions may offer some help. By expressly marketing their own global networks, at least you know that providers, such as Mode and Aryaka, have vested interests in maintaining their global network.
Cato Networks goes a step further and connects its PoPs to multiple, I believe three and more, tier 1 backbones. Should one network fail or degrade, the PoPs automatically select an alternative backbone.
Of course, you’re still exposed to a PoP failure with SD-Core solutions, but diverse PoPs are available with some providers. Cato, at least, took steps to address this point. The company announced self-healing capabilities last fall where, amongst other things, should one of their redundant PoPs actually fail, the connected locations will automatically rehome to the next closest PoP.
But regardless of the approach, you take for your SD-WAN, one thing is clear: be sure to think through the level of redundancy you’ll need in your core and your last mile. Otherwise, the repercussions of the next Internet outage might have, shall we say, more personal consequences.