“The right to dissent is the essence of democracy-the will to dissent is an effective safeguard against judicial lethargy-the effect of a dissent is the essence of progress.” – Justice Jesse W. Carter 1953.
I failed. It might have been my lukewarm now tasteless tea or his lengthy salesy slides but I finally failed and a small chuckle had escaped. It escaped and made the vendor’s SE pause his presentation on the enormous benefits SD-WAN. He shot a bewildered look my way. For lack of a better word Jim awkwardly probed me:
“Any mistakes you noticed?”
“Nah…it’s just the boardgame syndrome…”
I will go back to the boardgame syndrome and its dangers but I dissent.
I dissent because I believe in late 2019, while any junior SE and self-proclaimed “expert” with the experience of implementing SD-WAN for their own company can lecture on why we need SD-WAN to breathe normally, almost no credible work has been done to bring an opposing perspective to the discussion. Furthermore, as Justice Carter years ago pointed out, I dissent because I deeply fear the follow-the-crowd fog that is hanging over the WAN market. Finally, I dissent because I am not happy with our inaction for almost 2 decades.
The choice of SD-WAN has become so obvious that makes younger engineers raise a very valid question: “How did you guys really survive for decades without it?”. That is a great and perhaps funny question. I guess, pure luck and small scales back in the days – Right? But some of us ran networks of 1000’s of routers connected to some wicked circuits and they’re still alive and have been serving millions of customers with minimum impact. How did they survive without SD-WAN for so long?
All the marketing smoke screen aside, SD-WAN grew in the gaping space created by failures on two sides; the vendors and operators. For almost 2 decades the WAN equipment manufacturers and even protocol designers basked in the fact that a WAN router should do basic exterior routing (say BGP), some encryption (say IPSec) and perhaps some rudimentary path performance check before routing. With such mindsets floating around, improvements remained mostly limited to vertical scaling; in other words: “horsepower”.
Well, it worked for quite some time. But now the word is out and everyone has realized the emperors are naked. Straight out of such cozy insobriety, the emperors raced to yank businesses grown in that very gap. Quite a reasonable and predictable RE-action. To recap the dossier, the emperors in fact never bothered seriously explore the much-needed WAN features such as hitless upgrades, quiet but immediate patching, AI-based path health checks, circuit switching, central management and even something as basic as better analytics. Gauging by any modern standards, that is a lot of negligence in one scoop. There are plenty of examples out there as to how the world treats such negligence; probably the Taxi industry can tell you how they felt when someone else started rolling into their territory; different lines of business; yet the same trajectory.
Nonetheless, I hate to knock the hardware manufacturers alone without placing the blame on all the parties involved. So that, the dossier also names the carriers and even the consumers. The folks who are adept when it comes to publishing RFO’s and coming clean as to how they dropped the ball for the millionth time. And the people who never insisted on raising the bar on their vendors; they happily purchased what became available. All the three have to be called out.
That was “the case” against us, yet I believe it would be beneficial to remember how we got here. We all knew our WAN deployments could consist of large number of routers. We soon realized they had to be patched frequently and this would cause disruptions. Moons ago, we also noticed some new applications would need fresh QoS settings. But we had 1500 WAN routers to visit. Guess what? Again, for almost two decades the lazy went out and hired more engineers and the smart opted to long for friendlier API’s. Both worked out. We acknowledged the need for smarter TCP stacks on our WAN equipment and smarter file transfer protocols but decided to let it pass and even pay to get an extra box to achieve that. Also, it goes without saying, we experienced packet-loss the same way we had in the 90s. Still we, the consumer, put our traffic blindly on a random circuit while most of us quietly wished there were a bit of brain somewhere to make such decisions especially at massive scales; no one spoke up and no one really offered any sustainable solution for many years. That is how we wound up here.
Let’s face it; none of these were truly “emerging” needs when SD-WAN started to grow. They had been known for years. Hence for the record, it was in that world where SD-WAN became a thing.
Up to this point this has been a very lengthy background. But SD-WAN is here. Over the years I came up with the reasons you might want to consider before ripping everything out of your current WAN infrastructure and I think they could help many decision makers.
This document is structured to be consumed both by technical folks and senior leaders hope it conveys the right message to the right audience:
- The tale of SD-What?
Most people have already forgotten this or in other words “expanded their vison” but I think it’s worth noting that less than a decade ago when the early ideas of network “switch” programmability were coming to life we were hoping to make inexpensive bricks with super simple commodity silicon. They were supposed to be programmed at the forwarding level by a central brain; perhaps per flow, an impressive and massive breakthrough. OpenFlow was in fact the embodiment of such ideas. I don’t genuinely think when the idea was being developed anyone close to the academic world would’ve labeled “central management” as SDN (Software Define Networking). Now you have every right to loosely call your garden-variety central LDAP server SD-Credentials and your Domain Controller, SD-domain. Even with a bit of stretch your Microsoft WSUS (central patching server) can very well be your SD-patching; you define policies, download and distribute patches and monitor your progress. Then there are those of us who have had industrial networking experience, and we would tell you by the same standards, the SCADA and PLC systems from 20-30 years ago were doing Software Defined (SD) something this whole time. A bit uncomfortable at the flow level but let’s not hang up on the terminology; after all SD-WAN devices can manipulate paths even though it might do so using the same hefty piece of hardware (or silicon). Although there is more to it but as Bill Maher would tell you: “I guess anything is possible”.
- The tale of “those expensive MPLS circuits”
Up until recently especially in places outside the United States, layer 2 or 3 MPLS circuits were significantly and generally pricier than their equivalent DIA (Direct Internet Access) circuits. They sold well and eventually replaced Frame Relay and ATM networks for many years. They were marketed as QoS-friendly, away from the Wild Wild West Internet and arguably emerged as more secure and stable solutions. All in all, many customers were able to make a case and get their CFO’s to sign those fat recurring checks. But over time, the landscape changed and in fact changed dramatically. Nowadays, DIA (Direct Internet Access) circuits are much more stable and although do not support end to end QoS still can be relied upon for most types of traffic. The industry has come a long way to get to this point. Our carriers, who for the most part offer both services had to RE-act and this frenzy is rapidly shrinking the gap in pricing between your L2/3 MPLS and DIA circuits. So here goes my banal adage; if you are setting aside a hefty budget to roll out SD-WAN to primarily reduce your circuits cost, you might want to go back to your dusty MPLS contracts and push your carriers much harder by putting a few competing offers on the table. Also give DIA/IPSec a fair POC chance to make sure it really does not work for your traffic. Keep in mind things have changed drastically and investigations on paper is not enough; test per application.
- The tale of Central Management
Just imagine how alluring it would be if you had a network of 3000 WAN routers and someone walked in and offered a “single pane of glass” to manage them centrally. And by the word manage, I mean let’s be creative; you could make configuration changes including QoS, you could patch, you could monitor or do anything that you would normally have to SSH into those individual boxes. This sounds like utopia to me, a legitimate wish neglected for decades by vendors. Up to this point, your SE or your “expert” friend is talking; but you also have to remember even years ago there was a tiny community of operators who found out hiring 30 or 300 engineers to maintain those 3000 routers wasn’t really practical. The automation world jumped in and most if not all of those tasks became doable through some form of rather “simple scripting”. Then in turn the scripting world and market evolved and now we are at a point where hiring and training such resources are not a daunting task anymore. Additionally, as you might have heard, the modules or API’s available to your engineers for creating such tools are much more mature compared to what we had 10 years ago. So again, here goes the same sentence: if you are setting aside a hefty budget to roll out SD-WAN because lack of WAN central management is your primary pain point, weigh your in-house automation muscles first. If you don’t see any, then you might be having a bigger problem. A problem that can cost you more and more in the future way past your SD-WAN decision point. I would fix that first.
- The tale of midnight traffic shifts
Most readers have at least woken up a few times in the middle of the night to respond to a page from some frustrated remote site to shift traffic away from an unhealthy circuit to the backup path. I’ve done my own fair share too. The rules were simple; you would challenge the user, ping from the edge, give up, curse the carrier, open a case with them and make sure your route maps steered the traffic in the right and healthy direction then go back to bed and keep your fingers crossed that the secondary circuit wouldn’t start acting up again. Then came along the follow-the-sun model but the pain really didn’t go away. There would still be some “reactive” chain of actions awaiting a user to complain about a slow application or some choppy voice caused by packet loss, jitter or unusual latency. Well, now your SD-WAN SE is congratulating you because we’ve “invented” something that can continuously probe the health of your circuits and route traffic onto the healthier path. This is great but perhaps not great “news”. Some WAN hardware vendors, have had mechanisms for over a decade to detect such adverse conditions and to steer traffic in the healthy direction but what if they’re pulling those “features” and marketing their new SD-WAN solutions? Again, here goes the same sentence: if you are setting aside a hefty budget to roll out SD-WAN to perform circuit health check and circuit “switch”, weigh your in-house automation muscles first. Without turning this piece into a network or software architecture discussion, there are still plenty of ways on-the-box, in-path or off-path to measure adverse circuit conditions, run them against pre-defined health criteria and “tell” your route engine what to do next. It doesn’t have to be off-the-shelf.
- The tale of my rusty “kitchen sink”
Leave my kitchen sink alone. I remember the conversation I had with my friend, boss and peer Scott Sanders many years ago in his dark office when we were kicking around the idea of consolidating several WAN functions in one cheap commodity box for small to medium offices. This was years before the time vendors started wondering why firewalls, DNS, DHCP, malware protection, IPS, AES encryption and the Suite-B stuff, NAT, VRF mapping, WAN optimization, network monitoring and my kitchen sink cannot be packed in their recent “invention” which does routing, health check monitoring and circuit switching as well. Back then, we decided not to pursue the idea. The decision was primarily made because the merchant silicon technology wasn’t really geared toward WAN specific needs such as heavy encryption with AES-256. Now the silicon is available, but remember back then I also had concerns around the security, scalability, redundancy, service dependency, design complexity and a few other less obvious things such as teams and responsibilities. Most of those arguments are still valid and need to be addressed carefully before collapsing more services. Out of all the items above, probably the most curious and less analyzed case belongs to the WAN optimization and how the two markets can and probably should eventually converge. Yet again, the homework remains on your desk to make sure you understand all the aspects and consequences of such service consolidation.
- But what happened to that sales meeting?
By this point you might be wondering how the meeting with the SD-WAN sales person progressed. I will tell you; no so well. Not that because I had already replaced my expensive circuits with much cheaper ones and not just because my number of WAN incidents had dropped significantly in the last 5 years. It was not even because I had hired a few smart automation engineers to do central management as well as circuit health checks and not even because I still had reservations around service consolidation. Nope. A good salesperson can assure you none of that is really true in your “particular” case. They can also sing in chorus with your traditional engineers unwilling to learn to code and cross off your home-grown automation ideas as sheer wheel-reinvention and complicated.
Not any of that. In fact, I blame that colossal shiny silver elephant in the room. I would’ve genuinely entertained the idea if they or just any SD-WAN vendor checked the golden box that I’ve spend my entire career to defend and promote; and that’s to always remain vendor agnostic. The only place that I have allowed monopoly or Monopoly to exist is in a toy room. For reasons that the readers can all imagine; I would like to make entirely sure at any point in time my teams can replace one vendor or carrier with another; fully or partially and the entire system will remain functional with no degradation whatsoever. I understand there have been efforts even at the IETF level to standardize some parts of the SD-WAN world but as it stands right now, I dissent and categorize it as the most invasive attempt since mid-80s by anyone in this industry to monopolize such a significant portion of it. One might make an argument here that we did welcome such attempts previously when we opened doors to DMVPN, Fabricpath or vendor specific MLAG’s but to compensate for those technological conveniences, many of us went with the option of hiring smarter people and developing talents to architect around such limitations and still achieved the same goals. It might be just me but at least at this stage I am not willing to give up 100% of my leverage in any critical space such as the WAN to achieve things that can be achieved through other means.
- Special thanks to Ali So. for his insightful feedback and edits.