Why is making a change to the network one of those hard things that makes an IT department sigh? For indeed, plentiful sighing is exactly what happens. An engineer conjures up a change. At many companies, that change is submitted to a change review board. The first level reviewers see the change. And they sigh.
“Here we go. A network change. What are we breaking this time?”
The careful, skeptical scrutiny begins. What, exactly, is being changed? Is the spanning-tree topology being affected? Will there be a routing table convergence event? Are access-lists being impacted? What about VPN tunnel crypto configurations? How hard would it be to back the change out? Which customers might be affected if things go sideways? Who’s on call during the change window? When is the change window, anyway? Did Susan look at this change yet? Because that part of the network is her design, so she’s the one that should really give us the go-ahead…
And on it goes, from one reviewer to another, which much additional sighing along the way. Some sign off, but then someone rejects, kicking it back to the submitter. After yet another sigh, the submitter fixes the issue, and the process starts all over. This time the change makes it to middle management, who gets fretful and calls a meeting to discuss, but signs off in the end with a resigned sigh.
Finally, all the approvals are done. The change is scheduled for a time that is deemed to be the least risky to the business and the most inconvenient for the engineer. Let’s say 2am on a Saturday, when the engineer is neither awake enough to properly do the job, nor in a positive frame of mind.
Again, why? Why is making a network change so hard? Why all the sighing? Because the network is an infrastructure resource that’s common to all of the rest of the IT infrastructure. If a network change causes a network outage, the blast radius is potentially huge, affecting large numbers of IT systems, business units, and ultimately customers. And so it is that network changes, especially significant ones, are looked at askance by those who have been burned before.
What improves the process of making changes to the network? What de-risks the process? Knowing exactly what a change will accomplish de-risks the process. In traditional change management, senior engineers are expected to understand the nuance of a change, weigh the business impact, and approve or deny with appropriate notations.
However, the larger a network gets, the more difficult it becomes to truly understand the implications of every potential change. Modern networks tend to be discreet smaller networks stitched together. They are not homogenous. Depending on history and humans, these disparate networks are unlikely to even be predictable. As an organization’s network grows, the risk of harm from changes actually increases. At a point, we don’t know what we don’t know.
Even when a well-understood, low-risk change is made, there is room for human error. Typing the wrong command can cause a problem, as can typing the right command at the wrong time.
One solution to de-risk network changes is lab testing. The problem with this approach is always the lab itself. Usually, there isn’t one. Or if there is one, the lab is made up of leftover components from upgrade projects. The lab equipment lacks parity with production. That doesn’t mean running a change against the lab is waste of time. It just means that the lab exercise might not catch every problem that could crop up when running the change against the production network.
Many engineers model changes in virtual network simulators these days, using tools like Junosphere from Juniper or VIRL from Cisco to model an approximation of their production environment and test changes. For the real world of multi-vendor networks, network.toCode()’s multivendor labs might be the best option around shy of building a multi-vendor virtual lab of your very own.
A significant step beyond a lab which merely approximates the production environment is a simulator that models the production environment. Simulation software ingests configurations and state from the network, running them against software that is supposed to emulate, right down to the incremental version numbers, the exact behavior of the production network.
Such models are more likely to reveal the problems a change might introduce. Running a change through a model that replicates the production topology in exacting detail should bring more confidence to the change control process, offering sigh relief.
Introducing Forward Networks.
Forward Networks is the most recent entrant to the production network simulator market. Their approach is that of network assurance, consisting of three main elements.
1. Search. With the Forward platform, operators can submit a simple query and get back an answer. Forward describes this function as Google-like. Thus, Forward claims to be bringing search to networking. For what it’s worth, there are other network-related tools that use a natural language query interface. Arkin, purchased by VMware in 2016, leaps to mind.
In any case, what are operators searching against? Forward builds a full-fledged model of the network using all network device configurations, as well as dynamic network state such as the forwarding entries in the RIB. They query the network constantly to maintain the model in near real-time using SNMP, screen scrapes, and API calls (when they can get them).
Forward can handle configurations and state data from a variety of switches, routers, firewalls, and load balancers, describing the end result as “a copy of your network in software.” To learn how the network will handle forwarding traffic, they run it against a revolutionary mathematical algorithm that allows Forward to trace the so-called “all packet.”
I dislike words like “revolutionary” that tend towards marketing hyperbole, but revolutionary is the word Forward used to describe their math. Since every packet that is possible — the “all packet” — can be tested against the model because of this algorithm, perhaps revolutionary isn’t too grandiose.
2. Verify. With the Forward Networks model, it becomes possible to perform testing without impacting production. For example, operators can have Forward constantly ask the question, “Is my network doing what it’s supposed to be doing all the time?”
This verification capability is a sort of unit testing, a concept borrowed from the world of software development. With such unit testing, Forward can demonstrate that key policies and intent are, in fact, being enforced on the network. When policies are no longer working as intended, Forward can raise an alert regarding the out of compliance situation.
The model also allows for troubleshooting of unexpected problems that arise on the network, keeping operators off of production equipment as long as possible while still making headway towards problem resolution.
3. Predict. To address sighs introduced by the change control process, the Forward platform can be used to test proposed changes. Will the change accomplish what is intended? Will the change accomplish anything that is NOT intended, like, oh I don’t know…break the network?
The view from the hot aisle.
This time, I want to believe. I want desperately to have a network model that I can run tests against and trust to reflect my ultimate production reality. I want a model that is a source of truth, and not merely a source of rough approximation. However, I have not had good luck with these sorts of products in the past. To be blunt, they’ve never worked right. They’ve required a dedicated human or two to keep them running, and were terribly buggy. And if you think about it, bugs in your testing software introduce an ironic twist to the entire idea.
To make me a believer, the Forward Networks will need to consistently absorb all vendor software changes across a huge variety of platforms. That is no small undertaking, especially considering the current ugliness of SNMP and screen scraping that makes up the propensity of their data gathering interfaces. Fortunately, most networks don’t adopt every software iteration that comes from a vendor, so as long as Forward can stay a step or two ahead of their customers, this should be okay. And Forward says that they are already doing this for the vendor gear they support.
Forward Networks also needs to convince me that hardware doesn’t matter as much as my gut tells me it does. By that, I mean that Forward Networks’ model is all about software configuration and tracking real-time state. Certainly, that’s important, but the fact is that network software runs on complex hardware, usually quite specialized, and with a variety of capabilities. For example, take a look at the wide variety of Cisco Nexus chassis line cards to get a sense of the tight coupling of hardware, software, features, and performance most of the vertically integrated networking industry is cursed with.
That said, hardware limitations and quirks would be incredibly hard to model. I’m not suggesting that Forward Networks needs to somehow emulate every bit of silicon on the market, even if they could get their hands on the information required to do such a thing. Rather, this a caution flag raised for those who might be considering network simulation products, including Forward Networks. What problem are you trying to solve? If you’re looking for a useful tool to help catch errors before they happen, there’s a strong use case here. If you’re looking for a tool to 100% emulate your entire network right down to the silicon and the PHYs in software, this is not that tool. As far as I know, there is no such tool.
Managing your expectations properly will help you determine whether or not Forward Networks should occupy some of the coveted space in your toolkit. I advise lots of testing before buying. Put the product through its paces. Verify the accuracy of the model that it builds. Make sure the network operating systems in use on your network are fully supported. Make sure that operators can make use of it without having to be dedicated to it. If all the tests pass muster, then the Forward Networks platform should be a boon that sees ongoing use, and not a burden that ends up abandoned.