This article is part 4 of a series on the Aruba 8400 chassis switch, launched in August 2017. See the links section at the bottom of this article for the other articles in the series.
Root cause analysis (RCA) is the art of finding out precisely why a broken thing is broken. In IT, determining root cause of an issue is complicated, in that application delivery is based on a large number of components–both code and physical infrastructure.
Determining the root of the problem—the thing that, once you fix it, the problem is resolved–is a holy grail of network management systems. When goat sacrifices fail, deep human knowledge of a snowflake system’s specifics is required to sort through data, decide what is symptomatic as opposed to causal, and filter the information down to the root cause.
Expert systems software, for decades, has promised to deliver RCA to network engineers. For example, 15 or more years ago, Aprisma Spectrum was a monitoring platform that promised to determine the root cause of a network outage. As I recall, Spectrum’s RCA functionality was a mostly a dependency tree that the platform operator was required to build. Spectrum could infer where the root cause was based on the breaks in the dependency tree. That was hardly a revolution in software, even at that time.
The Aruba 8400’s Take On Root Cause Analysis
Aruba takes a deeper look at the challenge of RCA. They gather an enormous amount of information and feed it to their analytics engine to determine root cause.
The 8400’s integrated network analytics engine is the first item to call out, before we even consider RCA. Ponder for a moment that the 8400 is, on board, performing network analytics. Typically, network analytics is performed by an external device running software that was purchased, possibly at an eye-watering price, in addition to the hardware being monitored. The 8400 offers analytics built-in as part of the base license. While Aruba isn’t giving the 8400 away, you’re getting some bang for your buck here.
How does the network analytics engine work?
Data is gathered via what Aruba calls “agents.” That’s not an unusual term, but I have a sense of what that word usually conjures up. When engineers hear “agents” they generally respond with, “Yuck.” In this case, “yuck” is unwarranted. Agents in this context are Python scripts running on the 8400, and not special software packages that must be loaded onto remote devices.
The Python scripts feature a module (i.e. methods and classes) that make it easier to create agents performing checks. The scripts run in an LXC container with CPU and memory constraints that keep the system from being overrun.
From Aruba’s point of view, running agents on the 8400 is key because it grants the agents the ability to gather 8400 data with no gaps. In remote polling situations, a network interruption causes a gap in the time-series data. In addition, Aruba tightly integrates the agents with the Prometheus database running on the box, as well as the web UI.
These Python script agents monitor for and trigger on anomalies. Agents have full access to configuration, protocol state, and network statistics. In other words, agents have access to data which gives them context around the data being gathered. That context is what’s required to know when there’s an anomaly.
The Prometheus time series database is the data repository for the agents. The data is correlated with configuration checkpoints and diffs, so that changes to the switch configuration can be tied to detected traffic anomalies.
The agents can also grab data from neighbor infrastructure and servers. This is an important detail, as it means agents have more information to work with than simply what the local 8400 knows about.
Let’s Consider An RCA Example
On launch day, Aruba demonstrated a VoIP quality monitor running on an 8400. In this demo, the 8400 agent running the monitor was also monitoring two external Aruba 2930 switches.
The agent monitored the traffic in the voice queue, identified as traffic marked with a specific DSCP value. (I can’t remember what value they were using, but seem to recall it wasn’t the typical EF.) Specifically, the agent monitored the packets-per-second metric and then detected an anomaly, where the polling result of the script was abnormal.
Interestingly, the scripts made a monitoring decision for this specific DSCP value based on the configuration of the 8400. The script author didn’t have to explicitly instruct the script to monitor specific DSCP values. Instead, the script looked at the 8400’s configuration, observed how particular DSCP values were being treated, and made an inference about the values worth tracking for anomalies.
For the curious, the abnormality detected was a spike in VoIP packets, as opposed to a gradual ramp up. VoIP spikes are unusual, as voice calls require low bandwidth, and therefore should never result in a hard spike. When the spike was detected, the script performed a REST call out to the 2930s, polling the results of an IP SLA monitor running on them.
The remote 2930 SLA data was correlated with the anomaly in the timeline, and presented in the web UI so that an operator could click to drill down. The web UI automatically generated a clickable link over the anomaly, which brought up a list of SLA data points. Each SLA data point could be clicked to find more information. In all of this clicking, Aruba was showing off their correlation of data from multiple sources.
As it happened, the spike appeared to be due to bulk traffic that was incorrectly marked with the VoIP DSCP value. Bulk traffic that shows up in the voice queue during times of congestion is likely to lead to jitter and perhaps dropped packets for the legitimate voice traffic. Now, did the 8400 tell the operator about this conclusion? Not exactly, but the information was present in one interface for a knowledgeable human to make the inferential leap required.
Yes, But Is That Root Cause Analysis?
We could argue whether what Aruba demonstrated is or is not RCA. My takeaway is that it’s possible, with well-written agents, to monitor multiple data sources and flag real issues using the 8400 as a platform. This feels like a significant step ahead of staring at scrolling graphs in an NMS and trying to make sense of what happened a half-hour ago that caused the network to burp.
No, you’re not getting an expert system in a box that is pre-programmed to recognize thousands of common network issues and flag them for you. Nor are you getting a set of instructions that tell you to take some specific action to ameliorate some bad network juju the system detected. Instead, you’re getting a platform you can use to build some specific, customized intelligence into your campus network core. If you know what you’re looking for, you can program the 8400 to take care of that for you.
For those intimidated by coding this level of monitoring intelligence, Aruba is not abandoning you to write agents on your own. The Python scripts are consumable via an official Aruba Exchange as well as GitHub. This means that the community will be able to contribute scripts, some of which will eventually become “Aruba certified.” Of course, you must exercise wisdom before executing someone else’s code on your switch, as we recall the rope Aruba is giving you with the 8400–useful, but also long enough to hang yourself with.
What About Remediation?
For those souls brave enough to seek automated remediation of a detected network anomaly (hey, fix the problem for me), Aruba is glad to extend the long rope metaphor even further. It is possible to script specific actions to take when a specific condition is detected.
Aruba was actually a bit excited about remediation, claiming that their ability to launch a remediation action against a detected anomaly is unique. I found that claim specious. I’ve been able to launch remediation scripts against a particular detected condition for a long time with the commonplace SolarWinds Orion platform many networkers are familiar with. I suspect other NMS platforms can do the same thing.
That said, Aruba specifically poked at the emerging streaming telemetry (as opposed to NMS) solutions on the market, stating that those other real-time detection engines might find issues, but don’t offer remediation. I suppose that might be true, at least for now. Adding a remediation capability to a monitoring platform isn’t a terribly difficult feature to add if customers want it.
The Aruba 8400 chassis switch is a modern, informed, insightful take on the needs of the campus core. The switch is a look ahead to what’s coming in IT operations instead more of the same. For organizations attempting to automate their network operations and integrate their network into the larger IT stack, the Aruba 8400 is worthy of consideration.
- Aruba Picks A Fight In The Campus Core With Its New 8400 Switch
- The Aruba 8400 Chassis Switch. Yes, But Why?
- The Aruba 8400 Hardware Highlights
- The Aruba 8400 ArubaOS-CX Network Operating System
- The Aruba 8400 Integrated Network Analytics & Automated Root Cause Analysis
This article underwent a technical review by Aruba Networks to ensure accuracy, which I appreciate. I sat for an entire day during the launch event hosted by Tech Field Day listening to several hours of presentation on this complex platform. I like to be sure I got it right.