One of the fundamentally difficult things in modern networks is the fact that organizations don’t own significant portions of them. Certainly, the traditional portions of the network are still under the care and feeding of the local IT staff: the data center, the campus, and enterprise wireless to name a few. Increasingly though, more and more applications that are critical for business operations live beyond the confines of an organization’s network. The apps are in the cloud, hanging off of a B2B VPN connection, or accessed via a leased private path.
The reason this situation is difficult is that it’s still the network engineer’s (i.e. your) problem to pinpoint performance issues when applications aren’t working well. Pointing out to business stakeholders that the cloud app is accessed via the Internet and therefore, “It’s not my problem,” isn’t an acceptable response. You’re IT. You’re the technology expert. You’re still expected to help resolve the issue.
Monitoring vs. Analysis
One of the ways we network operators discover problems is through monitoring. We monitor up/down status. We take latency measurements. We take jitter measurements. We compile utilization baselines. We collect logs by the unholy gigabyte. Really, we gather ridiculous amounts of data through our monitoring applications.
Of course, most of us who identify with being enthusiastic data collectors often don’t use the data very well. Assuming that to be true, let’s make a couple of obvious applications.
- Data is not enough. Analysis of that data is key. What good is data that doesn’t tell you anything meaningful?
- We are gathering so many data points, we pathetic humans cannot effectively analyze them. Therefore, without software analytics, the data is almost useless to us.
Okay. Now let’s bring this back around to the topic of monitoring network segments you don’t own. The way I see it, there are two key considerations.
- Monitor the data that is available to you, even though you don’t own the network.
- Analyze that huge pile data so that the data is useful.
Pulling this off successfully leads to specific benefits. For example, some problems can be anticipated and flagged before they are too impactful. In other cases, the root cause of a problem can be quickly discovered. How? Again, by monitoring the right things, and constantly analyzing the data flowing in. “But I can’t monitor very much on networks I don’t own. How does this help?” That’s a fair point, which we’ll address in a bit.
Improving Mean Time To Innocence (or Guilt)
Competent troubleshooters know that proving what a problem is not is nearly as important as proving what a problem is. In complex application deployments, nailing down the precise problem can be a bit like nailing gelatin to the wall. Just when we think we’ve got it, the suspect is cleared of wrongdoing, and we go back to the pile of data in the hopes of sifting out something else useful. That’s a time consuming way to go about finding the root cause of the problem. Humans are bad at sifting through mounds of data and spotting trends — well-written software tends to do a better job.
In the case of troubleshooting poor performance for an off-site application, improving mean time to innocence is really important. Businesses need to understand whether the problem is in the local infrastructure, in the remote cloud, or somewhere in the middle. This is not especially easy to track down by hand. Manual traceroutes, simple ping tests, and DNS resolution checks are most of what can be done with the average workstation, but in fact there is a great deal more information that is publicly accessible. For example, it is possible to know the entirety of the path between the local network and the remote cloud. It is also possible to know whether that path has changed, and when. It is even possible to tell if a node in that path is lossy, and estimate how lossy the node is and how long it has been dropping traffic.
Bringing It All Together: Real-World Examples
To determine what’s going on between my network and cloud hosted services or Internet destinations, I am using ThousandEyes. As implied by the name, ThousandEyes watches over the network path between one (or several) probes and remote destinations using a battery of available tests. The test results are aggregated into a web interface you access as a cloud service. These results are presented in such a way that you can quickly diagnose where problems really lie, and move towards a resolution. I’ve written about ThousandEyes before — there’s a variety of use cases, including monitoring cloud providers or critical customer networks as well as estimating available bandwidth between two points. For our article here, I get into some other examples.
I was recently trying to use the sales website of a ski resort near me, Gunstock Mountain. On a particular day, I found that the site was up, but nearly unusable — very slow response times. Where did the problem lie? Was it my network? A DNS server issue? Something in the path between me and the server hosting the site? I checked a few things sort of as a reflex, but it struck me that my ThousandEyes probe running on the VMware ESXi server in my network rack would do a far better job.
I built a page load test that would retrieve a web page from the Gunstock sales site, tracking all of the interesting data surrounding the transactions. What I found was fascinating. When drilling into different time slices, I found different performance problems — not just one issue with this inconsistently performing site.
Problem 1: Slow DNS Response. Here, DNS took 2.5 seconds of the transaction. The web site itself was not at the root of the problem.
Problem 2: Slow SSL setup. In this shot, note that it took 2.2 seconds for the SSL session to be established.
Problem 3: Data loss in the path. Here, I’d noticed that the web page test failed completely. As I drilled into the path visualization view, I saw that an upstream hop was showing 100% packet loss. In other words, my local network was up, and my Internet circuit was working. But a node between my ThousandEyes agent and the web server was badly broken. Root cause identified.
In a different exercise, I set up a voice test to San Jose. Rather often, I use VoIP to chat with folks who are in or around the San Jose area. Mostly, those conversations are fine. Sometimes, they are NOT fine. Determining the root cause of a lousy quality Internet call is not possible without technology that includes a historical view to handle the monitoring and data analysis. In this shot, I’ve noticed that my average MOS between my ThousandEyes agent and the remote test agent has dropped significantly.
The next step is determining why the MOS dropped. In the screenshot above, we can see that loss is at 0%, which is good. Discards seem a bit higher than we’d like to see at 5.2%. Latency is acceptable at 57ms. But PDV — packet delay variation, jitter for our purposes here — is unacceptable. These path characteristics could cause VoIP quality to be impacted. The question then becomes, where are the trouble spots in the path?
As we dig into the path, we see a couple of interesting links in the screenshot below, both highlighted in red. One of the links has a delay of 38ms; based on the hostnames, we’ll assume the link connects roughly the MCI and SFO areas. Since that’s Kansas City to San Francisco, 38ms doesn’t seem terrible. The other link is a puzzlement, though. It appears to connect between routers in the SJC (San Jose) area, but with a 35ms delay. That seems awfully high. If I had to bet, I’d say that link was the most likely cause of our lousy MOS.
Let’s probe a little deeper and see if my suspicion is right. Maybe this link is a problem at other times, too. Sure enough, poking at another point in time featuring an anomalous MOS, we see that same “vlan249” link in the screenshot below with a very high delay — 193ms. Hmm. Perhaps the link is overloaded or the interface is having trouble. Hard to say, but we definitely have a convincing data point.
Summing It Up
The best network analytical tools gather the data that’s impractical to gather by hand, and presents it in a useful way. That allows you to move issues towards a problem resolution, as you’re able to visualize what a problem is, and not simply what it is not. Even better, you are able to do this quickly — improving mean time to innocence.
Drilling into these problems is as straightforward as observing historical anomalies and moving swiftly through the ThousandEyes interface to identify a root cause. We network engineers love our CLIs, but ThousandEyes is a product that has made a useful and intuitive GUI. The data visualizations are key to finding out what’s really going on without having to sift through piles of numeric data, trying to perform anomaly detection and event correlation manually.
The Packet Pushers thank ThousandEyes for their sponsorship of this post.