Most well-behaved and well-designed networks should behave in a relatively predictable way. Even in the educational space where we have users in dorms, and the unusual requirements of researchers and academics, there are patterns to network activity. Knowing your network baselines is one of the most important weapons in the troubleshooting arsenal. After all, if you don’t know what is normal behaviour, how do you begin to identify what is unusual? It seems that many people don’t take the time or implement the tools to help them in this regard, which is a shame, because it is not so difficult to do. In addition, it can assist in helping troubleshoot areas outside the network space.
The first and easiest part to baselining is to find out where your traffic is flowing. In many enterprises, it will be fairly straightforward to predict behaviour. For example, there might be a spike in traffic at 9 a.m. when the users arrive for work and begin turning on PCs and getting their email. Traffic may dip around lunchtime. Backups might tax the network between midnight and 6 am. Where I worked, we used a product called Statseeker. Many of the functions it provided could be done using open-source tools such as RRDTool and MRTG, Cacti, or Smokeping, but the version 2.8 version for Statseeker we used gave us easily configurable statistics and graphs on packets, bytes, errors, drops and delay without having to roll-your-own. In addition, it allowed for the sending of alerts on link and device reachability, threshold alerting (useful once you have your baselines) and even a rudimentary SLA reporting function.
The new Statseeker version 3 adds in network discovery, Netflow capability and many other things, but we had not upgraded at the time I left. The version 2.8 worked for years on a scavenged Dell GX 280, monitoring a couple thousand ports for a reasonable (albeit .edu discounted) price. It does not have a huge range of bells and whistles, but it does what it does very well.
An example of the output is shown in the first image below. This was a connection between two buildings with some servers in one from 2009. You can see definite regular patterns day by day. Later in the month, some of the servers were moved to a new site, and you can see that later in the month, the traffic pattern changed (click on the image to see it clearly).
As you can see, this gives you an at-a-glance baseline of what your traffic looks like. If something odd is happening in the network, you can see it easily.
Another example happened in 2010. One of our WAN providers did some work on a link one evening. After it was finished, I got threshold alerts on the round-trip delay from the Statseeker server. As you can see in the next image (click for detail), something was definitely wrong. You can also see it took them a few hours to sort it out. Even without the automated threshold alerts, having this tool available would have allowed me to immediately see that there was a problem as soon as the users complained that their network sucked. Baseline to the upstream device would have shown there was no issue there, so the problem was on the link where work had occurred.
Here’s one final example where baselining can make you either a hero or a villain, depending upon how much work you create for someone else. One month, the traffic statistics for a particular remote site student computer lab seemed quite high compared to previous months – once again, know your baseline. A quick check of the link baselines showed a spike in downloads regularly throughout the day, always on the hour, but sometimes at intervals of two or three hours. This was odd, so off to the logs. It turned out that the desktop support guys had misconfigured the antivirus, so it was going out to the net for updates instead of the local repository. However, they were also using a product called Deep Freeze which would reset the OS image at reboot. The antivirus was doing a full software update whenever a class of users logged out and booted the PCs because the Deep Freeze would wipe out the update at reboot as an unauthorized OS change. Needless to say, had we not had a decent baseline of what was going on in the network, this behaviour could have gone on indefinitely, wasting expensive WAN resources as well as putting a load on old PCs used in the labs, adversely affecting the user experience.
One further important thing to note is that baselining the network is not a one-off. It is a continuing process. User behaviour evolves and changes over time. New applications, server moves and a myriad of other factors make the baseline a dynamic moving target, and you need to keep on top of how your network is evolving. This data, already to hand, assists you in capacity planning. Is that WAN link getting too small? Or are we paying for bandwidth we aren’t using? Like all statistics and logging, you need to look at it regularly. Threshold alerting is useful, but even better is getting to know your network, and understanding its daily and weekly cycles.
I have focused on the Statseeker product here because I used it in a targeted way to baseline and monitor our network, and I was extremely happy with it. You may use other NMS products that do a similar thing. Leave a comment on which NMS and baselining tools you use.
In concluding, the main point here is that baselining is a key foundation of troubleshooting performance in the network. Knowing what is usual and what is abnormal gives you a starting point. It may turn out that the spike in traffic on the WAN is totally legit. Or it could be a remote site PC over a thin pipe has become a Skype supernode. Without a proper baseline, you might be jumping at shadows.