If a router reboots, but the monitoring system didn’t poll it during that reboot time, did it really reboot? You might receive an SNMP cold start trap, but that’s not reliable. How do you know if the device rebooted unexpectedly? Some people might say that it doesn’t matter, the device is back up and running, so who cares? But you should be tracking unexplained system reboots, in case a device has a hardware or software fault.
Network monitoring systems will do various polling of devices on a periodic basis. 5 minutes is a typical default time. If a device completely reboots during that 5 minute window, then when the NMS comes along for the next poll and asks “Are you alive? Are your links up? Everything OK there?” the router replies with “Yep, all good, all working well. No more memory leak either, now you happen to mention it.”
Anyone’s who ever sat there praying for a remote Cisco 2500 to come back up will know that there’s no way your NMS could miss that going down, as those things have a reboot time measured in hours and days. But new equipment – in particular virtualised equipment – has a very quick reboot cycle, and could easily be missed. Using ever shorter polling intervals is not the answer. We need another way – enter sysUpTimeInstance. This is a standard SNMP OID that is supported by pretty much all SNMP implementations. It tells us how long the system has been running, in hundredths of seconds. Here’s an example poll:
$ snmpget -v 2c -c ChuckNorris 192.168.1.224 126.96.36.199.188.8.131.52.0
DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (3670760) 10:11:47.60
So all we need to do is make sure our NMS is polling that OID, and raises an alert if the uptime is less than say 15 minutes. That way you know you’ll pick up the reboot, regardless of how long the reload took.
Of course, some monitoring systems will automatically poll the sysUpTime for you, and they’ll give you a very useful message to tell you the system just rebooted. Other network monitoring tools are enormously powerful, but feel that such simple things are beneath their dignity. They prefer to focus on the hard problems, not the easy ones. They may well notice sysUpTime changes, but they put it in a logfile and don’t bother creating an incident. HP’s NNMi is a case in point.
If you’re using NNMi, you will need to create a MIB expression matching sysUptime. I usually use sysUptime*0.01 – this normalises it to seconds. Then create custom polling rules to monitor that MIB expression, and set it to raise a critical incident if it is less than 900 (i.e. 15 minutes). You may want to go and tweak the custom poller incident too, to set enrichment rules for that event, to give some slightly more meaningful text than the default.
So, take a look at your NMS – check to see if it is monitoring system uptime, and test its behaviour when a device is rebooted. You want to know that it’s going to react if a system does do a quick reboot, regardless of what your polling intervals are set to.
Side note 1: Technically sysUpTimeInstance measures the time the SNMP agent has been running, and so is prone to false alerts if SNMP is reconfigured. hrSystemUptime (184.108.40.206.220.127.116.11.1) is the time the system has been running. hrSystemUptime doesn’t seem to be quite so widely supported, though. YMMV.
Side note 2: NNMi 9.x seems to be incapable of graphing the sysUptime value. It gets the first value, and continues polling after that, but it only displays the delta. Since the initial value and the delta are usually a long way apart, it makes the graph effectively useless. This is a known bug logged in December 2010, but it is still unresolved as of 9.11 patch 3. Harumph. Surely it’s not that hard to poll a value and just graph what it tells you?