Catch Unexpected Reboots Through Monitoring sysUpTimeInstance

If a router reboots, but the monitoring system didn’t poll it during that reboot time, did it really reboot? You might receive an SNMP cold start trap, but that’s not reliable. How do you know if the device rebooted unexpectedly? Some people might say that it doesn’t matter, the device is back up and running, so who cares? But you should be tracking unexplained system reboots, in case a device has a hardware or software fault.

Network monitoring systems will do various polling of devices on a periodic basis. 5 minutes is a typical default time. If a device completely reboots during that 5 minute window, then when the NMS comes along for the next poll and asks “Are you alive? Are your links up? Everything OK there?” the router replies with “Yep, all good, all working well. No more memory leak either, now you happen to mention it.”

Anyone’s who ever sat there praying for a remote Cisco 2500 to come back up will know that there’s no way your NMS could miss that going down, as those things have a reboot time measured in hours and days. But new equipment – in particular virtualised equipment – has a very quick reboot cycle, and could easily be missed. Using ever shorter polling intervals is not the answer. We need another way – enter sysUpTimeInstance. This is a standard SNMP OID that is supported by pretty much all SNMP implementations. It tells us how long the system has been running, in hundredths of seconds. Here’s an example poll:

$ snmpget -v 2c -c ChuckNorris 192.168.1.224 1.3.6.1.2.1.1.3.0
DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (3670760) 10:11:47.60

So all we need to do is make sure our NMS is polling that OID, and raises an alert if the uptime is less than say 15 minutes. That way you know you’ll pick up the reboot, regardless of how long the reload took.

Of course, some monitoring systems will automatically poll the sysUpTime for you, and they’ll give you a very useful message to tell you the system just rebooted. Other network monitoring tools are enormously powerful, but feel that such simple things are beneath their dignity. They prefer to focus on the hard problems, not the easy ones. They may well notice sysUpTime changes, but they put it in a logfile and don’t bother creating an incident. HP’s NNMi is a case in point.

If you’re using NNMi, you will need to create a MIB expression matching sysUptime. I usually use sysUptime*0.01 – this normalises it to seconds. Then create custom polling rules to monitor that MIB expression, and set it to raise a critical incident if it is less than 900 (i.e. 15 minutes). You may want to go and tweak the custom poller incident too, to set enrichment rules for that event, to give some slightly more meaningful text than the default.

So, take a look at your NMS – check to see if it is monitoring system uptime, and test its behaviour when a device is rebooted. You want to know that it’s going to react if a system does do a quick reboot, regardless of what your polling intervals are set to.

Side note 1: Technically sysUpTimeInstance measures the time the SNMP agent has been running, and so is prone to false alerts if SNMP is reconfigured. hrSystemUptime (1.3.6.1.2.1.25.1.1) is the time the system has been running. hrSystemUptime doesn’t seem to be quite so widely supported, though. YMMV.


Side note 2: NNMi 9.x seems to be incapable of graphing the sysUptime value. It gets the first value, and continues polling after that, but it only displays the delta. Since the initial value and the delta are usually a long way apart, it makes the graph effectively useless. This is a known bug logged in December 2010, but it is still unresolved as of 9.11 patch 3. Harumph. Surely it’s not that hard to poll a value and just graph what it tells you?

Lindsay Hill

Lindsay Hill

Network Management Consultant
Lindsay (@northlandboy) is a network management consultant, with experience across networks, servers, applications and security. He is CCIE #36708, RHCE, CISSP and HP MASE. More of his own content is at lkhill.com.
Lindsay Hill
  • http://showbrain.blogspot.com Ben Story

    In some equipment, if the equipment is up for more than 496 days the counter will fill and roll back to 0 without a reboot.

    • Guest

       Love those 32bit systimers

    • http://www.brianraaen.com/ Brian Christopher Raaen

      In those cases it does not hurt for an engineer to take a peek at the equipment to make sure there are not problems lurking around the corner.  I hated telneting to a router a little while back and not having commands I was used to using only to do a “show version” and see it was running IOS 11.3

      • Lindsay Hill

        That’s a good point – people used to go on about long uptime systems (most I’ve seen is over 1100 days), but now when I see that I think “old code/unpatched.” I like your way of thinking about it – it’s almost a reminder that you probably need to give that system some love!

  • http://nitinkhanna.com Nitin Khanna

    If you really are into SNMP based implementations, simply setup a computer with a small program to keep plotting, instead of expecting the whole thing from your NMS. That way, you have more control over the info. You can even solve the delta problem by telling your computer (or VM, if you’re into it) program to simply add to the base value. It will even give you control over the 496 day roll over… 

    • Lindsay Hill

      Agreed that that method is simpler, but y’know, when I paid tens of thousands for a program that says it can do that sort of thing, I’d like it to be able to do it itself, you know?

      MRTG vs NNMi + iSPI is a sort of similar thing – you can create all sorts of immensely powerful reports with iSPI Performance, but when all you want is a simple graph of say link utilisation, MRTG kicks ass, and NNMi/iSPI just doesn’t really offer anything like it.

      • http://nitinkhanna.com Nitin Khanna

        Lindsay, I understand that in a corporate environment, it is expected that a more professional and permanent solution but as a networking person not afraid of programming, I sincerely believe that such hacks should be part of a network engineer’s life. How else did MRTG come into being? 

        • Lindsay Hill

          Yep, a lot depends on scale. For simpler stuff, a quick and dirty way of stringing together a few tools to do something useful is perfect. What you don’t want to do is end up being the only guy who knows how a lot of the hacks hang together!

          There should be more network engineers like you “not afraid of programming” – too many times I see engineers brute force things when a simple little shell script would have save a good day of repetitive typing…

          • http://nitinkhanna.com Nitin Khanna

            Yup, I agree. I’m still a student and I’ve already seen enough students with me who are good at networking but really afraid of coding. They’ve never even touched command line, let alone shell scripting. That really needs to change. Until then, a few hacks save the day. 

  • http://www.facebook.com/mrbiggles Michael Biggs

    Many devices support the snmpEngineTime OID, which won’t wrap for 136 years. It can be better than the sysUpTime, unless you really want that 100th of a second accuracy that it provides.