5 Tips for Escaping Troubleshooting Hell

We’ve all been there: it’s 5am and you’re up to your humerus in a network failure that no amount of coffee and swearing seems to fix. People are hovering on your shoulder, presumably to observe the ferocious look of concentration on your face. How do you escape this 3rd Circle of Troubleshooting Hell and ride the Unicorn of Success to victory? Here are five tips to help you escape with your sanity (mostly) intact.

  1. Be methodical. Start at the point closest to the most likely source of the problem and step through the possible trouble points one by one.
  2. Draw diagrams of the problem, and write down the steps you’ve already tried. Whiteboards are useful for this, especially when you’re working in a team, but a notebook is also a good idea. Make sure to keep any notes you make for the next time you have to troubleshoot the same system. Writing down what you’ve already tried also stops you going into a despair cycle, trying the same thing over and over again. Keep your notebooks, too. I’ve found mine invaluable for solving problems years after solving a similar fault.
  3. Make sure you have something decent to eat and keep drinking (preferably not whisky – no matter how tempting it might be). It’s important to remember that your brain needs to be refueled regularly, especially if you haven’t had much sleep. Make time to grab something decent to eat and keep yourself hydrated.
  4. Get a human shield. People hanging around your desk asking for updates every 20 seconds doesn’t help matters. Find someone to deflect them and deploy them in a tactical position to intercept wingnuts. This leaves you free to concentrate on actually fixing this problem at hand. If you must, give them a title such as “Meatshield Puppet-Master” to make them feel important.
  5. Recognise when you’re not making progress. Sometimes you’ll just run out of ideas. At this point, you can continue to bang your head off the wall, or you can take 5 minutes to get some fresh air and give your subconscious brain a chance to work on the problem in the background. If the problem isn’t critical, but you’re still having problems solving it, try sleeping on it. Just don’t do what I’ve done in the past and lurch upright at an unreasonably anti-social time of night shouting the network engineer’s version of “Eureka!”. Trying to explain to an irate spouse why you’re making so much noise can be tricky, especially if they “don’t do geek-talk.”

I think these are fairly common sense ways to try and make yourself as effective and efficient as possible, but they’re easy to forget when you’re under pressure to fix a fault quickly.

Over to you guys – what have I missed?  What was your worst troubleshooting experience? Mine involved a certain firewall vendor, a failed “upgrade”, and a 26 hour working day.

About Neil Anderson

Neil is a freelance network security architect and contractor working with a number of clients in Scotland and Europe. He is CCIE #18705 and also holds a CISSP. He can often be found sampling beer in remote locations and ranting about tech to anyone too stupid to run away. If you're very unlucky, he may talk to you in Gaelic.

Neil can be occasionally be found on Twitter.

  • http://twitter.com/networkstatic Brent Salisbury

    Great read Neil. Awesome pointers for when we are traveling through the trough of dissolution. I really like #5, it is really bizarre that you can stare at something for two hours straight and get the same result from a 5 minute break. One thing I tend to preach to junior engineers is be open minded on the problem. People that walk in and throw at the customer or a systems guy counterpart that there is no way this is a network problem tend to get pie on their face sometimes. Let’s face it we are guilty until proven innocent.

    Great post. These are golden advice topics for the community.

    Respect,
    -Brent

  • http://twitter.com/FlorianHeigl1 Florian Heigl

    For 3) I’ve made the rule 1liter of water per coffee; thats easier to get right than X per day or X per hour since you might not be able to keep track of time beyond some point :)

    Also, go and have a break before starting long-running measures (i.e. prepare your full restore, take a break, then start the restore). It’s been numerous times that I’ve come back from that short smoking-break and either had come up with a restore-less fix or remembered some tiny super-important detail to note before restoration.

    Otherwise you’ll also keep pushing off till “you’re done” or “the backup is running” etc. often meaning you never take a time out. Then you usually end up ruining things :)

  • http://twitter.com/digital_dave_ David Rutledge

    Nice post! When t-shooting I like to grab a coworker and brainstorm out-loud what could logically be causing the results we are seeing. And even if they might not be able to provide any specific help, maybe in the process of bouncing ideas off them you’ll get your ‘eureka’ moment.

  • http://www.packetu.com/ Paul Stewart

    Number 4 can be incredibly important. I remember being in a few high stress situations over the years. My old boss would sometimes come along to “run blocker” while I worked through the issue. That is incredibly helpful for keeping a clear thought process. That ultimately makes you a much more productive troubleshooter.

  • Will

    Setup private dial-in and webex for personally selected people. Large scale outages cause program, project, and people managers to try and ‘take over’ the session often pushing out the duration of troubleshooting.

    For massive outages I also suggest to open a service case with the vendor immediately. No one is too good for help and management probably expects you use that avenue for assistance anyway. I used to think, screw that, I can fix this on my own but TAC is just as good a human shield as anyone (even if your help ends up being the B-team guy).

  • Tony Mattke

    This technique works quite well. I often grab a Jr. Engineer and use him or her as a rubber duck. They can at least show interest and possibly provide questions. :)

  • Andis Kakeli

    Great points Neil. Couldn’t agree more with you on number 4.
    I wasn’t t fond of the idea of calling TAC, but my recent experiences with them have proved me wrong and I would not waste the chance to work with them whenever possible.
    As for what you may have missed, a brilliant colleague of mine once said, “you cannot fix what you don’t understand” Isn’t that an absolute?

  • http://profiles.google.com/osbjmg John Gill

    I thought I should add the conception by project-manager types that doing the troubleshooting work “in parallel” is usually a myth.

    If there are changes to be made, make 1 change and test, and repeat as necessary.