Want to be a better engineer? Learn to troubleshoot
Since I have been working in IT for many years there are a few traits that generally separate a good engineer from a great engineer. One of those traits is troubleshooting. For those of you out there who have been doing this a long time would hopefully agree. The purpose of this post is not to teach you to troubleshoot. That would take a lifetime…or three to fully master. The goal of this post is to stress the importance of this process for those of you who want to advance your career.
First off let’s define a troubleshooter
“a skilled worker employed to locate trouble and make repairs in machinery and technical equipment”
Troubleshooting is a process I often put a lot of value in when interviewing people for engineering positions. I hold it in an even higher regard if that person is looking for a senior level position on the team. Now I am sure most of you reading this post are from the IT engineering field. However, you can pretty much map this skill set to any job across any vertical working on “things”
Troubleshooting is process not a skill. It is often not the same exact repeatable steps in modern networks or systems. Sure there are some things that you can use from case to case. However, there is not a single way to troubleshoot. The importance is on the process you take.
What makes a good troubleshooter?
Passing a test does not immediately make you a good troubleshooter. This skill set if best learned in the field. I can’t tell you how many times I have RTFM but quickly been in a situation out of the lab and in the real world where things take a rapid turn and a quick resolution needs to be performed.
Now I am not saying that all troubleshooting experience has to come from trial and error, or throwing ideas against the wall and seeing what sticks. Most of us in this industry probably have built our own computers for example. Remember when your monitor would not display an image? What did you do to resolve the issue? Most likely you started by checking the monitor was on, the cables was plugged in, the cable was good, the drivers were good, etc, etc..
I personally feel those people who are genuinely naturally inquisitive can pick up troubleshooting skills easier then someone who does not. I have seen this in the real world with engineers many times. Those that want to know how and why things work often have a better natural ability to take a problem. Break it down into smaller sizable chunks and make suggested troubleshooting steps.
What are some steps to get better at troubleshooting?
Ask a lot of questions, you have heard people say there are no dumb questions? That’s 110% true. For those of you in the early stages of your IT career ask your senior level engineers and architects to shadow them or help out with projects.
Never be satisfied with “It just started working” Sure some problems can be nearly impossible to track down. Some problems also might be related to software bugs or an even rarer hardware failure. However, you should have a bit of curiosity as to what really happened. Think of it like closure in the IT world J. Let’s say you are more of junior level engineer and the issue gets escalated to a senior resource. Go back and ask that person who solved the problem to explain what he or she did to solve the issue. This will serve you tremendously in the future.
Having a position at some point in your career where you have to help people remotely. Working in a NOC or trying to help a co-worker without being there will really force you to not only troubleshoot but to ask questions (very specific questions) to help the process of elimination. This type of work will also teach you patience. Something that comes in handy in times of a 3am network down problem are you are working through the troubleshooting process
Troubleshooting in the real world
All problems are always due to the most complex failure right? FALSE!! Often times its simplest of things that are causing the issues. One of the best questions to ask when troubleshooting something “What was the last thing that changed?” I have seen engineer get blinded that it must be something else when in fact the issues were directly related to the last thing that changed.
Let’s walk through a quick example of over complication
Engineers are building out some new network switches in a data center. Cranking away on the configs, bringing up routing, VLANS, etc. As the hours pass some of the switches are not passing traffic to the core. The engineers spend hours checking over configs line by line. Trying to verify ARP entries, look at route and MAC address tables. They try rebooting the switches with no luck. Later on in the night they try and swap the SFP based optics, swapping them around with other switches. Even get desperate as to try and use a different fiber patch cable, checking to see if the fiber is sending light over the fiber. Finally, after several hours of hair pulling they call an engineer who has never seen the design, config or anything and he asks one question “Are the switches directly connect?” The guys at site knew exactly what was meant by that. Shortly after that the team on site switches the polarity of the fiber cable and magic its up! This is just a reminder not to ignore the simple things. I am sure we all have dozens of stories like this.
Having a troubleshooting methodology that follows the layers is a good starting point to work from. As your career progresses you will have obtained all kinds of little tips and tricks that you can use. As well as hopefully share and pass on to new members in your teams. I hope this post helps inspire you to take a look at how you currently troubleshoot your networks and not overthink things too much.