A not-especially-new challenge facing network engineers is that of far away management: how do you make sure you’re always able to manage gear that’s further distant than a quick ride in the car could handle? Even smaller networks can have a global spread, making this problem common. Here are a few scenarios I’ve faced recently, where I had to think through what I was doing, so as to avoid cutting off my access to remote devices.
- Updating the public-facing IP address of a remote firewall, when my normal access to that firewall was via the external address I was changing.
- Changing the tunnel peer IP of a site-to-site VPN device, when my normal access to that VPN router was via one of the tunnels whose parameters I was changing.
- Updating a public BGP scheme for a site, when knocking down the BGP session meant knocking down my normal means of accessing the routers I needed to work on.
- Managing a remote firewall whose anti-spoofing and routing tables needed tweaking after a new path to a remote network was introduced.
There are many answers to these and similar scenarios. Some of them cost money. Some of them come with experience…almost every engineer has a war story that includes a phrase like, “I hit enter, and then the console stopped responding. I felt like I was gonna throw up.” Yeah. My personal favorite is from my early days as a packet pusher, when I typed “debug ip packet” on campus border router. The console quit responding, and then that section of the campus went red on the management station. “Uh, I’ll be right back. Gotta go reboot the router across town.” Ah, the good ol’ days. Pardon my digression.
So what ARE the strategies for making sure you don’t lose access to that device an ocean away while you’re working on it?
- Good documentation. Your best defense against doing something stupid is making sure you know what’s going on at that remote site. Make a detailed diagram, modeling every link, every IP address, every VLAN number, WAN circuit IDs, and anything else that could possibly related to your project. I’m not suggesting you have to manually diagram every access port, but router and switch interconnects are a must, as are all in-path devices: firewalls, IDS/IPS, VPN concentrators, ISP border routers, load-balancers, WAN optimizers, and the like. This should include physical labels on remote devices which will aid folks at the site if you need them to do something for you.
- Do your planning while you’re awake. Don’t put yourself in the position of a 2am maintenance window where you have to plan out the details of your work. At 2am, you’re probably tired, and Chargers Chocolate Espresso Beans are not going to bring clarity to your addled brain. Do your planning ahead of time. By “planning”, I don’t mean that you should merely throw a quick task outline on your whiteboard or in a text document, although that’s a start. I mean write out every step and every bit of code that goes with it, and then have a trusted co-worker sanity check you. No trusted co-workers (or even untrusted ones)? Take a potential acolyte, throw ’em in a conference room, and explain your plan to them. Use the whiteboard, but do not sniff the markers excessively. Gesticulate wildly. You might find that talking through the plan, even with someone who doesn’t know enough to second-guess you, could reveal a fatal flaw.
- Make sure you’re on the right device. I happen to use a tabbed console program to manage my gear. If I’m not careful, I can paste code into the wrong terminal window or make changes to the wrong firewall policy because at a glance, THEY ALL LOOK ALIKE. Sometimes, devices (especially redundant pairs) have very similar names, which can get confusing as you go back and forth between them. Some tricks I have used to keep myself straight in the midst of a change include using different backgrounds, different font and color combinations, opening certain devices with a read-only account to prevent inadvertent changes, and temporarily renaming devices. Being able to refer back to your awesomely detailed network diagram can also help clear away confusion.
- Understand exactly what your next command will do. Ask the right questions and answer them in your mind before committing. By that, I mean that you know that you know that you know what’s going to happen when you hit enter. This is especially crucial when making a configuration change that will impact the routing table of a far-away router (and potentially other devices) you’re working on. For example, I recently had an uh-oh moment when I shut down the BGP session of a non-important router I was prepping for future service. When I shutdown the BGP neighbor, I lost the advertised default route. When I lost the default route, the remote router didn’t know how to get back to me. That was a silly mistake because I was in a hurry, and I know better. I just wasn’t thinking about it before I shut down the BGP session. It happens. Don’t let it happen to you.
- Think through your security scheme. Border routers often have their VTY and/or interfaces protected by access lists. That access-list is generally made up of known hosts or networks where management traffic should originate from. Often not included in these border ACLs are the device’s connected networks, as that makes standardization of the management ACL difficult. In practice, that means that even though a troubled device you’re working on might be physically accessible via an adjacent device, your access list could stop you from taking advantage of that adjacency. That’s a bummer if jumping from a device you can reach to the one you otherwise can’t would have saved you.
- Someone on site. Most errors can be fixed with a power cycle, assuming you didn’t commit broken code to the NVRAM startup configuration first. A human can cover a power cycle for you. Flesh and bone might also be able to connect a console cable to a system you can RDP or SSH to at the remote site, allowing you to undo whatever you did to kill the device in question via serial connection. Hey, it’s a little embarrassing, but better than a snail-mail repair.
- Scheduled reload. A lot of folks like to schedule a Cisco device reload before a significant change via “reload in X”, where X is a certain number of minutes away that the device will reboot itself. That way, if you make a mistake, you just have to wait for device to reboot itself and load the startup config that had been working, giving you another shot to try again.
- Remotely managed power strips. Some fancy and usually expensive power strips allow you to cycle power to a specific socket, which could potentially allow you to remotely reboot a device you’ve bricked.
- The road less traveled. Multiple paths to the same network give you options you don’t otherwise have. Those additional paths could be in the form of a cheap Internet line you only use for backdoor access, an out-of-band network, a terminal/console server, dial-up, or allowing temporary access to a device interface you would normally not use for management.
Do you have a favorite trick or technique you’d like to share to keep your remote devices accessible during risky changes?