This article is Part 2 in the 6-part series “The Bulletproof Maintenance Window”. For the rest of the story, see the links at the bottom of the page.
Many network changes have gone south simply because the person doing the configuration didn’t sit down to visualize the results of their configurations. Here are some examples of questions to ask yourself about the work that you are planning to do:
- What happens to this node when you change its loopback address?
- What happens to the IGP in this and other parts of the network when you change that cost value?
- Which monitoring systems will think the sky is falling when you remove that old SNMP string?
- How many customers will experience a service interruption during the normal course of the maintenance, and, if it goes sideways, how many customers could potentially be affected and for how long (i.e., what is the blast radius)?
It’s critical to think through all of the details of your procedures in advance of doing the actual work. Phil Gervasi has a great post on the subject at his blog. He gets into some good specific examples that illustrate the kind of detail-oriented thinking you need to be doing if you want to be successful in this area.
Write it down
Put your plan in a runbook or MOP (Method of Procedure) document. Take the time to do some writing at the beginning of the document to communicate context. Then, build your plan as a set of sequential steps, with configuration, validation, and rollback procedures. Remember to plan to clean up after yourself – add new devices to monitoring systems, update text description fields in the configurations, things like that.
Your plan should be in a document that is accessible to anyone who is involved with the change. In Part 3 – Peer Review, we’ll spend more time on how these documents can build the effectiveness of your peer review process. In general, the written plan should include the following categories of information:
- Why you are doing the change. This comes from the context that you gathered as discussed in Part 1 – Get and Communicate Context.
- Which customers are affected by the change, how they are affected, and how long they can expect to be without service or with partial service.
- Which personnel resources will be needed (e.g., will you need to have a datacenter facilities person on site to help move or run cables?).
- Which nodes/devices will be affected.
- Which nodes will be directly configured.
- Exact configuration commands for each step in the procedure.
- How the control, data, and management planes in each configured device will be modified by the change.
- Some examples of this are:
- How IGP neighbor states will be modified.
- How signaling protocols like RSVP or LDP will be modified.
- How middleboxes in the network, such as firewalls and load balancers, will respond to the changes.
- How the node itself will respond to all of these changes (e.g., will you lose SSH access and have to plan to do some configuration work via the console?)
- Some examples of this are:
- How applications will be affected.
- Will you fail over to a secondary device before changing the primary device? If so, is the application resilient to these changes in state?
- How will application response times change as a result of the new configuration?
- What kind of impact do you expect as a result of draining traffic from one side/site/datacenter before you do the actual work?
- Specific plans for how to validate that all of the above are healthy before the end of the maintenance window.
- A specific set of show commands that are to be run on every node is one example.
- Decision criteria to help determine if the change was successful.
- Specific plans for rolling back the configuration, in case of problems that cannot be solved within the allotted time of the maintenance window.
- Specific plans for how to validate that the backout/rollback plan has successfully reversed all of the changes and returned the system to a healthy state.
The “gold standard” for these documents is this: Could you build this document and then hand it to another engineer at roughly your level of skill, and have them be successful in implementing the change without your involvement?
In Part 3, we’ll tackle Peer Review.
This article is Part 2 in the 6-part series “Bulletproof Maintenance Windows”. For the rest of the story, check out the following: