In Part 1 and Part 2 we covered everything from initial device provisioning to the continuous deployment of changes to our network. We ended with a very simplistic example that barely scratched the surface of the corner cases and scaling challenges that can be encountered along the way. From a process perspective however, now that we have this shiny new toy that’s going to fully overhaul how our team operates the network, what did we accomplish here?
- Single source of truth/management – Having all device changes pushed via a centralized manner introduces a level of configuration uniformity and confidence that just isn’t present in today’s enterprise networks.
- Reproducible templates and consistency – Once a new device is deployed (or an existing one is converted to a template), gone are the days of finding that one device or one port that is not configured up to your standard. This design also heavily encourages building your configurations in a cookie cutter fashion. When deploying a new device, the amount of non-standard config you might be thinking of pushing is directly proportional to the amount of new templates you’ll have to code, which should discourage you from going in the unique snowflake direction as much as possible.
- Human error – Sure there is still a lot of room for human error in this process, but I’d argue that the level of visibility and consistency decreases the chance of fat fingering a change.
- Vendor independence – Let’s say we want to replace an access switch with a different vendor. Since we’re just reading in variables from a YAML file, the only work left would be to script the Jinja2 templates (a one time per vendor scope of work).
- Repeatable deployment process – As I already mentioned a few times in this exercise, we’re not writing a one-off script every time we need to push a change. We’re running the same exact Ansible playbook which pushes only the necessary changes to the affected devices. I can’t state enough how important this level of consistency is to the confidence of the change process (hello, ITIL).
- Speed of deployment – We might not see a huge bump when deploying 1 device, but if you’re introducing a new environment or even datacenter, it will become very visible.
- Uniform underlay/overlay management – We’re already managing our cloud overlay network configurations in this pipeline manner, so getting the underlay under the same process is a win.
- Compliance – Since we’re tackling configuration uniformity at the deployment layer, our confidence in having all of our devices compliant at any given time is high.
- Change control – If a certain weekend change goes south and you need to quickly identify precisely what changed, wouldn’t having a specific git pull request to reference in that change ticket be a life savior? In terms of peer review, whether or not that is part of your change control process today, it’s comforting to know that someone literally can’t go around your approval process.
Challenges & Concerns:
- Testing and staging – To fully embrace CI/CD principles, the ability to deploy fast and often is predicated on the fact that a particular change was fully tested and vetted by the time it hits your production pipeline. Unfortunately the network stack is still very behind in this matter. One way to move in the right direction, is to stage your changes in a virtual equivalent of your environment (or parts of that environment). Another approach which isn’t exclusive of the former one, is to encourage a scripted checkout and unit testing for every change.
- Failure at scale – This is a predictable concern of automation as a whole. Instead of copying and pasting what I wrote above under the Human Error section, I’ll say that you constantly need to be cognizant of this particular issue when reviewing changes. Trying to get ahead of “can this particular change do some serious damage at scale”, will go a long way (easier said than done).
- Declarative state functionality – Many Ansible network modules do not currently give you the ability to declare a device config end state and automatically add and remove the needed statements to get to that state. They allow you to do it in the forward direction to add the necessary commands, but lack the intelligence to go in the reverse direction and remove the statements which shouldn’t be there. This poses a problem with some of the logic we introduced in our process and bears more research into other tools that take a “config replace” approach (ie, NAPALM).
- Inventory file size – Since we’re relying on Ansible’s idempotent capability to poll every device config and decide if that config needs to be changed with every successful git pull, these playbook runs can take a very long time. I found that separating out the main playbooks on a per environment basis helps decrease these run times. Additionally, if from a risk perspective you would rather not even touch certain devices (at least while you gain confidence in the process), this type of segregation helps.
- Bugs at scale / Open source support model – In my experience, when you deploy a tool like Ansible at scale in a medium/large network, bugs are unavoidable. Having a vendor to call for support and having to solve the problem via open source, are two very different support models. There is a certain helplessness factor with the open source approach, however I can easily argue that a level of that uncertainty is present with traditional vendor support as well. The question is how will/should your team deal with that? Are you interested in having your team contribute fixes to the open source project themselves, from a business perspective?
- Staffing implications – It’s no secret that this type of process change will fully overhaul the dynamic of your team. Can you rely on training alone to beef up your team’s knowledge? Going forward, hiring qualified network engineers that have a sprinkle of system/development knowledge is going to become harder and more expensive, so how does that fit into your plans?