This is part 1 of a planned multi-part blog post on network automation efforts at athenahealth inc.
Automate your network! Isn’t that all you hear these days if you are a network engineer? “Automate your network provisioning like changing a vlan on a port or adding a new rack/row/pod in your data center, so you can concentrate on higher-level tasks”. It’s a beautiful place to be. Wouldn’t we all like to be there?
This series of writeups is meant to be something of “A History of Network Automation at athenahealth”.
Firstly, this is not a how-to guide, not a step by step instructional text or even the right way to do automation. This “History” is only meant to share with you our experiences and hopefully some useful code.
Also, this “History” is still being written, so I anticipate changes and corrections to previously held beliefs and positions.
And lastly, I’m hoping for your feedback, commentary, and suggestions should you feel so inclined.
Part 1: Hardware, OS, Speed Rails
In this first post I will set the stage by describing the choices we’ve made before even starting to automate.
What Is Automation (To Me)?
Traditionally we’d configure the box and use RANCID to grab a config backup. If you did it right, you’d also look for “configured by” syslog messages on your syslog collector and trigger a RANCID run on demand. Then store every run in a versioning system like CVS. Maybe throw some web interface to browse it. That was my approach for as long as I can remember.
All the hype today is around “infrastructure as code”. To me, that means a couple of things:
- The described earlier process needs to reverse. You create new configuration version (yes, if you have any sense you do this anyway in a notepad instead of directly on the router), you commit it to the version repository, test it (I hope you test it!), and then kick off the provisioning task.
- You don’t actually create a configuration. You define values for variables and then some code (a compiler of sorts) runs to generate the configuration. You probably are doing something similar today using templates, which could be a simple text file with search and replace, or an excel sheet with macros. I even tried to make a basic webpage in the past that would let a user enter some values and it would return a copy/paste version of configuration.
- Automation in general means to me that given identical input parameters, a given automated process will produce identical results every time regardless of the number of executions.
This is a crude explanation of what we at athenahealth network team saw as a milestone, not even the destination. In fact, we still have a very vague idea of where the destination is to this day.
What to automate?
Overall, we have a production network and an Out-of-Band (OOB) network. The OOB network is used to manage production without connecting to the infrastructure through the customer facing interfaces.
After some deliberation we decided to get started on automating the OOB network. The main reason was that if we made a mistake on the OOB network, it was unlikely to affect the customer. The worst case scenario is we lose management consoles, but the customer remains unaffected.
We haven’t run into an instance yet where we did mess it up, but one of the great advantages of automation is also one of the great disadvantages: Just as you can quickly and reliably propagate your changes through your infrastructure, you can also just as quickly and just as reliably propagate mistakes through your infrastructure, amplifying the outage scope.
How Do You Start?
Configuration Management System (CMS)
Our colleagues in Systems have been using Puppet configuration management system for a long time to automate the system deployment lifecycle. However, Puppet needs an agent to be installed on the managed node. This doesn’t fly well with most incumbent network vendors. The access to the OS is usually restricted and, at best, you get a trimmed down bash shell. Some vendors are relaxing this model more and more, but overall this is still the state of Network OS today.
We had long discussions about Ansible vs Puppet (we didn’t consider Chef and Salt). We wanted to go with Ansible because it’s agentless, Python-based and seems to be in wide use by network automation community.
However, we ended up going with Puppet mainly because we hope one day to integrate with our Systems teams and to leverage in-house expertise.
An additional benefit of Puppet, at least to me, is that Puppet is designed to be a declarative system, whereas Ansible appears to be more of an imperative system. Imperative focuses on how and declarative focuses on what.
I thought it was important to have an ability to enforce consistent state vs. scripting how to get to the state. Therefore, regardless of the fact that Puppet is Ruby-based, we went with that.
But an attentive reader may ask a very valid question: How will you install Puppet agents on those legacy systems that don’t let you install anything? And this reader will be 100% correct. We will take a look at this at a bit later.
Speed Rails (Or How We Chose The Hardware)
We have realized that the feature set we are using on our data center gear is very basic: switch, route, sometimes ACL. This prompted us to look at “whitebox” switches and thus whitebox vendors. Definitions of “whitebox” switch varies depending on who your ask.
What does it cost to rack a switch? If you can calculate the cost in dollars, that’s the best outcome to make the business case. To do so I needed access to how much we pay our Data Center Operations team members, which is not information that was available to me, so I resorted to the next best currency: time.
Let’s look a brief description of what it takes to rack a switch:
- Get the switch out of the box.
- Find a screwdriver
- Attach side brackets to your switch with up to 8 tiny screws per side
- Find a bigger screwdriver
- Attach rack rails to the rack with 4 cage nuts and 4 screws per rail in the rack
- Slide in the switch, attach power
The process goes on to provisioning, but we’ll explore this subject in the next blog post.
Assuming no screws or cage nuts got dropped or stripped, and you had the right screwdrivers to attach all that, it takes about 30 min per switch. That is the best-case scenario. Tack on the effort required to assist the network team in provisioning the switch and it can quickly bubble up to 45 min to 1hr. Wouldn’t it be great if we can eliminate steps 2 through 5?
Enter Dell. Dell’s S3048-ON is a 48 x 1G + 4 x 10G whitebox switch. Its advantage is that it comes with speed rails. The rail kit snaps onto the switch and into the rack. No tools needed, no screws lost. Takes 5 min to install. That’s over 80% improvement in efficiency.
We were not able to locate any other whitebox switch with speed rails. The small premium Dell charges for its switch justified the time saving incurred in the rack and stack efforts.
Other efficiencies came from color coded power supplies and fan trays. If the power supply is red, it faces the hot aisle. Need to reverse the airflow? Grab the blue colored power supply and fan trays.
The MAC address of the management interface, which is what we’ll use for provisioning, is also encoded into a QR code printed on the box. This allows it to be scanned rather than typed in, further reducing human error.
In looking for available NOS (Network Operating Systems) options on the Internet and cross referencing it with the HCLs (Hardware Compatibility List) for most accessible whitebox switches, Dell switches also have the largest number of NOS supported.
And lastly, we already have a great working relationship with Dell. Dell switches listed Cumulus Linux, BigSwitch Network SwitchOS, Pica8, Dell OS10 and Pluribus as supported NOS vendors. Dell has speed rails.
And so Dell it was.
The next piece of the puzzle was to make sure that the NOS we chose will work with Puppet. In reality, CMS and NOS choice was very much intertwined and one didn’t follow the other in tandem. Rather the choice of one deeply affected choices for the other.
- Cumulus Linux, as the name suggests, is just a Linux distro with a few tweaks to “make networking great again.” Athenahealth is very familiar with operating Linux and Cumulus Linux is from a Debian family, which is, I’d say, very main stream.
- Cumulus Linux gives you unrestricted shell access to your system and I can install any .deb package (the usual architecture constraints still apply). You could even compile software from source code if needed.
- Cumulus Linux has an extensive HCL so if we decide to part ways with Dell, we don’t have to replace the NOS and thus can keep most of the automation.
- Cumulus Linux being pretty much a Debian distro and Debian has a very large user base, so if you need to Google how to do this or that on linux, you can just google “how to do this or that on Debian or Ubuntu.” Try that with some of other NOS vendors.
- Again, because Cumulus Linux is just linux, I can install modern analytics applications. Maybe I want to install ganglia on every node.
- Cumulus Networks open sources most of what they code. Whether because they chose to or they have to is irrelevant, but short of the broadcom chipset code which they can’t open source for licensing reasons, they open source everything. This means that even if Cumulus goes under somehow, the contributions will not vanish and will simply become part of another distro like Dell OS10 or maybe even Debian itself.
- Cumulus Linux is a Debian flavor, so I can install a Puppet agent natively without any limitations.
- Another important reason, and I cannot overstate the importance of it, is CumulusVX. CumulusVX is basically a VM you can run in VirtualBox or on ESX, but the point is that you can test nearly everything in a virtual environment. Just spin up a bunch of CumulusVX VMs, link them together to mimic your real network and test your changes against a virtual model of your environment! Magic! Not that many vendors are doing this today and it makes it super hard to test changes before they roll to prod.
- And, of course, the rocket turtle is adorable! Who can resist that!?
Code Repository System
Choosing the code management system was probably the simplest choice of all: Git.
Git is the most modern version control system out there today for various reasons (which are beyond the scope of blog post), but between the architecture of git itself (distributed VC), products like github and gitlab and the abundance of training materials on git the choice was easy. Git!
Putting It All Together
Now we have to glue together Dell switches, Puppet, Cumulus, and Git. This is probably why you are reading this post in the first place.
Starting was somewhat hard for us. So many questions: What is the first step to take? How far do we need to plan this before we get hit by “analysis paralysis”? What to start with? And so on…
Luckily for us the answer presented itself somewhat accidentally. We had to expand our data center into a suite next door to an existing one and needed an out-of-band network. This ended up a being a perfect place to start while introducing low risks if major failure happened.
As of this writing, our OOB network is automated (mostly). In the next blog post I’ll explore our first automation efforts on the Out-of-Band network. Spoiler alert: it is ZTP.