Part 2: Out-of-Band Network
This is part 2 of a planned multi-part blog post on network automation efforts at athenahealth inc.
In Part 1 of this series I explained our hardware and software selections. In this post I will look at our very first network automation efforts and before we automate anything else, we have to bootstrap the device by installing the OS and the OS license. We also will need a way to prime it for automation. Part 3 of the series will talk about installing the puppet agent in an automated fashion and having it check in with the puppet master.
As we discussed in Part 1 of the series Cumulus Linux runs on “whitebox” switches. A “whitebox” switch is really just a switch without an OS (NOS). Also whitebox switches usually sport “merchant silicon” usually from Broadcom or Melanox ASIC makers.
The branding of the switch is not important. In our case it’s Dell. By the way, a whitebox switch from a major vendor is sometimes called a “britebox” switch, but it is not much more than a whitebox switch with a well-known brand logo glued to the front.
The main concept is that any whitebox switch will usually have an off the shelf (a.k.a merchant silicon) ASIC and will run an ONIE bootloader.
ONIE stands for Open Network Install Environment. It is an OCP (the Open Compute Project) initiative that was contributed by Cumulus Networks and is really like a PXE boot with some extra bells and whistles.
ONIE is what will bootstrap your NOS, be that Cumulus Linux, BigSwitch’s Switch Ligth OS, Pica8’s PicOS, Dell’s OS10, etc. If ONIE does not find an installed OS on the onboard storage, it will go through a discovery phase.
While looking for an installer at each of these locations it will cycle through a list of predefined files names all starting with onie-installer with hardware specific suffixes. The last filename searched will be onie-installer without any suffixes or file extensions.
Onie-installer is just that: an installer capable of installing an OS onto the switch’s local partition(s). In this aspect it’s no different from installing Windows or MacOS from a flash drive onto your computer. Assuming one of the sources does hold an onie-installer image, ONIE will execute that installer and at this point it becomes the job of the OS specific installer to finish the installation of the NOS.
In case you are wondering who provides the installer, I wanted to clarify this: the installer is provided by your NOS vendor of choice and not by the ONIE project. ONIE is BIOS level standard and provides a way for any NOS to know how to talk to the BIOS of the switch regardless of the vendor or the hardware. Similar to an x86 standard for reading a CD and providing a boot sequence.
Now that ONIE triggered the NOS install and the installer completed the job, it’s time to configure the OS and this is where your NOS vendor’s specific implementation of ZTP kicks in.
What is ZTP? ZTP stands for Zero Touch Provisioning.
For those of you who have not come across this concept yet, this is nothing scary or complicated. Instead it is really just a glorified PXE boot-like process. Many vendors offer this feature now. For example, Cisco offers POAP (Power-On Auto Provisioning). In fact, if you worked on WLAN controller-based networks, you have dealt with ZTP frequently to bootstrap your Access Points.
The implementations vary and, as far as I know, are not standardized. Since we are working with Cumulus Linux and Dell switches, I’ll talk about how it works in case of Cumulus Linux.
Configuring The NOS
In case of Cumulus Linux ZTP implementation it mostly resembles the ONIE process:
- Get DHCP address on the management interface and poke an HTTP server listed in the DHCP option for scripts named cumulus-ztp which, just like ONIE, will cascade down with hardware specific suffixes down to a generic cumulus-ztp file name without suffixes or extensions.
- If the management interface is not available (no DHCP address acquired, no link, etc.) Cumulus ZTP process will look, just like ONIE does, on the USB flash drive cycling through the same file names. By contrast, ONIE will try USB first, then the network.
- If nothing is found, the login prompt will be given on the console and SSH will listen on the management interface.
An especially eagle-eyed reader may have noticed that at no point in time I mentioned the front panel ports. If anyone of you recalls the old Cisco TFTP provisioning feature that was available even back on the old Catalyst product lines, the switches came preconfigured with interface vlan1. The dhcp client on interface vlan1 would look for a DHCP address with a TFTP option set that describes the location of the configuration file. All ports including the uplinks were configured to be, by default, in VLAN 1 and therefore any front panel port can be your provisioning port.
This is not in the case with Broadcom-based switches, which is what our Dell S3048-ON switch built with. Here, the only live port you get is the management port. This fact presents a challenge: there is no management network for the OOB network and therefore the management port is not used when we deploy an OOB switch. We must be able to establish connectivity on the uplinks without connecting to the switch on console or the management port.
The front panel ports are off until a Cumulus Networks license is installed and switchd is restarted for the license to take effect. It is a license key you get from Cumulus Networks, but really, it’s just a license to use Broadcom chipset. You just buy the Broadcom license from your NOS vendor instead of Broadcom directly.
Why am I telling you all of this? Because with this scenario we really have only one option to provision the switch: the USB port!
Yes, we can temporarily connect the management port, but that would not be a very “cookie cutter” of us, would it be? We’d still have to snake a cable to some other already provisioned switch that can provide access to the right DHCP server and so on. It would be a mess if you need to spin up batches of switches at a time. It is with those arguments we decided to proceed with the USB flash drive option.
The technical requirements for a supported USB drive are incredibly simple: a FAT32 file system that needs to be big enough to hold onie-installer and a cumulus-ztp script. The drive doesn’t even need to be clean. My test USB drive, actually, had presentation slides from some conference I went to and they didn’t interfere in anyway.
USB drives are cheap and you can buy them in bulk from your distributor, Amazon, or elsewhere and you can have a bin of them enough to spin up even a medium size data center. (I doubt Facebook and Google use flash drives in this case so I’d love to know how they do it).
Your DCOps (Data Center Operations) team can easily update the drives, if needed, by simply dragging and dropping new onie-installer and new cumulus-ztp script onto them.
In reality, you shouldn’t need to update them often. Instead, you could keep provisioning your switches using an “old” installer and scripts. After the switch is checked into your automation system you would use inline automation to bring it up to snuff.
An annual review of your USB drive bin might be prudent, though, so they don’t get too too behind your current needs.
Automate Something Easy
Automation is hard work, but basically it’s writing down how to do certain things using a funky language. If you are like me, I have rarely documented my work. The result was that I ended up being “the guy who knows how to fix it” which was great for my ego, but also I was “the guy who knows how to fix it” which was terrible for my work/life balance. : )
This was not out of laziness, but because of understaffed teams, tight timelines and because there were too many tangents to document! Automation is just that: documenting every tangent possible even if the result of “the document” (your code) says “escalate to a human.” With that in mind, pick something easy to automate. Something that doesn’t have a complicated and long decision tree.
The first thing I had to automate was the license installation and starting a DHCP agent on a front panel port.
Remember, we are spinning up a bunch of ToR (top of rack) OOB switches, so at the time of provisioning all I could expect is a couple of uplinks, which are configured at the aggregation switch as trunks. My ZTP script had to do four basic things:
- Install the Cumulus Linux license and restart the switchd service (which is what needs the license)
- Configure uplink ports as trunks
- Create an SVI (Switch Virtual Interface. You may know it as a vlan interface)
- Start the DHCP agent to get an IP on the new SVI
At the end of these 4 steps I should be able to SSH into the switch using front panel ports without ever needing to bust out a console cable or prestage the switch in the lab.
Some folks out there will argue that you should automate your most frequent activities as they will provide you with the most value. I agree, but this only holds true if you know how to write code, how to automate, you have some backend infrastructure to support your automation efforts, etc.
If you are just starting to code and automate (which was pretty much the situation in my case) you should automate something simple rather than something frequently repeated. Small daily/weekly wins are far more satisfactory than a single large one once a year.
You will have to Google a lot. You will have to try a lot and fail a lot. This will not be glorious. You won’t be able to brag about it, because it will likely be of low value, but you will learn a ton!
In my case, automating these first 4 activities was also valuable. Not for me, not for my team, but for our DC Ops team. It was them who had to bust out the console cable, stage the switch, console in and put basic a configuration on the switch. Between racking the switch using traditional switch racking hardware (see Part 1 of this blog) and staging basic config on it using console cable each switch consumed about 1 man/hr. At this rate our OOB provisioning capacity was effectively 8 switches per person per day and nothing else during that day. Pretty inefficient, wouldn’t you say?
Enter Dell switches on speed rails with USB flash drives and our DCOP tech’s time on installing a single OOB switch is now 15 min from “cut open the box” to “walk away and forget”. That is a 75% improvement per switch. I’m pretty sure our DC Ops are still drunk after celebrating that change of process!
Our First Code
First, a word of advice: Comment, comment, comment your code! Learn the syntax for comments for your programming language of choice, which in case of bash script is anything starting with a hashtag. Document your thoughts as you code. Describe what each line or a block of code should do in as much detail as possible. This will come in very handy when you have to come back to it in a few month and fix a bug or improve a feature.
Second, document the process in human language before you code. At athenahealth we refer to this rule as “process before tools”. If you are automating a manual process, document each step of the said process in as much detail as you can. If you know how, make a flowchart describing decision logic and actions to be taken. It will be of great value as you continue working on even small scripts.
First code in my automation journey:
#!/bin/sh #Install a License from usb and restart switchd /usr/cumulus/bin/cl-license -i $ZTP_USB_MOUNTPOINT/CumulusConfigs/cl-license.txt #Restart switchd to load the license if service switchd status &>/dev/null; then service switchd restart; fi if ! service switchd status &>/dev/null; then service switchd start; fi #Load interface config from usb cp $ZTP_USB_MOUNTPOINT/CumulusConfigs/dell_s3000_c2338/interfaces /etc/network/interfaces cp $ZTP_USB_MOUNTPOINT/CumulusConfigs/dell_s3000_c2338/interfaces.d/*.intf /etc/network/interfaces.d/ #Reload interfaces to apply loaded config ifreload -a
Let’s take a look at what’s going on here.
This is a very trivial bash script.
- First not commented out line installs Cumulus Linux license
- Next 2 (re)start switchd depending on if it’s running or not
- Next 2 lines just copies static configuration files into place.
- And finally the last line applies interface configs
If you are just starting your scripting/coding life like I was some time ago and furthermore, maybe you don’t have any Linux experience/knowledge, then even this simple code will be challenging. While it is not directly related to networking, I want to call out a few points in this code.
First, this is a terrible code, but I’m a beginner, so I don’t care. Does it work? It does! Can it be improved? Of course!
Second, this terrible code is written in Bash script. If you don’t know Bash script you might want to consider learning and coding in Python instead. I had some prior knowledge of Bash and Bash script and I wanted to get something working before I get demotivated by the syntax of a new language. If you want to learn Bash script, I highly recommend a lynda.com Up and Running with Bash Scripting training course. It’s not free, but it’s great!
Alternatively, take a look at Kirk Byers’ Python for Network Engineers course. I think he has both paid and free versions.
Now, let’s discuss each line of the script in detail
This is a special line which indicates to the system which program to use to run the script. /bin/sh is actually an alias (symlink) to the default shell on your linux system and your user profile (they don’t have to be the same). In case of stock Cumulus Linux /bin/sh is a symlink to /bin/dash. /bin/dash is not Bash, but it’s close enough.
#Install a License from usb and restart switchd
Any line starting with a # symbol (except the #! lovingly called a shebang) is a comment.
/usr/cumulus/bin/cl-license -i $ZTP_USB_MOUNTPOINT/CumulusConfigs/cl-license.txt
cl-license is a Cumulus Linux specific utility made for sole purpose of installing and removing the license. $ZTP_USB_MOUNTPOINT is an environment variable that holds the mount point for the USB flash drive. Using this variable instead of explicitly specifying the mount point like /mnt/usb protects you from later release versions changing the said mount point locations and makes your script more portable. /CumulusConfigs/cl-license.txt is the location of the text file with the license code on the USB drive.
if service switchd status &>/dev/null; then service switchd restart; fi
if ! service switchd status &>/dev/null; then service switchd start; fi
Switchd service may or may not be running, so I did two separate lines here. The first line will only execute if the service is running and thus will try to restart it. The second line will execute if the service is stopped and will try to start it (which is not the same as restart).
If you are new to Bash script, simply reading these two lines doesn’t quite explain how that logic works. I’m making use of a command exit code here, which can be viewed with echo $? and will be 0 (zero) if the service is running. The 0 (zero) exit code evaluates as true (I know, odd) condition. Any other exit code evaluates as false. Adding a ! (exclamation point) in front of service switchd status inverts the boolean result of the evaluated expression. Therefore if the switchd service is not running and the service switchd status returns non zero exit code which evaluates to false, we’ll get ! false, which evaluates to true and the command after then will get executed starting the switchd service.
This can be improved in many ways, but it does the job and so we move on.
cp $ZTP_USB_MOUNTPOINT/CumulusConfigs/dell_s3000_c2338/interfaces /etc/network/interfaces
cp $ZTP_USB_MOUNTPOINT/CumulusConfigs/dell_s3000_c2338/interfaces.d/*.intf /etc/network/interfaces.d/
These two copy commands do a very simple thing. They copy a set of flat text configuration files to the appropriate locations on the switch storage device. Debian (which is what Cumulus Linux is based on) keeps network interface configurations in /etc/network/interfaces text file, so we copy one from our USB flash drive into that location.
The syntax of the interfaces file allows us to include other files located on the local file system and the convention dictates that we place them in /etc/network/interfaces.d/ directory, thus we instruct the system to copy those as well. You can review the content of these configuration files in my github repo.
ifreload -a will instruct the switch to re-read the configuration files we just copied over and apply the configuration.
That’s it! A simple Bash script that installs the license to give us functioning uplinks and content of the configuration files has instructions to put the uplinks into trunk mode, create the SVI and look for a DHCP assigned address on the said SVI.
Assuming there is a functioning DHCP server in VLAN 555 (my configuration creates an SVI in VLAN 555) the switch will get an IP and you can SSH into it to finish the configuration. All this without ever needing a console cable or prestaging the switch outside the permanent resting place at the top of the rack.
Now, go out there and write some terrible code while I’m preparing the next part of this blog. Spoiler alert: the next part explores Puppet.