Please read There and Back Again – A Journey Into Network Automation – Introduction for the context of this post.
Deploying an SDN solution, in this case Nuage VSP, has a very definite impact on the physical network. It provides an opportunity to simplify.
Nuage VSP, like most ‘commercial’ SDN solutions, provides an Overlay network, a dynamic tunneling fabric, usually delivered through VXLAN, from hypervisor to hypervisor. This means that the BAU activity, the change and the churn of network services, happens for the most part in the Overlay. The physical network, or Underlay, becomes a transport for those VXLAN packets.
I had chosen a CLOS architecture for the network and, given the uniformity of my Leaf node configurations, using a Zero Touch Provisioning Server made more sense than it had in previous engagements.
We had selected Arista Networks to provide our CLOS architecture for two reasons; Arista’s partnership with Nuage Networks, which offered us advantages in bare metal integration, and their open API (more about that in Parts 2 and 3).
Arista offer a Zero Touch Provisioning Server, actually called exactly that, which allows for dynamic provisioning of configurations and automatic upgrade of the EOS code to a selected version. The fact that all Arista switches use exactly the same code, one image covers everything, made using ZTPS even easier.
So, task one, teach myself ZTPS.
When an Arista switch is installed out of the box, it is already set to ZTPS mode. This means that it is sending DHCP requests on all ports. If it receives a DHCP response that gives it the location of a ZTP Server, the first thing it does is send ZTPS some information that will allow it to be identified, including system MAC address, serial number and LLDP neighbour details.
In my case I did not want to tie the identity of a switch to its hardware so I chose to use only LLDP as the identifier for each switch in my network. This has two benefits:
- If a switch becomes faulty and needs to be replaced, the device can be swapped and the last saved configuration will be pulled from ZTPS without the need for a network engineer to do anything.
- By using LLDP neighbor information we have an automatic topology checker. In other words, if I say Switch 01 can be identified because it connects on Ethernet1, Ethernet2 and Ethernet3 to Ethernet1/1 on each of three Spine switches, then Switch 01 will only get its configuration when all uplinks are connected from and to the right place.
I found getting up and running with ZTPS to be relatively straightforward. You can download almost fully functioning virtual Arista devices using vEOS images to test against. I use VMware Workstation but Virtual Box will do just as well.
Arista offer a pre-packaged installation of a small CLOS network as well as a pre-packaged ZTPS image:
As I’ve already mentioned, the introduction of the Nuage VSP overlay allows for simplification of leaf configurations. This is true so long as you separate out services and control plane elements from application compute at a rack or pod level.
In this case, we have Compute Racks that contain Compute Leaf switches and Infrastructure Racks that contain Border Leaf switches.
The Compute Racks contain hypervisors and are the functional blocks that will scale out. Each Compute Leaf switch requires only three VLANs: a VXLAN transport VLAN, a storage VLAN for iSCSI and an Internal API VLAN for OpenStack. We then use MLAG to provide resilience to the NIC bonded hosts across a pair of Leaf switches and BGP as our routing protocol and that is pretty much it.
As we run eBGP between Leaf and Spine, with each Leaf pair having its own AS, each rack has a different set of IP networks, but despite this I could still template a lot of the configuration. This meant that each switch requires a template file, plus a small file that contains its set of unique attributes.
I could test out this approach entirely in VMware Workstation with vEOS and found that all I needed to do was first define my LLDP patterns in a ZTPS neighbordb file:
variables: patterns: - name: LeafSW01 definition: LeafSW01 interfaces: - Ethernet1: SpineSW01:Ethernet1 - Ethernet2: SpineSW02:Ethernet1 - Ethernet3: SpineSW03:Ethernet1 - name: LeafSW02 definition: LeafSW02 interfaces: - Ethernet1: SpineSW01:Ethernet2 - Ethernet2: SpineSW02:Ethernet2 - Ethernet3: SpineSW03:Ethernet2
Then define what actions my switches would take to upgrade their firmware and load their configuration in definition files, and then make sure the right EOS image and configuration files were in the right locations:
name: LeafSW01 actions: - action: install_image always_execute: true attributes: url: files/images/EOS-4.15.4F.swi version: 4.15.4F name: "validate image" onstart: "Starting to install image" onsuccess: "SUCCESS: 4.15.4F installed" onfailure: "FAIL: Please contact Network Support" - action: replace_config always_execute: true attributes: url: files/templates/clsw/startup-config name: "configure base attributes" - action: add_config attributes: url: files/templates/LeafSW01/uniqueattributes name: "configure unique attributes"
In testing you can run ZTPS in standalone mode and use a debug switch to see exactly what is going on. That is easy and very useful.
In production, it is recommended that you set ZTPS up as a Web server by using Apache and WSGI. This is not so straightforward and would require its own separate post.
ZTPS is very flexible and can be customized. For example, you can use resource pools to make things more dynamic and create custom plugins. The Arista EOS+ team are pushing combining Ansible with ZTPS. Ansible is very cool and I will cover what I have done with it in Part 2.
Once I had completed my testing using vEOS, I set about the task of creating my production configuration.
We were deploying into two data centers, where each DC would initially contain eight racks. I was designing to scale up to twenty Compute racks per DC. We were also delivering an out-of-band management network in each DC, so this meant two management access switches to be built in each rack.
This equates to a fair number of unique attributes files (38 switches to be provisioned per DC) and of course a risk of mistakes in those files. To help mitigate this risk, I decided to see if I could automate the build of the configurations. I settled on a Word document where I placed a template and linked the variables in the document to an Excel spreadsheet. This meant that all of the variables were displayed neatly in columns in my spreadsheet and it was fairly trivial to do a visual check of each switch’s configuration.
It was also then a relatively quick task to create the files I needed. This is still a little crude in terms of configuration automation but was enough for what I was trying to do.
There must be a minimum amount of network infrastructure created to allow ZTPS to communicate with the switches it will configure. In my case, this meant building my Spine and the Management Core Switches and a pair of Management Access Switches. ZTPS was connected onto the Management network to secure it and all switches would be provisioned via their out-of-band management interfaces.
The time had arrived to install the networks into the first DC. The switches were all powered on and two data center engineers were patching them into the structured cabling. I must admit to feeling a thrill as I watched the management network build without further intervention and while I got on with something else.
Once the management network was built, the CLOS network was next.
ZTPS Lessons Learned
So I guess the first question to answer is; how did this work out in terms of the effort to stand up ZTPS, versus the length of time it would have taken just to write my configurations and put them on manually, plus upgrade EOS?
Given that I had to learn how to install and set up ZTPS, then actually install and set it up, I spent about five days in total getting ready for the first DC deployment. This included writing a patching schedule for the DC guys. It would have taken me at least three days to do things manually.
It is important to note at this point the other benefits of ZTPS:
- Once the networks had built, I could be 100% sure the cabling was all in order and therefore I was expecting testing to throw up fewer issues.
- I could also be 100% sure I was on the same code version across both networks.
- I had configured ZTPS to provision the next sixteen racks ready for scale out.
- As part of the provisioning, each switch was configured to push its configurations back to ZTPS whenever a save was issued.
- ZTPS is configured to restore that latest configuration should a switch be returned to ZTPS mode or if it is replaced.
For me, those benefits outweighed the extra time it took me to deploy. But then we came to do the second DC.
DC2 is a mirror image of DC1 so I put the configurations from DC1 onto the ZTP Server for DC2 and then did a relatively simple search and replace to change the hostnames and the second octet of all IP addresses. It took me about two hours to do this and I was ready to go with DC2. Now we are cooking with gas!
My client is now looking at a third DC. If that goes ahead, we are looking at the time it will take to install ZTPS again, half a day, and then two hours to create the configurations and prep ZTPS ready to go.
This type of payback is common to automation in general: you spend a certain amount of effort initially to create workflows, and then over time the reward far outweighs the initial cost.
A Maturing Deployment
Since those initial deployments I have matured my ZTPS deployment in one key way.
In order to pre-provision for the next sixteen racks, I initially configured BGP on the Spine switches ready to go for the next thirty-two neighbors. I was not happy with thirty-two peer establishment attempts sitting in an active but failed state. I decided to look at using a custom plugin on the ZTP server that would be triggered when a new switch is being provisioned and take its corresponding Spine BGP neighbor statement out of ‘shutdown’.
The initial challenge is how to match the neighbor statement to the new switch being provisioned. The answer was that the neighbor IP address is the Leaf uplink IP address. This can be found in the unique attributes file for the switch.
Once I had this solution, I engaged with the Arista EOS+ team to help me create a custom plugin. Arista wrote a Python script that would interrogate the unique attributes file for the IP address. Once the IP address was known it would trigger an Ansible Playbook that would then configure the Spine switch to take the neighbor out of ‘shutdown’. The custom plugin is triggered from the definition file written for each switch.
I created the Ansible Playbook and modified the Python code to configure all three of my Spine switches. I also created a version I could run manually to allow ad hoc testing. I did this by adding an argument for the switch hostname, which then matches the location of its unique attributes file.
Once this was complete, I was pleased (that thrill again) but not satisfied. My BGP configuration on each Spine switch was huge! I thought about this and realized that the attributes required for the BGP configuration of each Leaf node on the Spine switches could also be found in the Leaf nodes unique attributes file. I needed the hostname for a description and the BGP AS. I set out to enhance the custom plugin further and with a bit of effort got things working.
This is the thing about automation. Start off just attempting to do some small things and very quickly you will start to see how you can take things further and gain more and more benefits.
Figure 1 illustrates the end-to-end process flow of my ZTPS deployment and Figure 2 illustrates the custom plugin.
With the first DC deployed it was time to test. The client was using Ansible as a key component of much of the automation workflow they were working on for application deployment, and I had read about the merits of Ansible. I wrote out my detailed test plan as always but this time I determined to use Ansible to automate each test wherever possible.
Thanks for reading Part 1, I hope you found it useful. Part 2 will cover the continuation of my automation journey using Ansible.
There and Back Again – A Journey Into Network Automation Part 2 – Ansible
‘There and Back Again – A Journey Into Network Automation Part 3 – Python’ will be available 15/12/2016