Using IP SLA Delay Feature to Safely Monitor Lossy Links

IP SLA is a great feature if you want to add some automation and intelligence into the network. SLA is no SDN/OpenFlow, but it can be very useful. It can also take down a network. Let’s say you are using DMVPN for a number of spoke locations in your network. You have a primary Internet connection on the spoke using DSL or cable modem, and you have an air card for backup. You want to use IP SLA to monitor your spoke’s primary Internet connection: when it goes down, bring the air card online. Of course, when the primary Internet connection comes back online, you want to switch back over.

Here are some of the configuration guides that I found on setting up IP SLA. This is not an extensive list, and none of these cover what this article is going to try to explain. But they are good references.

 For a simple ICMP probe, your IP SLA configuration looks something like this:

ip sla 1
..icmp-echo 155.1.23.2 source-interface Serial0/3/0
..timeout 1000
..threshold 1000
..frequency 30
ip sla schedule 1 life forever start-time now
!
track 1 ip sla 1 reachabiity

**I don’t have a static route using tracking or an EEM script watching for the tracked object going
down in this sample configuration, but that is not the focus here.**

This is a normal IP SLA configuration, correct? We ping 155.1.23.2 (assume it is my primary DMVPN hub) and if we get a response, the track 1 object is up, and we are using our primary Internet connection. If the ping fails, then the track 1 object goes down, and we use whatever mechanism to bring up the air card.

What about something in the transit path that is dropping packets randomly? On a DMVPN solution, I would recommend monitoring one, if not all, of your DMVPN hubs’ public IPs. That should improve stability. But what if something in the path is dropping some packets randomly to both hubs? It could be a link with errors, congestion, etc. It could happen – I have seen it. So, the first polling interval the ICMP response is good and track 1 is up. The next polling interval the ICMP packet is lost, so our primary Internet connection is ‘down’ and we failover to the air card, re-establish our DMVPN tunnels, the routing protocol forms adjacencies, routes are exchanged, and data starts flowing again. Great! During the next polling interval, we get a response along our primary Internet connection path, so we fail back over to primary, re-establish tunnels, the routing protocol forms adjacencies, routes are exchanged, and data starts flowing. Next interval fail. Next interval response, fail, response… Now something in the transit path has just completely brought down our spoke, or spokes. Should we really failover just because we lost a single ping? TCP/IP was designed to recover from packet loss, so there is a good probability this path experiencing packet loss still has a better user experience than moving to the air card.

It took me a while to find it, but there is a threshold feature of sorts. Under the track object, you can specify delays, up to 180 seconds, for the track object going into a up or down state. This can give your ICMP probe a few more chances to check the link. In this example, here is what I have added:

track 1 ip sla 1 reachability
..delay down 90 up 90

This allows the probe three chances to get a response on the primary Internet path before failing over. Likewise, when the primary connection comes back up, it must be stable for 3 polling intervals, or 90 seconds. This interval is similar to routing protocols timeout/hold timer methods. Let’s take a look at it in action.

cgaller_ip_sla1

cgaller_ip_sla2

Here we see the endpoint is responding to the ICMP probe and the tracked state is up. Now let’s take that endpoint down and see how the tracked object responds.

cgaller_ip_sla3

cgaller_ip_sla4

So now our probe is showing timeout. Our tracked object is still show up, but it also has a delayed down of 82 seconds remaining. Now, I will restore connectivity before down timer expires on the tracked object.

cgaller_ip_sla5

I restored the connectivity before the timer on the tracked object expired. The last state change was 9 minutes and 31 seconds ago. Since it never transitioned to the down state, we never failed over. Our probe now has a little wiggle room to allow for some packet loss. This will help us prevent outages caused by failing back and forth constantly due to a single packet loss somewhere in the transit path.

BONUS

You may notice that at the bottom of the ‘show track 1′ output there is a line that states ‘Track-list 3′. I setup to two ICMP probes, both tracked. Then I created another tracked object list that tracks the first two. Why? Because I want to monitor both DMVPN hubs. Only if I can’t reach both of them after 3 polling intervals do I want to failover to the air card. So let’s bring down probe 1.

cgaller_ip_sla6

There we are. Tracked object 1 is down, tracked object 2 is up, but the object that monitors both of them is showing that we still having connectivity to at least one of them. This also will allow me to perform maintenance on one of the hubs and not causing all of the spokes to jump over to their air cards.

For completeness, we will bring down the second IP SLA and verify that tracked object 3 goes down.

cgaller_ip_sla7

cgaller_ip_sla8

As expected when both probes go down, it causes both tracked objects (1 & 2) to go down, causing our tracked object list 3 to go down. And now we fail over. When at least one of the ICMP probes comes back up for 90 seconds, then tracked object 3 will come back up and we will fail back over to the primary Internet connection.

Since you have made it this far, you must be interested in this, so I will give you my entire IP SLA config.

track 1 ip sla 1 reachability
..delay down 90 up 90
track 2 ip sla 2 reachability
..delay down 90 up 90
track 3 list boolean or
..object 1
..object 2
!
ip sla 1
..icmp-echo 155.1.23.2 source-interface Serial0/3/0
..timeout 1000
..threshold 1000
..frequency 30
ip sla schedule 1 life forever start-time now
!
ip sla 2
..icmp-echo 155.1.13.1 source-interface Serial0/2/0:0
..timeout 1000
..threshold 1000
..frequency 30
ip sla schedule 2 life forever start-time now

You can use tracked object 3 with a static route or EEM script to affect the actual failover event. It is also possible to put delays on the combined tracked object or state that a certain tracked object NOT be up.

Share and enjoy!

Charles Galler

Charles Galler

Charles is a network and UC engineer for a mainly Cisco reseller. He has worked in the networking industry for about 13 years. He started as a network administrator for a small CLEC (carrier) where he did it all in IT and worked on the carrier network. After the CLEC, Charles went to work for a large healthcare organization in the Houston area and stayed with them for about three and a half years. Now he works for a reseller in the professional services part of the organization. He is currently studying for his CCIE in Routing and Switching and plans on passing it before the end of 2014. You can find him on the Twitter @twidfeki.
Charles Galler
Charles Galler

Latest posts by Charles Galler (see all)

  • Luis M

    What I miss, I have the OID for the ip sla last trigger result, what I don’t have is the OID for the track object, do you know which on it is?

    Kind regards

    • Charles Galler

      I have not looked into SNMP monitoring of the tracked objects. I did see a trap notification for tracked objects, but didn’t see anything that could be polled. With that being said, I was looking for something else and not specifically looking for tracked object OIDs.

      What are you wanting to do with the tracked object? Just get an email alert? Your monitoring system detected the tracked object going down?

  • http://twitter.com/SomeClown Teren

    Nice write-up. I used something similar at another company I was at and it worked well and accomplished what it needed to accomplish. In my current environment we’re only failing over between permanent WAN links (mix of SONET and Ethernet) and since we’re using OSPF we just configure sub-second failover of the routes to accomplish the same thing. We’d be vulnerable to flapping (as you describe above) except for the type of links and the low probability of a lot of flapping (if one of our larger fiber links goes down, it’s probably not coming up real quick… at least not until the backhoe gets out of the way and someone duct-tapes it all back together). :)

    • Charles Galler

      Sometimes just allowing the routing protocol to converge the network is simple and easy. In the case of the scenario that brought up this issue is that the client was failing over to an Air Card which has much less bandwidth. There is an EEM script that is also monitoring the tracked object and when the tracked object goes down, only then does the Air Card come up. Then the EEM script shuts down non-critical services such as guest WiFi. Thanks for the feedback.

  • Duna Yiv

    Sweet! Just what I needed to read.

7ads6x98y