It’s widely accepted that Bidirectional Forward Detection (BFD) is a Good Thing(tm). Most of us use BFD on point-to-point links to quickly detect forwarding problems. The general rule of thumb is that you install the IGP as a client of BFD, set the minimum interval, make sure the multiplier is three, and Bob’s your uncle.
But is that all BFD can do?
Multi-hop
Most of us just assume that the use case for BFD is limited to adjacent routers; however RFC 5883 explains how BFD can use multi-hop to create a session between routers with IP connectivity. This creates new possibilities such as running a BFD session between CE routers that span a service provider network. Typically, detecting WAN forwarding errors is left to something such as IP SLA or RPM.
We’re all guilty of having to buy cheap managed MPLS services because of small operational budgets. Typically, such managed MPLS services do not come with high availability features such as BFD and FRR; each time the provider has an internal issue, it causes problems because traffic is simply sent into a black hole. So the ultimate burden of responsibility for detecting WAN forwarding errors is pushed into the customer’s hands.
An alternative solution is to use multi-hop BFD to detect WAN forwarding errors. BFD uses UDP and is able to be routed like a normal IP packet through a series of transit routers. This opens the possibility of using multi-hop BFD to detect WAN forwarding errors in a WAN, regardless of the number of routers in between, just as long as there’s IP connectivity. Features such as IP SLA and RPM are simple enough to use, but require that you use a proprietary protocol across your entire WAN.
The Topology
A few of you may recognize this subset topology from the JNCIE-M Study Guide by Harry Reynolds. I still have my JNCIE-SP laboratory with pre-built MPLS configurations, so it was the perfect environment to demonstrate multi-hop BFD.
In this demonstration, the two CE routers are C1 and C2. The “stuff” in the middle is a simple MPLS network that represents a managed MPLS service that many of us use as a WAN. To get a bit more technical R4 and R6 are PE routers while R3 and R5 are P routers.
All of the important information such as interface names, IP addresses, network addresses, loopback addresses, and hints towards the IGP and BGP policies have been included in the topology. The IGP being used is IS-IS with a standard area of 49.0001 and 49.0002; this isn’t important for our test, it was simply a left over from the JNCIE laboratory. For those of you that are curious about the multiple IS-IS areas, it was used to create inter-area LSP with no CSPF attributes.
The most important information is that C1 has an ASN of 65010 and the aggregate address of 200.200/16 while the router C2 has an ASN of 65020 and the aggregate address of 220.220/16. The rest of the topology is simply mechanics to transport packets between the two CE routers.
The Objective
The objective of this demonstration is to detect forwarding errors between C1 and C2 without having forwarding or topology knowledge of the provider network. BFD needs to be installed between routers C1 and C2 in order to detect forwarding errors; however, this is only half of the problem. Once an error is detected, a corrective action needs to be taken. The appropriate action would be to remove the next-hop from the FIB of the offending prefixes.
The Solution
An easy method to map a next-hop address to a set of offending prefixes would be to use a routing protocol such as BGP. The two CE routers C1 and C2 will use a multi-hop EBGP session to exchange prefixes. BFD can now use this single BGP session between C1 and C2 to detect any type of forwarding errors between the routers. If BFD detects an error, it will simply signal the multi-hop EBGP session to be torn down, and the offending prefixes will be removed from the RIB and FIB immediately on both C1 and C2.
But Wait …
If the two CE routers C1 and C2 only have a multi-hop EBGP session between them, how will the service provider know how to route the packets?
There will actually be two BGP sessions per CE router. The first BGP session will be to the PE and the second BGP session to the other CE router.
CE-PE
The CE-PE BGP session has a very special role in this design. It will be responsible for a two things:
- Advertise all prefixes with the no-export community. This will allow the managed MPLS service to have forwarding information, but will restrict the advertisements so that other CE routers will not receive the prefixes.
- Advertise the CE loopback address into the managed MPLS service with no special communities. Obviously, all CE routers connected to the same managed MPLS service will receive these prefixes.
The end result is that each CE router connected to the managed MPLS service will have loopback address reachability to every other CE router, while the managed MPLS provider has complete forwarding information for each CE router. For example, C1 will announce 200.200.0.1/32 with no special communities, but will announce 200.200/16 with the no-export BGP community. The managed MPLS provider will now know that 200.200/16 belongs to C1. From the perspective of C2, the only prefixes received from its BGP session to the PE is the loopback address of C1 (200.200.0.1/32). C2 has no idea about 200.200/16. C2 will also announce its loopback address 220.220.0.1/32 with no BGP communities, thus C1 will learn about this prefix via its BGP session from its PE router. C2 will also announce the prefix 220.220/16 with the no-export BGP community so that the managed MPLS service will have forwarding information, but C1 will know nothing about the prefix 220.220/16.
To summarize: each CE router has loopback address connectivity to every other CE router, but nothing else. On the other hand, each CE router advertises its network range to its respective PE router with the no-export BGP community, thus hiding these prefixes from other CE routers, but giving the managed MPLS service a complete forwarding table.
CE-CE
Having a CE router only have loopback address visibility into the network isn’t very useful. The trick is to establish a second BGP session between each CE and advertise its respective prefixes. This will give each CE a full view into the WAN. For example C1 will advertise 200.200/16 to C2, and C2 will advertise 220.220/16 to C1.
By using this specific CE to CE BGP session as a client to BFD, we will be able to detect any type of forwarding errors in the managed MPLS network and remove these specific, offending prefixes from the RIB and FIB.
Scaling Concerns
In most networks, this alternative solution would work well. However, this solution does require a multi-hop BGP session per CE router. In a hub and spoke topology, this means that if there are 100 branch routers and two data centers, each data center will require 100 multi-hop BGP sessions and each branch router will require a single multi-hop BGP session to each data center or hub.
In some sense, this scaling concern is moot, because IP SLA and RPM would require similar requirements in a hub and spoke topology. Each branch router will require a IP SLA or RPM probe to the hub, and the hub will require another probe to each branch router.
However, a benefit of using BGP and BFD versus IP SLA and RPM is that modern routers support higher scale using BGP and BFD. For example the Juniper MX can support over 8,000 BGP and BFD sessions, but only 2,000 RPM connections.
Laboratory
Time to get our hands dirty. Let’s login to C1 and check out the BGP sessions.
root@P-network:C1> show bgp summary
Groups: 2 Peers: 2 Down peers: 0
Table Tot Paths Act Paths Suppressed History Damp State
Pending
inet.0 3 3 0 0 0
0
Peer AS InPkt OutPkt OutQ Flaps Last
Up/Dwn State|#Active/Received/Accepted/Damped...
172.16.0.5 65412 16 17 0 0
5:44 2/2/2/0 0/0/0/0
220.220.0.1 65020 11 12 0 0
3:50 1/1/1/0 0/0/0/0
There are two BGP sessions: the first is to the PE router R4 (172.16.0.5) and the second is to C2 (220.220.0.1). The CE to CE multi-hop EBGP session uses the CE loopback addresses as the source and destination addresses.
Recall that the C1 CE-PE BGP session will advertise the loopback address (200.200.0.1/32) and its network range (200.200/16) with the no-export BGP community. Let’s verify this:
root@P-network:C1> show route advertising-protocol bgp 172.16.0.5 \
extensive
inet.0: 10 destinations, 10 routes (10 active, 0 holddown, 0 hidden)
* 200.200.0.1/32 (1 entry, 1 announced)
BGP group pe type External
Nexthop: Self
AS path: [65010] I
* 200.200.0.0/16 (1 entry, 1 announced)
BGP group pe type External
Nexthop: Self
AS path: [65010] I
Communities: no-export
The C1 CE-CE BGP session will advertise only its network range (200.200/16):
root@P-network:C1> show route advertising-protocol bgp 220.220.0.1 \
extensive
inet.0: 10 destinations, 10 routes (10 active, 0 holddown, 0 hidden)
* 200.200.0.0/16 (1 entry, 1 announced)
BGP group ce type External
Nexthop: Self
AS path: [65010] I
Communities: 65010:666
Just for fun we tagged C1‘s 200.200/16 prefix with the BGP community 65010:666.
Another good method of verification is to take a look at what the PE router R4 is advertising to C1:
root@P-network:C1> show route receive-protocol bgp 172.16.0.5 extensive
inet.0: 10 destinations, 10 routes (10 active, 0 holddown, 0 hidden)
* 172.16.0.8/30 (1 entry, 1 announced)
Accepted
Nexthop: 172.16.0.5
AS path: 65412 I
Communities: target:65412:420
* 220.220.0.1/32 (1 entry, 1 announced)
Accepted
Nexthop: 172.16.0.5
AS path: 65412 65020 I
Communities: target:65412:420
As expected, the PE router is only sending its own /30 network between C1 and R4 and the loopback address of C2. Now let’s take a look at what C2 is advertising to C1:
root@P-network:C1> show route receive-protocol bgp 220.220.0.1 extensive
inet.0: 10 destinations, 10 routes (10 active, 0 holddown, 0 hidden)
* 220.220.0.0/16 (1 entry, 1 announced)
Accepted
Nexthop: 220.220.0.1
AS path: 65020 I
Just as suspected, C2 is only advertising its aggregate 220.220/16 to C1. To get a better view of C1, let’s take a look at its routing table:
root@P-network:C1> show route protocol bgp
inet.0: 10 destinations, 10 routes (10 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both
172.16.0.8/30 *[BGP/170] 00:05:05, localpref 100
AS path: 65412 I
> to 172.16.0.5 via ge-1/2/11.0
220.220.0.1/32 *[BGP/170] 00:05:06, localpref 100
AS path: 65412 65020 I
> to 172.16.0.5 via ge-1/2/11.0
220.220.0.0/16 *[BGP/170] 00:04:45, localpref 100, from 220.220.0.1
AS path: 65020 I
> to 172.16.0.5 via ge-1/2/11.0
C1 has received two routes from its PE: the CE-PE /30 network and the loopback address of C2. C1 has received one prefix from C2: 220.220.0.0/16.
Now let’s take a look at the BFD session that’s using the CE-CE BGP session as a client:
root@P-network:C1> show bfd session
Detect Transmit
Address State Interface Time Interval
Multiplier
220.220.0.1 Up 0.450 0.150
3
1 sessions, 1 clients
Cumulative transmit rate 6.7 pps, cumulative receive rate 6.7 ops
So far, everything looks good. Let’s check the loopback connectivity between C1 and C2:
root@P-network:C1> ping source 200.200.0.1 220.220.0.1 rapid count 10 PING 220.220.0.1 (220.220.0.1): 56 data bytes !!!!!!!!!! --- 220.220.0.1 ping statistics --- 10 packets transmitted, 10 packets received, 0% packet loss round-trip min/avg/max/stddev = 0.354/0.369/0.468/0.033 ms
Now, check an address within the BGP aggregate:
root@P-network:C1> ping source 200.200.0.254 220.220.0.254 rapid count \ 10 PING 220.220.0.254 (220.220.0.254): 56 data bytes !!!!!!!!!! --- 220.220.0.254 ping statistics --- 10 packets transmitted, 10 packets received, 0% packet loss round-trip min/avg/max/stddev = 0.358/0.382/0.498/0.043 ms
At this point, the topology has been configured correctly, and C1 and C2 have complete connectivity. Now it’s time to make some trouble.
Detecting Forwarding Errors
A really evil way to create a MPLS forwarding error is to only block family mpls packets. This will discard all data traveling through LSPs, but protocols such as BFD (attached to the provider’s IGP) and RSVP will still think the LSP is up and operational because the provider IGP and regular IP forwarding isn’t affected. However, this will cause major problems for customers trying to send data through the provider network. The traffic will simply be discarded and never reach its destination. As the customer sends data into the provider network, it will use a LSP as a next-hop, and because the LSP uses family mpls to forward traffic, the traffic will be discarded by the evil family mpls discard filter.
Let’s login to the P router R3 and create a nasty firewall filter that will only discard ingress MPLS packets from the PE router R6.
[edit] root@P-network:R3# set firewall family mpls filter deny-mpls term 1 \ then discard
root@P-network:R3# set interfaces ge-1/3/2.0 family mpls filter \ input deny-mpls [edit] root@P-network:R3# commit and-quit *** C2/ce.log *** bgp_recv: peer 200.200.0.1 (External AS 65010): received unexpected EOF Terminated BFD session to peer 200.200.0.1 (External AS 65010) commit complete Exiting configuration mode
Very interesting! As soon as the configuration is committed, the CE-CE BGP session immediately detects a forwarding error and tears down the BGP session, removing the offending BGP prefixes along that forwarding path. Let’s check C1 and see if BFD is still up:
root@P-network:C1> show bfd session 0 sessions, 0 clients
Cumulative transmit rate 0.0 pps, cumulative receive rate 0.0 ops
No surprise there. Let’s remove the offending MPLS filter from R3:
[edit] root@P-network:R3# delete interfaces ge-1/3/2.0 family mpls filter [edit] root@P-network:R3# commit and-quit commit complete Exiting configuration mode 200.200.0.1 (External AS 65010): reseting pending active connection advertising receiving-speaker only capabilty to neighbor 200.200.0.1 (External AS 65010) Initiated BFD session to peer 200.200.0.1 (External AS 65010): \ address=200.200.0.1 ifindex=0 ifname=(none) txivl=150 rxivl=150 mult=3 ver=255 BFD session to peer 200.200.0.1 (External AS 65010) up
As soon as the forwarding error has been removed the CE routers immediately detect it and bring BGP back up:
root@P-network:C1> show bfd session
Detect Transmit
Address State Interface Time Interval
Multiplier
220.220.0.1 Up 0.450 0.150
3
1 sessions, 1 clients
Cumulative transmit rate 6.7 pps, cumulative receive rate 6.7 ops
BFD is back up now that we have removed the evil MPLS discard filter on R3. Obviously, BFD needs to be tuned to meet the requirements of your WAN. In this demonstration, I’ve used a 150ms interval because the Juniper MX80-48T is simply looped back onto itself. In a real production environment, it may make sense to use an interval such as 1000ms, which would put the detection time at 3000ms assuming a multiplier of three.
Conclusion
BGP and BFD can create a powerful combination in detecting forwarding errors in a WAN topology. This demonstration only showed the functional aspects of the protocols. Scaling such a design would require additional thought. For example, if two or more managed MPLS providers were involved, a loopback address per provider would need to be used. For example, C1 could use 200.200.0.1 for MPLS provider #1 and 200.200.0.2 for MPLS provider #2. This would prevent a scenario where C1 and C2 could still create a CE-CE multi-hop EBGP connection using 200.200.0.1 over provider #2. In summary, each CE would need a dedicated loopback address for each provider that would be used for the CE-CE BGP session.
I’ve seen many customers use this method of detecting forwarding errors within a WAN very successfully. It’s a simple and elegant design that’s based on standards-based protocols. Modern routers have no issues scaling such a solution into the thousands of branch routers in a hub and spoke topology.
If your WAN doesn’t use a hub and spoke topology and requires a full mesh of connectivity, don’t forget to use the following formula to calculate the total number of sessions:
Another interesting benefit of using BFD is that it offers interval and multiplier adaptation by default. If the router detects too many forwarding errors within a certain amount of time, it will back off the interval and multiplier until it finds a happy medium. However, this is a doubled-edged sword. If the BFD is subject to adaptation, that means you’ve set the timers too low, but on the other hand the adapted timers should give you a good idea of where you need to be to avoid false positives. Don’t be alarmed by network oscillation though; BFD will typically only use adaptation with small intervals which are sub second. The general sample window is 15 seconds. For example, if you set the BFD interval to 10ms and the session became unstable within a 15 second window, BFD would adapt to 20ms and measure again. If it was still unstable, it would adapt it to 40ms, so on and so forth.
Goodies
For those of you wanting to experiment with this, feel free to download the configuration used for this laboratory. A single Juniper MX80 was used to create a topology of 12 routers running MPLS, IS-IS, BGP, BFD and other protocols. The feature that allows the virtualization of the router is logical-systems. It’s sort of like VMware for a router. Simply spawn a new logical-system for each virtual router that you need.
The illustration above shows the physical configuration of the MX80-48T and how the interfaces were looped together.
Configuration: http://pastebin.com/G03CUd9N
Happy hunting.



