This is a continuation from Part 1
Case for LDPoRSVP
As we mentioned at the very beginning that ACME provides L3VPN and L2VPN services, which requires end to end LSP between the PEs. But due to scaling reasons, ACME decided not to extend RSVP to the edge routers. This creates a problem as there is no more end to end LSP between PE’s.
In order to solve the above issue, ACME decided to run LDP on the edge PEs and tunnel it over RSVP. This actually allowed ACME to have the ease of setting up LDP at the edges and RSVP provides traffic engineering at the core bringing the best of both worlds.
Cool seems like we have solved the problem. But remember Network Design is all about trade-offs. So let’s ponder for a while on what we lose if we don’t have RSVP to the edge PE’s and can it be fixed somehow?
Not extending RSVP (TE mesh) to the edge PEs means we are losing following RSVP capabilities:
- End-End Traffic engineering capabilities
- Fast reroute at edges.
For Fast-Reroute at the edges, an NSP can take a look at other alternatives like IGP-FRR but there isn’t really an alternative for end to end traffic optimization other than extending RSVP to the edges (or, SDN or Segment Routing which can solve this but this post isn’t about new shiny things). For a moment, let’s say that end to end Traffic optimization is necessary, but we also know that in a large network like what we have here, we will run into scaling problems by extending the full mesh to the edges due to the amount of state created at the midpoint routers. In order to solve this kind of problem, the concept of LSP hierarchy [RFC 4206] was introduced.
In this scheme, we create a layer of LSP mesh, let’s say between core routers and then core routers advertised those TE links and their metrics in the IGP (ISIS/OSPF) with the help of Forwarding Adjacency (FA) which appears to the outer PE mesh as if they are real physical links. This allows the outer PEs to use these TE links for their path computation. The benefit we get out of this approach is that P routers will have less number of LSPs to keep track of as they are only aware about Core LSP mesh and aren’t aware about LSPs between PE-PE. Like in below Fig.8 Inner P router will have to only keep track of Inner Level 1 mesh.
But you can see, this doesn’t seem like that great of a solution as it doesn’t help with PE-PE tunnels n-squared problem and not to mention the management overhead of maintaining two meshes and with Diffserv TE (our case) you will have at least three meshes to manage. If you are interested in digging deeper, you can look at RFC 5439 [An Analysis of Scaling Issues in MPLS-TE Core Networks].
*AFAIK across major vendors only Juniper has support for hierarchal LSPs.
Another possible option could be to break the single RSVP domain into multiple smaller domains. In the below fig. 9 three meshes exist between PE-P, P-P and P-PE routers.
One of the biggest problems with this approach is that since you don’t have an end-to-end LSP, you can’t provide L3VPN/L2VPN services unless you bring the concepts of seamless MPLS design with P routers running BGP labeled unicast. Looking into seamless MPLS begs for its own post and analysis of its applicability.
[Tangent Alert ended]
IGP and TE Metrics
Generally in a network, IGP metrics are a reflection of either link bandwidth or delay. In the realm of TE, IGP metrics were modified to advertise a single additional TE metrics for CSPF. This allows us to have optimization based on at least two different metrics reflecting either delay or bandwidth. For instance, now I can have an optimization based on metrics reflecting the propagation delay for TE tunnels transporting Voice, while metrics reflecting link bandwidth can be used for TE tunnels transporting data.
What is ACME doing?
In case of ACME TE deployment, Data TE LSPs carrying data traffic is based on the same metric as the metric used for IGP shortest path first (reflecting link bandwidth). TE metrics (reflecting delay) are used for Voice TE LSPs. Head end LSR uses IGP metrics while calculating CSPF for data TE and use TE metrics while calculating CSPF for Voice TE.
Tunnel Reservation Types
When an LSP tunnel is setup, it’s established with either the fixed-filter (FF) or the shared-explicit (SE) reservation styles. RSVP defines three types of reservation styles, but only two are applicable for TE purposes.
- Fixed Filter (FF) Style
- If FF is used, a unique label and unique resource reservations are assigned to each sender. In RSVP-TE, the FF reservation style allows the establishment of multiple parallel unicast point-to-point LSP tunnels. If the LSP tunnels traverse a common link, the total amount of reserved bandwidth on the shared link is the sum of the reservations for each of the individual senders. The FF reservation style doesn’t allow resource sharing or merging of LSPs. This is the default reservation style for Juniper.
- Shared Explicit (SE) Style
- SE style, allows sharing and merging, which is particularly useful in rerouting techniques such as make-before-break (FRR, Tunnel optimization).This style is important for rerouting LSPs with no disruption to the traffic. This is the default reservation style for Cisco and Alcatel Lucent. In case of Juniper, if FRR is enabled, then the tunnels will be signaled with SE style reservation or if the adaptive behavior is enabled on LSPs.
- Wildcard Filter (WF) style
- The wildcard filter (WF) reservation style is not used for explicit routing because of its merging rules and lack of applicability for TE purposes. WF style real application is for multipoint to point ﬂows in which only one sender sends at any time.For instance a telephone conference where multiple people are on the conference, but only one at a time speaks.
The choice of which style to select is made exclusively by the egress LSR but can be influenced by the ingress LSR. Ingress LSR can indicate the reservation style type to the egress LSR by setting or clearing the “SE style desired” flag in the SESSION_ATTRIBUTE object of the PATH message.
What is ACME doing?
All the tunnels in ACME are using SE style reservation, which allows them to use make-before-break and avoids double booking of resources while tunnel rerouting.
How much bandwidth needs to be requested for TE tunnels is the next most important decision after choosing a particular MPLS TE deployment model. One important point, which I have seen many folks missing, is that whatever the bandwidth size of the tunnel is, it’s a control plane constraint, rather than a physical constraint. Which means it’s possible that the actual traffic load can be more than what the tunnel had reserved which could result into traffic congestion and dropped packets. On the flip side, if the reserved bandwidth for the tunnel is more than the actual traffic load, then it means Tunnels are reserving more bandwidth then it’s needed, which results into underutilization of the links and possibly rejection of the other tunnels.
Tunnel sizing can be done manually by calculating it offline and periodically re-adjust it which will obviously be not be in real time. In order to achieve this on a large-scale network, I can either hire 10 people to just adjust the tunnel bandwidth size manually based on the changing traffic demands or write some kind of custom scripts to do that.
Another approach to solve this could be online resizing where the head-end router calculates the bandwidth needed and try to resize the tunnel accordingly. This feature of online resizing by head end routers is known as Auto Bandwidth in which algorithms are run on the head-end routers to dynamically resize the tunnels based upon some measure of the traffic load on the tunnel over previous measurement periods.
So essentially in Auto-bandwidth router takes the traffic samples every X sec’s (frequency) and then adjust the tunnel bandwidth at the end of Y sec’s (adjust-interval) based on the peak value measured by the collected samples during Y sec’s (adjust-interval) time frame . There is also a provision of threshold where if the peak value of the samples collected crosses a certain threshold value, then the new tunnel bandwidth is immediately signaled rather than waiting for the Y sec (adjust-interval) to expire. This is known is Overflow, which is implemented, by most of the vendors.
While using Auto-bandwidth, the key thing is to find an optimal value for the Adjust-interval which is not too long that it’s slow in reacting to the traffic demands which will result in efficiencies, but also not too short where TE tunnels are always getting re-signaled causing too much network churn. Like in fig.10 , It is very well possible that auto bandwidth can suffer bandwidth lag issues based on one’s traffic patterns and adjust-interval.
Ref: RIPE 61 BCP planning and TE
I won’t go too much deeper into Auto-Bandwidth but if you want to get a better understanding of various issues around Auto-Bandwidth then please take a look at following Nanog presentation https://www.nanog.org/sites/default/files/tues.general.steenbergen.autobandwidth.30.pdf
One thing which has been recently interesting to me is the SDN in WAN space. For instance an offline tool can take over the auto bandwidth function of the router control plane and based on the traffic demand resizes the tunnels. The TE tunnels will be programmed through PCEP and topology information can be collected through BGP-LS. There is already some commercial and open source implementation available.
[Tangent Alert ended]
What is ACME doing?
ACME has turned on Auto Bandwidth on their head end routers with a thorough understanding of caveats Auto bandwidth brings. From a long-term perspective ACME is looking at solutions where Offline tools can configure the TE head end appropriately based on current traffic load.
Routing Traffic into the Tunnel
So once the TE tunnels are up and ready for use, we have to map the traffic to TE tunnel. There are a couple of ways to do that
1. Static Routing: You can just define a static route for the destination with a next hop as the TE tunnel. Obviously this is not going to scale very well.
2. Policy Based routing: This solution will have similar scaling properties like static routing.
AutoRoute essentially tells the head end router to use the TE tunnel to reach routes announced by the tail-end router and its downstream routers. For every MPLS TE tunnel configured with AutoRoute announce, the link state IGP will install the routes announced by the tail-end router and its downstream routers into the RIB. Therefore, all the traffic directed to prefixes topologically behind the tunnel head-end is pushed onto the tunnel. These LSP paths will not be announced in the link-state advertisements to the other routers. AFAIK, Juniper refers this as IGP-Shortcuts.
In our example, Fig. 11 a core router CR-1 has two tunnels and AutoRoute feature is turned on. First Tunnel (TU1) is between CR-1 and CR-2 and the other tunnel (TU2) is between CR-1 and a PE router C. Below is how the routing table will look like at CR-1
What if I have draft Rosen multicast in the core?
Traditionally problem with MPLS TE deployment is that they are solely for unicast. If an NSP has a mix of multicast and unicast traffic, then unicast traffic will travel over MPLS TE but the multicast traffic will travel through the native IP path. Now this creates a problem for multicast RPF checks to be successful because in RPF, a received multicast packet is checked against the source address to ensure that it is incoming on the same interface that the unicast routing table will choose to send the traffic out for that source address. In simple words, the problem is that the source address of the incoming packets will point to the TE tunnel interface as the outgoing interface in the unicast rib, but the packets are received on the physical interface and hence an RPF drop.
In the case of AutoRoute, vendors have provided a “multicast-intact” nerd knob to solve this problem. This basically creates a separate table which filters all the tunnel LSP’s and only contains actual physical interfaces as the next hops by the native IGP (an essentially IGP table without MPLS TE enabled). Now when a multicast packet comes in on the physical interface, then this new table is consulted and the RPF check succeeds.’
4. Forwarding Adjacency (FA):
This is another popular method to force traffic through tunnel. In FA we take an additional step apart from what we do it in AutoRoute i.e. We advertise the tunnel LSP’s as P2P links to all the routers. This allows the downstream routers behind the Tunnel head end, which doesn’t speak MPLS TE and yet have visibility about the TE tunnel when they run their SPF. From a downstream router perspective tunnel LSP links appear as a normal P2P link and had no indication that these are TE links. One of the prerequisites for the FA to work is that the TE tunnel should be bidirectional . Let say in the below fig. 12 all the links are of IGP cost 10. From Router A’s perspective cost to reach Router B is 50 via CR-2 and 60 via CR-1. Our intention here is to load balance the traffic between the two POPs through CR-1 and CR-2. Router A will always sends the traffic for Router B to CR-2 as based on its IGP view that’s the shorted path.
In order to achieve our goal, we can use Forwarding Adjacency and advertise our both tunnel’s TU1 and TU2 with a cost of 10 in the IGP. Router A sees these two tunnels as P2P links with the cost of 10 and hence now in its IGP view cost via CR-1 and CR-2 is same i.e. 30 . This allows Router A to use both paths to reach Router B.
Forwarding Adjacency seems like a nice idea, then why not just always use it right? Not so fast Charlie. It does come with its own issues.
- Increase IGP Size: As I mentioned earlier that FA advertises TE links as P2P links in the IGP. Which means this will increase the IGP size in our network. Let’s say we have 100 Core routers running a full mesh of TE tunnels between them, then enabling FA will inject extra 9,900 (100x 99) links in the IGP. So based on your existing IGP size this could be a problem or maybe not, but this is a factor which needs to be considered.
- IGP churn: Since enabling FA make IGP see TE tunnels as normal links, a link TE tunnel failure for whatever reason will cause an IGP churn as the TE link will be removed and re-advertised.
- Could cause some interesting routing behavior: Let’s assume a scenario where you have three POPs A, B and C. All the link cost is 10.
Shortest path between Router A–>B = 50 via
Shortest path between Router B–>C =50 via
Now let’s say for some reason we want to force traffic between POP A and B via red links only. So we created a TE Tunnel (TU1 with PHP disabled just to keep the explanation simple) with a policy to only follow red links and advertised the Tunnel via FA with a cost of 2.
So when Router B is calculating its SPF for Router A, it sees the shortest path via CR-3 with cost of 22= 10+2+10.So it will start sending all the traffic destined to Router A via CR-3.
During this process of controlling the traffic between POP A and B, unknowingly I also affected the traffic between POP B and POP C. What happened is now from the Router Bs point of view the shortest path to reach Router C is 42 =10+2+10+10+10 through Router CR-1. This is something we didn’t plan and was unintended consequence as we didn’t want to have traffic for POP C goes through POP A. Obviously this is a solvable problem, but the main point here was to bring awareness.
What if I have draft Rosen multicast in the core?
If the network has multicast presence, then it will have similar RPF issues like what we saw for AutoRoute. The only problem is that the fix for AutoRoute was easy i.e. Enable multicast-intact, on the head end router. This knob is only available with AutoRoute and in order to solve this with FA one needs to enable MP-BGP multicast family in order to have RPF checks successful. I wouldn’t go deeper, but with FA the complexity of a network increases as now, one has to enable BGP-multicast, which is not as simple as just putting multicast-intact under your IGP configuration.
5. EXP based traffic classification:
This one is basically applicable in the scenario where we have Diffserv-TE deployed. Like in our ACME scenario we decided to have one mesh of tunnels for Voice LSPs and other for Data. Considering that Voice traffic is marked as with EXP bit 101 (EF) at the edges when the voice traffic enters the network, we can configure a policy at the Core router (Tunnel Head ends) to map traffic having an EXP bit of 101 to Voice LSP and everything else goes to Data LSP.
At this point, let’s look at a potential behavior knows as “traffic sloshing” which can be observed in a Core mesh deployment with Auto-bandwidth and AutoRoute enabled on the tunnels.
In below Fig.14, At Time = T0 , there are two TE tunnels from each Core Routers in POP-A (CR-1, CR-2) to POP-B Core routers (CR-3, CR-4). All the green links are running normal utilization and has a cost of 10. Except for Pink link which has a cost of 15.
Tunnels are running auto-bandwidth and AutoRoute. Now from Router A’s perspective, it sees Router B ‘s distance as 50 via CR-1 and 55 via CR-2. So it prefers CR-1 and all the traffic from Router A to B flows via CR-1. Since we are using auto-bandwidth, Tunnel from CR-1 is sized up accordingly to the traffic demand pushed from Router A to B. Tunnel’s from CR-2 are setup with bandwidth = 0 as there is no traffic flowing through them.
At Time T=1, assume that pink link in the fig.15 Started running hot due some other inter-pop traffic, but that’s okay for the tunnels from CR-2 as they requested zero bandwidth as no traffic is flowing through them.
At Time T=2, a link failure happens for the link connected to CR-1. So now from Router A’s perspective the shortest IGP cost to reach B is through Router CR-2.
This causes all traffic to shift suddenly to Tunnel’s on CR-2. Now the problem is that TE tunnels from CR-2 are going through congested link, which will cause loss of traffic. More ever the tunnels from CR-1 will reestablish themselves on the new path, which can make the problem worse.
This can be potentially solved by using FA’s and making the metrics of the CR-1 tunnel better than CR-2 which will cause the traffic from Router A to stick to CR-1 tunnels even in the event of failure. But you have already seen that just using FA’s is not an easy choice either as it comes with its own problems.
What is ACME doing?
ACME has decided to use EXP based traffic classification as a means to push traffic down the tunnels. Any traffic, marked as EXP 101 will be pushed into Voice TE and everything else will be pushed into Data TE. They also said that it’s good that you made us aware about the traffic sloshing problem, but they will still keep the tunnels at Core and use auto-bandwidth, auto-route till they adopt Central controller (using PCEP to program tunnels, BGP-LS to collect to topology) to manage their TE tunnels.
Continued in Part 3