This is a continuation from Part 1
9. ACME VPN RR’s Design:
So the current total number of PE’s dedicated for VPN functionality is around 400 (2 PE’s in each POP x 200). A full iBGP mesh between 400 PE’s comes around 79,800 sessions ((400×399) /2). By introducing two VPN RR’s each PE will have only two iBGP sessions with each of the VPN RR’s and each RR will have 400 iBGP sessions with each PE. So this solved the full iBGP mesh problem.
Now the next problem is how do we solve sub-optimaltiy, path-diversity (for multipathing and fast-recovery) with the introduction of RR’s. Luckily, this is pretty simple to solve with VPN’s Routes and thanks to Route Distinguishers (RD’s). ACME decided to use different RD’s for the same customer which made the vpnv4 routes unique. Hence VPNv4 RR’s reflects doesn’t hide the path as they don’t look same to the VPNv4 RR.
If we look at an example below, the Customer is advertising the same Prefix “P” to both PE’s. Since we are adding separate Route Distinguishers to the same prefix “P”, it results into two different VPNv4 Routes.From a RR perspective, they are two different routes and hence it reflects both routes to the other VPNv4 clients. This is a pretty well known and simple way to achieve path-diversity and mitigate sub-optimality.
One other feature ACME is looking into is BGP Route target constraint RFC, i.e. RFC 4684 to gain efficiencies on the PE side, but this is slightly off topic so I will leave this as an exercise for the reader.
10. ACME Internet RR design
We could potentially apply the similar concept on Internet traffic like we did for VPN traffic to overcome the challenges posed by RR’s if we placed the Internet traffic within a VRF (lets call it Internet VRF). But unfortunately ACME doesn’t put its Internet traffic within a VRF. So lets explore some other ideas.
As you saw earlier that the location of RR’s becomes very important to avoid sub-optimal routing. So ACME decided to put a RR pair in each POP. Since RR’s are co-located with RR clients within the POP, RR’s best path selection should be the same as those made by POP clients from an IGP cost perspective. This way ACME avoided sub-optimality at the Intra-POP level as the IGP distance to BGP Next-Hop will not play a big factor now. But this created another problem as you may already know that all top-level RR’s need be fully meshed together which created its own scalability issues. There are 400 RR’s (200 POPs x 2 RR each POP) brings the iBGP mesh to 79,800 sessions. This brings back to the problem where we started initially, i.e. large number of iBGP sessions.
So ACME decided to introduce a 2nd level hierarchy of RR ‘s. ACME grouped POPs into three main geographical regions West, Central and East Coast. RR’s in the POPs are RR clients to the regional RR’s (2nd level hierarchy). Top level RR’s are fully meshed each other.
In the below fig. 8 (A) West coast POP has two RR’s (within POP) and they are peering with their 2nd level regional RR’s (West). There is also an iBGP connection between 1st level RR’s within the POP. Fig.8 (B) shows a high level picture of the 1st level RR’s in each POP peering with their 2nd level RR’s which are fully meshed together.
So by putting RR’s in each POP solved the sub-optimality problem at the Intra-POP level and putting regional RR’s at 2nd level solved the full iBGP mesh problem between 1st level RR’s.
Another variant of the above solution is keep the iBGP full-mesh within the POPs. At that point the job of the RR’s in the POP is to
- Reflect routes learned from the RR clients within the POP and advertise to 2nd level RR’s
- Reflect the routes learned from 2nd level RR’s and advertise towards the RR clients.
Since the clients within the POP are fully-meshed there is no value of RR reflecting the routes within the POP (Intra). Hence at that point we will disable Client-to-Client reflection (“no bgp client-to-client reflection”) within the POP. Also at this point, creating an iBGP mesh between 1st level RR’s within the POP adds no value so they will not be connected to each other.
One thing to observe here is if we increase the number of PE’s to 10 or whatever number in each POP or increase the number of POPs, this design will handle the growth.
Okay, So far so good, but what about solving sub-optimality and path diversity for Inter-POP traffic? Plus adding a 2nd level of RR hierarchy may further reduce the path diversity.
11. BGP Add-Path
In order to solve Inter-POP suboptimality and path-diversity issues between POPs, ACME looked into BGP Add-Path(https://tools.ietf.org/html/draft-ietf-idr-add-paths-10) which enjoys support from various vendors. BGP Add-Path is very similar in concept of adding unique RD for VPNv4 Routes. It adds a unique Path ID to the Prefix to make Prefixes look different for instance, a Prefix 10/8 learned from two different disjoint next-hops can be identified as (10/8, ID =1) and (10/8, ID=2). In order to have BGP Add-Path functionality it requires the software upgrade on BGP RR and RR Clients. Luckily, all the routers running BGP in ACME network had the latest software which supports this capability. Also, there is another solution called BGP diverse-path (Shadow RR’s) which allows to solve the similar problem without any need to upgrade the BGP RR clients, but the problem there is that it’s only supported by one Vendor (at least for now). Hence this solution was not considered.
So there are different modes of BGP Add-Paths (Add-ALL path, Add-N path, etc..) and they all have different properties. The major benefits of using BGP add paths are
1) Avoid MED oscillations (This was the original motive of BGP add path)
2) Helps in achieving optimal routing (Hot Potato Routing)
3) Fast-Recovery (Convergence)
As we know that there is no free lunch, cost for using BGP Add-Path which may vary based on the mode used. The cost is more on memory (adj-rib-ins) caused because of additional paths and control plane stress over iBGP sessions caused due to reflecting additional path BGP updates which may re-trigger BGP decision process more often.
In this mode all paths with unique next-hops are sent by the BGP RR to RR clients. This is equivalent of receiving multiple paths in a normal iBGP full mesh from all the iBGP neighbors. This solution brings best path visibility to all the routers and provides
1) Avoid MED oscillations — Yes, as no paths are hidden from any BGP Router.
2) Helps in achieving Hot Potato Routing — Yes, as the RR propagates all the paths to their clients, RR Clients gets visibility to all the paths and can pick the optimal (hot potato) path w.r.t their own IGP distance.
3) Convergence — Yes, Since all the paths are known by a BGP router, post convergence path following an IGP event or BGP next hop router failure has been already available that will perform rerouting w.r.t this event.
The biggest drawback of this method is that all paths are stored by all routers which is very expensive from a memory perspective. For instance a path to prefix P is advertised by N BGP border routers, with a full mesh iBGP sessions, a BGP Router will store N paths in its Adj-RIB-In. If an Add-ALL-Paths along with Route Reflection is used, then each client is connected to 2 RR’s, it will learn upto 2xN paths as both RR’s will send the full set of available paths. Hence the BGP router will store 2xN paths in its Adj-RIB-IN’s which is worse by orders of 2 than having a full iBGP mesh.
ii) Group Best Path
The main objective of this mode is to avoid MED oscillations.The idea of this mode is to let BGP routers advertise over iBGP the best path that they know for each neighboring AS. As a result, the lowest-MED paths from each neighboring AS are known to all BGP routers, hence non-lowest MED paths cannot be picked as best, guaranteeing convergence. Regarding fast recovery (convergence) and load balancing, Add-Group-Best-Paths provides one path for each neighboring AS, but not necessarily the post-convergence ones or the optimal ones.
The increase in control-plane stress highly depends on the connectivity of the AS. Large transit ISPs receiving paths towards the same IP prefix from many different ASes will need to store and update one best path per such neighboring AS. ISPs with few different neighboring ASes will not see a large amount of additional BGP Updates flowing through their iBGP architecture
The group best is the set of paths are the best paths from the paths of the same AS. For instance, let’s say there are three AS: 100,200 and 300. Prefix P11, P12, and P13 are learned from AS 100; P21, P22, and P23 are learned from AS200 and P31, P32, and P33 are learned from AS300. When we run BGP best path algorithm on the paths from each AS, the algorithm will select a best path from each set of paths from that AS. Assuming p11 is the best from AS100, p21 is the best from AS200, and p31 is the best from AS300, then the group-best is the set of P11, P21, and P31.
In this mode RR only propagate N paths to the clients. N paths are calculated by the RR by first computing the best path, removes all the paths including the best path with the same next hop as the best path, then computes the second best path by running best path computation on remaining paths, and repeats this process until the resulting set becomes empty or the N paths have been selected.
Though this mode, theoretically doesn’t guarantee many aspects like
1) Avoid MED oscillations: This mode doesn’t guarantee avoiding MED oscillations as Routers don’t have visibility to all the paths (Only N paths)
2) Helps in achieving Hot Potato Routing: It doesn’t guarantee that, but if N is very high, then possibility of achieving optimal hot potato routing increases.
3) Convergence: Maybe. It doesn’t learn all the alternate paths, it doesn’t guarantee that post convergence path will be known to the router.
Even though this mode doesn’t guarantee above aspects theoretically, but practically chances of achieving Hot Potato Routing and convergence are pretty high with high value of N (Juniper supports maximum 6 paths). Also, this mode is a lot more deterministic for an NSP to predict increases in memory and control plane stress with N paths.
ACME decided to implement Add-N-Path and was enabled on 1st and 2nd level RR’s with the value of N to 4. Having 4 paths (if available) for a prefix was enough for achieving fast-recovery and Hot Potato routing. Please recall that this is only for inter-POP traffic, for Intra-POP traffic we have full iBGP mesh between the PEs.
12.BGP Best External
BGP Best External is complimentary to add-path in situations where the routing policy causes a border router to prefer a prefix “P” learned over an iBGP session than eBGP session (situations like Active-Backup topology). For instance in Fig.
- PE1 and PE2 are learning a Prefix “P” and advertising them to the RR. PE1 is preferred over PE2 which is accomplished by increasing the Local-Pref to 300.
- RR propagates the Route learned from PE1 to PE2. PE2 prefers router PE1 over its eBGP neighbor.
- PE2 withdraws its route advertisement for Prefix “P” from RR as its using PE1 as the best route for Prefix “P”
- RR learns only path, i.e. from PE1.
So even if RR is enabled for Add-Path, it won’t be able to reflect both paths to PE3. Since PE3 is learning single prefix we can’t achieve fast convergence (BGP PIC edge), multipathing.
With Best External enabled, PE2 will be still advertising its route to RR even though that’s not the best path.
So we looked at how we can scale RR design in a large scale network and still achieve Hot Potato (Optimal Routing) and BGP Fast recovery. This is something we can do today. Recently, many vendors have released support for their OS in virtual form factor. This facilitates an environment where pure Control Plane functions (out of band RR) can be run over VM’s with more compute power available. This has opened up opportunities for ideas like RR’s running SPF from a Client perspective and calculating the best path to the BGP Next Hop and sending that BGP best path to the client. In the case of multiple areas, BGP -LS can be leveraged to export the IGP view of another area to the RR.
14. References and Further Study
- Border Gateway Protocol (BGP) Persistent Route Oscillation Condition https://tools.ietf.org/html/rfc3345
- Constrained Route Distribution for Border Gateway Protocol/MultiProtocol Label Switching (BGP/MPLS) Internet Protocol (IP) Virtual Private Networks (VPNs) https://tools.ietf.org/html/rfc4684
- BGP Route Reflection: An Alternative to Full Mesh Internal BGP (IBGP) https://tools.ietf.org/html/rfc4456
- Advertisement of Multiple Paths in BGP https://tools.ietf.org/html/draft-ietf-idr-add-paths-10
- Distribution of Diverse BGP Paths https://tools.ietf.org/html/rfc6774
- BGP Add-Paths: The Scaling/Performance tradeoffs http://inl.info.ucl.ac.be/publications/bgp-add-paths-scalingperformance-tradeoffs
- BGP Route Reflection Revisited http://irl.cs.ucla.edu/~j13park/rr-commag.pdf
- Analysis of paths selection modes for Add-Paths https://tools.ietf.org/html/draft-vvds-add-paths-analysis-00
- Best Practices for Advertisement of Multiple Paths in IBGP https://tools.ietf.org/html/draft-ietf-idr-add-paths-guidelines-07