In the first article of this series, reliability and resiliency has been explained. Every component and every device can and eventually will fail, thus system should be resilient enough to re converge/recover to a previous state. Resiliency can be achieved with redundancy. But how much redundancy is best for the resiliency is another consideration to take.
Many tests has been performed for routing convergence based on link number for different routing systems and it seems two or three links are the best optimum for the routing re convergence.
For the routing systems, there are two approaches for faster convergence than their default convergence time. Fast convergence is achieved with protocol parameter tuning, failure detection time reduction, propagation of failure to the routing system and routing and forwarding table updates of the systems.
For the fast reroute, backup forwarding entry should be in the device’s forwarding table. There are many techniques for fast reroute. Since Russ White and I wrote many articles for the subject I will not explain the concept again but here is the link for fast reroute.
In this article I will mention mostly about scalability as a common network design tools and we will see how scalability often helps for the high availability.
Scalability is the ability to change, modify or remove the part of entire system without having a huge impact on the overall design. There are two scalability approaches for the IT systems. These approaches are scale up or scale out and implies for the Network, Compute, Storage, Application, Database and many other systems.
Scalability through scaling up the system can be defined as to increase the existing system resources without adding a new system. Consider scale out application architecture, if application can be run over the two different server, we can do some maintenance on the one of the servers without affecting the user experience.
Consider we have only one router as a network device and we need to plan software upgrade. If we have a two supervisor engine for control plane activities on that router, we can upgrade the software without having a downtime and maintenance will not be an issue. We don’t have to have Flag Day for upgrade activity.Although the benefit of having scale up approach for high availability is limited, obviously in this case it helps.
Scaling with the scale out approach gives high availability. Secondary system might be processing the load or worst case waits as an idle to take responsibility in case of a primary system failure. If the response time increase, since the expectation is not to worry about a failure, scale out approaches loses its benefit.
In general database systems, specifically relational database systems are seen as hard to scale with the scale out approaches due to CAP theorem.I am planning the explain this concept in a separate post but I would suggest you to read about it.
Often scalability is considered together with the system size. If the sizes of the network, database, storage architecture, application instances are big, it is considered as scalable system. Although the idea maybe not wrong we should know that smaller size systems can be a scalable system as well.
Once we modify, remove or add additional component into a system we don’t expect to have an impact on the running system. As an example, let’s examine scalability of routing protocols. If we have many router and many links in the single area OSPF deployment, even small link flap can trigger the all routers to calculate a new topology. Up to some limit it might be acceptable but after that limit it affects the overall routing domain a lot. For the OSPF case, limit is defined generally with the Routing LSA size.Let’s not go and try to explain how and why but rather we should think how can put to limit without affecting the systems. In general for the routing and many other systems such as data center bridging, we keep the domain boundary at the optimum level.
To support the scalability, choosing the correct technology is important. Consider you need additional ports in your data center aggregation layer switch to support more compute resources so on. Of course you can go and create additional access-aggregation POD and connect to your core if you have three tier architecture and port on the core. But rather, two tier leaf and spine architecture for the physical could be considered since it doesn’t not only support for scalability but also could give better east-west application performance and would be simpler architecture than the POD based design.
Last but not least, scalable systems also should be manageable. While growing in size, if system starts to become non manageable, it will affect inversely scalability. Still we will need those flag days, long and often maintenance windows, operationally complex environment and at the end we will end up with a non scalable system.