In the first article of the series, reliability and resiliency are covered. We should know that whatever device, link type or software you choose eventually they will fail. Thus designing resilient system is one of the most critical aspects of IT. I mentioned that one way of providing resiliency is redundancy. If we have redundant system, reaction to the failures can be optimized through either fast convergence or fast reroute mechanisms. They are not mutually exclusive, can be used together in the network. For example BGP Fast reroute is provided via BGP PIC (Prefix Independent Convergence). In order to react to a failure fast and change the path in the FIB to a pre-computed backup path, IGP should convergence faster than it’s default.
In the second part of the series I mentioned mainly about scalability. Internet of Things, Mobility, Big Data and the many other application made change the traffic pattern as well as bandwidth requirement. We have seen two different scalability approaches, scale-up and scale-out and mentioned that scalability is related but not exactly same with system’s size.
What about complexity and the management of the system. Probably I should have been mentioning about complexity in the first article as a first concept. And probably you wonder why I use system instead of only network. Because until now whatever I have been mentioning as a design concept here, definitely applies to the network, compute, storage, application and services (FW, IPS/IDS , Load Balancer and so on) part of the IT.
Complexity of the IT is the sum of the processes, technologies and protocols on the entire system. Whenever you want to bring new protocol to the environment you add complexity. You may reduce the impact of that new protocol to overall complexity but still it will be more complex than its latest state.
Consider you use static routing and you want to enable dynamic routing for whatever reason. Dynamic routing adds more state to the control plane of the routers than static routing due to keep its packet at many database , require stuff to configure and manage it and most importantly failure is propagated and processed by the other routers. Sure with summarization and aggregation ; topology and reachability information can be hide and more simplistic design can be provided. I don’t say dynamic protocols are complex thus don’t use them, but I want you to think if deploying it worth to have additional complexity.
Another example; you want to enable network overlay tunnel to extend layer 2 domain between data centers. This might be technical requirement and many technology might provide this ability. But whatever technology you choose it will add some level of complexity onto the system. I don’t mean overlay technology shouldn’t be used or you shouldn’t extend your layer 2 (Although it is totally separate discussion), I want you to check if there is another way to do it?.
Last example ; You may suggest to business to have a second data center to increase overall availability, provides disaster avoidance or recovery capability, but it comes with a cost of complexity among many others. I am not even mentioning that it is more costly than having one data center. Doubling many devices and technologies if not all , and having much more state at routers, switches, statefull devices and many other places because of running two data centers instead of one.
I see many discussion nowadays around having a single number to measure overall system complexity. If we place our network device, links, protocols, prefixes, overlay tunnels, storage systems, security, Qos and traffic engineering policies and all the other technologies and equipment into the equation and based on some formula can we have a single complexity number ?.Despite of many people, I believe we might have it. Let me know what do you think?.