A system-wide outage due to partial router failure at Southwest Airlines on July 20, 2016. Switching between primary and backup generators causes a fire that takes out Delta’s Atlanta data center on August 8, 2016. Both are examples of Black Swans events, at least from the perspectives of Southwest and Delta. Black Swans as written about by Nassim Nicholas Taleb have three attributes:
- The event lies outside regular expectations
- The event has extreme impact
- It is rationalized by hindsight as if it could have been planned for.
One principle of cloud native design is that everything fails eventually and systems must be constructed to handle it. Nassim takes that one step further in his book Antifragile which talks about systems which thrive and grow when exposed to volatility and disorder (sounds a lot like machine learning – perhaps the future of networking). There is even an Antifragile Software Manifesto but what does this mean for networkers?
When NASA creates software for interplanetary probes, three different teams write separate programs for each calculation and compare answers, looking for agreement. The theory is that a software flaw in one program is not duplicated in the others. Extending the concept of fault domain (or blast radius) to leverage multiple vendors and/or technology stacks can achieve a similar result.
Below is a campus network design where each endpoint connects with two technology stacks: wired and wireless, or wireless and cellular. In both cases, the first technology is primary and the second technology is standby (there are efforts like Multipath TCP that would utilize both links). Each endpoint is simultaneously connected to both networks but using only one actively.
- Each floor is divided into alternating zones
- Each zone has separate subnets with its own routing
- Uplinks from access points in one zone connect to switches in the other zone
- When wired networking in a zone fails, wireless remains up (supported by the other zone)
- When wireless or even the building routers fail, in-building cellular remains up
- People and devices have multiple options for adapting to changing circumstances
- The switches and routers in each zone might even be from different vendors
Everything above works to remove dependencies between the technology stacks and avoid “Shared Fate.” And there are other factors that make this approach even more resilient:
- The network architecture and protocols for wired, wireless, and cellular are very different and not typically susceptible to the same types of faults
- Wired, wireless, and cellular have separate engineering teams and code bases
- Radio protocols are analog in nature and tend to degrade more gracefully
We implemented a variation of this design during a time when there were stability issues with the switch fabric by deploying separate routers and switches dedicated for the wireless access points. This could be cost prohibitive except we used leftover, older model switches to create the separate switch fabric.
Applying similar principles to cloud computing means running the same application in multiple public or private clouds, which is made possible by widespread support for container platforms. Does running in multiple regions or availability zones within the same cloud platform provide separate fault domains? Does “Antifragile” computing imply separate cloud providers and the technology stacks?