I finally got around to reading The Mythical Man Month (MMM), a famous book on large-scale software development projects (think operating systems) written in 1975, revised in 1995, and still strikingly relevant today in the neighboring field of building and managing massive networks. While multiple points land directly on those of us working on massive networks, I’d like to focus on the idea of a single, master architect.
I grew up in America. I live and work there now. While there are incredible hierarchies in every direction, as a culture we’re pretty uncomfortable with overtly placing ourselves in them. When there’s a team of four people of similar competence building a network, generally they respect each other’s autonomy and avoid dictating to each other. Some folks have no problem telling their peers that they’ve made a mistake, even on a public email thread or in a meeting, but they’re usually the ones propping up the stereotype of the computer geek with terrible people skills, and often poor hygiene. So I’m discounting them as outliers.
The benefit of cordiality is you get a team that functions well together. The price is you end up with a lot of small disparities that I call “cruft.” Some examples of cruft:
- BGP policies applied using route-maps in some places, distribute-lists in others.
- Names of objects slightly different everywhere, some in all uppercase, some all lower, some mixed.
- Settings for non-essential things like syslog applied haphazardly.
- Wild disparities in port/neighbor descriptions, with some giving details like circuit IDs, others saying, “BGP neighbor”.
These discrepancies result from multiple people, knowledgeable and otherwise, deciding on the spot how to address a micro problem. Taken individually each choice may be perfectly reasonable; taken together the effect is chaotic, and since the one thing cruft does best is accrue over time, it never gets better without significant work. You log into a switch and have no reasonable ability to predict what you’ll see beyond the basics that allowed you to log in. In soft terms it’s frustrating and jarring to work in networks run like this; in hard terms it costs you time, and therefore money, to fix anything because you’re constantly looking things up and examining the context. Any complex change to a network, like migrating customers to MPLS, is now even riskier. Often you can’t tell if a feature is set up deliberately and for good cause, or if it accidentally fell into place years ago for no apparent reason.
Fortunately a bit of fascism can fix this. As MMM describes, complex software products need a continuity that can only come from one mind, or a tiny group of tightly coordinating minds. This is the architect, and in projects of great complexity we’re used to seeing them – not so much in that four-person team I described above.
But imagine someone brought up a new ISP link that used a distribute-list directly on the BGP peer, while every other link used route-maps. If the buck stops at one of those four geeks, he’ll rightly tell whoever brought it up to schedule a maintenance to move it over to the “right” method.
This brings up another point – unless the “right” way is clearly inferior, the continuity is more important than what’s nagging at you. Most complicated solutions have multiple, debatable methods of achieving them. At some point the debate needs to stop and you might lose. It happens, and practicing civil disobedience unravels scalability.
Finally, the details matter. I mentioned upper- versus lower-case above for a reason. As networks grow increasingly complex, we grow automation tools to manage that complexity. So while it might just be annoying today that you have to search a configuration multiple times for a prefix-list named “default-route”, “DEFAULT-ROUTE”, “DEFAULT_ROUTE”, etcetera, when it comes time to automate some tasks, this lack of continuity is something you’ll either have to go back and correct, or account for in the software by making it more complex. Wouldn’t it be easier to have someone flatly state which way to do it before it’s done?