The normal place to start when studying for a certification like the CCDE would be with technology –after all, we’re engineers, and we like technology. If it’s geeky, it’s good, right? But… Technology isn’t the right place to start studying for the CCDE practical.
Then where should you start? I would challenge you to start with the blueprint.
Of course, there’s a bit of a problem here, because the CCDE doesn’t have one blueprint. In fact, the CCDE has four blueprints, or four axes on which candidates and content are measured. Over the next several weeks, I’d like to discuss each of these blueprints in some detail so you get a better feel for how this certification is structured –because the CCDE isn’t so much a technology certification as it is a mindset and skill set certification.
Let’s start with availability, the first of the five domains.
Availability is most easily defined in terms of 9’s. And resiliency means redundancy, right? Not so fast… Availability is actually a rather delicate balance between redundancy, fast convergence, security, and serviceability. Redundancy doesn’t do any good unless you can switch to the redundant links quickly. Redundancy is a positive harm if it causes control plane convergence to become the slowest component in the switchover. There’s no point in having a really low mean time between failures if you don’t have a really low mean time to repair to match. One 504,000 minute outage a year still means 0% availability.
Availability doesn’t always show up as a simple redundancy issue on the CCDE; it might show up as too much redundancy as well as too little, not enough manageability as well as too much, or even as too much or too little security.
For example, suppose you are faced with the problem of placing multiple data centers (okay, okay, distributed private cloud locations!) someplace on the East Coast of the United States. It’s obvious that you should avoid putting to data centers within the path of a single potential hurricane –but should you worry about a hurricane and an earthquake happening within a week of one another? Should we choose locations for the worst case scenario, or for the most likely one?
Or assume, for a moment, that you are dealing with an application failure caused by a slow switchover between two paths in the network. Is the right solution to simply add a third path in parallel?
Or what if you’re asked to build a campus network with wireless links? Should you plan a second wireless network to back up the first, or is it likely that multiple wireless links will fail in the same physical location for the same reason? To put it another way, would multiple wireless links tend to fate share for the same physical conditions?
Clearly there’s no easy answers to these sorts of questions –but there are some solid questions you can ask yourself to find a path to the right answer.
Will adding more redundancy here make convergence slower, or faster?
Will the added redundancy fate share with the existing path or link?
What’s the tradeoff between complexity (including mean time to repair) and this fast convergence mechanism in this specific situation?
What do the applications really need? Is there really a requirement for sub-one second convergence here? What is “good enough?”
By the time we’re done, you’re most likely going to have a long list of these sorts of questions to ask yourself when looking at each question on the practical. Asking these same questions in real life will make you a better designer –and isn’t that the goal of a certification? To show you where you need to grow and learn to reach that next rung?