Dinner with His Toadishness, Derick Winkworth, the other night rolled into a 3+ hour discussion of avant garde ways to do networking. One of the adjunct topics that came up was that of ownership within IT. Ownership is a complex problem in the data center, because there’s many complex technologies at work. No one single person is responsible for all that’s going on. Sometimes it’s a problem of scale – the bigger the system gets, the harder it gets to manage. Sometimes it’s a problem of diversity – several diverse technologies at work require a broad number of skillsets. Sometimes its a problem of business silos – we work on this, but we don’t work on that.
The end result of this is that ownership tends to align via technical areas. Each technical team with their respective specialization work on their stuff, and know as little as possible about what’s going on with the other teams, in order to maintain a teflon exterior. “My stuff is working. My stuff is fine. Everything is okay here. I don’t know jack about that other stuff. Talk to them.”
From the standpoint of IT, this might be a personal win of sorts (the “no one can blame me” mentality), but of course, it’s a total failure for the business. When something is broken…wait. Back up for a moment, and let me qualify that. What do I mean that “something is broken” in the context of IT? I mean that some application is not working up to the business’ expectations of that app. What’s a data center if not an application delivery engine? If the application isn’t being delivered, something’s wrong with the engine.
When my car is broken because there’s something wrong with the engine, I don’t want a bunch of mechanics lining up and telling me, “I’m the spark plug guy. No problem found. I’m the fuel pump guy. No problem found. I’m the electronics guy. No problem found. I’m the piston guy. No problem found.” I want someone to fix the car that is obviously broken. I want someone to own the problem and get it resolved. Not walk away going, “Not my problem, man. Hope things work out for you.”
Is there anyone in your organization that’s willing to take ownership of IT problems? What about you? You’re really putting yourself out there if you do this. Diagnosing complicated application problems is tough. Let’s say the issue is a poorly performing web application. Now let’s think through all of the possible teams and related issues that could be tied to such a thing.
- L1-L3 network. Routing, switching, LAN, WAN.
- L4-L7 network. Load-balancing. WAN optimization. DNS/DHCP.
- Network security. Firewalls, in-line IPS, access lists. RADIUS/TACACS.
- Server. Bare metal hardware, blade centers, hypervisors, memory, CPU, NICs.
- Sysadmin. OS, standard OS builds (that often include host security scanning). Authentication, directory services.
- Storage. Fibre channel, storage-over-IP, disk arrays, SANs.
- Low level application. Web engine, web engine plugins (that often include security scanning again).
- High level application. Web software (SharePoint or other CMS) with their own plugins.
- Database. Most web sites use a SQL engine.
- System architect. This is the person who designed the app, and should know how a transaction flows through the system, what baseline response times should be, and so on. Also known as “the one who used to sit in that cube over there before they quit after being on suicide watch for several months.”
There you go – one through ten, the nuts and bolts of human beings responsible for an application delivery engine that’s supposed to be serving up a web app to consumers. (And I didn’t even mention the third parties that might be involved.) If the web app is slow, it’s not enough for humans one through ten to say, “I’m not sure what’s wrong. Individually, all the parts seem fine. Not my deal. Someone else will fix it.” For the sake of the business that’s stuck on the side of the road because the engine is broken, someone’s got to take responsibility and get this resolved.
Fair enough. In most organizations, these issues *do* get resolved. Eventually. Finger pointing goes on (usually by the more ignorant, insecure, or clueless people in the organization), everyone has enough of the whining, the right people get on a call, and the root cause is identified and addressed.
This is…horrible. What a horrible, horrible, awful and horrible way to have to deal with complex troubleshooting. We all point fingers at each other until an adult sits us in a room and tells us to get it sorted out. Then we do our best to veil our disgust for the other teams who clearly don’t know what they’re doing (unlike we, God’s gift to our respective discipline), and grudgingly start to work it out.
The solution to this approach is two-fold.
- One is people. Get over yourself, and suck a little less everyday. Think out loud about all the possible ways that your area of responsibility could be to blame, put them out of the table, and explain to the troubleshooting team what you’ve done to investigate those things. Listen to suggestions openly, and not defensively. In other words, act like a grown up, and consider the real possibility that your stuff is all or part of the root cause, and stop taking it to personally. Stuff breaks ALL THE TIME. If it didn’t, you wouldn’t have a job. It’s not “your” network or server or web app or whatever.
- The other (and here’s the more interesting part of the discussion), is to improve monitoring of the application delivery as an industry. THE INDUSTRY SUCKS AT THIS. There are powerful, expensive, and obscene framework monitoring applications that are supposed to help a business know how well an application is being delivered. But they are such a burden to deploy and maintain that they end up as red light/green light far too often. So forget that, and the thousand other application performance monitoring packages out there up and down the scale. What if instead we could, as an industry, standardize how application delivery is measured?
I’m not sure yet how to articulate this idea into specifics. But I know what application delivery monitoring isn’t. Application delivery monitoring is not graphing a bunch of obtuse SNMP OIDs with inscrutable manufacturer definitions. It’s not each of the ten groups above myopically monitoring their own stuff and pronouncing everything okay. It’s not even APM. APM systems have to be distributed throughout an organization to be meaningful, and need to be able to explain the source of a slowdown. Often, APMs are deployed centrally, just don’t have enough specific insights, don’t present the data well, or offer no analysis beyond pointing out some parameter is outside of baseline norms. Yes, APMs are more the idea of what I’m thinking of, but I think throwing steady transactions at an application isn’t quite the sort of instrumentation I’m looking for. I don’t want an APM to tell me simply that the app is broken and then alert me. I have users for that. I want it to tell me what exactly is broken, in what way is the app experiencing brokenness, fix it for me (if possible), and then let me know the end result. (Capacity has been reduced, etc.) Don’t tell me symptoms. Tell me the root of the problem, then do something about the issue.
The big idea here is that if a data center is an application delivery engine, then the most noble goal of monitoring is to verify that the application is being delivered – popular off-the-shelf applications, custom in-house applications, SaaS applications, whatever. There needs to be a standard set of questions that can be asked of an application, and ALL parts of the infrastructure that the application rides on. The answers need to be well-defined and meaningful. And a centralized application should be able to aggregate all of the answers, analyze them, know what they mean, and then via a pre-defined policy work around the problem or else alert someone who can resolve the issue. And that alert should be a detailed report of the problem along with recommended solutions. Only with standard questions & answers (i.e. what we’re asking our infrastructure and the range of acceptable responses) can we get to this point. This is why SNMP OIDs fail – they are manufacturer-specific and often difficult to wrap context around. Contextless numbers are useless as bellwethers. Ever spent any time reading MIB definitions? Half of them make me want to stand on a random street corner in Silicon Valley and yell gibberish at the Teslas driving by.
We need to enable IT as an organization to own the application delivery engine as a whole. When an application is broken, the root cause shouldn’t be mysterious. A monitoring application fed the right information should be able to assess the issue and make a recommendation of how to fix it. Whoever pushes the industry in that direction – that of taking ownership of the engine and not the engine’s parts – wins.