Firstly, the events in Japan were horrible beyond belief. The earthquake and subsequent tsunami were horrifying to witness, let alone be a part of. The subsequent nuclear incident (currently designated a class 7, equaling that of Chernobyl in 1986) can only have torn at the hearts of those involved. My sympathies and condolences go to all those affected by the events of March 15th and who will be living with the aftermath for many years to come. Secondly, I’d like to say that I’m not pro or anti-nuclear, nor am I am interested in discussing the relative merits/demerits of nuclear power versus anything else. Out of this tragedy however, came a report; “The Fukushima Nuclear Accident Independent Investigation Commission Executive Summary” which was brought to my attention by our very own @ecbanks . Unusually, this document is written in very plain English rather than Japanese and precludes any knowledge of nuclear safety systems. The report is very critical of the culture which was pervasive in the National Diet of Japan, the regulatory bodies (NISA and NSC), the government agency responsible for the promotion of Nuclear power (METI) and the Operator itself, TEPCO. I highly recommend you read this report yourself and draw your own conclusions, but my feeling is that there are lessons which can be learned which equally apply to Information Systems incident management as they do to nuclear safety. I don’t wish to cheapen any aspect of the whole tragedy, but my thought is this: if lessons can be learned which prevent someone affected from having their credit card lifted, well maybe we can prevent their day getting any worse.
Obviously, there is no comparison in terms of the stakes involved when managing a coolant failure versus a virus outbreak or network intrusion. However, there are parallels which can be drawn in terms of how people at all levels react when something bad happens. Certainly, the whole accident report was full of actions and inactions which I recognised from my own experience. I shall try to avoid creating hard and fast rules which are insufficiently flexible to deal with the real world. I think that even thinking through the issues which the document highlights can prepare one better in the case of an emergency.
Trust your technical leads
“Although TEPCO and the regulators had agreed on how to deal with the vent and the injection of seawater, the Kantei (prime minister’s office) was unaware of this, and intervened, resulting in further disorder and confusion.” (Fukushima Nuclear Accident Independent Investigation Commission, 2012 – Page 34, Paragraph 8)
In my experience, the guys on the ground with their fingers in the dike are the ones with best understanding of the core issue and will (in most cases) be best placed to respond. Assuming you’ve built your technical teams well, these guys should be capable of independent action and may well breach protocol, SLA’s, and take unilateral action in order to resolve or mitigate an issue. Second-guessing these guys or countermanding their actions without superior knowledge or situational awareness will frustrate the efforts of all. Forcing them to explain themselves at length or repeatedly to the various layers of management just wastes time and frustrates unnecessarily. Do not second guess technical decisions. Ask for options if necessary; be prepared to offer them but receive none in reply.
The definition of an “acceptable loss”
“Decontamination should not be treated as a unilateral decision, but must be categorized according to its effect. It must be remembered that at the root of residents’ questions is not decontamination, but whether they can reconstruct their former lives.” (Fukushima Nuclear Accident Independent Investigation Commission, 2012 – Page 41, Paragraph 5)
Losing a day’s transactions may be a price you’d happily pay to bring a service back on line, but the business may think differently. This is one of the times when you’ll need to exercise judgement and say to yourself “this decision is above my pay-grade”. Unfortunately, this is the point when you need two-way communication with the business. An example of this may be the “burning” of an infected server in order to prevent infection of others. If you believe that yours is truly the best course of action, then be prepared to justify it. Have the argument with yourself if necessary beforehand to prepare better counter arguments. Inevitably, in this situation you are going to be better informed and prepared than anyone hearing the detail for the first time. Being aggressive or derisive will not move the situation forward. Be prepared to say, “We’ve covered that,” and move onto the next thing without drawing breath.
Do not assume your audience incompetent or an expert.
“One of the biggest concerns among residents is the impact of radiation on their health. Nevertheless, the government and Fukushima Prefecture have yet to make a proper response to the pressing concerns of residents regarding radiation doses in their neighbourhood, its impact on their health, and other radiation issues. What the government needs to do is convey detailed information to the residents and provide options for informed decision-making.” (Fukushima Nuclear Accident Independent Investigation Commission, 2012 – Page 39, Paragraph 7)
During a crisis, the flow of information up and down the chain of command is key. Clearly, not everyone is an expert in everything and the further you go up the chain, the less technical the audience is likely to be. Being patronising or barking “404! 404!” when asked a question is likely to frustrate rather than placate. I suggest that communications be as succinct as possible; try using the following as an example:
- Explain the problem – “There is a problem with the FOO which is affecting customer FOO.”
- Explain what you are doing to investigate/correct the issue – “We are currently on the phone to the vendor.”
- Offer what the next step might be – “They may need to issue a patch.”
- Suggest a timeframe for a resolution or at the very least the next update. Stick to the timetable for the update. – “This may take 72 hours to address, we will know more in an hour. I will issue another update then.”
The key is not to get bogged down in technical detail. Equally, pulling the front-line guy onto an executive level conference call to explain network security 101 or constantly demanding status updates will not resolve the issue any faster.
Have a Plan B
“If emission data cannot be retrieved from ERSS, the SPEEDI output is not accurate or reliable enough to use in delineating evacuation zones. Some of the people involved were aware of the limitations of the system, but no revisions were made before the accident. There was no other monitoring network in place that could supplement or replace the forecast systems. The System failed.” (Fukushima Nuclear Accident Independent Investigation Commission, 2012 – Page 39, Paragraph 4)
In this case, a key radiation monitoring system failed, effectively “blinding” the government as to which areas which needed immediate evacuation. The parallel I can draw is this: if your key monitoring system is dependent on your production system in any way, then you need to have a plan B in the event of failure. One must consider the effect on your decision making process that a loss of monitoring would have; how else can you retrieve the data you need to do your job? It is good practice to have a “battle box” of basic equipment available in the event of an emergency. This should contain bare-essential items such as patch leads, serial cables, an old (but reliable) laptop with basic system management tools, screwdrivers, fuses, and possibly items such as basic switches or routers. In some financial organisations I’ve seen complete “battle bridge” deployments with completely redundant management infrastructure with independent lines and MPLS connections. This is overkill for most, but you should have some version of this available to you if you suddenly lose power at your desk.
HVAC and critical infrastructure issues should be dealt with immediately
“Following the implementation of new regulations in other countries, discussions were held about revising the guidelines to include a scenario where the AC power source was lost. The discussion also included reviewing the reliability of existing DC power sources. Unfortunately, these talks did not result in any revision to the guideline or the regulations, and at the time of the accident no serious consideration had been given to a scenario involving loss of AC power to the plant.” (Fukushima Nuclear Accident Independent Investigation Commission, 2012 – Page 43, Paragraph 2)
If you discover the UPS batteries are leaking, assume that the power is going to fail tomorrow. Often, the “urgent” supersedes the “important” for the most trivial reasons. In the Fukushima case, a whole raft of barriers were in place which delayed the retrofit of the alternate AC supply; had it been in place prior to the tsunami, a chain of events may well have been avoided which led to a coolant leak and discharge of nuclear material into the atmosphere.
Sometimes there is a reason why everyone else does it that way
“Despite the fact that constant vigilance is needed to keep up with evolving international standards on earthquake safeguards, Japan’s electric power operators have repeatedly and stubbornly refused to evaluate and update existing regulations, including backchecks and backfitting. The Japanese nuclear industry has fallen behind the global standard of earthquake and tsunami preparedness, and failed to reduce the risk of severe accidents by adhering to the five layers of the defense-in-depth strategy.” (Fukushima Nuclear Accident Independent Investigation Commission, 2012 – Page 43, Paragraph 7)
Whilst it is very easy to become jaded by processes such as ITIL, ISO27001, and PCI, they all strive to improve the organisation in one way or another. Even if they don’t directly apply or you don’t have a regulatory monkey on your back, some valuable lessons can be learned in terms of best practice. Being “unique” in the market place can be a double-edged sword; make sure that your reasons for doing things “differently” are genuinely for the best and not to the ultimate detriment of the business.
Access to up-to-date documentation is mandatory
“Response manuals with detailed anti-severe accident measures were not up to date, and the diagrams and documents outlining the venting procedures were incomplete or missing.” (Fukushima Nuclear Accident Independent Investigation Commission, 2012 – Page 30, Paragraph 4)
“..On top of this, sections in the diagrams of the severe accident instruction manual were missing. Workers not only had to work using this flawed manual, but they were pressed for time, and working in the dark with flash-lights as their only light source” (Fukushima Nuclear Accident Independent Investigation Commission, 2012 – Page 17, Paragraph 8)
Documentation is one of those jobs that no one ever wants to do; it is slow, boring and has the lingering association that no one will ever look at it again. This may be true for 95% of documentation that you create, but being able to access knowledge like “which switch port the internet router is connected” immediately is the kind of thing which may save your job at some point. Discipline in maintaining documentation is critical; updating the paper trail BEFORE you make the change will prevent the detail from being forgotten. Also bear in mind the consequences off storing all your documentation on a fancy CMS in a data centre which has lost all power. Have a hard copy or an electronic copy *somewhere* which can be independently powered and accessed.
There is always a worse, worst-case scenario
“At the time of the accident, the laws, regulations and infrastructure were based on the assumption that the scope and magnitude of possible natural disasters would not exceed precedent. There was a failure to take into account the prospect of unprecedented events such as the earthquake and tsunami on March 11, 2011, despite the fact that the possibility of such events was known.” (Fukushima Nuclear Accident Independent Investigation Commission, 2012 – Page 46, Paragraph 2)
It’s clear that information security risk management cannot foresee every eventuality, nor offer advice about how to deal with every incident it can identify, but even exploring the “worst case scenario” can be helpful. I learned about a helpful technique from former England rugby coach Sir Clive Woodward at a corporate event. In an open workshop scenario, sit down with the stakeholders and talk through as many “what if” scenarios as you can think of and how you would deal with them. For example:
- “What would we do if the power was to fail at the primary data centre during a scheduled outage at the DR site?”
- “What would we do if we found the Customer Oracle Database was corrupt whilst all the DBA’s had food poisoning?”
- “What would we do if the building next door was to catch fire and the emergency services prevented access to our building?”
I suggest that you stop short when you get into the realms of “when aliens attack” or “dinosaur rampage,” but even improbable scenarios should be talked through and even put into the “possible scenarios” documentation. This way, when the worst case does occur, the guys on the ground will be much better prepared than being completely “cold” to the situation.
Don’t leave your at-risk people/assets/chips in danger
“Through more than four evacuations, over 70 percent of residents from the areas near the Fukushima Daiichi and Fukushima Dai-ni plants (Futaba, Okuma, Tomioka, Naraha, Namie) evacuated. There were numerous complaints about evacuation orders that required the residents living nearest the nuclear plants to evacuate so many times.” (Fukushima Nuclear Accident Independent Investigation Commission, 2012 – Page 50, Paragraph 7)
In the Fukushima case, partly because of poor communication and partly because of poor access to information (thanks to the failure of the ERSS system), residents were moved on more than one occasion and to locations where they received higher radiation doses than they would have done had they ignored the official advice. If you need to take a critical system offline because you fear a breach, leave it offline until such time you are completely sure the danger has passed. Virtualisation technologies make it very easy to spin up potentially infected/compromised servers and assess them while being isolated from the network. If you fear a server is compromised, take it completely off the network and don’t just shut down the inbound firewall connection.
Ensure that enough basic tools are available
“Some workers had to share one dosimeter with several others because the devices were limited. Very few were without a dosimeter at all.” (Fukushima Nuclear Accident Independent Investigation Commission, 2012 – Page 62, Paragraph 1)
“We were supposed to manage our cumulative radiation exposure level on our own, perhaps because the database became unavailable due to the earthquake. But we didn’t even have pen or paper. We had no way to accurately keep track.” (Fukushima Nuclear Accident Independent Investigation Commission, 2012 – Page 66, Paragraph 2)
In the Fukushima incident, this is perhaps the most horrendous case of “spoil(ing) the ship for a ha’p’orth of tar”. The lack of such basic equipment meant that workers were constantly in fear of exactly how much of radiation dose they had received; many will never know the full extent of the damage to their health. In IT, one is fairly unlikely to run into a situation where such personally hazardous circumstances occur, but taking the availability of something basic for granted can stymie any recovery situation. The lesson is to take nothing for granted and make no assumptions:
- “The DR site will have a sufficient AC power to run the servers.”
- “We can easily get hold of replacement servers.”
- “The patch cords will be long enough.”
- “The line failover to the DR site will be automatic and immediate.”
The Fukushima incident was a result of entrenched practices coupled with bad luck, but was ultimately triggered by two environmental events occurring back to back. Whilst the earthquake and subsequent tsunami could not be prevented by man, there were predicted. Had all the outstanding upgrades been implemented, the impact could have been significantly less severe. The report points out failings at every operational and regulatory level, but fundamentally, humans can fix human problems. Information Security is all about human-made problems; computers don’t leave themselves on trains or set weak passwords. Their owners do. By applying some of the “lessons learned” from such an enormous disasters such as a nuclear meltdown to IT systems, then perhaps a few more minor disasters in our own working lives can be prevented.