These days, network and system outages are mainstream news headlines. When IT infrastructures go down, so do all the sites and applications that depend on them, and a ripple of annoyance turns into frustration, disgruntlement and lost productivity. Meanwhile, engineers, PR officials, and service providers all scramble to contain the damage and fix the problem and its effects, often heading into full-on panic mode when repairs take too long or introduce new problems.
Once the dust settles, mitigation and remediation starts, with the goal of preventing the same root cause from ever happening again, and introducing new processes to try to keep things from degenerating into ad-hoc responses and eventual panic.
The problem is, though, that it’s impossible to predict the future, and while it’s always a smart thing to prevent something from happening again, it’s the unknown emergencies that are the norm. No plan or process will be able to fully address every new point that pops up during an unpredicted incident.
So what should be done after an emergency to increase the odds of a more successful response in the future, for whatever may happen? Here are some tips…
Get some experienced Ops people. There are plenty of people who have gone through a decade of system errors and failures, and have the experience and temperament to handle these stressful outages. The caveat is that not everyone is talented at everything, and here we are talking about people who can make good decisions with limited information – NOT software developers. Knowing code doesn’t make you valuable during outages. Patton was a good general during wars, but he wasn’t a peacetime leader. He excelled in the moment. During an outage – you need Pattons.
Responses should be in minutes, not hours. Make sure that you have the right infrastructure, availability and communications tools to be able to mobilize your teams immediately.
Don’t just reboot. System restarts are a knee jerk reaction, useful lots of the time, but they encourage a “ctrl-alt-del” mentality that can cloud out other considerations as to how to initially address the problem.
Continuously flesh out the FMEA and use it to record history. While the specifics of any given outage will differ with each incident, there are indeed general categories of causes and likely scenarios. For instance, master-slave connection breaks have been around for years in a whole variety of situations. It’s hard to predict the future, but this kind of issue will happen at some point. Even if you’ve never gone through it before, make sure you’ve done your homework and know what kinds of emergencies you’ll encounter.
Be very careful of “improving” reliability. A lot of the time, additional complexities have the opposite, negative effect.
Gather knowledge from beyond IT outages, especially different disaster response fields. For example, plan to structure responses in a way that allows for flexibility in command and final decision-making; during an outage you need an incident commander – what he/she says is the final word. It won’t be the CEO or CTO or other big shot. It’ll be a person on the ground.
Maintain excellence in your customer outreach. You win more by fixing something that goes wrong than downplaying what went wrong.
The next outage will be different and it will be a surprise, but it can be handled positively. You need a plan and you need to be able to launch it quickly when the next outage happens. That plan is about people being able to make informed on-the-spot decisions via well-thought out, flexible processes – not technology.
-Dr. Sukh Grewal, CEO, Grey Wall Software LLC, developers of Veoci