We are here to help you plan, prepare, respond, and report on anything that comes your way. Give us a few moments of your time and we'll show you.
Oct 30, 2013Back to Veoci Blog
Comedians, commentators, and politicians are having a field day with the problems at healthcare.gov. As software developers and computer operations experts, part of us is just not laughing. We have spent years in high-volume web software and the jokes and blogs about outages and errors do not make us laugh as hard as everyone else. Scar tissue perhaps?
Yes, absolutely. There has not been a single outage or performance slowdown that we later found out we could not have handled better. Rewind and watch in slow motion, and you will always see what could have been. Fear of the potential aftermath makes working on outages a double-edged sword: even if you work wonders getting the system back online, it will be followed by a penalty for not having prevented the problem in the first place. Having been in numerous high stress outages and having worked to resolve them in the trenches, we share some sense of camaraderie with the IT team behind the Obamacare website. Popular belief assumes a straight line from disaster to the nirvana of perfectly-running systems. The reality is that the path out of an outage or performance problem requires a difficult decision between several options, all of which cause some damage and require time for recovery, and there is always uncertainty about the extent of the damage and recovery time.
One could (a) go into full disaster recovery mode, (b) roll back all changes to some previous version when things were supposedly running well, or (c) try and find the problem and fix it on the fly. Disaster recoveries can fail and usually include loss of some data. Previous versions may lead to the same problem with the new situation; roll back could introduce other errors; or a previous version could lead to worse issues that you just don't know about. Most organizations first attempt to fix the problem on the fly, and this has its own set of issues: experts may not be available, experts may not be able to find the issue, vendors might not be responsive, or most likely, unlike disaster recovery or rollback, one usually cannot estimate the exact time in which a solution will be completed.
Selecting the path to follow requires a "General Patton" spine; the "fog of outage" makes knees weak. The solution is to develop a specialized team that is brought together for disasters and not only has superb technical capability, but also the instinct to make difficult decisions quickly - people who almost subconsciously reach the best answer. As we have developed Veoci over the past three years, our initial inspiration came from the people on the front line who work to resolve outages. Better communication, task management, and rapid team assembly are the basics. A complete record of what was known at what time makes the lives of our responders easier: "Well, at that time, all we knew was X." As we talk to IT managers, it is often instructive to find out if they have personally lived through high-stress outages. If they have, the Veoci demo brings out a cathartic outpouring of how useful Veoci would've been.
With what must be an enormous load on the Obamacare site, their problems are understandable. It is a new system, and the emergency management team is still getting to know each other and starting to blow away the chaff. Combine the bureaucracy ensured by procurement and development rules with the complexities of coordinating hundreds of developers, numerous systems, and millions of users, and the current situation seems inevitable. Now that it is a crisis, healthcare.gov is probably in agile development mode.
The bureaucracy has run for the hills, and a much smaller and focused team is fixing the issues. After all, healthcare.gov is neither a massive system nor is its load any more daunting than what many web sites now handle routinely. No new technology is needed, just better execution. I would expect them to stumble through November and for everyone to forget about how bad the situation was by March 2014. -Dr. Sukh Grewal, CEO Grey Wall Software, developers of Veoci
Receive all the latest emergency, crisis, and continuity management news, tips, and advice