How much damage can a business system outage cause? As is pretty clear these days, they happen often, and can have serious impact. Take, for example, Visa’s payment network outage. On June 1st, 2018, Visa’s payment system in Europe went down for nearly ten hours, halting many personal and bank transactions. The massive, complex nature of the system made it difficult to pinpoint the root cause of the outage, adding hours of downtime and many degrees of frustration for the company’s customers. After performing their root cause analysis, the company identified a “very rare partial failure” of a switch in one of their data centers as the cause of the outage.
The pervasiveness of digital technologies across all industries means that outages aren’t exclusive to a particular type of business, either; corporations, SMBs, non-profits, manufacturers, governments – any sector, really – all experience unscheduled outages.
What are the causes of unpredicted downtime, what can be expected from it, and how can an organization lessen the impact of it?
The Impact of IT Outages
The biggest, most glaring consequence of outages is how they hit the bottom line. Downtime costs a hefty amount of cash.
A large portion of that bill stems from the lack of services a company hit with an outage can provide its customers. When Visa’s payment network went down in 2018, 5.2 million transactions across Europe were affected. Imagine the bill that would generate from such a vital service being unavailable for ten hours.
In 2014, Gartner reported that the average cost of an outage was $5,600 per minute, meaning an hour of downtime costs $336,000. This was just an average, however. As some other studies have indicated, these costs can run much higher depending on the operations of a business.
Beyond the Outage
Significant outages have a cascading impact, and leave ghosts behind. Their cost extends far beyond the actual downtime. For a more public facing organization, serious reputational fallout could follow an outage.
Consider the outage Sutter Health in California experienced in May 2018. An activated fire suppression system in one of the healthcare organization’s data centers caused an internal communications blackout that lasted over 24 hours. Some of the organization’s locations couldn’t provide treatment or begin scheduled surgeries because of the outage, while other locations had to resort to pen and paper for operations like creating patient records.
While the hospital had recovery strategies in place, it appears they weren’t as effective as they could have been. The blackout caused a lot of confusion among the patients, many of whom couldn’t be notified of the delays and cancellations at the hospital because their contact information was locked in the downed system. These Sutter Health locations did reschedule these appointments for patients who decided to return after the blackout, but there’s no doubt that the outage left some soured on the organization and its services.
As we can see from both the Visa and Sutter Health outages, there’s a couple different reasons why these events happen to organizations. It’s impossible to protect against them all at once, but there are a number of steps organizations can take to avoid dealing with an incident similar to those of Visa and Sutter Health.
What Causes IT Outages?
IT outages are normally the result of a few different events:
- Hardware failures
Most cars will break down after a number of years. A few small issues slowly build up over time, finally compounding into a massive issue that’s not cost effective, or maybe even impossible, to fix.
Digital hardware is similar. Older hardware isn’t made to keep up with the extreme pace at which technology advances now. Not only that, but a lot of the machinery is energy-intensive and requires intense regulation (just think of the cooling system employed by Google at its data centers). Occasionally, the hardware or its supporting infrastructure just fails.
- Software failures
Software isn’t perfect. Both newer and older softwares are used by organizations throughout the world, and both are responsible for causing outages. While new software introduces undetected, process-breaking bugs, old software is prone to strain and unpatchable flaws.
Older software is particularly susceptible. Some people (and organizations) go years without updating or upgrading their software, allowing their vital business functions to rely on a shaky system. This practice isn’t all that uncommon.
Windows 7, which launched almost 10 years ago, accounts for 40% of the OS market share. And it increased in 2018 as well. Microsoft, despite the popularity of the OS, will discontinue support for it in January 2020, leaving those users vulnerable to new cybersecurity threats.
Despite this, and for often pretty logical reasons (often having to do with, ironically, security and permissioning issues), many entities will continue to use sunsetted software as part of their coreload, since it’s a known entity that they’ve performed diligence on.
- Human errors
Human error is the one cause of outages that will never truly be eliminated. Accidents and negligence happen all the time. The only measure companies can really take against human error is by increasing the quality of training. Even then, a company will never be able to prevent the freak occurrences that lead to outages.
Reducing the Impact
A lot can be done to lessen the impact of these IT outages. Here are some of the commonly prescribed steps that you can take, if you haven’t already, to increase your organization’s resilience:
- Embrace New Tech
Many organizations are starting to embrace cloud computing for their operations, after years of hesitation. The thinking was that putting your processes and data into the hands of someone else, rather than “safe and sound” in your own data centers, was the antithesis of being secure.
As it turns out, because cloud service providers’ entire reputation depends on security and availability (which in today’s world of constant cyber threats are tightly intertwined), they do a better job than any organization whose primary objective isn’t necessarily managing their data center(s). Simply put, with the right provider, using the cloud is not only about innovation, but about upping the safety on your systems.
- Update Software Regularly
As Agile methodologies in software development have become prevalent in the last two decades, many software companies now release patches on a regular and frequent basis. There are those who will wait a while for any additional patches to new releases, given that there might be issues that went unnoticed during regression testing.
That doesn’t mean you shouldn’t update your software, though. The fact is, these new releases not only add new features, increase performance, and fix bugs, they also address emerging security issues. Implement a system for regularly reviewing updates in a timely manner to ensure you have the latest and most secure version of the software you use.
- Prepare and Train Staff More Effectively
Earlier we noted that if there was one root cause for outages that you can count on, it’s human error. But your organization can be proactive by teaching employees more thoroughly. Education is one of the best outage prevention strategies.
Drills go a long way here too. Giving your teams the resources they need to consistently run IT/DR drills will prove its value. A team needs practice to perform well, and through repetition teach themselves to instinctively avoid mistakes. Next time an outage strikes, enable your team and ensure your organization sees a little downtime as possible.
Prepare for IT Outages
Here’s the truth: Outages are not only inevitable, they’re expensive. Money, reputation, and assets are all on the line when downtime strikes. But with diligent planning, regular maintenance and review, and thorough training, a team can minimize the impact an outage has on their organization. Take the steps above to keep your organization resilient.