With the Twitter IPO in the news, it is timely to see how a great site like Twitter manages to stay up so well. For many years Twitter had quite a poor reputation for reliability. After a doozy of an outage in July last year, Christina Warren wrote in Mashable (June 21, 2012), “The uptime was too good to last. At approximately 1:50 p.m. ET, Twitter went down again. Back to Facebook, everyone. Twitter is down – let’s party like it’s 2007, 2008 and 2009!”
Twitter’s downtime in 2007 was legendary. The average uptime for 2007 was 98.06%. Twitter was down for 5 days and 23 hours for the year. In December 2007 alone it was down 11 hours. Then Twitter started to improve. By the time of the June 21, 2012 outage, the Twitter VP of Engineering, Mazen Rawashdeh was able to claim in his blog that “for the past six months, we’ve enjoyed our highest marks for site reliability and stability ever: at least 99.96% and often 99.99%. In simpler terms, this means that in an average 24-hour period, twitter.com has been stable and available to everyone for roughly 23 hours, 59 minutes and 40-ish seconds.”
Quite an achievement, going down from 6 days downtime in 2007 to 4 hours in 2012 – almost “four nines” (i.e. 99.99% – see Figure 1). Unfortunately, though, however great a technical achievement that was, Twitter’s fame and widespread usage meant that any downtime had a greater impact on customer perceptions and negative PR. Not surprisingly, an 11 hour outage 6 years ago when Twitter had 1.6 million tweets is much more palatable than even hiccup when you’ve got 200 million users sending over 400 million tweets daily these days. And indeed, Twitter managed to make it onto lists like Yotta’s “Top 15 Worst Web Outages of 2012” despite all its uptime gains over the years.
Despite negative publicity, the bottom line is that Twitter has done an amazing job. The reality is that most organizations don’t manage to reach 97% for their applications. It takes tremendous discipline and effort to reach >99% uptime. When the Mazen Rawashdeh talks of reaching 99.96% uptime for a few months, it represents a tremendous effort on the part of many teams working together.
The effort is three-pronged: (i) reduce the problems that cause the outages, (ii) add redundancy to reduce impact, (iii) improve the process of responding to an outage and reduce the time to recover. Twitter has reduced the number of outages and, perhaps even more impressively, reduced the time to recover from an outage in a noticeable way.
Have a look at the following log showing Twitter’s 2013 outages, as well as two “big ones” from 2012 (Figures 2 & 3).
Most of the causes for outages in 2013 were unique and did not repeat – there was a new one every time. The complexity of systems and the modes of possible failures can be overwhelming, but given the things we ask of large applications, complexity is inevitable and necessary. System outages are in a different galaxy of complexity – the Twitter outage story for 2013 is just an example of great success.
In light of this, the Twitter IPO is exciting. It is the biggest Silicon Valley blockbuster since Facebook to go out for an IPO. The reduction in outages was perhaps one piece that raised the confidence level to go out for an IPO; if Twitter had a four hour outage last month, there would be no IPO this month. If we estimate the IPO to be $10 billion, a four hour outage could have a billion dollar impact. Clearly, Twitter has put some serious effort into preventing that from happening, while also figuring out ways to continually improve their response to the unforeseeable problem that arise. Many organizations suffer less visible but no less devastating effects from outages. The scout’s motto applies: “Always Be Prepared”.
Twitter is a small application with a giant user load. A typical Fortune 500 corporation’s systems will never experience the traffic that Twitter does, but they very well could have thousands of applications running simultaneously over a broad range of complexity and usage. Outages will happen. Period. While the user load on Twitter adds considerable complexity to the infrastructure, corporations more than make up for the operational difficulty with the number and diversity of applications, funding constraints, decade-old code and architecture, and the basic problem of retaining people and the application knowledge base.
Just as it does for Twitter, all this represents a massive challenge for IT management. In the end, it looks like Twitter’s doing a good job of learning from the past and implementing effective strategies to maintain uptime. For IT management charged with improving the quality of IT service delivery, the Twitter lesson is clear – add redundancy and reduce recovery time from outages while answering these critical questions:
- How long before you found out you had an outage?
- How long did it take before you assembled the right team to look at the situation?
- What was the impact of the outage in dollars?
Despite the unpredictability of outages, these points can be addressed systematically, and continuously getting the numbers down is a clear marker of an effective, evolved response plan.
Figure 1: The elusive “nines”
NINES
Uptime percentage
Mins/year
Hrs/year
days/year
1
97.0000%
15,768.0
262.80
10.950
2
99.0000%
5,256.0
87.60
3.650
99.2000%
4,204.8
70.08
2.920
99.5000%
2,628.0
43.80
1.825
3
99.9000%
525.6
8.76
0.365
99.9200%
420.5
7.01
0.292
99.9500%
262.8
4.38
0.182
4
99.9900%
52.6
0.88
0.036
99.9920%
42.0
0.70
0.029
99.9950%
26.3
0.44
0.018
5
99.9990%
5.3
0.09
0.004
6
99.9999%
0.5
0.01
0.000
Figure 2: Year to Date Twitter Outages for 2013
4-Sep-13
“Some users may experience issues trying to access twitter.com. Access to Twitter on mobile apps is not affected.”
“Due to a code-related error, a series of web servers went down from 13:48-14:19 PDT, making the twitter.com website inaccessible for some users.”
27-Aug-13
20:49 UTC to 22:29 UTC
“Viewing of images and photos was sporadically impacted.”
“DNS registrar: DNS records for various organizations were modified, including one of Twitter’s domains used for image serving”
6-Aug-2013
13:55 UTC to 14:19 UTC
“Some users were not able to get to twitter.com or load mobile timelines”
“Due to unexpected issues during a maintenance”
3-Jun-13
“Twitter was not available from 1:08pm PDT to 1:33pm PDT. Some users may have experienced Tweet delivery delay from 1:33pm PDT and 1:53pm PDT.”
“Due to an error in a routine change”
23-May-13
18:37 PST
“Some web users may be experiencing empty timelines and some users may be experiencing an issue with Twitter on mobile devices.”
“Our engineers are currently working on this issue”
1-May-13
“Some users may be experiencing an issue when uploading a photo.”
“Our engineers are currently working on this issue. Update: This issue has been resolved.”
23-Apr-13
“Some users may be experiencing an issue with our service.”
Our engineers are currently “working on this issue. Update: As of 8:50am PT, this issue has been resolved.”
11-Apr-13
“Some users may have experienced an issue with links contained within Tweets.”
“This issue has been resolved as of 7:10 am PST.”
27-Feb-13
“Some users may be experiencing an issue with our service”
“Our engineers are currently working on this issue.Update: This issue has been resolved as of 5am PST.”
7-Feb-13
“Earlier today, some users experienced a bug on twitter.com. While scrolling through another user’s profile, it falsely appeared to the viewer that the user had retweeted a Tweet that the viewer hadn’t actually sent.”
“This issue has been resolved.“
5-Feb-13
“Some twitter.com users may be experiencing issues uploading photos from the website.”
“Our engineers are currently working to resolve the issue. Update: This issue has been resolved.“
31-Jan-13
07:00 PST to 9:50 PST
“We experienced intermittent issues affecting web and mobile users, globally, ‘Twitter is currently experiencing a widespread service outage that appears to be intermittent’.”
“Some users may be experiencing issues accessing Twitter. Our engineers are currently working to resolve the issue. This incident has now been resolved.”
21-Jan-13
“Some users may be experiencing issues accessing Twitter.”
“Our engineers are currently working to resolve the issue.
Update: This incident has been resolved.“
16-Jan-13
“Some users are currently experiencing issues with twitter.com and some mobile clients.”
“Our engineers are working on this issue. Update: This incident has been resolved“
Figure 3: Two “big” outages from 2012
26-July-12
08:20 PST down
09:04 recovered
09:15 down again
09:58 Server side error codes
10:25 recovered
“Users around the world got zilch from us.”
“Coincidental failure of two parallel systems at nearly the same time.”
21-Jun-12
~09:00 PT Down
10:10 Rollback to previous version
10:40 Down again
11:08 Full recover begins
“Twitter was inaccessible for all web users, and mobile clients were not showing new Tweets.”
“Cascading bug in one of our infrastructure components. Included rolling back to a previous stable version of Twitter.”
-Dr. Sukh Grewal, CEO, Grey Wall Software LLC, developers of Veoci