Amazon's Cloud May Seem Magical, But It Isn't

By now you have heard that Amazon Web Services had a massive disruption yesterday, affecting Elastic Cloud Computing (EC2) instances in the company's northern Virginia data center. The disruption was/is long-lived (Amazon's dashboard is still showing problems), and certainly blew any claims for an annual uptime of 99.9 percent, which is 8.76 hours downtime per year. In fact, it likely blew 99.8 percent uptime, which is 17.52 hours of downtime. While 99.8 percent sounds good, the fact that some s

Mike Fratto

April 22, 2011

4 Min Read
Network Computing logo

By now you have heard that Amazon Web Services had a massive disruption yesterday, affecting Elastic Cloud Computing (EC2) instances in the company's northern Virginia data center. The disruption was/is long-lived (Amazon's dashboard is still showing problems), and certainly blew any claims for an annual uptime of 99.9 percent, which is 8.76 hours downtime per year. In fact, it likely blew 99.8 percent uptime, which is 17.52 hours of downtime. While 99.8 percent sounds good, the fact that some sites have been down the better part of a day has real impact on revenue. The downtime is also bad for those who manage Amazon's Web Services. It's bad for those that use Amazon's web services. No one likes downtime. But it's not necessarily a reason to avoid the cloud, and don't make the mistake of thinking that owning your own infrastructure would have avoided a similar problem.

Technologies and strategies like cloud, virtualization, clustering and RAID, are not going to magically dissolve failure. Failure happens. It happens to everyone. When failure happens to a giant like Amazon, it is a Big Deal--partly because Amazon is a victim of its own success and of the promises it made to customers, in Justin Santa Barbara's opinion.

Amazon is also a victim of the hype surrounding cloud computing, with the notion that cloud computing today provides resilient, fault-tolerant computing services with the capability to auto-magically recover from failure. Automation is key to the success of any cloud service, and particularly on the scale of EC2. It would appear, and this is pure speculation on my part, that Amazon's automatic recovery processes exacerbated the outage by consuming storage at a massive rate. Gartner's Lydia Leong has a good explanation of what happened at Amazon outage and the auto-immune vulnerabilities of resiliency. Ooooops. I am not going to kick Amazon. Mistakes happen to the best of us, and I have to think the folks who designed and manage Amazon's service are pretty talented. But if I were an Amazon EC2 customer, or any cloud customer, I'd be taking a good, hard look at my cloud providers' availability claims to see if they are sufficient to meet my business needs.

What do you do as a customer or potential customer of Amazon's services? Amazon is notoriously tight-lipped about its operations and management, and there are some folks--like Roman Stanek, whose company is an Amazon customer--who would like more transparency. Transparency is important not only during an event, but during the purchase process. If you are going to base your business, in whole or in part, on an external service, you need assurances that the service is run reliably and that the operation's processes are going to result in effective assessments of an outage's severity and a realistic assessment of recovery.

You can't really demand a demonstration of a catastrophic fail-over and recovery. You can review the available materials and processes a provider uses and assess its effectiveness based on your own experience--or, if you don't have the expertise, the assessment of a trusted adviser--and determine whether the provider can satisfy their own promises. If a potential service provider isn't forthcoming with a potential customer, then perhaps you decide not to do business with the organization.Of course, even the best-laid plans can be waylaid by unforeseen consequences, which is likely what happened to Amazon. I don't believe the company would have designed in such a catastrophic failure point on purpose.

What are the options? Stay out of cloud computing? Maybe. Cloud computing, with its automated management plane, is still young. A lot of smart thinking has taken place in the operations and management of big, automated computing services, but there is more to come. Cloud computing is still cutting edge and needs maturity. Of course, I get a chuckle from the idea that cloud availability issues can be solved by using multiple cloud providers via an automated method.

While I haven't given that idea a great deal of thought, I'd need to see some serious proof points to start to believe it. Adding more clouds doesn't necessarily mean additive availability. Building a computing system with products that are reliant on each other and offer five nines reliability actually reduces the statistical uptime because the reliability is multiplicative, not additive. Adding more clouds won't magically make your services more reliable. What you need to do, if you are planning on using cloud services, is to examine the applications you want to put in the cloud and consider how they can be redesigned for resilience. Your application--as a system that includes hardware, software, services, etc.--has to be designed to recover from failure. As George Reese, CTO of Enstratus, said on Twitter, "When you put the responsibility for availability on software, your hardware options increase and your costs go down. And, ultimately, you get greater availability."

Read more about:

2011

About the Author(s)

Mike Fratto

Former Network Computing Editor

SUBSCRIBE TO OUR NEWSLETTER
Stay informed! Sign up to get expert advice and insight delivered direct to your inbox
More Insights