Listen

The AWS Outage: The Cloud's Shining Moment

The Amazon Web Services outage has a silver lining.

By George Reese
April 23, 2011 | Comments: 69

So many cloud pundits are piling on to the misfortunes of Amazon Web Services this week as a response to the massive failures in the AWS Virginia region. If you think this week exposed weakness in the cloud, you don't get it: it was the cloud's shining moment, exposing the strength of cloud computing.

In short, if your systems failed in the Amazon cloud this week, it wasn't Amazon's fault. You either deemed an outage of this nature an acceptable risk or you failed to design for Amazon's cloud computing model. The strength of cloud computing is that it puts control over application availability in the hands of the application developer and not in the hands of your IT staff, data center limitations, or a managed services provider.

The AWS outage highlighted the fact that, in the cloud, you control your SLA in the cloud—not AWS.

The Dueling Models of Cloud Computing

Until this past week, there's been a mostly silent war ranging out there between two dueling architectural models of cloud computing applications: "design for failure" and traditional. This battle is about how we ultimately handle availability in the context of cloud computing.

The Amazon model is the "design for failure" model. Under the "design for failure" model, combinations of your software and management tools take responsibility for application availability. The actual infrastructure availability is entirely irrelevant to your application availability. 100% uptime should be achievable even when your cloud provider has a massive, data-center-wide outage.

Most cloud providers follow some variant of the "design for failure" model. A handful of providers, however, follow the traditional model in which the underlying infrastructure takes ultimate responsibility for availability. It doesn't matter how dumb your application is, the infrastructure will provide the redundancy necessary to keep it running in the face of failure. The clouds that tend to follow this model are vCloud-based clouds that leverage the capabilities of VMware to provide this level of infrastructural support.

The advantage of the traditional model is that any application can be deployed into it and assigned the level of redundancy appropriate to its function. The downside is that the traditional model is heavily constrained by geography. It would not have helped you survive this level of cloud provider (public or private) outage.

The advantage of the "design for failure" model is that the application developer has total control of their availability with only their data model and volume imposing geographical limitations. The downside of the "design for failure" model is that you must "design for failure" up front.

The Five Levels of Redundancy

In a cloud computing environment, there are five possible levels of redundancy:

Physical
Virtual resource
Availability zone
Region
Cloud

When I talk about redundancy, I'm talking about a level of redundancy that enables you to survive failures with zero downtime. You have the redundancy that simply lets the system keep moving when faced with failures.

Physical redundancy encompasses all traditional "n+1" concepts: redundant hardware, data center redundancy, the ability to do vMotion or equivalents, and the ability to replicate an entire network topology in the face of massive infrastructural failure.

Traditional models end at physical redundancy. "Design for failure" doesn't care about physical redundancy. Instead, it allocates redundant virtual resources like virtual machines so that the failure of the underlying infrastructure supporting one virtual machine doesn't impact the operations of the other unless they are sharing the failed infrastructural component.

The fault tolerance of virtual redundancy generally ends at the cluster/cabinet/data center level (depending on your virtualization topology). To achieve better redundancy, you spread your virtualization resources across multiple availability zones. At this time, I believe only Amazon gives you full control over your availability zone deployments. When you have redundant resources across multiple availability zones, you can survive the complete loss of (n-1) availability zones (where n is the number of availability zones in which you are redundant).

Until this week, no one has needed anything more than availability zone redundancy. If you had redundancy across availability zones, you would have survived every outage suffered to date in the Amazon cloud. As we noted this week, however, an outage can take out an entire cloud region.

Regional redundancy enables you to survive the loss of an entire cloud region. If you had regional redundancy in place, you would have come through the recent outage without any problems except maybe an increased workload for your surviving virtual resources. Of course, regional redundancy won't let you survive business failures of your cloud provider.

Cloud redundancy enables you to survive the complete loss of a cloud provider.

Applied "Design for Failure"

In presentations, I refer to the "design for failure" model as the AWS model. AWS doesn't have any particular monopoly on this model, but their lack of persistent virtual machines pushes this model to its extreme. Actually, best practices for building greenfield applications in most clouds fit under this model.

The fundamental principle of "design for failure" is that the application is responsible for its own availability, regardless of the reliability of the underlying cloud infrastructure. In other word, you should be able to deploy a "design for failure" application and achieve 99.9999% uptime (really, 100%) leveraging any cloud infrastructure. It doesn't matter if the underlying infrastructural components have only a 90% uptime rating. It doesn't matter if the cloud has a complete data center meltdown that takes it entirely off the Internet.

There are several requirements for "design for failure":

Each application component must be deployed across redundant cloud components, ideally with minimal or no common points of failure
Each application component must make no assumptions about the underlying infrastructure—it must be able to adapt to changes in the infrastructure without downtime
Each application component should be partition tolerant—in other words, it should be able to survive network latency (or loss of communication) among the nodes that support that component
Automation tools must be in place to orchestrate application responses to failures or other changes in the infrastructure (full disclosure, I am CTO of a company that sells such automation tools, enStratus)

Applications built with "design for failure" in mind don't need SLAs. They don't care about the lack of control associated with deploying in someone else's infrastructure. By their very nature, they will achieve uptimes you can't dream of with other architectures and survive extreme failures in the cloud infrastructure.

Let's look at a design for failure model that would have come through the AWS outage in flying colors:

Dynamic DNS pointing to elastic load balancers in Virginia and California
Load balancers routing to web applications in at least two zones in each region
A NoSQL data store with the ring spread across all web application availability zones in both Virginia and California
A cloud management tool (running outside the cloud!) monitoring this infrastructure for failures and handling reconfiguration

Upon failure, your California systems and the management tool take over. The management tool reconfigures DNS to remove the Virginia load balancer from the mix. All traffic is now going to California. The web applications in California are stupid and don't care about Virginia under any circumstance, and your NoSQL system is able to deal with the lost Virginia systems. Your cloud management tool attempts to kill off all Virginia resources and bring up resources in California to replace the load.

Voila, no humans, no 2am calls, and no outage! Extra bonus points for "bursting" into Singapore, Japan, Ireland, or another cloud! When Virginia comes back up, the system may or may not attempt to rebalance back into Virginia.

Relational Databases

OK, so I neatly sidestepped the issue of relational databases. Things are obviously not so clean with relational database systems and the NoSQL system almost certainly would have lost some minimal amounts of data in the cut-over. If that data loss is acceptable, you better not be running a relational database system. If it is not acceptable, then you need to be running a relational database system.

A NoSQL database (and I hate the term NoSQL with the passion of a billion white hot suns) trades off data consistency for something called partition tolerance. The layman's description of partition tolerance is basically the ability to split your data across multiple, geographically distinct partitions. A relational system can't give you that. A NoSQL system can't give you data consistency. Pick your poison.

Sometimes that poison must be a relational database. And that means we can't easily partition our data across California and Virginia. You now need to look at several different options:

Master/slave across regions with automated slave promotion using your cloud management tool
Master/slave across regions with manual slave promotion
Regional data segmentation with a master/master configuration and automated failover

There are likely a number of other options depending on your data model and DBA skillset. All of them involve potential data loss when you recover systems to the California region, as well as some basic level of downtime. All, however, protect your data consistency during normal operations—something the NoSQL option doesn't provide you. The choice of automated vs. manual depends on whether you want a human making data loss acceptance decisions. You may particularly want a human involved in that decision in a scenario like what happened this week because only a human really can judge, "How confident am I that AWS will have the system up in the next (INSERT AN INTERVAL HERE)?"

The Traditional Model

As its name implies, the "design for failure" requires you to design for failure. It therefore significantly constrains your application architecture. While most of these constraints are things you should be doing anyways, most legacy applications just aren't built that way. Of course, "design for failure" is also heavily biased towards NoSQL databases which often are not appropriate in an enterprise application context.

The traditional model will support any kind of application, even a "design for failure" application. The problem is that it's often harder to build "design for failure" systems on top of the traditional model because most current implementations of the traditional model simply lack the flexibility and tools that make "design for failure" work in other clouds.

Control, SLAs, Cloud Models, and You

When you make the move into the cloud, you are doing so exactly because you want to give up control over the infrastructure level. The knee-jerk reaction is to look for an SLA from your cloud provider to cover this lack of control. The better reaction is to deploy applications in the cloud designed to make your lack of control irrelevant. It's not simply an availability issue; it also extends to other aspects of cloud computing like security and governance. You don't need no stinking SLA.

As I stated earlier, this outage highlights the power of cloud computing. What about Netflix, an AWS customer that kept on going because they had proper "design for failure"? Try doing that in your private IT infrastructure with the complete loss of a data center. What about another AWS/enStratus startup customer who did not design for failure, but took advantage of the cloud DR capabilities to rapidly move their systems to California? What startup would ever have been able to relocate their entire application across country within a few hours of the loss of their entire data center without already paying through the nose for it?

These kinds of failures don't expose the weaknesses of the cloud—they expose why the cloud is so important.

69 Comments

By Razi Sharir on April 23, 2011 11:11 AM

Couldn't agree more. When we designed Xeround SQL Cloud Database-as-a-service; we looked at all the layers you listed for alternate DRP design: Physical > Virtual resource > Availability zone > Region > Cloud.
With that in mind, we took a cloud agnostic approach enabling our users to run their databases on any public cloud - Amazon (East, Europe, same/multi zone), Rackspace and soon many other IaaS worldwide, private cloud (Vmware vCloud)… In fact we also support a similar approach PaaS wise – support Heroku, CloudControll and quite a few others coming soon.
We know running a dB in the cloud is tricky and took the DBaaS direction, Acunna Matata worry free philosophy; auto everything – hilling, elastic scalability, distribution, even more front end SQL/NoSQL APIs coming soon.
Try us at xeround.com (razi Sharir

By Will on April 23, 2011 1:34 PM

I thought the failure was that multiple availability zones died simultaneously, something that by design and per Amazon's docs should never happen short of a hurricane in Virginia. Note that out is exponentially harder to distribute your app across not only AZs but geographical areas as well: high speed links connect AZs within a geo, but going from one geo to another is extremely slow and not realtime.

Of course you design for failure, it happens every day on AWS. But can you design around multiple datacenters (availability zones) dying simultaneously? When AWS told you not to worry about that eventuality? Probably not without downtime and some serious compromises.

By Alexander Muse on April 23, 2011 1:43 PM

It was a surprise that so many popular web services went down when Amazon went down. We always assumed one of two things would happen: that our infrastructure in the Amazon cloud would fail or our infrastructure in our data center would fail (hopefully not at the same time).

By mike on April 23, 2011 2:07 PM

The problem is that once EVERYONE falls back to a service in another availability zone, that zone suddenly has to handle twice the load (probably a lot more when Virginia goes down, because it's generally believed to have the most instances). We saw pretty heavy slowdown across zones even with only a handful of people following this approach. You need to either bring another provider into the mix, or just have faith that AWS keeps piles and piles of spare capacity.

By bobi in reply to comment from mike on April 25, 2011 7:54 AM

‘overload’ situation in such event remains for either few minutes or for couple of hours... you are right it affects the ‘performance’.... but that is what happens when you failover business critical apps to DR site when primary site goes off... you do it just for 20% of total apps... and assume that it will failback to primary site in few hours...DRs are not designed to meet the performance factors, but to keep the business critical apps available at scale-down mode.

By BiggieBig on April 23, 2011 2:35 PM

More bullshit.

By jerhewet in reply to comment from BiggieBig on April 23, 2011 5:14 PM

Ayep. What BiggieBig said. The cloud fanatics really do need to wake the f__k up and smell the f__ckin' coffee.

By ronroddam in reply to comment from BiggieBig on April 23, 2011 6:22 PM

Succinct, not much substance or thoughtful consideration but succinct.

By heynow in reply to comment from BiggieBig on May 2, 2011 8:19 AM

Biggie, tell us more...Messieur Biggie. Inquiring minds want to know your point of view.

By Justin Santa Barbara on April 23, 2011 3:20 PM

AWS previously assured us that multiple Availability Zones wouldn't realistically fail at the same time. Now that proved to be untrue, you choose to say "Ah - you shouldn't have believed AWS, you should have been using multiple regions" Presumably when the next outage hits both US regions you'll say "Ah - of course you should have used the EU and Asia regions as well".

We should recognize AWS as a single point of failure and look at hosting across multiple providers. Fool me once, shame on you; fool me twice, shame on me.

This does require sophisticated management tools like enStratus, but you should use those tools to avoid putting all your eggs into the AWS basket.

I'm not sure that the rest of the technology stack has necessarily caught up to this model though - in particular NoSQL databases aren't the panacea you appear to believe them to be. Hopefully all the pieces of the technology stack will evolve.

By George Reese in reply to comment from Justin Santa Barbara on April 23, 2011 3:25 PM

AWS has never in any conversation I have ever had said that multiple availability zones would not realistically fail at the same time. If they felt that way, don't you think they'd have an SLA better than 99.9%?

Of course, if you want to survive the failure of multiple availability zones, you should spread yourself across regions. I don't understand why this is so hard for people to understand.

Similarly, yes, you should have some ability to migrate your systems into another cloud. I don't think actual technical loss of all AWS regions (or even multiple regions) can happen absent of nuclear war or asteroid strike, but companies do go out of business/get sued/etc.

By Justin Santa Barbara in reply to comment from George Reese on April 23, 2011 3:45 PM

From the EC2 homepage (aws.amazon.com/ec2/):
"Availability Zones are distinct locations that are engineered to be insulated from failures in other Availability Zones and provide inexpensive, low latency network connectivity to other Availability Zones in the same Region. By launching instances in separate Availability Zones, you can protect your applications from failure of a single location."

From the EC2 FAQ (aws.amazon.com/ec2/faqs/):
"Q: How isolated are Availability Zones from one another?
Each availability zone runs on its own physically distinct, independent infrastructure, and is engineered to be highly reliable. Common points of failures like generators and cooling equipment are not shared across Availability Zones. Additionally, they are physically separate, such that even extremely uncommon disasters such as fires, tornados or flooding would only affect a single Availability Zone."

Seems pretty clear to me that multiple AZ failure is supposed to be unrealistic except in the case of disasters, and AWS even explicitly state that it would have to be a large scale disaster, not just a "measly" fire, tornado or flood :-)

In addition, AWS themselves engineered their own solutions reflecting this assumption (e.g. RDS Multi-AZ is multi-AZ, not multi-region)

Of course, you're right - AWS was over-promising here; we should have ignored what they stated and used multiple regions. But it's the same people and the same software that run those multiple regions, so I don't see understand how you continue to have faith that multiple regions won't go down except in extraordinary circumstances.

I think we're in agreement that you can't trust a single AZ; we've learned in this outage that you can't trust a single Region. We only disagree in that you continue to have faith in multiple AWS regions, whereas I have no reason to believe that e.g. an AWS software bug won't get deployed to all regions, or that a rogue AWS employee won't somehow shut down all the regions.

As for your conversations with AWS, if they were in fact privately saying to you that multiple AZ failure were likely, while publicly saying the opposite, I think you should publish that story.

By Hagrin in reply to comment from Justin Santa Barbara on April 24, 2011 10:22 AM

Pretty telling that the author never replied to this comment, yet replied to comments below.

Cloud computing has its uses, but if someone is trying to "polish the turd" that was the epic outage is nothing more than a glorified used car salesman. Amazon promised something that they didn't deliver - it's really that simple. They had more downtime this week than my company has had in the last 5 years and that includes an entire physical relocation of our corporate office.

Trying to justify the outage as a failure of the customer is so ridiculous I'll never read anything by this author again.

By bobi in re

The AWS Outage: The Cloud's Shining Moment

The Amazon Web Services outage has a silver lining.

Tags:

You might also be interested in:

69 Comments