We have been talking about cloud failures. How likely are they?
Some outages occur all the time. Even with the excellent reliability of hardware, a cloud data center has an enormous number of components. Something is always failing. Cloud data centers are built to detect failure, and move applications to working hardware, and restart them. They have failure zones so that instances of the same service are kept on different sets of hardware.
What about other kinds of outages?
Amazon and Windows Azure have had interruptions in service, some longer than others. There could be major power outages such as in the Northeast United States in 1965 that left people without power for up to 12 hours. In 2003 there was a major power outage in the Northeastern and North central US as well as Ontario, Canada. The Japanese tsunami had a similar effect. Many smaller outages occur after storms. Even if a data center remains, external Internet connections might be interrupted. In 2009 Google had a major outage in Asia caused by a configuration error that caused problems even in the United States and Europe. The problem was analogous to the cause of the 1965 Northeast US power failure.
As we have discussed in previous posts, any software or service that you depend on is a possible source of an outrage, including the Internet itself. You don’t even need an outage; all you need is for a service to become less responsive. Remember when Michael Jackson died? It was difficult to get to any web site, because there was not enough bandwidth to accommodate everybody’s surfing to find out the news. It was the largest self-inflicted denial of service attack ever.
Are these outages really rare?
As we found out with the financial sector, black swan events can happen. Random events do occur. Small probability events happen. Kahneman and Tversky demonstrated that people reason about probability poorly. While this can be acceptable behavior with regard to personal decisions, it is very questionable when it comes to estimating the probabilities of rare engineering events
You cannot assume that any connection to a distributed service will always be available.
Netflix’s continual availability during the April 2011 Amazon outage is now legendary. The reason for their success was because they assumed failure was possible. They had stateless services. They restricted the use of relational data to where it was really necessary so they could switch to a hot standby. They degraded gracefully, only keeping alive services that were really necessary. You might not have been able to get your personalized movie list, but you could still find and play movies. They had enough excess capacity to deal with transient failures, and shifting loads.
Assume the rare can occur, because it will.