Thoughts on Software Development

Sunday, March 18, 2012

In the simple example we have been discussing, the consequences of a failure appear immediately to the user. In a more complicated architecture there are many more tiers and many more dependencies. With more dependencies, more problems can possibly result from poor decisions on how to handle failure. Those dependencies include other applications in your own shop, third party libraries that you don't control, the internet, etc. For example, if your order queues fail, you cannot do orders. If your customer service app fails, you cannot retrieve member information. Unhandled failures propagate (like cracks) throughout your application.

Failures Cascade - a unhandled failure in one part of your system becomes a failure of your application.

In deciding on how to respond to failure we have to distinguish between two types of failure: transient failures and resource failures. Transient failures are not due to component failure, but a resource temporarily under load that cannot respond as fast as you had assumed. With resource failures you have to have an alternative strategy because a component is not available.

Transient failures occur for short periods of time. The typical response is to retry the operation after a short period of time. But questions still remain. How often do you retry? What is a short period of time? What do you with the data during the retry? On the other hand, remember that just as failures cascade, so do delays. While you are waiting or retrying, scarce resources are being used (threads, memory, TCPIP ports, database connections) that cannot be used for other requests.

Go back to our WCF example and look at the try/catch block. If we had to do a retry how would we change the logic? We will have to adopt a whole different strategy to handle failure, a retry loop inside the catch handler would not be enough because other failures could reoccur, and not every error would allow a retry. You have to design the entire routine around expecting failure, because as we discussed in our last post, failures happen.

Since slow responses usually come from resource bottlenecks, you have to treat them as failures if you are going to have reasonable availability. Which means that transient failure can soon look like resource failures. So what do you do? Retry for a limited amount time and then give up. In addition, never block on an I/O, timeout and assume failure.

From the point of view of architecture and design, there is really no such thing as a transient failure. If you have a transient failure, fail fast and treat it as a resource failure.

Friday, March 09, 2012

Failures Are a Fact of Software Life

Why is failure endemic to distributed systems?

In the past two blog posts we talked about a hypothetical ASP.NET application. Let's add a second tier to this app where we make a call to a web service. We will then have some version of the following code fragment which resembles something everybody has written:

ClientProxy client = new ClientProxy();
int result = client.Do(a, b, c);

What's wrong with this?

We have assumed that the call would succeed. Why would it not succeed? At the very minimum you could have a network timeout. You are assuming you have control over a resource that your really do not. The fundamental concept in designing for failure is to understand that any interface between two components can fail. So we rewrite the code as follows:

try
{
   ClientProxy client = new ClientProxy();
   int result = client.Do (a, b, c);
}
   catch (Exception ex)
{
   ????
}

But now what do you do in the exception handler?

In this simple example, how many times do you retry? When you give up do you cache the input, or do you make the user enter it over again? Suppose the service on the other side stopped working? What happens when the underlying hardware crashes, and your application has to be restarted. Where is the user data then?

What about total failure conditions? Do you "go to" out of the exception handler? Where do you go to?

You cannot program your way out of a failure condition in code that is based on the assumption that everything works properly. You have to architect and design for failure conditions from the start. The critical issue is how you respond to that failure.

Here is the fundamental principle of designing for failure:

Assume failure will occur. The question is how will the application respond to that failure. You cannot depend on the underlying infrastructure to achieve availability because it cannot make that guarantee.

Subscribe:

newtelligence dasBlog 1.8.5223.1