Why is failure endemic to distributed systems?
In the past two blog posts we talked about a hypothetical ASP.NET application. Let's add a second tier to this app where we make a call to a web service.
We will then have some version of the following code fragment which resembles something everybody has written:
ClientProxy client = new ClientProxy();
int result = client.Do(a, b, c);
What's wrong with this?
We have assumed that the call would succeed. Why would it not succeed? At the very minimum you could have a network timeout.
You are assuming you have control over a resource that your really do not.
The fundamental concept in designing for failure is to understand that any interface between two components can fail.
So we rewrite the code as follows:
try
{
ClientProxy client = new ClientProxy();
int result = client.Do (a, b, c);
}
catch (Exception ex)
{
????
}
But now what do you do in the exception handler?
In this simple example, how many times do you retry?
When you give up do you cache the input, or do you make the user enter it over again?
Suppose the service on the other side stopped working? What happens when the underlying hardware crashes, and your application has to be restarted.
Where is the user data then?
What about total failure conditions? Do you "go to" out of the exception handler?
Where do you go to?
You cannot program your way out of a failure condition in code that is
based on the assumption that everything works properly. You have to architect
and design for failure conditions from the start.
The critical issue is how you respond to that failure.
Here is the fundamental principle of designing for failure:
Assume failure will occur. The question is how will the application respond to that failure. You cannot depend on the underlying infrastructure to achieve availability because it cannot make that guarantee.