OK, enough of the politics and philosophy; back to the techie stuff. Today’s topic is timeouts, and why they should not be used to detect failures except when absolutely necessary. The first reason is that you just never know how long a timeout should be. If you set it too long, your code will sit around longer than it should after an operation has already failed (and you’ll get very little information about why it failed. If you set it too short, you risk timing out for an operation that did in fact succeed. That can be even worse, if the operation is not truly idempotent – i.e. if the effect of retrying an operation that already succeeded is not exactly zero. It can be very frustrating to debug problems that occur when a request that’s retried after a timeout leaves you in a state that’s almost but not quite the same as if the operation had succeeded (and been known to succeed) the first time. Explicit failure notification is vastly superior to reliance on timeouts, and should be preferred wherever possible.

There’s a second problem with timeouts that’s even more pernicious, and that has to do with their proliferation. As I’ve written before, it’s important to minimize the number of states a system can be in, which is a product of the states its components can be in. If you have multiple components all relying on separate timeouts to detect failures, most often communications failures or failures of remote nodes, then it’s highly likely that you’ll have many states where some components have detected and are responding to a failure while others remain oblivious. Specifically, you’ll have O(n!) recovery scenarios, where n is the number of components (or instances of components, or sometimes even requests). That’s unacceptable. It’s far, far better to have one piece of the system responsible for detecting such failures and providing some sort of notification to all of the others – all of which will just keep on trying until they get either a normal response or a notification that whoever they’re talking to has died. Such notification should be considered an essential feature of the communications paradigm in any serious distributed system, freeing all components but one from the need to guess how long to wait before concluding that something’s wrong. Because notifications can be serialized, such an approach also ensures that components will always detect the failure and do their recovery in a deterministic order. Limiting timeouts to that one “heartbeat” component and relying on explicit failure notification everywhere else is one of the best ways to make a system more debuggable and maintainable.

In conclusion, then, timeouts are a necessary evil but they are still evil. Using them to implement periodic maintenance or tuning functions is fine, but using them in more places than absolutely necessary to detect failures is a sure way to make an unmaintainable mess of your system.