One of the common problems in designing any network-based software (which is pretty much all software that matters nowadays) is how to deal with the possibility of sending a request and never getting a response. The most common approach is to set a timer when the request is sent, and abort if the timer fires. This approach is so common that it’s actually a bit amazing that support for it is not usually built into the networking interfaces themselves. Sending the request and setting the timer are typically two completely separate actions, served by completely separate OS subsystems, exacting overhead (e.g. extra syscalls, otherwise-unnecessary timeout threads) and requiring programmers to examine error codes to tell the difference between a timeout and a more local kind of error. Ick. Any number of “frameworks” have been created – I’ve been involved in a couple myself – to provide a better programming model, but they’re still layered on top of the same inherently flawed model instead of replacing it.

I would contend, though, that per-request timeouts are inherently suboptimal. Relying on them makes the overhead mentioned above a potentially serious issue, and in addition you always seem to end up with a plethora of different timeout values with subtle dependencies that provide breeding grounds for bugs. Too many times I’ve seen software that depends on timeout X always occurring before timeout Y, but then Y needs to be shortened (or X lengthened) to deal with some completely unrelated problem, so the hidden assumption is violated and chaos ensues. Most often, what I find is that there should be one timeout, implemented in dedicated “heartbeat” code, to determine whether the remote node you’re talking to is really still there. Everyone else should wait indefinitely (retransmitting etc. if necessary) until they find out from the heartbeat module that their cause is hopeless, instead of everyone trying to figure that out for themselves. Besides being far easier to verify code that uses this model as correct, it’s more efficient. “Everyone for themselves” is a lot like what used to happen on the internet with congestion, where everyone whose packet was dropped due to congestion would retransmit stupidly, causing the original congestion to persist longer than was necessary and actually hampering efforts to improve the situation. It was stupid then, and it’s stupid now. There’s a difference between letting systems or modules act independently to avoid single points of failure and setting them up to exhibit “mob” behavior that’s injurious to all.