A few days ago, Pierre Phaneuf revived a discussion about timeouts on a very old article. I’ve spent the last few days at the 2008 Lustre User Group meeting in beautiful Sonoma CA and, since I’m stuck at SFO waiting for my red-eye home (blech blech blech), it seems like a good time to write some more about why I think timeouts are evil.

The Lustre folks are making a bunch of changes to deal with timeout issues on very large systems. For example, they’re going to adaptive instead of fixed timeouts, and introducing a new “Network Request Scheduler” that basically does fair queuing at the Lustre level. Why? Because administrators of large and even medium-sized Lustre systems have gotten sick and tired of logs filled with “RPC timeout” messages and “haven’t heard from X for a while so I’m evicting him” and worse. In essence, Lustre is doing a good job of proving that per-request timeouts don’t scale. NRS will help to prevent outright starvation, but that’s only the most immediate manifestation of the fundamental problem. Let’s say node X sends a message and expects a reply. What happens when 100K other nodes have messages in the queue ahead of X’s, and each one takes more than timeout/100K to execute? Same as now, basically. There are many solutions, but pretty much all of them involve making NRS more complex and increasing protocol overhead without addressing the fundamental problem.

In various systems that I have designed and that have actually worked, I’ve followed a fairly simple rule: there is only one component in the system that can determine anything about the health of another node. Everybody else who sent something to that node will wait indefinitely until they get either a response or an indication from that one subsystem that the other end is dead. That way only the heartbeat subsystem has to be designed – and it does have to be designed carefully – to deal with all of the reliability and scalability issues of determining who’s alive or dead. You don’t have N components in the system each responding in different ways and at different time horizons to a real or perceived node failure. That way inevitably leads to disaster.

Note, however, that decisions and actions other than detecting and responding to a node failure can still be timeout-based. The best example is retransmissions over an unreliable network. It’s entirely possible, and in a large system likely, that a message will get lost and the sender will wait indefinitely while the heartbeat subsystem never sees a problem and thus rightly never sends any node-failure indication to end the wait. It’s entirely legitimate and necessary to retransmit in such a case, though of course the retransmission policy should be considered with an eye toward scalability and the implementation should ideally be invisible to higher software layers.

To pick the obvious example, what might the world be like if Lustre had followed this rule? NRS or something like it might still be necessary for other reasons, but not to prevent the ill effects of prematurely deciding a node is dead merely because it’s too overloaded to answer one request quickly enough. Adaptive timeouts wouldn’t even make sense. Perhaps most significantly, users wouldn’t have to implement failover using a completely separate package that triggers action independently of when Lustre itself detects and responds to RPC failures. That’s a tuning nightmare at best, and it’s highly questionable whether you could ever really trust such a setup not to tie itself up in knots at the most inopportune time (i.e. during heavy load).