The second thought has to do with communications protocols. Zooko was having a problem in Mojo Nation that involved events on a queue being processed out of order, and it turned out to be related to the fact that sometimes the system clock goes backwards. Here’s the sequence:

  1. Event A is scheduled, at current time plus X.
  2. Clock runs backwards.
  3. Event B is scheduled, at current time (now earlier than before) plus X.
  4. Event B runs before event A.

Well, the obvious solution to this problem would be to ensure that the clock never runs backwards, and in effect that’s what Zooko did. However, I got to thinking about why this was ever a problem. First, consider what our expectations should be when we schedule an event:

  • The handler for an event should run at or after its scheduled time.
  • If two events are scheduled for different times, the one with the later time should not be dispatched before the one with the earlier time.

In a true real-time system, we may add one more expectation:

  • The interval between an event’s scheduled time and completion (not just dispatch or even execution) of its handler has a finite predetermined bound X. Therefore, the handler for an event scheduled for time T+X should not run before the handler for an event scheduled for time T.

We’re not dealing with real-time systems, so in a sense the discussion should simply end there. However, let’s say just for the sake of argument that we have defined upper bounds for the dispatch-to-execution delay and the event-handler execution time. If we want to ensure ordering of events we have to separate them by at least the dispatch-to-execution delay. In addition, if the events are being scheduled from different contexts and either of those contexts is itself an event handler, we need to increase that difference by the event-handler execution time. If we don’t account explicitly for both of these factors, we should not be surprised when events are handled out of order – even in a real-time system.

As you can see, dealing with time properly is a pain. It’s even worse in a distributed systems. In the hope that it will save someone else from weeks or even months spent pulling their hair out dealing with time-related bugs – as I used to do before I “saw the light” – I offer the following suggestions for protocol designers:

  • Never rely on absolute time. There’s effectively no such thing in a distributed system anyway.
  • Keep your dependencies on relative time (e.g. timeouts) to an absolute minimum.
  • Never rely on the relationships between separate intervals. That’s tantamount to relying on absolute time.

If you follow these simple rules, you’ll save yourself a lot of grief. You’ll also find that your protocol is easier to describe in terms that a protocol validator will understand, which will allow you to use such validators to avoid other kinds of bugs. It’s worth it; trust me.