If I were ever to teach an actual class, it would probably be about how to write networking code that actually works. A lot of people teach about protocols and algorithms, and that stuff’s definitely important, but I see a lot less about there about the nitty-gritty details of how to implement those things in a debuggable and maintainable kind of way. There seems to be a huge gap between the things CS professors tell us we should be doing and the actual facilities that are present on most systems, and it’s increasingly apparent that a lot of programmers are falling into that gap. Even if you already have a complete and clear description of the protocols you’ll be implementing, actually turning that into working (and perhaps even efficient) code is a hard problem all by itself. Here,’s another example of a “defensive programming technique” that you can use to save time on mere debugging so you can spend more on the fun stuff.
One of the most common problems network programmers face is the prospect of a message arriving later than it should. Often the incoming message is a reply to one you had sent earlier, but by the time it arrives the object it referred to has already been deleted or has changed state. This is part of the reasons why I don’t like timeouts very much, and it’s one of many reasons why it’s a bad idea to send out pointers and expect them to be valid when they’re sent back. Textbooks are full of acknowledgment and sliding-window protocols to handle this for the case of messages sent through a single channel. In some cases you can take advantage of that by closing an old channel and reopening a new one, but that’s not a general solution. Closing and opening channels often is very disruptive and inefficient, you might be dealing with orders of magnitude too many objects to make such an approach feasible, you might have other reasons for not using a protocol that imposes ordering as well as providing exactly-once semantics, etc. For these or other reasons, you might need to deal with this issue yourself.
The way to avoid “stale message” problems is to observe that every message has a context. There’s some reason you’re getting it, something the sender knows – or thinks he knows – about the state or information you hold. If your state changes, it might invalidate the sender’s assumptions and that can be reflected by creating a new context to replace the old. The easiest way to do this is to represent the relevant context in the form of an object, and to establish a rule that every message you handle must identify an object to which it pertains. These objects must have three basic attributes:
- A unique ID, so you can look it up.
- A generation number, so you can reject messages from a previous “epoch” of the object’s existence (including another free/allocate cycle). Note that the generation number needs to be large enough to avoid wraparound, just like all that textbook stuff for the single-channel case.
- A state, so you can reject messages that don’t make sense any more (or perhaps never did).
The objects used to establish a message’s context might not be objects as they would otherwise exist in your program. They could also represent requests, transactions, asynchronous event handlers, or groups of any of these things – in fact, whatever contexts you use when you think about your program. In many of these cases, the object also provides a handy place to maintain lock state, timestamps, or other information associated with handling the message. It’s OK to create a “global” object to establish context for messages that have a genuinely global effect. The ID can be an index into a table, and in fact that’s where the generation number becomes most useful. Similarly, it can be very useful if the ID that you expose externally is actually an ID+generation internally, so the beginning of your message handler can look like this:
- Extract the external ID from the message header.
- Separate the external ID into the real ID and the generation number.
- Use the real ID to look up the appropriate object.
- Check the message’s generation number against the object’s, and reject it (loudly) if they don’t match.
- Check the message type against the object’s state (more about this in a moment) and reject etc.
- Process the message, secure in the knowledge that it’s now safe to do so.
The part about checking state is also important. Basically, the idea here is that certain messages are only valid in certain states, and it’s the same idea as in Microsoft’s Singularity project. If a request object should only get a RequestComplete message while it’s in a RequestInFlight state and not when it’s in a RequestOnQueue state, that can be enforced even before real processing begins by assigning the appropriate states. Watch out for synchronization issues around state changes, though. If you’re using the kind of staged execution model that I’ve recommended elsewhere, these checks should happen before releasing the lock that allows another thread to take over the message-dispatch role. Also, keep it simple. States and messages that are uniquely associated with one another are pretty safe, but if you have too many rules more complex than “if type is X then state must (not) be Y” than you’re probably asking for trouble. More complex state models can protect you from more things, but the one thing we’re really concerned with here is stray (i.e. delayed or duplicate) messages, and chasing lots of false alarms from a too-complex state model is a cure worse than the disease.
Adopting this methodology won’t protect you from all stray-message problems. For example, if a message to you is duplicated and the duplicates arrive back to back, then they might get processed in the same generation and state. If that’s a possibility for you, you’ll need to work out another mechanism to deal with it and then subject that mechanism to some more rigorous formal verification. What these simple tricks represent is more of a coding style that inherently avoids some of the most common problems from timeouts and retransmissions and race conditions, complementary to any protocol-specific analysis you might do.