Every so often, people notice that I grimace when RPC is mentioned. I’ll try to explain why. Firstly, let’s get it out of the way that “RPC” can mean many things. Sometimes when people say “RPC” they mean it in a very generic sense, to refer to any facility for calling functions remotely. Other times they very specifically mean Sun RPC version X or DCE RPC version Y or CORBA version Z. I’m going to use “RPC” more in the generic sense, but with a recognition that some particular comments might not apply to some particular kinds of RPC.
So…what’s wrong with RPC? It’s a very useful abstraction, isn’t it? Well, yes it is, but every abstraction has limits. RPC is to a certain extent a victim of its own success. Because it provides such a useful abstraction for such a common and important class of remote interactions, people forget that there are other kinds of remote interactions for which the RPC abstraction is inadequate. For example:
- RPC involves both a request and a response, just like a function is invoked and then returns. But what about messages for which no response is necessary? Many applications can make very effective use of various “hints” that can safely be dropped and therefore need not even be acknowledged, and no state need be maintained on the sender regarding their status. The RPC model forces every request to be tracked and responded to individually, making such “hinting” significantly more expensive than it needs to be.
- Similarly, the RPC model incorporates a “round trip” assumption that can be too limiting. What if the logical message path for an operation is from A to B (who A thinks has a resource) to C (who B knows really has it) and back to A? This pattern appears frequently in distributed applications, but in most kinds of RPC A will be unprepared to get a reply back from C instead of B. In some cases the desired effect can be obtained by issuing a fake multicast RPC, but that’s kludgy at best.
- In addition to the “response from the wrong place” problem above, most forms of RPC are not very “proxy-friendly”. There are many reasons, but by far the most common is reliance on per-request timeouts. These are typically set by the requestor without even knowing whether proxies are present, and start to occur spuriously as proxy hierarchies deepen.
For all these reasons, and probably more that I can’t think of or articulate right now, I’ve learned to distrust RPC. Don’t get me wrong; if your needs happen to match the assumptions emobodied in the RPC model, you’d be crazy not to take advantage of the facilities that exist, and even if you’re not doing true RPC the data serialization/deserialization (a.k.a. marshaling) tools that are often associated with RPC packages are very useful. However, the choice to use RPC is one that should be made with a full awareness of its limitations.
The obvious next question is what to do when RPC is not appropriate. That’s a complicated topic; once you move beyond RPC you’re in the realm of protocol design, about which whole books have been written. The difficulty of designing a correct, robust and efficient protocol is in itself one of the best reasons to stick with RPC. The #1 rule of protocol design is, of course, not to do it if an existing protocol – RPC-based or otherwise – is at all adequate. Nonetheless, it is sometimes necessary and I have some suggestions for when it is:
- Avoid timeouts as much as possible. If you need to use elapsed time to determine that a node or connection is dead, isolate that logic in a heartbeat protocol that provides “node/connection died” events. Have your main protocol retry requests indefinitely until success or until one of these events occurs. In most other situations there are better alternatives to using timeouts at all.
- Keep state transitions to a minimum. I don’t believe “stateless” protocols suffice in all situations, and they’re certainly not always the most efficient alternative, but huge state tables are generally the hallmark of poorly designed protocols. Keeping the state space small helps both in proving the protocol correct and in implementing it later.
- “Future-proof” your protocol by using protocol version numbers, self-identifying message fields, capabilities negotiation, etc. I might write another whole article just on techniques in this area some day.
- Validate your protocol. There are many tools available for this, and using any of them is far better than having to debug every race condition or deadlock/livelock “in the wild”. I’ve written my own protocol validator, but that was more as an exercise than anything else and for serious use I’d recommend Murφ instead.
- When implementing your protocol, be extra-aggressive about adding checks for invalid state transitions (often identified as such during the validation process), so implementation errors can be caught sooner rather than later. This sort of checking is a good idea for any code, but it’s absolutely indispensable for protocol code.
- Add lots of logging, at multiple levels – the messaging level, the “protocol engine” level, the higher-level request level. Leave the logging facilities in production code.
There’s much more, but that should be a good start. I’ve designed successful protocols for everything from cluster membership to distributed lock management to cache coherency over the years, and if it weren’t for these rules I would have gone insane long ago. I hope that, by writing them down, I can help someone else learn from my mistakes instead of having to repeat them.