At the end of my original actor-model post, I suggested that I might post some example code to show how it works. More importantly, I want to show how it avoids some common problems that arise in the lock-based model, and makes others at least a little bit more tractable. Before we dive in, then, I need to explain some of what those problems are and how they happen.

The most basic problem is race conditions. The canonical case is two threads incrementing a shared variable. At the end, the variable should have the value n+2. However, if the two threads’ read-modify-write sequences overlap at all, one of the updates is lost. Consider the following sequence:

  1. Thread A reads the variable (let’s say it’s zero) into a register.
  2. Thread A increments the in-register value.
  3. Thread B reads the variable (it’s still zero) into a register.
  4. Thread A writes the in-register value (one) into the variable.
  5. Thread B increments its in-register value.
  6. Thread B writes its in-register value (also one) into the variable.

Oops. This is a minimal overlap in a simple case, and still led to an incorrect result. The common solution is to prevent thread A and thread B from executing their sequences concurrently by putting a lock around the variable. Fine so far. Now we hit the second major problem with lock-based programming. What happens if you need to make consistent modifications to two variables protected by two different locks – as part of a doubly linked list, for example, or moving money from one bank account to another? You take both locks, of course . . . but in what order? If two threads try to take two locks in opposite orders, they deadlock. Therefore, you require that any two locks always be taken in the same order. Problem solved, but at a cost.

This is all pretty basic stuff. There are myriad better and fuller explanations of this stuff out there, but I’ve repeated these examples here so I can refer to them. What’s often left out of these explanations is how these problems relate to the context of real code evolving over time in a real development organization. It’s all very well, for example, to state in the abstract that locks must be taken in order, but that’s often excruciatingly inconvenient when you’re already holding one lock because you were called by some other piece of code written by someone else, and suddenly realize that you need to take another lock that’s earlier in the mandated order. Now you need to do some serious refactoring. In some cases you have to release the first lock, take both in the proper order, re-check conditions, re-calculate values, and then proceed. It’s very easy to introduce new races or other problems when you do this, or to create code paths that don’t release all the locks properly, etc. Then you realize that the next layer has to take yet another lock, and you have to go through an even more difficult exercise. Lock orders are often defined in terms of types, but what happens if you need to take two locks of the same type? Order by address? That’s actually a pretty common approach, but it still leaves you solving the “oops, wrong order” problem over and over. That’s the cost I mentioned previously. I’ve seen work on a piece of code get so bogged down dealing with such issues that almost no developer-hours were left to make any other kinds of improvements. Obviously a better solution is necessary.

With that in mind, I’ve constructed an actor-model example. I could use any of dozens from past work experience, but those generally require too much context that would be too hard to explain. It’s contrived, I know, but for the purposes of illustration. Let’s say that you’re commodity trader, specializing in jellybeans. Periodically, you get an order to buy some number of jellybeans. This is a really weird kind of market, so the way you do this is to pick another trader and try to transfer some of their jellybeans to yourself. They might not actually have enough, though, so they have to execute their own order first by picking yet another trader at random, etc. Lastly, you and the other traders might all receive multiple orders simultaneously. Do you see the problem yet? Each order is a thread. Each trader is an object, with an associated jellybean count. You need to protect against races when there are multiple simultaneous orders, so each trader also has a lock. All of the trader objects are the same type, so lock ordering by type won’t work. Orders might cascade, and each new trader brought into the cascade might have a lock that’s ordered before the ones you already hold. Oh, what a mess of releases and re-checks and unwinds you’ll end up with. Go ahead and work through it a bit, if you don’t believe me. (Don’t worry about requests forming a cycle, by the way, or there not being enough jellybeans in the system. Assume that some higher-level part of the system guarantees that neither can occur.)

So, how does the actor model make this any better? Well, first off, races involving a single trader are simply gone because all actions on that trader are serialized via messages. Thus, there’s no lock either. What we have is a pair of message handlers on each trader. I’ll use pseudo-Python syntax just for convenience, but this won’t work without some surrounding infrastructure. Hopefully it’s enough to illustrate the basic concepts.

class Trader:
    def __init__ (self, njb):
        self.jellybeans = njb
        self.client_map = {}
        self.trader_map = {}
    def OrderMsgHandler (self, msg):
        # Leave a "breadcrumb" for when the reply comes back.
        self.client_map[msg.xid] = (msg.sender, msg.njb)
        seller = pick_a_seller()
    def TransferReqHandler (self, msg):
        if self.jellybeans >= msg.njb:
            # We have enough ourselves.
            self.jellybeans -= msg.njb
            # We don't have enough; get more.
            self.trader_map[msg.xid] = (msg.sender, msg.njb)
            seller = pick_a_seller()
    def TransferHandler (self, msg):
        self.jellybeans += msg.njb
            buyer, njb = self.client_map[msg.xid]
            del self.client_map[msg.xid]
        except KeyError:
            # Not a client; must have been another trader.
            buyer, njb = self.trader_map[msg.xid]
            del self.trader_map[msg.xid]
            # Bug alert; see following text.

Now all a client has to do to place an order is generate a transaction ID, send an OrderMsg to a Trader, and wait for an OrderDoneMsg to come back. Simple, huh? There is one bug in the code, though. In fact it’s a race condition, which I left in partly for the sake of readability and partly to illustrate a point I made in my last post: the actor model is not a panacea. It avoids single-trader race conditions without risking deadlock, but it’s still vulnerable to a multi-trader race condition. Specifically, the bug is that a trader that needs to procure more jellybeans only asks for the difference between what they have and what they need, but some other order might come along and “steal” what they already had between the time they send their TransferReq and the time they get a Transfer back. That’s a higher-level kind of error, the domain of formal methods and protocol validators (like Murphi, which I’ve written about before). The actor model does not ensure error-free code, but it does make it likely that the code will contain fewer errors and that the types of errors that creep in during continued development will cause less trouble – they’ll be easier to catch, easier to fix, easier to verify as fixed. In this case, it’s pretty trivial to divide a trader’s jellybeans into a free pool and a reserved pool associated with in-flight orders, so that the “stealing” cannot occur. I’ll leave that as an exercise for the (highly motivated) reader.

One big disadvantage of the actor model, for that matter any messaging-based or state-machine-based model, has compared to a shared-memory/locking model is that the flow of control can seem rather hard to follow, and sequences of events leading up to a bug can be hard to reconstruct. For that reason, I think it’s important for any infrastructure supporting such model to include some kind of message logging facility. If you know something about the sequence of messages/events at each actor, and the messages contain enough information to distinguish causal relationships (e.g. “receipt of message X caused message Y to be generated”) then that usually gets you back to the point where it’s easy to see how things went awry. Consider the bug in the code above, for example. With message logs, it should be pretty easy to see that a new order slipped in while the old one was in progress, at the trader that “unexpectedly” didn’t have enough jellybeans to continue.