Model Checking

Fri 27 September 2013

tags: process

Model checking is one of the most effective tools available for reducing the prevalence of bugs in highly concurrent code. Nonetheless, a surprising number of even very smart and very senior software developers and architects seem to know about it. Of the many such people I've worked with over the years, maybe one in ten have even heard of it, and I can count on one hand the number who've appreciated its value. Seems like a good subject for a blog post, then. ;) Let's start with what the heck it is.

Model checking is a technique for verifying finite state concurrent
systems such as sequential circuit designs and communication protocols.

That's from the blurb for Model Checking - the seminal book on the subject by Clarke, Grumberg, and Peled. The way model checking works is by generating states within a system according to rules you specify (the model), and checking them against conditions that you also specify to ensure that invalild states never occur. Some model checkers also check for deadlock and livelock that might preclude reaching a valid final state, but that's not essential. It should be pretty obvious that the number of states even in a fairly simple system can be quite large, so many of the tools also do things like symmetry reduction or Monte Carlo sampling as well.

My favorite set of tools in this space is the Murphi family, of which CMurphi is the one that has been most usable for me recently. Like many such tools, Murphi requires that you specify your model in a language that they describe as Pascal-like but which to my eye looks even more Ada-like. That's really not as awful as it sounds. I've actually found writing Murphi code quite enjoyable every time I've done it. The fact that the model is not written in the same language as the implementation is a known problem in the field. On the one hand, it creates a very strong possibility that the model will be correct but the separate implementation will not, reducing the value of the entire effort. On the other hand, traditional languages struggle to express the kinds of things that a model checker needs (and futhermore to work efficiently). I tried to write a real-code model checker once, and didn't get very far.

To give you some idea of why it's so hard for model checkers to do what they do, I'll use an example from my own recent experience. I'm developing a new kind of replication for GlusterFS. To make sure the protocol behaved correctly even across multiple failures, I developed a Murphi model for it. This model - consisting of state definitions, rules for transitions between states, and invariant conditions to be checked - comes to 550 lines (72 blank or comments). Running this simple model generates the following figures.

172838 states
468981 rules
10.60 seconds

That's for a simple protocol, with a small problem size - three nodes, three writes, two failures. The model was also relentlessly optimized, e.g. eliminating states that Murphi would see as different only because of fields that would never be used again. Still, that's a lot of states. When I introduced a fourth write, the run time tripled. When I introduced a fourth node, I let it run for five minutes (3M states and 10M transitions) but it still showed no signs of starting to converge so I killed it. BTW, I forgot to mention that the model contains five known shortcuts to make it checkable, plus probably at least as many more shortcuts I didn't even realize I wasn't taking.

If it's so hard and you have to take so many shortcuts, is it still worth it? Most definitely. Look at those numbers again. How many people do you think can reason about so many states and transitions, many of them representing dark unexpected corners of the state space because of forgotten possibilities, in their heads? I'm guessing none. Even people who are very good at this will find errors in their protocols, as has happened to me every time I've done the exercise. I actually thought I'd done pretty well this time, with nothing that I could characterize as an out-and-out bug in the protocol. Sure, there were things that turned out to be missing, so that out of five allowable implementations only one would actually be bug free, so I still thought the exercise was worth it. Then I added a third failure. I didn't expect a three-node system to continue working if more than one of those were concurrent (the model allows the failures to be any mix of sequential and concurrent), but I expected it to fail cleanly without reaching an invalid state. Surprise! It managed to produce a case where a reader can observe values that go back in time. This might not make much sense without knowing the protocol involved, but it might give some idea of the crazy conditions a model checker will find that you couldn't possibly have considered.

write #1 happens while node A is the leader
B fails immediately
C completes the write
read #1 happens while A isn't finished yet (but reads newer value)
A fails
B comes back up, becomes leader
C fails while B is still figuring out what went on
A comes back up
read #2 happens, gets older value from B

So now I have a bug to fix, and that's a good thing. Clearly, it involves a very specific set of ill-timed reads, writes, and failures. Could I have found it by inspection or ad-hoc analysis? Hell, no. Could I have found it by testing on live systems? Maybe, eventually, but it probably would have taken months for this particular combination to occur on its own. Forcing it to occur would require a lot of extra code, plus an exerciser that would amount to a model checker running 100x slower across machines than Murphi does. With enough real deployments over enough time it would have happened, but the only feasible way to prevent that was with model checking. Try it.

P.S. I fixed the bug.

Comments for this blog entry