Fri 27 September 2013
Model checking is one of the most effective tools available for reducing the
prevalence of bugs in highly concurrent code. Nonetheless, a surprising number
of even very smart and very senior software developers and architects seem to
know about it. Of the many such people I've worked with over the years, maybe
one in ten have even heard of it, and I can count on one hand the number who've
appreciated its value. Seems like a good subject for a blog post, then. ;)
Let's start with what the heck it is.
Model checking is a technique for verifying finite state concurrent
systems such as sequential circuit designs and communication protocols.
That's from the blurb for
Model Checking - the seminal book on the
subject by Clarke, Grumberg, and Peled. The way model checking works is by
generating states within a system according to rules you specify (the model),
and checking them against conditions that you also specify to ensure that
invalild states never occur. Some model checkers also check for deadlock and
livelock that might preclude reaching a valid final state, but that's not
essential. It should be pretty obvious that the number of states even in a
fairly simple system can be quite large, so many of the tools also do things
like symmetry reduction or Monte Carlo sampling as well.
My favorite set of tools in this space is the
Murphi family, of which
CMurphi is the one that has been most usable for me recently. Like many
such tools, Murphi requires that you specify your model in a language that they
describe as Pascal-like but which to my eye looks even more Ada-like. That's
really not as awful as it sounds. I've actually found writing Murphi code
quite enjoyable every time I've done it. The fact that the model is not
written in the same language as the implementation is a known problem in the
field. On the one hand, it creates a very strong possibility that the model
will be correct but the separate implementation will not, reducing the value of
the entire effort. On the other hand, traditional languages struggle to
express the kinds of things that a model checker needs (and futhermore to work
efficiently). I tried to write a real-code model checker once, and
didn't get very far.
To give you some idea of why it's so hard for model checkers to do what they
do, I'll use an example from my own recent experience. I'm developing a new
kind of replication for GlusterFS. To make sure the protocol behaved correctly
even across multiple failures, I developed a Murphi model for it. This model
- consisting of state definitions, rules for transitions between states, and
invariant conditions to be checked - comes to 550 lines (72 blank or comments).
Running this simple model generates the following figures.
That's for a simple protocol, with a small problem size - three nodes, three
writes, two failures. The model was also relentlessly optimized, e.g.
eliminating states that Murphi would see as different only because of fields
that would never be used again. Still, that's a lot of states. When I
introduced a fourth write, the run time tripled. When I introduced a fourth
node, I let it run for five minutes (3M states and 10M transitions) but it
still showed no signs of starting to converge so I killed it. BTW, I forgot to
mention that the model contains five known shortcuts to make it checkable, plus
probably at least as many more shortcuts I didn't even realize I wasn't taking.
If it's so hard and you have to take so many shortcuts, is it still worth it?
Most definitely. Look at those numbers again. How many people do you think
can reason about so many states and transitions, many of them representing dark
unexpected corners of the state space because of forgotten possibilities, in
their heads? I'm guessing
none. Even people who are very good at this
will find errors in their protocols, as has happened to me every time I've done
the exercise. I actually thought I'd done pretty well this time, with nothing
that I could characterize as an out-and-out bug in the protocol. Sure,
there were things that turned out to be missing, so that out of five allowable
implementations only one would actually be bug free, so I still thought the
exercise was worth it. Then I added a third failure. I didn't expect a
three-node system to continue working if more than one of those were concurrent
(the model allows the failures to be any mix of sequential and concurrent), but
I expected it to fail cleanly without reaching an invalid state. Surprise! It
managed to produce a case where a reader can observe values that go back in
time. This might not make much sense without knowing the protocol involved,
but it might give some idea of the crazy conditions a model checker will find
that you couldn't possibly have considered.
write #1 happens while node A is the leader
B fails immediately
C completes the write
read #1 happens while A isn't finished yet (but reads newer value)
B comes back up, becomes leader
C fails while B is still figuring out what went on
A comes back up
read #2 happens, gets older value from B
So now I have a bug to fix, and that's a good thing. Clearly, it involves a
very specific set of ill-timed reads, writes, and failures. Could I have found
it by inspection or ad-hoc analysis? Hell, no. Could I have found it by
testing on live systems? Maybe, eventually, but it probably would have taken
months for this particular combination to occur on its own. Forcing it to
occur would require a lot of extra code, plus an exerciser that would amount to
a model checker running 100x slower across machines than Murphi does. With
enough real deployments over enough time it would have happened, but the only
feasible way to prevent that was with model checking. Try it.
P.S. I fixed the bug.