As I’ve hinted a few times before, one of the projects I’ve really wanted to pursue for a while is a new kind of program to verify code correctness. Better tools to find bugs after the fact are all well and good, but it’s even better if you can find bugs before the code even runs — especially if it’s done as part of an automated process such as a nightly build. Two kinds of checkers currently exist to serve this need.

  • Protocol checkers such as Murphi are really good for detecting errors in high-level logic, including logic that involves a lot of concurrency. The drawback is that they work with an algorithm expressed in their own special-purpose language instead of real code. Since there’s no connection between the version of a protocol that you’ve verified and the one you’ve actually implemented, bugs can still creep in. I once implemented my own protocol checker largely to address this, but it was actually a pretty lame attempt.
  • Code checkers such as the Stanford checker (later commercialized as Coverity’s Prevent and Extend products) or Smatch can be thought of as “lint on steroids” but that description doesn’t really do them justice. The code- and data-flow analysis they do can actually be very sophisticated; equipped with the right information about when transitions occur and which are valid, they can uncover many kinds of bugs — in real code — that lint never dreamed of. Weaker examples of the breed analyze only within a function, while stronger ones can analyze between functions as well … but still only within a single continuous code path, and that’s the drawback to this class of tool. My favorite example of when this is insufficient involves code that allocates a structure before it sends a network request and then frees it when the reply comes back or a timeout occurs. This class of tool cannot distinguish between the legitimate case where the structure gets freed in the reply/timeout path (which is in effect a separate execution of the program) and the bug case where it never gets freed at all. Similarly, it can’t deal with control flows that jump between multiple concurrent threads.

OK, enough background. What do I propose to do about any of this?

My thinking starts with the idea that my earlier attempt at doing protocol analysis of real code took precisely the wrong approach. I was trying to start with a protocol checker and address its weakness, but it’s actually easier to correct the problems with code checkers instead. Basically what I need is a single-path code checker plus some way of “telling” it about concurrent or discontinuous control flows (the latter including those where the flow passes through another node or program or even a user). I might have to write the code checker myself, because none of the free ones seem general enough for what I want, but in theory this approach should work with an existing checker such as those already mentioned. (This fits quite nicely with my ideas about layering not only on the basis of code structures and communication paths but also with respect for separation between domains of human expertise, by the way.) The second ingredient is something akin to automatic unit-test generation. If the user can specify a protocol (or API) in some form that connects to the actual code, it becomes possible to generate the hundreds or thousands of possible invocation sequences as code which is then run through the code checker. The connection has to be fairly specific so that e.g. the message sent by one call will be the same one received and processed by the next, but in the end the high-level protocol description does lead to the actual implementation code being checked appropriately. Thus, a simple coding mistake could lead to bugs like the allocate/free example above or yesterday’s infinite loop being found automatically and precisely (i.e. without also producing lots of false alarms).

I’m leaning heavily toward using some flavor of Petri Net for the protocol specification, for several reasons. At the technical level, they are expressive enough to represent the protocol/code connections I’ll be dealing with while remaining fairly intuitive. At a different level, Petri Nets are already in pretty common use as a specification method and there are many tools available to manipulate them, making it relatively easy for users to create protocol descriptions and then keep them current with respect to both the spec and the code without forcing new language/format restrictions on either. Lastly, there’s a large established body of work regarding the analysis of Petri Nets, which might ease my own task in creating a tool that uses them. Some flavors of state machines might be even more intuitive and familiar to users, but they’re not quite as well suited to expressing the kinds of connections I need and there has been somewhat less work done on using them for tasks like this. I’m still thinking about this part, though, so if anyone has any thoughts to add about representations I’m all ears.

Obviously if I could afford the time off work and didn’t have that nagging no-bachelor-degree problem I’d seriously consider doing this as a thesis project for a PhD (I think it meets the criteria most would apply to such). As it is, though, this will probably remain no more than a concept I play with during idle moments for at least another year, and then maybe I’ll be able to make some progress on my own.