Yesterday I chased down a bug that turned out to be caused by code approximately like the following.

lock();
for (;;) {
    req = find_completed_request();
    if (req) {
        unlock();
        return req;
    }
    unlock();
    wait_for_next_completion();
    lock();
}

The problem with this code is the case when it’s called with no requests outstanding. Since none are outstanding, none will complete, and there’s no guarantee that any new ones will be issued — and in the corresponding real code none would. Even more annoying was that the wait function (within the Linux kernel) was implemented in a way that would stop the debugger cold when it tried to attach to the process, so now I’d have a hung debugger and still no way of seeing where the process was stuck. Fortunately I had enough context to lead to this function and I’m pretty good at finding bugs by inspection.

This particular anti-pattern is one I’ve encountered many times before, to the point where it would probably make my Ten Least Wanted list. What’s particularly interesting about it is that it’s an example of a bug that static code checkers are very nearly helpless against. (Yeah, I ended a sentence with a preposition. Just try to rewrite that sentence so I don’t … or just get over it.) Firstly, the code checker would need to understand in general about multiple code paths (or instances of the same code path) executing simultaneously, and I’ve yet to see one that does. Protocol checkers are good at that, but this is a code-level problem. Secondly, the code checker would need to understand enough about the possible sequences of calls to recognize this particular indefinite-postponement problem without raising false alarms about similar cases where a request does in fact complete and end the wait. As it turns out, I have some ideas about how to fuse protocol-level and code-level checking to address this kind of thing, and I’ll be writing about those soon, but for now I’ll just gripe about people who unintentionally create infinite loops because they didn’t properly consider whether those loops’ termination conditions would always be met.

By the way, refactoring the above fragment so it doesn’t exhibit this problem is left as an exercise for the reader.