One of the things that often amazes me is how poor most programmers are at debugging. Yeah, I know, it sounds like I’m putting on airs when I say that, but it’s true. Schools basically don’t teach debugging as such, though you’ll pick some up in the course of doing projects, and a lot of people don’t ever get much better at it later in their career. Many of the articles and books you’ll find on debugging tell you no more than how to use a particular debugger or how to debug a particular program, which is useful but still too narrowly focused. My intent is to introduce some more general approaches to debugging.

Most people who are poor at debugging fall into one of two extremes.

  • Some people think that debugging just means sitting in a debugger all day stepping through code line by line, or maybe speeding up the process a little by setting a breakpoint and then stepping from there. That’s great when it works, but there’s a vast territory of environments and problems for which it doesn’t. For example, it won’t help you find a timing problem or race condition in any kind of parallel or distributed system (including OS kernels) where the rest of the system is likely to continue and possibly fail due to timeouts while you sit in the debugger.
  • Other people eschew debuggers entirely, and instead claim that problems should always be solved by examining code. This works OK for code with which you’re intimately familiar (most often because you wrote it) but on large multi-person projects that’s not likely to be the case. It’s also not a very good approach for problems involving concurrency, where you have to understand not only how the code runs by itself, but how it interacts with other code (or other instances of the same code) running elsewhere simultaneously.

Fortunately, there is a conceptual framework so general and so powerful that it can help even in the situations where these particular approaches fail. It’s called the scientific method. Many have heard the term, but have little appreciation for how it might apply to the problem of debugging a program. In essence, it involves three parts.

  1. Devise a theory to explain observed behavior.
  2. Devise an experiment to test the theory.
  3. Run the experiment and observe the results.

As the saying goes: lather, rinse, repeat. The important thing here is to think of what you’re doing as an experiment. One of the most common mistakes I see people make when debugging is trying for a fix too early, and not considering any changes that they don’t think will lead to a fix. Usually that’s because there’s pressure to provide a fix now now now, and time spent running experiments doesn’t feel as productive as time working on a fix. Believe me, I know that pressure. I know the temptation to gamble, to cross your fingers and hope you can be the hero who fixed the bug in record time. Resist it. I’ve seen more valuable debugging time wasted chasing one red herring after another like that than I’ve seen wasted on methodical approaches. Even if you do occasionally lob a “hail Mary” into the end zone, all those times you tried and failed will also cost you both time and reputation.

In traditional science, there will often be many people studying a problem, testing multiple theories at once. When you’re debugging, though, you don’t have that luxury. Often you’ll be working alone, or at most with one or two others. That means you have to come up with all of those theories yourself. Then you need to prioritize them according to which seem most likely, and determine which experiments will be most helpful to confirm or eliminate possibilities. The “experiment” might actually mean examining code, walking through the forest of which calls what and seeing if there are indeed any code paths that match a theory. Other times it might actually involve running a live test – of either components in isolation or the whole system – with breakpoints set or extra trace information to see whether certain theoretical conditions are being met. Either kind of experiment will take time, so it’s important to do the most informative experiments first. Programmers tend to get attached to their favorite theories, so to avoid confirmation bias it’s important to decide before you run each experiment what each result would mean. This exercise also helps to identify which experiments might actually provide more than one useful result in a single run, thus saving time. What you should end up with is a sort of decision tree, much like you should be familiar with in code:

if (experiment[A] == result[B]) {
    return theory[C];
}
else if (experiment[D] == result[E]) {
    ...

As with code, if you do the cheapest and most informative checks first, your “program” (debugging session) will go faster. By developing a set of theories instead of just one, and prioritizing the experiments to distinguish between them, your average-case time spent experimenting will be kept to a minimum. As an added benefit, if you go a day or two without finding the problem and somebody starts asking whether you’ve considered X or why you haven’t tried Y, you’ll be completely prepared to give a satisfactory answer.

Debugging scientifically might not seem as exciting as relying on hunches, but if you want excitement go skydiving. The most exciting thing about debugging should be the prospect of being done and getting out of the office, or of shipping products and making money so you don’t have to have an office at all.