In my last technical article, about adaptive readahead, I touched on a subject I’ve been meaning to write more about. Here’s the relevant passage.

Having multiple parts of a system each trying to be smart like this, when being smart means consuming resources while remaining oblivious to the others, often hurts overall performance and robustness as well.

I see this kind of “interference” between components quite a bit. In fact, I’d say it’s one of the most common marks of bad design. For example, as readers of my article on server design surely know, multiple components each trying to manage their own thread or memory pools for their own maximal benefit, without any kind of coordination, can often be a bad thing. As system-wide resources are overcommitted, all sorts of thrashing and starvation and other problems set in. When every component tries too hard to maximize its own performance, the result can often be lower performance throughout the system. I’ll resist the temptation to make a political point here.

Such interference is quite apparent in the data-storage world, in a couple of different ways. For example, every time an operating system changes the way it does readahead, it affects the access pattern seen by a disk array that might also be trying to do readahead. In some cases, this can lead to wrong decisions being made about which blocks to prefetch, wasting cache space and bandwidth that could otherwise be put to better use. Similarly, an operating system’s buffer or page cache does a pretty good job of turning a mostly-read access pattern into a mostly-write access pattern. If disk arrays’ caches were designed to optimize a mostly-read access pattern at the expense of a mostly-write pattern, this would also represent a kind of interference. It doesn’t happen because disk-array designers are not idiots, and it’s a little known or appreciated fact that those big caches on high-end disk arrays are mostly there to absorb writes. Caching is also involved in what is probably the most severe kind of interference as well – that which affects not just performance but correctness. In my experience, a frightening percentage of bugs are caused by programmers caching or copying data without having a strategy for how changes to separate copies will be reconciled. This can result either in stale data being returned to a user, or to system failure as different parts of the system make decisions based on what’s supposed to be the same information but isn’t. Caching is one of the most common techniques people use to improve performance, but the interference that often results can be a killer.

If interference is so bad, why is it so common? Usually it’s because one component is trying to do another component’s job, and the reasons for that in turn mostly come down to laziness and arrogance. Laziness is a factor because coordination requires work. It’s easier just to have your piece of code go off and do its own thing (e.g. with its own cache) without trying to coordinate with anything else. Even better, if the Frobnicator component is currently the bottleneck in the system then the guy who adds a standalone cache to it can get credit for fixing the bottleneck despite the fact that they might have hurt overall performance or even introduced serious bugs. “The Frobnicator is no longer the bottleneck. Great job, Joe! Here’s a bonus.” Yeah, thanks Joe, for making everyone else’s job harder. Attempts to save a few cycles out of millions by making zillions of direct function calls instead of the few dispatch tables and callbacks of a well-defined API fall into the same category. The arrogance part of this is our old friend the “Not Invented Here” syndrome. The most common rationalization for embedding redundant functionality into a component is “this way we can get exactly the XXX functionality we need without all the overhead of stuff that’s there for someone else” where XXX is the component that should be external. Sometimes this also comes down to laziness; just as it’s often easier to rewrite code (and reintroduce already-fixed bugs) than to fix it, it’s often easier to reimplement another component instead of extending it. Other times it’s just because the filesystem guys think they can write a better volume manager than the volume-manager guys, or the database guys think they can write a better block cache than the operating-system guys. They’re usually wrong, but programmers – especially those who have developed a high level of skill without self-discipline to match – love trying to outdo one another. Unfortunately, it’s their fellow programmers and users who suffer the consequences.

The thing that kills me about all this is that decomposing a complex system into the right set of components with the right interfaces between them in a way that minimizes interference (or potential for interference) is the very essence of an architect or designer’s job. If you can’t figure out a way to make generally useful functionality generally available, then you’re not doing a very good job as an architect or designer and should have any such title removed from your business card. If a significant part of your code is copied (either literally or conceptually) from somewhere else, or if you often find yourself fixing what’s basically the same bug in multiple places, or if every code change seems to require re-tuning everything to fix performance, you’re probably suffering the effects of interference. If you’re smart, you’ll figure out a way to refactor so that the system “flows” more coherently instead of fighting against itself.