There are many ways to be wrong in a technical discussion, but the ways usually fall into one of two categories – ways that can easily be corrected, and ways that can not. In the first case, somebody else can provide a counterexample or explanation that rests on only a few widely agreed-upon axioms, so that even those not particularly familiar with the subject matter can see that the original claim was wrong. In the second case, showing how or why a claim is wrong can require making subtle distinctions or pointing out less well-known facts. In the process, the whole discussion can often be drawn into a morass of contested claims leading to no clear conclusion. The first type of wrongness often isn’t worth addressing at all – the author and/or others will quickly spot the error without any assistance – or can be dealt with quickly. The second kind of wrongness can persist almost indefinitely, but still needs to be addressed because it can trap the unwary and lead them to waste a lot of time pursuing bad ideas when they could have been pursuing good ones. As futile as it may be, I’m going to address one particularly pernicious example regarding a topic I’ve written about many times – the infamous CAP theorem.

xkcd #386

In his latest missive, Michael Stonebraker claims to be addressing some mis-perceptions of his views on Brewer’s CAP theorem. I would say many of those “mis-perceptions” are entirely accurate, but it’s no surprise that he’d want to back-pedal a bit. Unfortunately, it’s not very effective to fight mischaracterization of one’s views by tossing out mischaracterizations of others. For example (near the end):

In summary, appealing to the CAP theorem exclusively for engineering guidance is, in my opinion, inappropriate.

I’d like to know who’s appealing to the CAP theorem exclusively. Not me. Not Coda Hale, who refers to the three properties of CAP as “Platonic ideals” and repeatedly refers to design heuristics or compromises involving more than just CAP. Not anyone I’ve seen involved in this discussion. The most extreme view I’ve seen is the complete rejection of CAP by those who just can’t let go of consistency and transactions. Those are easy models, it’s not hard to understand their appeal, but sometimes they’re just not appropriate for the problem at hand. Just like simpler Newtonian physics had to yield to the “weird” models proposed by Planck or Einstein, computing has moved into the “relativistic” world of Lamport or Lynch. It’s a world where there is no such thing as absolute time in any system with more than one physical clock, and where the same event for all practical purposes occurs at different times in different places (even without network partitions). The only concepts of time that matter in such systems are before/after and causality, but not duration. Because of this, node speed really doesn’t matter except to the extent that it affects the number of nodes you need to perform a task. As Stonebraker puts it,

Next generation DBMS technologies, such as VoltDB, have been shown to run around 50X the speed of conventional SQL engines. Thus, if you need 200 nodes to support a specific SQL application, then VoltDB can probably do the same application on 4 nodes. The probability of a failure on 200 nodes is wildly different than the probability of failure on four nodes.

What he fails to mention is that VoltDB gains that speed mostly by being a memory-based system, with all of the data-protection and capacity limitations that implies. If you need 200 nodes to support a specific SQL application, then you might still need 200 nodes not because of performance but because of capacity, so VoltDB won’t do the same job on 4 nodes. That’s exactly the kind of “trap the unwary” omission I was talking about.

He’s right on the last point, though: the probability of failure of 200 nodes really is wildly different than the same probability on 4 – exactly the point Coda makes, even doing one better by providing the actual formula, in his supposedly “misrepresentative” article. However, it’s worth examining the causes of node failures. Again, Stonebraker’s take.

The following important sources of outages are not considered in the CAP theorem.

Bohrbugs. These are repeatable DBMS errors that cause the DBMS to crash…

Application errors. The application inadvertently updates (all copies) of the data base…

Human error. A human types the database equivalent of RM * and causes a global outage…

Reprovisioning.

Back here in reality, most Bohrbugs will cause a single node to crash, and the very relativity that makes distributed systems so challenging also makes it very unlikely that other nodes will experience the exact same sequence of events that triggers the failure. Other than overload, bugs that cause such “contagion” to take down the entire system are very rare. That’s why they’re newsworthy. You never see anything about twenty servers at Google or Yahoo failing, because that happens every day and because the people who designed those systems understand how to deal with it. More about that in a moment.

Going down the list, of course CAP doesn’t address application or human errors. Neither does Stonebraker’s approach. Neither can, because neither can control how applications or humans behave. Application errors have to be fixed in the applications, and human errors have to be fixed at a higher level too – e.g. by using automation to minimize the need for human intervention. It’s not worth talking about the cases where no tradeoffs are possible. What do you “trade off” to make human error disappear? Citing these kinds of errors as shortcomings of CAP, without noting their more general intractability, is just another dirty trick. As for reprovisioning as a “stop the world” operation, Benjamin Black and others have already pointed out that it’s simply not so for them . . . and I’ll add that it need not be so even in a more consistency-oriented world. In any system that can survive a failure of some nodes, those nodes can be upgraded while they’re offline but the rest of the system keeps running. The fact that some systems don’t have that property is merely a deficiency in their implementation, not a commentary on CAP.

What I find most misguided about Stonebraker’s article, though, is this.

In my experience, network partitions do not happen often. Specifically, they occur less frequently than the sum of bohrbugs, application errors, human errors and reprovisioning events. So it doesn’t much matter what you do when confronted with network partitions. Surviving them will not “move the needle” on availability because higher frequency events will cause global outages. Hence, you are giving up something (consistency) and getting nothing in return.

So, because network partitions occur less than some other kind of error, we shouldn’t worry about them? Because more people die in cars than in planes, we shouldn’t try to make planes safer? Also, notice how he says that network partitions are rare in his experience. His experience may be vast, but much of it is irrelevant because the scale and characteristics of networks nowadays are unlike those of even five years ago. People with more recent experience at higher scale seem to believe that network partitions are an important issue, and claiming that partitions are rare in (increasingly common) multi-datacenter environments is just ridiculous. Based on all this plus my own experience, I think dealing with network partitions does “move the needle” on availability and is hardly “nothing in return” at all. Sure, being always prepared for a partition carries a cost, but so does the alternative and that’s the whole point of CAP.

Remember how I exempted overload from my comment about systemic failure? I did that because the #1 cause of system overload is parts of the system being too dependent on another. Sooner or later, even in the best-designed system, some node somewhere is going to become overloaded. The more nodes you have waiting synchronously for each others’ responses, as they must when the system is built around consistency or consensus, the more likely it becomes that the local bottleneck will turn into a global traffic jam. Forcing non-quorum nodes down to preserve consistency among the rest – the other part of the traditional approach that Stonebraker clings to – only makes this even more likely because it increases load on the survivors. That’s the other part of why you don’t hear about a few nodes going down at Facebook or Amazon. Their engineers know that consistency has a cost too. Consistency means coupling, and coupling is bad for availability, and availability matters more to some people than consistency.

The conclusion, then, is much as it was before. Partitions are real, and significant, so we’re left with a tradeoff between consistency and availability. Real engineers make that tradeoff in various ways. Other people try to deny that the tradeoff is necessary, or that any choice other than their own might be valid, and they make up all manner of counter-factual reasons why. Which would you rather be?