Stu Charlton has written an interesting post called Confused CAP Arguments. The tone is mostly civil and reasonable, though it’s a bit disappointing to see some of the negative stereotypes (“cocktails…newly-graduated devs”) and strawmen (“only using CAP”) repeated yet again. Oh, well. The issue of “is CA possible” seems to be rather central to Stu’s post, and it also came up in my last post in the subject, so I’ll delve into that aspect a bit and even share some of the war stories that I promised to.

First, though, a bit of background. Back in 1993, before either Stu or the current generation of anti-RDBMS hipsters entered the workforce, I was working on HACMP/6000 at Clam. At first I worked on the lock manager, which can be thought of as a key/value store where the values are very small but have complex semantics. A lot of the hashing and replication techniques it used would be instantly recognizable to NoSQL folks. Later, I was responsible for the network routing and monitoring infrastructure. The “DGSP” (Die, Gravy Sucking Pig) message that we used to enforce quorum was my invention. I became pretty intimately familiar with all of the ways such systems could fail. In particular, I became familiar with how even systems with redundant network hardware and cutting-edge network software could still experience network partitions. To address one of Stu’s points, about treating network failures as the only kind, one observation I made pretty early on is that for all practical purposes they are the only kind. It’s not hard to promote any local failure to a node failure, and it greatly simplifies the rest of the system if you do. Anyway, the real point is that I might not have been a database guy myself, but I directly supported them (including time on-site at Oracle HQ) so I have very little patience for database guys trying to pretend that only they have the proper context to understand the application or business value of various tradeoffs in this area. IBM wouldn’t have sold tens of thousands of HACMP licenses if we hadn’t thoroughly understood those same issues. So, back to the issue at hand. Here’s one of the less coherent parts of Stu’s post.

What is Partition Intolerance?

This is never well defined. Like lactose intolerance, the results, can be, err, unpredictable. It seems to imply one of: complete system unavailability, data inconsistency, or both.

First, data inconsistency is a sacrifice of C. That’s pretty clearly defined. Gilbert and Lynch defined P thus: the system continues to operate despite arbitrary message loss. That’s still a bit vague both in terms of “the system” (all or part) and “continues to operate” (as before or degraded somehow). I tried to provide a clearer interpretation a while back, distinguishing P from A by contrasting the effect of a partition on requests vs. nodes. In short, a non-A system fails some nodes, while non-P system fails some requests. Some might disagree with that definition – several did in the comments to that article – but I think it’s clear enough to make “never well defined” untrue.

What Stu does get absolutely right is that the effect of a partition in a partition-intolerant system can be unpredictable. This is why I don’t like CA systems much. If you’ve partitioned your data somehow and some nodes become unreachable, then some requests will be unable to reach the data (or acquire the locks) that they need, and a CA system must either block or fail such requests. If it could satisfy them by eliminating possible sources of conflict, e.g. by forcing nodes down and breaking their locks, then it would be a CP system (with reduced capacity). If it could satisfy them by allowing the conflict and hoping to perform some sort of resolution later, it would be an AP system (with reduced consistency). As a CA system, preserving both capacity and consistency, block or fail are the only options. Failing requests is the better option, but is unfortunately rarely chosen because people don’t usually seem to choose CA. Instead, they fall into it because they try to deny that partitions could occur. They have unbounded faith in the efficacy of redundant networks. Anybody who has seen the dark underbelly of network infrastructure knows how misplaced that faith can be. In “CAP-naive” systems, the most common behavior I used to see was that some requests would succeed and others would block indefinitely waiting for some resource they needed. It’s bad enough that hash-based resource distribution would cause these blockages to be completely unpredictable, but there’s an even bigger iceberg to worry about: cascading failures. Database requests often need many resources, and if they block unexpectedly while acquiring the fifth then they block while still holding the other four. That would often cause even more requests to block waiting for those four, and so on until practically every resource is unavailable and the whole system locks up. This isn’t theoretical. I’ve seen it happen many many times. I’ve seen it happen almost instantaneously when a critical resource becomes unreachable in a busy system, and I’ve seen it unfold slowly over hours or even days when a less popular resource did so in a lightly loaded system. It’s not some exceptional exotic behavior in CA systems. It’s the norm when partitions occur – and they do.

OK, so this can happen. Is it worth designing around? Implementing redundant networks can reduce the frequency of partitions to very low levels. That means some additional cost for network hardware and software, but without replication to ensure reachability (on top of potentially hardware-assisted replication to ensure durability) the savings on node and storage hardware to reach fixed capacity/performance goals might be even greater. Let’s just eat the cost on those rare occasions when the unthinkable happens. Doesn’t that make more economic sense? The first argument against that is that partitions even in redundant networks are more common than most people think, so the economic cost of such failures is likewise greater. I think the more compelling argument, though, is that a better tradeoff is practically always available. Any CA system which degrades messily into zero capacity when a partition occurs can be easily replaced with a CP system which degrades cleanly to half capacity under the same conditions. The technology for quorum-enforcing CP systems is mature and readily accessible, the configurations and running costs are exactly the same, so there’s no good reason to implement a CA system instead. Every CA system you see is the result not of diligent design but of (often willful) ignorance. CA systems can exist; they just shouldn’t.