One of the few good things to come out of Jim Starkey’s ill-fated attempt to brush aside Brewer’s CAP Theorem with some made-up definitions and MVCC hand-waves was that it afforded an opportunity for people to think and talk about issues of availability, partition tolerance, etc. In his latest message (note the new title – apparently anything that doesn’t draw positive attention becomes boring) Jim constructs a strawman that I’m sure is very familiar to folks who’ve been involved in CAP discussions before.
That slop isn’t good enough for a database system. If somebody in Tokyo
and somebody in New York can each withdraw the last $1M from an account,
there’s going to be an unhappy banker who’s going to think that getting
the right answer trumps a funny definition of availability.
Yawn. There’s nothing about CAP that says all systems need to be AP, and only a very few misguided folks who even remotely suggest it. If strong consistency is a requirement, you still get to choose between CA and CP. Do you make the system completely unavailable for some, or allow some subset of requests hang for anyone who makes them? Most people who’ve worked on high-availability systems really hate having an ill-defined and unpredictable subset of requests hang, so they tend to choose some form of quorum rule. That’s CP according to these definitions, which is confusing but hardly matters; the question and the tradeoffs remain the same.
That brings us to leases. Because they expire in bounded time, leases turn a hang (which breaks partition tolerance) into a finite delay (which preserves it). Combined with MVCC, it’s easy to imagine a system which appears to preserve consistency as well without having to sacrifice availability by forcing any nodes down. Wait, isn’t that the Holy Grail that CAP tells us is impossible? (This is where I half-expect Starkey to say that’s what he was proposing, even though he never got that far in his reasoning. I’ve worked with many DEC alumni of approximately the same age who had the same obsessive credit-stealing behavior; it must have been hell for the truly creative people who worked there at the time.) To see why this idea doesn’t quite work, let’s flesh it out a little. To make a change, you start a transaction. You take out leases on anything you plan to change, and because it’s MVCC all changes you make are private until you commit. When you commit, you check all of your leases; if any have expired, then you abort and retry. That’s all pretty obvious to anyone skilled in the art, right?
The first problem that occurs to me is the lack of forward-progress guarantees and the potential for deadlock, but the real problem is the same one I pointed out to Jim in the original thread: how does a transaction terminate? How does everyone agree that it has terminated, so that they can start a dependent transaction without introducing inconsistency? Leases ensure that everyone capable of accurate timekeeping (including you) will know that your lease expired, but how will they know whether it did so before or after you committed? What if you committed and then immediately became isolated? Where’s the newly committed data? If you’re the only one who has it, then someone might hang waiting for the data instead of the lock and we’re not partition-tolerant any more. If one or more copies of the data must be somewhere else, then you need to provide an unambiguous and simultaneous signal to everyone involved that the new version you wrote is now the committed version. You can go around and around on acknowledgements and counter-acknowledgements and commit protocols until you realize that you need something like Paxos.
The problem with Paxos is that it’s very heavyweight for a widely distributed system. The time for such an algorithm to run is effectively proportional not to the normal message latency but to the maximum time that a node can go to sleep and then continue with its old state intact. In a local environment you can have many redundant connections per node and mostly need to worry about such delays (e.g. caused by an OS glitch) on the order of a few seconds, even for a very sick system. In a wide-area environment – which is where CAP is meant to apply – nodes are likely to have few connections and phenomena such as routing instability can leave them stranded for minutes. A multi-round protocol potentially involving multiple minutes per round just doesn’t preserve partition tolerance in any meaningful sense. Consensus itself is a conceptual sibling of consistency, so you might as well say that the C in CAP stands for Consensus. If you have to wait for consensus, then you have to sacrifice one of the other properties and we’re back to having the same problem we started with. (Hmmm. Global + Agreement + Timely = GAT Theorem? Or TAG?)
I’m not saying that you can’t use leases to create a “good enough” compromise between C and A and P, leading to some extremely useful systems. I love leases. They were the subject of my first patent, which is how I ran into some of their limitations head-on and can write about them now. You can also make that “good enough” compromise using many other mechanisms, without refuting CAP. In the end, as many already knew, it’s about tradeoffs. It’s about knowing they’re necessary, and thinking about them, and choosing the right ones for your needs.