When I’m talking to people about Eric Brewer’s CAP Theorem, one of the things that’s hardest to explain is the operative definitions of availability and partition tolerance. An “intuitive definition” is no definition at all, and a definition borrowed from another context might be even more misleading. I saw an example of this on a mailing list just last night. A genuine cluster expert was using definitions that were relevant for a particular kind of clustering fifteen years ago to cast aspersions on the relevance or import of more recent work on different kinds of systems. His definitions were clearly the wrong ones, and his immature comments about Lynch as a guy from Berkeley (even though she’s at MIT) weren’t exactly endearing either. I guess I should thank him for inspiring this post, though.
The first thing we need to get out of the way is that the actual words “availability” and “partition tolerance” are either meaningless or misleading here. Call them A and P instead. Both A and P have to do with availability. Both A and P have to do with partitions. Got it? The choice was made to use “availability” for one property that a system might have and “partition tolerance” for the other, but if the labels had been swapped they’d make about as much sense. You have to look at the context (for Brewer) or the formal definition (for Lynch) – not just the words.
It’s actually easiest to work backwards a little bit, starting with Nancy Lynch’s more formal 2002 SIGACT paper. Here’s how that paper defines availability and partition tolerance.
For a distributed system to be continuously available, every request received by a non-failing node in the system must result in a response.
…
In order to model partition tolerance, the network will be allowed to lose arbitrarily many messages sent from one node to another.
Lynch also makes the point that unbounded delay is indistinguishable from failure. Time is therefore an essential component of these definitions (a point made even more explicitly in the Dynamo paper). What they leave unclear is the precise dividing line between node and network failure, and therefore between availability and partition-tolerance problems. For that, we go back in time to Brewer’s 2000 PODC keynote. Brewer doesn’t actually provide very clear definitions, but he does leave a very important clue in slide 16.
Forfeit Availability
…
Make minority partitions unavailable
What he’s talking about here is quorum, and this is where a lot of confusion comes from. Somebody from a high-availability clustering background, such as myself or the aforementioned expert, is likely to think of quorum requirements as a way of maintaining availability. We would have interpreted Lynch’s definitions in that light, and yet here Brewer explicitly presents quorum requirements as forfeiting availability. What’s going on? Well, remember what I said about the words being meaningless? The key to resolving this apparent contradiction is to stop thinking about the words, and start thinking in terms of nodes, requests, and bounded time.
- Availability is sacrificed if certain nodes are forced to wait for unbounded time because of a failure. This includes the common approach of forcing non-quorum nodes down, which Brewer alludes to.
- Partition tolerance is sacrificed if certain requests are forced to wait for unbounded time because of a failure. This is most often the case when a node holding a lock cannot be reached, and quorum loss is not used to break the lock.
This brings us (finally) to the practical implications for different types of systems. Let’s consider the case of a single resource and three nodes interested in that resource when a partition occurs, according to the following diagram.

What are the practical implications for different CAP systems if Y and Z receive requests for the resource?
- In an available but not partition-tolerant system, Y would be allowed to process its request because it can reach X to obtain the lock. Z’s request would be blocked because X is unreachable.
- In a partition-tolerant but not available system, Z would be allowed to process its request because it is part of the quorum group (X’s lock will be broken). Y’s request would be blocked because it is not part of the quorum group.
- In a system that is both available and partition-tolerant, both requests would be allowed to progress. Y would return current data as possibly modified by X, while Z would return possibly stale data. Consistency is obviously sacrificed in this case. Note that maintaining consistency is possible in the other two kinds of systems (which is not to say that it’s easy).
I hope this clarifies what availability and partition tolerance actually mean in a CAP context, and more importantly what the implications are for systems that have to deal with these tradeoffs.
Thanks so much for this very clear explanation. The graphic helps, too. Now I want to go back and read the Dynamo paper again since it will certainly make more sense now.
Thank you for the effort to explain this to us. I was just looking for it.
Unfortunately, I got lost once you introduced 3rd new word to me quorum and non-quorum nodes.
Ins’t there even easier definition of these things?
Going back to google again :(
David
Quorum is to avoid “split brain syndrome” in which two mutually isolated sets of nodes continue running independently after a network partition. The key observation is that only one such set can contain N/2+1 or more nodes. Therefore, if you can reach at least that many nodes (including yourself) then you’re part of the quorum majority and it’s safe to keep running. If you can’t, then you’re part of the minority and it’s not safe so you shut down. Since the quorum members can see or assume that non-quorum members are no longer alive, they can break locks instead of waiting for operations that require them. It’s a simple and completely reasonable approach in a local environment where you’ve done everything you can to prevent partitions (e.g. redundant networks or even signaling through storage), but in a distributed environment where partitions are inevitable it can mean that all but one site becomes unusable.
Thank you Jeff, for perfect and quick elaboration. Now it makes so much more sense to me.
I would like to clarify one more thing. Under the figure you have three bullets. The first bullet describes a scenario where we have an available but not partition-tolerant system. In the description you say that Z would be blocked because X is unreachable. Which I understand. However, I do not understand why this is called available system because when I compare it to the second bullet, and look at both situation as black-box from availability prospective, they appear to me completely similar. In both cases one node is working while the other is blocked. Yes, it is for two different reasons, which I understand, but what I don’t is why one is called available while the other is not. To me it appears that both situations are actually not-available (for two different reasons, though).
Thank you.
I think your confusion on that point is perfectly understandable. To be quite honest, I really don’t like the way Brewer and Lynch use the terms either. They’re too vague, they overlap too much, and when most people do try to make a distinction a majority would make the distinction exactly opposite to the way Brewer and Lynch did. That’s unfortunate, because I think understanding the “official version” is important when people talk about CA vs. CP vs. AP systems. I would have called the two properties “node availability” and “request availability” or something, but that’s kind of water under the bridge. The terms are what they are, and the definitions are the ones that I think follow from the papers.
Sorry, I’ve just been meaning to get that off my chest for a while. Thanks for the opportunity. ;) Just keep reminding yourself that “availability” means that nodes remain available and “partition tolerance” means that requests survive partitions, if it helps.
Back on topic, I don’t think availability and partition tolerance are the same even from a black-box perspective. In a non-available system, the server I’m using – or perhaps the one I’m running on – is completely non-functional. I get a clear and immediate indication that Stuff Is Broke. In a non-partition-tolerant system, by contrast, some and quite likely most requests will still succeed until one hangs because it needs a lock on the other side of a partition. The classic example is a set of servers mounting directories from one another via NFS. It’s a bad idea, I know. Why? Because you can run for hours thinking nothing is wrong, until you hit that one file on that one failed server and you get the dreaded “NFS server xxx not responding, still trying” in your logs. People don’t like that, which is why high-availability clusters have long tended to go the quorum route. Note that “high availability” here is defined opposite to the Brewer/Lynch usage, which is exactly why I think that usage is unfortunate.
Hi Jeff,
I’m just reading the most recent comments on this (great) blog entry.
Your distinction between node/request “availability” and quorum breaking makes me wondering: is it correct to say that a CA system can simply be turned into a CP one by just forcing “partitioned nodes” down, so that:
1) Partitioned nodes are completely non-functional.
2) Live nodes are still functional because locks held by the partitioned ones have been broken.
Thanks for your awesomeness ;)
Sergio B.
Thanks for your kind comments, Sergio. That’s an excellent question. If you don’t mind, I’d like to think about it for a bit and then answer in a separate post. Thanks for the idea.
Great, no worries: I’ll wait for your next post.
In the meantime, I’ll add a bit of context to my observation above.
I’m currently working on a master-based hub-and-spoke distributed system, whose original behavior on spoke-from-hub partitioning was to:
1) Hold locks for a given period of time (to allow reconnection) and then break them after timeout expiration (to allow live nodes to continue working).
2) Keep the partitioned (and disconnected from master) nodes up in an endless reconnection retry, but unable to answer any request.
AFAIU, the system was available for live nodes (after the timeout, obviously), and non-partition-tolerant for disconnected ones, so let’s say CA.
But, shutting down disconnected nodes would make all requests to those nodes failing in bounded time, so the system seems to be CP now: hence my comment.
Now, waiting for your post ;)
Cheers,
Sergio B.
Jeff,
Thanks for sharing your thoughts. I think in this post, however, you’re explanation of availability and partition tolerance as used in Brewer’s CAP theorem is wrong. Though, I can understand your confusion, I was stumped for a quite a while myself. If you revisit the Gilbert/Lynch document, I think you’ll come to the same conclusion.
Here’s the conclusion I came to after studying the documentation:
Consistent means you always get the correct data from any node.
Available means a request to any non-failing node will be serviced. This means that in your picture, both “Y” and “Z” nodes will be able to complete requests. “Availability” really exactly what you’d expect it to mean.
Partition Tolerance means you can handle if nodes in the network get separated. It doesn’t have anything to do with “availability” in particular (I’ll explain more later).
The confusion comes from the fact that Availability, Consistency, and Partition-Tolerance are grouped together, so tightly, yet “Partition-Tolerance” sort-of doesn’t belong. Availability and Consistency are positive features of a distributed system when viewed from the _outside_ as a black box, if you will. Partition-Tolerance is a requirement on the _inside_ of a distributed system.
What does this mean:
CP without A – The distributed service will always return correct results, though some or all of the nodes may not respond at all in when there are problems. If the network gets partitioned this will continue to work.
AP without C – Any non-failed node will always respond to requests, though the data may be stale. If the network gets partitioned this will continue to work.
CA without P – The distributed service will return consistent results from all non-failed nodes, all the time. Since there is no “Partition-Tolerance” the distributed systems network must be perfect – never failing. If the network does get partitioned, all bets are off. You’ll lose consistency, availability, or both. Choosing to forgo “Partition-Tolerance” means that your system will not tolerate network partitions.
Thanks,
Ryan
Consistent means you always get the correct data from any node.
“Correct” by what definition? In Gilbert and Lynch they refer to serializable (“atomic”) consistency on single objects. CAP may be provable for other definitions of consistency as well, but “correct” without elaboration is too vague for this kind of discussion.
Available means a request to any non-failing node will be serviced.
To be more precise, it means the request will be accepted and executed, but not necessarily that it will be able to complete in bounded time.
CA without P – The distributed service will return consistent results from all non-failed nodes, all the time. Since there is no “Partition-Tolerance” the distributed systems network must be perfect – never failing. If the network does get partitioned, all bets are off. You’ll lose consistency, availability, or both. Choosing to forgo “Partition-Tolerance” means that your system will not tolerate network partitions.
Partition tolerance is a property, not a precondition. Treating P as a precondition effectively denies the existence of CA systems outside of a non-existent world where networks never fail, which isn’t very informative.
If the network does get partitioned, all bets are off. You’ll lose consistency, availability, or both.
Without partition tolerance, a system will wait indefinitely for resources that are inaccessible. That’s not “all bets are off”; it’s a well defined failure mode. You can choose to trade away consistency or availability to get partition tolerance but their loss is not an inevitable consequence of being non-P. In a way, that’s the very essence of CAP. CAP says you can have only two – any two – of the three properties. It doesn’t say you can have only two but if you don’t have P then you have to lose one of the others, which is essentially what you seem to be saying.
Thanks for taking the time to respond.
“Correct” by what definition? In Gilbert and Lynch they refer to serializable (“atomic”) consistency on single objects. CAP may be provable for other definitions of consistency as well, but “correct” without elaboration is too vague for this kind of discussion.
Sorry, that was a lazy paraphrase on my part, because this discussion focuses on the Availability and Partition Tolerance, and I think “consistency” is understood already by people that would read this.
Regarding Availability: To be more precise, it means the request will be accepted and executed, but not necessarily that it will be able to complete in bounded time.
That is more precise, but it also is distracting when trying to understand the practical implications CAP. As pointed out in the Lynch document, “In some ways this is a weak definition of availability: it puts no bounds on how long the algorithm may run before terminating, and therefore allows unbounded computation. On the other hand, when qualified with the need for partition tolerance, this can be seen as a strong definition of availability: even when severe network failures occur, every request must terminate.” In other words, the definition of “availability” doesn’t prescribe how long the algorithm will take to complete, but it will complete. But, it’s sort-of beside the point, as we talk about CAP, we usually don’t care too much about “the algorithm”, or how long it will take (given that it must complete even if there are network problems which is really the only reason practical algorithms might take a long time).
Partition tolerance is a property, not a precondition. Treating P as a precondition effectively denies the existence of CA systems outside of a non-existent world where networks never fail, which isn’t very informative.
Actually, I think this is very informative, and it’s the reason I replied. I’ll restate it in another way, because I think it’s very important: It is impossible to have a “consistent” and “available” distributed system, unless you can guarantee there will be no network partitions. If you try, and there is a network partition – the result will be the potential for either inconsistency, lack of availability, or both depending on how the system is designed.
Recognizing that perfect CA systems are not practical is one of the real lessons of the CAP theorem. It doesn’t mean there can’t be “almost CA systems” out there that are available most of the time and are consistent most of the time. In fact, this is a very interesting area that I’d like to learn more about. Perhaps an idea for a future post.
Without partition tolerance, a system will wait indefinitely for resources that are inaccessible. That’s not “all bets are off”; it’s a well defined failure mode.
This is not necessarily true. This depends on how you implement the distributed system. You could choose to implement it such that resources remain accessible, but the data is not consistent. With respect to CAP, which talks about distributed systems in the general sense, it is “all bets are off”. Though, your correct, that when you design a system you can have well defined failure cases. CAP just tells us that without partition tolerance, in the case of a network partition, we will have some kind of problems (inconsistency, lack-of-availability, or both).
You can choose to trade away consistency or availability to get partition tolerance but their loss is not an inevitable consequence of being non-P.
Except “being non-P” is not very practical.
Being non-P = no partition tolerance = your distributed system cannot tolerated a network partition = you have a perfect network.
So, you can indeed have C and A, if you have a perfect network. Otherwise, when there are network problems your system will not be able to continue to offer both C and A.
It doesn’t say you can have only two but if you don’t have P then you have to lose one of the others, which is essentially what you seem to be saying.
I’m saying, as I stated above that “not having P” is not practical, and that the definition of A can P you gave above is not accurate. I attempted to explain where I thought the confusion came from, but that may have just been more confusing.
It is impossible to have a “consistent” and “available” distributed system, unless you can guarantee there will be no network partitions. If you try, and there is a network partition – the result will be the potential for either inconsistency, lack of availability, or both depending on how the system is designed.
I also think this is a very important point, which is why I’ll say again: it is entirely possible to have a CA system, even if partitions are likely. It’s just not very desirable in that case, but “possible” and “desirable” are two different things and CAP addresses only the first. Being CA, such a system will respond to partitions in a well defined way – some requests will succeed, others will be allowed to start but will not complete because a resource they depend upon is unavailable. The combination of requirements that would make such a system seem reasonable (high consistency, high immediacy for some requests but not for indistinguishable others, low data coupling to avoid wait cascades) seems so unlikely that I can scarcely imagine it being useful, but it’s still possible. You can turn such a system into CP or AP by various means, but then it’s a different system. Again, CAP says that it could remain CA, even if neither you nor I can think of why we’d want it to.
In that context, you might find my post on the TAG Conjecture – from this morning, before you commented – interesting. If partitions are unlikely (other than the degenerate case of node failure which is explicitly excluded by the “non-failing node” qualification of availability), then there’s no reason to give up C or A. That’s why CA systems, including not just databases but also filesystems and other kinds of things, are more common even than CP or AP, but that’s not really what CAP is about. It’s only when partitions are likely that tradeoffs between C and A (in CAP, or T and A in TAG) make sense.
[quote]I also think this is a very important point, which is why I’ll say again: it is entirely possible to have a CA system, even if partitions are likely.[/quote]
That statement completely contradicts the CAP theorem. You described a Consistent, Available, and Partition-Tolerant system. The CAP theorem says you can only choose 2 of those 3. This is proven in Lynch document, section 3.1:
It is impossible in the asynchonous network model to implement a read/write data object that guarantees the following:
Availablity,
Atomic consistency
in all fair executions (including those in which messages are lost).
You described a Consistent, Available, and Partition-Tolerant system.
No, I didn’t. I described a system that has well-defined behavior when partitions occur, not one for which that behavior satisfies “partition tolerance” as defined in the paper. Please read what I wrote before you try to tell me things I already know.
I would suggest that when you said “Being CA, such a system will respond to partitions in a well defined way – some requests will succeed, others will be allowed to start but will not complete because a resource they depend upon is unavailable.” Given the definition of “availability”, this doesn’t quite make sense. The system you describe would not meet the A requirements because in some scenarios requests will not terminate****.
This is where I really start to dislike the way Gilbert and Lynch play around with their definitions. Look at section 3.2.1, for example, in which they describe a supposedly consistent (“atomic”) and partition tolerant system – which simply fails if there’s a partition. Note that they say “the service” – i.e. as a whole, not just some nodes – stops returning responses. Complete failure when an event occurs doesn’t qualify as tolerating that event in my book.
So, if “partition tolerance” means merely that there can be partitions (not that they’re tolerated in any meaningful sense), then you’re right: CA can only exist locally. I stand corrected. I think that’s a fairly uninteresting result, though. If CA systems exist anywhere at all, then those same systems will also exist in the presence of partitions. When I used to work on high-availability clusters – even those with redundant networks – I still saw “impossible” partitions occur and still had to write code to deal with split-brain scenarios. There are no networks, or combinations of networks, that never fail. CAP is then reduced to a single tradeoff: terminate and return possibly-stale data, or wait indefinitely for known-good data.
On the other hand, if “partition tolerance” means that nodes which remain connected can continue to transfer data and control amongst themselves, and “availability” means that non-failing nodes are allowed to continue accepting requests instead of being forced down, then I think that opens up a much more interesting set of tradeoffs. I think the “two out of three” result is still provable in that case too, even if that doesn’t seem to be what Gilbert and Lynch have done. I wonder what Brewer would think.
Ack. I’m sorry, Ryan. I meant to reply to your comment, but somehow managed to edit it instead and ended up replacing most of it with my own. My apologies. Fortunately, I think the gist of it still comes through.
Jeff, thanks for the interesting discussion, it was helpful for me. For the record, below is my post which was accidently removed.
Cheers,
Ryan
————
Jeff, hopefully you’re not taking any offense to my disagreement with you, I’m attempting to be respectful and polite in my responses. I, of course, did read everything you wrote before responding.
I think we have some terminology disagreements. I would suggest that when you said “Being CA, such a system will respond to partitions in a well defined way – some requests will succeed, others will be allowed to start but will not complete because a resource they depend upon is unavailable.” Given the definition of “availability”, this doesn’t quite make sense. The system you describe would not meet the A requirements because in some scenarios requests will not terminate****. This system, viewed a black box, could not be called CA because it doesn’t always meet the A requirements. That is why I continue to disagree when you say for example “it is entirely possible to have a CA system, even if partitions are likely”.
Though our difference may be in what the term “CA System” means. When you say a system is a “CA system” you might be referring to a system which has CA qualities in the absence of network partitions, and well-defined non-CA behavior when network partitions occur. The term “CA system” to me means a system which _always_ has the CA qualities which is not possible if network partitions can occur.
Thanks,
Ryan
****Although “availability” puts no bounds on how long a request may run before terminating. As I stated in an earlier post, this refers to how long the algorithm is allowed to take. In the definition of availability, taking an unbound time because of a partition in the network is _not_ allowed. This is stated in section 2.2 of the Lynch document, it states that for a system to be continuously available it must handle network partitions. I know that this sounds like I’m saying that you always have to have P if you have A, but it’s not. This is reiterated in section 2.3 “The above definitions of availability and atomicity are qualified by the need to tolerate partitions”. And also in 2.3 “The availability requirement implies that every node receiving a request form a client must respond, even though arbitrary messages that are sent may be lost. … even if every other node in the network fails … a valid response must be generated.”
I got tired of having to pick this out of the archives every time I need to point someone to it, so I linked it from http://wiki.apache.org/cassandra/ArchitectureInternals. Great article!
Sorry for the stupid question. I still can’t figure out what is Partition Tolerant! Can you give me an example??? Thank you!
In a nutshell, partition tolerance means that requests can still be processed and completed by some nodes in the system even though other nodes are unreachable. In a CP system requests can complete at nodes that have quorum because non-quorum nodes have been shut down, had their locks broken, etc. In an AP system requests can complete at any live node, possibly resulting in inconsistency until the partition ends and the conflicting versions are reconciled.
In a CP system requests can complete at nodes that have quorum because non-quorum nodes have been shut down, had their locks broken, etc.
I think you meant to describe a CA system above. If I understand your explanation correctly, a P system does not have or care about quorum or non-quorum nodes.
Fantastic thread. If I understand this correctly, can this be summarized as:
1. One can’t have a “Consistent and Available” system that also allows for loss of arbitrarily many messages (or n/w partitions)
2. A partition-tolerant system (that survives network partitions) has to:
– Either sacrifice consistency, in which case both Y and Z would respond and “Request” is always available
– OR sacrifice “Request Availability” as Y (being a non-quorum node) won’t respond in