Eventual Consistency

I wrote about availability and partition tolerance in my last post, so I guess I should tackle consistency as well. Informally, “consistency” can refer to two slightly different things: conforming to some set of rules, or getting similar results for similar actions. Both parts apply to consistency in distributed systems: consistency means conforming to rules about who sees what values when. In its strongest form, consistency means a complete global ordering of reads and writes in a distributed system. Let’s say we’re dealing with an integer value that starts at 6. Two updates occur: X which adds 4, and Y which divides by 2. Furthermore, we know that there’s a causal relationship determining the order of the two updates because some other kind of event – e.g. a lock operation or message exchange – came in between. A complete global ordering ensures two things.

  • No node will ever see the result of Y before X. In other words, they might see 6/10/5 depending on when they read, but they will never see 3.
  • Once any node has seen the result of an update, no node will ever see a state without it. Once Z sees 10, nobody can see 6 again; once Z sees 5, nobody can see 10 again.

Eventual consistency breaks both of these rules. A node might in fact see Y before X, temporarily yielding 3. A moment later, another node might see X first and get 10. What remains true, though, is that once any node sees 10 then no other node should see 7. That’s because of the guarantee that eventual consistency does make.

For any pair of updates X and Y, all nodes will eventually converge on a result that reflects X and Y applied in the same order.

In other words, it’s OK if different nodes see different values or see them at different times, so long as eventually – meaning when both X and Y have worked their way through the system – they agree on the same value. If a system allows conflicting values to persist indefinitely, it’s not eventually consistent; it’s simply inconsistent. Note that there’s no limit on how long it takes for the system to quiesce. There’s also no restriction on what order everyone agrees on. The eventual agreed-upon result could be 7, representing Y before X and thus violating the causal relationship, but very few actual systems would make that choice. In fact, it’s entirely possibly to add additional guarantees on top of eventual consistency. Some of the most common are:

  • Read your own writes. The node that issued X should never thereafter see a result (e.g. 3) without it, even though other nodes might. Some people do actually manage to screw this one up, but it’s rare.
  • Processor consistency. If the same node issues X and Y, then nobody should ever see Y without X.
  • Never go back. A single node in our example will always proceed 6-10-5 even if it does so out of sync with other nodes.

Preservation of causal relationships is one of the most complex guarantees that is typically provided in eventually-consistent systems. Using mechanisms such as vector clocks, it is possible for a node receiving Y to observe that there might be an as-yet-unreceived X that should go before. It can then defer processing of Y until X arrives, or until it’s no longer possible for any such X to exist. Such careful ordering can be defeated by out-of-band messages that overtake the in-band messages, though, since by definition the information necessary to (re)establish ordering is not present out-of-band. Application designers who depend on such guarantees must therefore exercise some caution (or remove the dependency).

The last point I’d like to make about eventual consistency is one that I recently made via email. Eventual consistency just means updates are asynchronous. It doesn’t mean that they’re slow, or unordered, or non-atomic. In practice, extra guarantees such as the above can be added so that all of the consistency behaviors that matter to a particular application are preserved just as well as with strong consistency. Even when that’s not the case, inconsistency is in practice likely to be rare and fleeting. I vaguely recall seeing some numbers for the frequency of read-repair operations in Dynamo, which should approximately reflect the level of inconsistency that actually occurs, but I can’t find them right now. What matters is that eventual consistency is often strong in practice, even if it’s not guaranteed, and that’s sufficient for many – I think most – applications. There are important categories of applications for which guaranteed strong consistency really is a requirement (anything to do with money or nuclear materials come to mind) and people writing such applications should consider whether the software they’re using really meets those requirements, but they shouldn’t be driving technology for absolutely everyone.

Availability and Partition Tolerance

When I’m talking to people about Eric Brewer’s CAP Theorem, one of the things that’s hardest to explain is the operative definitions of availability and partition tolerance. An “intuitive definition” is no definition at all, and a definition borrowed from another context might be even more misleading. I saw an example of this on a mailing list just last night. A genuine cluster expert was using definitions that were relevant for a particular kind of clustering fifteen years ago to cast aspersions on the relevance or import of more recent work on different kinds of systems. His definitions were clearly the wrong ones, and his immature comments about Lynch as a guy from Berkeley (even though she’s at MIT) weren’t exactly endearing either. I guess I should thank him for inspiring this post, though.

The first thing we need to get out of the way is that the actual words “availability” and “partition tolerance” are either meaningless or misleading here. Call them A and P instead. Both A and P have to do with availability. Both A and P have to do with partitions. Got it? The choice was made to use “availability” for one property that a system might have and “partition tolerance” for the other, but if the labels had been swapped they’d make about as much sense. You have to look at the context (for Brewer) or the formal definition (for Lynch) – not just the words.

It’s actually easiest to work backwards a little bit, starting with Nancy Lynch’s more formal 2002 SIGACT paper. Here’s how that paper defines availability and partition tolerance.

For a distributed system to be continuously available, every request received by a non-failing node in the system must result in a response.

In order to model partition tolerance, the network will be allowed to lose arbitrarily many messages sent from one node to another.

Lynch also makes the point that unbounded delay is indistinguishable from failure. Time is therefore an essential component of these definitions (a point made even more explicitly in the Dynamo paper). What they leave unclear is the precise dividing line between node and network failure, and therefore between availability and partition-tolerance problems. For that, we go back in time to Brewer’s 2000 PODC keynote. Brewer doesn’t actually provide very clear definitions, but he does leave a very important clue in slide 16.

Forfeit Availability

Make minority partitions unavailable

What he’s talking about here is quorum, and this is where a lot of confusion comes from. Somebody from a high-availability clustering background, such as myself or the aforementioned expert, is likely to think of quorum requirements as a way of maintaining availability. We would have interpreted Lynch’s definitions in that light, and yet here Brewer explicitly presents quorum requirements as forfeiting availability. What’s going on? Well, remember what I said about the words being meaningless? The key to resolving this apparent contradiction is to stop thinking about the words, and start thinking in terms of nodes, requests, and bounded time.

  • Availability is sacrificed if certain nodes are forced to wait for unbounded time because of a failure. This includes the common approach of forcing non-quorum nodes down, which Brewer alludes to.
  • Partition tolerance is sacrificed if certain requests are forced to wait for unbounded time because of a failure. This is most often the case when a node holding a lock cannot be reached, and quorum loss is not used to break the lock.

This brings us (finally) to the practical implications for different types of systems. Let’s consider the case of a single resource and three nodes interested in that resource when a partition occurs, according to the following diagram.

What are the practical implications for different CAP systems if Y and Z receive requests for the resource?

  • In an available but not partition-tolerant system, Y would be allowed to process its request because it can reach X to obtain the lock. Z’s request would be blocked because X is unreachable.
  • In a partition-tolerant but not available system, Z would be allowed to process its request because it is part of the quorum group (X’s lock will be broken). Y’s request would be blocked because it is not part of the quorum group.
  • In a system that is both available and partition-tolerant, both requests would be allowed to progress. Y would return current data as possibly modified by X, while Z would return possibly stale data. Consistency is obviously sacrificed in this case. Note that maintaining consistency is possible in the other two kinds of systems (which is not to say that it’s easy).

I hope this clarifies what availability and partition tolerance actually mean in a CAP context, and more importantly what the implications are for systems that have to deal with these tradeoffs.

Glouds and Crids

As I mentioned in a previous post, the “cloud is just grid” meme tends to pop up rather frequently for me. That can be a bit frustrating, and the contrary “cloud has nothing to do with grid” meme is just as bad. I think the two technologies are in fact very clearly related, though not identical. Cloud folks could certainly help themselves by listening more to grid folks who solved many of the same problems much earlier, and grid folks could do themselves a favor by listening more to cloud folks who have taken that technology in important new directions. Cloud folks: be less arrogant. Grid folks: be less bitter. There should be synergy here, not competition. To understand why, let’s step back a bit and look at what

  • The original grid “vision” was of combining previously separate resources to solve problems larger than those resources could practically solve while remaining separate. This was importantly contrasted with the work then being done on purpose-built massively parallel systems like ASCI Red. In one direction, grid computing led to the federation of multiple such systems so that they could be used remotely. In another direction, it led to things like Seti@Home or BOINC. In any case, the general direction was toward aggregating resources in a fairly open academic-style environment.
  • There’s still so much churn around definitions of cloud computing that I know I’m stepping into dangerous waters, but I think of cloud as having three essential characteristics and many incidental ones. The essential ones IMO are location transparency, rapid self-service provisioning, and “multi-tenant” isolation of users from one another not only in terms of security but in terms of SLA-protected performance as well. I consider virtualization to be an incidental characteristic, but an important one as it enables both isolation and “infrastructure as a service” deployment where users get to install everything from the operating system on up. By contrast, I think “platform as a service” is a lot more grid-like and “software as a service” is almost indistinguishable from plain old hosted applications. IaaS is the true cloud; how’s that for a comment sure to provoke some controversy?

So, how does that help to identify similarities and differences?

  • Location transparency and rapid self-service provisioning are things grid and cloud have in common. This is where the synergies are greatest.
  • There’s a level difference: grid tends toward platform level and above, while cloud supports and even emphasizes infrastructure level.
  • There’s a granularity difference: grid tends toward “few large” and aggregating resources, while cloud tends toward “many small” and dividing them up (via virtualization) as well.

I think a lot of the confusion is related to the fact that grids and clouds have co-evolved, and many of each are misidentified. For example, many grids have developed cloud-like usage models and protection methods over time. Some of them have really become clouds. On the other side, many “private clouds” might actually be devoted to few tasks and forego a lot of between-user protection. They’re really grids in practice. What people should do is forget about what “tradition” something came out of, or what the vendor/promoter calls is. Look at the level and granularity of the provided service(s) and isolation features, and that will give you some hints about where you should look first for examples or secondary technologies that might help you write your own applications no matter what you call them.

P.S. About the title: a gloud is something that was conceived (and perhaps implemented) as a grid but then turned out to be a cloud, and a crid is something that went the other way.

My Christmas List (for 2010)

I remember being very impressed by the Iridigm “butterfly wing” display technology at least as far back as 2004, before they were acquired by Qualcomm. The basic idea as described by Scientific American is to use the same principle as a butterfly’s wing to create color via interference rather than pigment (as in paper) or emission (as in most computer displays). Advantages include a truly paper-like level of ambient-light reflection, and zero power consumption except when changing the display. These are both highly valuable properties for an e-book reader, which is why it’s exciting that Qualcomm intends to come out with one.

Battling Anti-NOSQL Trolls

Originally from Snail in a Turtleneck, who deserves all the credit for a cool name and a good post.

I have a Twitter feed for the term “nosql” and every day I get tweets like:

“What moron came up with #nosql? you’re all fired!”

I hope I meet someone who says this to me someday, though, so I can say: “Boy, what a good point! If only Google and Yahoo and LinkedIn and Twitter and the thousands of other high-traffic websites had listened to you.

I see a lot of this stuff too, though in real life rather than Twitter. There are at least three points I almost always have to make when one of these trolls pops up.

  1. We already know it’s more about ACID than SQL. Not only is such a “critique” unoriginal, but it’s irrelevant as well because it really doesn’t change anything about the relationship between traditional vs. non-traditional databases.
  2. “Not Only SQL” isn’t about saying our flavor should be the only one. It’s about rejecting that attitude from the ACID grognards.
  3. The CAP Theorem and its practical application are real, not just things that only exist in a few caffeine-addled newbies’ heads . The theorem itself was presented by a respected academic in an ACM keynote. Since then it has been the subject of formal proof by other academics, while the value of its application has been empirically demonstrated by everyone from Amazon to Yahoo.

Those three points are usually sufficient to stun the trolls into silence, even if expecting them to grow up and admit their error or apologize for their rudeness is unrealistic. Now, if only I could find a similar deterrent for the “cloud is just grid” trolls.

You Will Be Assimilated

I have to admit that part of my general antipathy toward Java is a form of envy. Don’t get all excited, Java fans; it’s only a small part, and “envy” might not even be the right word. The Java language certainly has its flaws, as others have written, but every language has flaws and I could probably forgive Java’s. The Java programming environment – the vast array of libraries and frameworks and IDEs and other crutches, seems more flawed and less forgiveable. The Java culture, with its at-best-misleading claims of “write once run everywhere” and “as fast as C” and its endless swirl of enthusiasm for patterns and “agile” and every other fad to come along, is worse still. What’s worst of all is the way all of these things get bundled together. Both the dependency structure and the culture seem to preclude picking out and benefiting from the good bits while leaving the bad bits behind. I’ve yet to meet a Java programmer who even seemed to try. To be a Java programmer, it seems, means to dive headlong into the entire Java ocean and leave anything else behind. That kind of enforced isolation from the broader world is a defining characteristic of cults, by the way.

So why do I care, and why do I admit to envy? Because there is some cool stuff that happens in the Java world. For example, I recently came across Structure101. Please, go read what they have to say about architecture extraction and enforcement, and especially about complexity debt. Set aside twenty minutes or so and watch the videos. Their analysis of the problems that occur in large-scale software development rings very true to me, and I think they’ve created an awesome tool to help address those problems. So what’s the problem? If you go to their “versions and pricing” page, you’ll see it’s only for Java. Both the problem and the solution are completely language-independent, but their implementation is Java-specific. I don’t blame Headway for this, or the other vendors (e.g. for static code analysis tools) I’ve seen make the same decision. I’m sure their choice is based on rational analysis of the market. I’d love to be able to use such tools. I can’t, hence the admission of envy. I resent the Java cultists for that, not because Java itself has no value but because of the opportunity cost they’ve imposed on the entire industry. Those grapes aren’t sour. I’m sure they’re very sweet, which is exactly why putting a fence around them is so offensive. Because of the “all or nothing” nature of the Java world, lots of cool things that could benefit the broader programming community – from C to Python to Erlang to Clojure – never will. Theirs is a fundamentally selfish “come play in my sandbox or we won’t play at all” kind of attitude. It’s very similar to the “any kind of database you want as long as its ACID and relational” attitude I’ve also written about lately, and quite antithetical to the UNIX/Linux/open-source way of making tools that do their own jobs well and combine well with other tools to do larger jobs.

Java is the Borg of the computing world.

Deja Vu

This should ring a bell with former members of the Revivio platform team: Droid auto-focus fails on a 24.5-day cycle. Here’s a hint for those who weren’t working on BSI code at the time: how long does it take for a signed 32-bit millisecond counter to wrap?

P.S. If any of you guys can remember which customer got hit by this, or other more precise details of the problem, please send me some email.

One Terabyte per Second

Just a few weeks ago, I was giving a presentation where I mentioned that 1TB/s isn’t that far off. Most people in the audience looked dubious. Well, we got there sooner than I thought. Of course, it costs approximately half a jillion dollars and if you try to do any metadata operations the MDS will probably hang itself, but it’s pretty impressive nonetheless.

A Bit More About SiCortex

I wrote about the demise of SiCortex when it happened, but the subject has surfaced again recently – first in a Network World article, and then on Google’s Cloud Computing group. I’d like to address a few issues these raise, plus some that have come up elsewhere. Please bear in mind that these are all from my perspective as an engineer in SiCortex’s software group, so more informed parties might be able to correct some of the details, but I’m pretty confident about the gist of them.

  • SiCortex was explicitly not trying to sell a supercomputer. We never even cracked the Top500, or intended to except as a prerequisite for the more relevant Green500 (at the time; the requirements have since changed). We even offered the 12-node 72-core $15K SC072 as an entry point, which was lauded for its affordability and accessibility (and predated Cray’s CX-1 by quite a bit). That’s hardly “Formula One” territory.
  • The “green” angle was not just a matter of market positioning, though it was certainly that. Low power and heat were also quite fundamental to achieving the level of density and simplicity (e.g. no active network components other than the compute nodes) that were the real core of the architecture.
  • Using an x86 instruction set was not an option. Even if the appropriate licenses had been available, they would surely have been out of our financial reach. As it was, IP licensing (especially MIPS) was the second largest expense behind payroll.
  • Similarly, using QPI or similar wasn’t an option. QPI itself didn’t even exist at the outset, and licensing it later would have been similarly expensive.
  • Both the x86 and QPI mirages are also affected by the question of technological fit with other key architectural elements (see above). It seems vaguely possible that something could have been worked out with VIA or some other third party that already had the necessary licenses, but then again maybe not. The same question came up with respect to Raza and Cavium in the MIPS space. I was personally never very satisfied with the answers I got when I or others asked about such things, but then I was barely qualified to be asking. Maybe one of my erstwhile colleagues could shed more light here.
  • It was probably not feasible to make and sell machines even smaller than the SC072. I and others did favor this idea, largely based on the four-node “Frost” machines we’d used during development, but the people making such decisions said that we could not afford the sales and support infrastructure necessary for that market. Having seen the margins and support costs for the SC072 up close and personal, I agree that it would not have been profitable. An argument might still be made that such a machine would have been worthwhile even if it had not been profitable in and of itself, as a way of establishing mind share etc., but that’s not clearly the case either.

As I said in my other post, the failure of SiCortex had nothing to do with the technical merit of the product. I don’t think they even had to do with the market merit. Don’t believe the disgruntled ex-employee who now works for another player in this space and seems to make a point of going around to every article or story about SiCortex to comment anonymously about CPU or memory performance. Sour grapes, dude. The machine was selling OK, though of course not as well as anyone had hoped. Where we failed was in “selling” the company to investors so we could finish building the next generation, and in any other VC market even that might have been an easy sell. After all of the times I’ve seen charlatans do a better job of selling a story to investors than of selling a product to users, I think SiCortex still stands as an example of the exact opposite. We were for real. It’s the investors who turned out to be fake.

Back on Top

Google finally has me at #1 for my own name again. Take that, Jeff Darcy!

(Seriously, I couldn’t be in better company. Jeff’s a fine cartoonist whose work I enjoy . . . at least when it’s not too Ohio-specific for a Massachusetts guy to understand. I wish him the best of everything . . . except for the top spot on the top search engine.)