Dealing With Distributed State

At the Linux Foundation’s recent End User Summit, I had the pleasure of meeting K.S. Bhaskar from FIS. Recently he wrote an article on his blog about Eventual State Consistency vs. Eventual Path Consistency in which he has some particularly interesting things about different kinds of consistency guarantees.

there are applications where detection of colliding updates does not suffice to ensure Consistency (where Consistency is defined as the database always being within the design parameters of the application logic – for example, the “in balance” guarantee that the sums of the assets and liabilities columns of a balance sheet are always equal).

He then gives an example showing an apparent problem with two financial transactions and their associated service charges, across two sites while a service-charge rate change is still “in flight” between them. I originally responded there, but my reply seems to have disappeared. Maybe it got lost due to a conflict with a subsequent update. ;) In any case, I might as well respond here because I think his example highlights an important issue. I don’t think Bhaskar’s example really demonstrates the problem he had described. In the last step he says that

B detects a collision, since it observes that the data that was read by the application logic on A to compute the result of transaction P has changed on B

How could B observe such a thing? Only if it knew either the data that was read on A (i.e. the service-charge rate in effect for the transaction was included as part of the replication request) or the exact replication state on A at the time P was processed there (e.g. by using vector clocks or similar). Either way, it would have enough information to replicate the transaction in a consistent fashion.

The real problem would be if B didn’t know whether or not the rate change had reached A yet when P was processed there. That would result in B needing to distinguish between two possible states that would have to be handled differently, but with no way to make that distinction. The general rule to avoid these kinds of unresolvable conflicts is: don’t pass around references to values that might be inconsistent across systems. It’s like passing a pointer from one address space to a process in another; you just shouldn’t expect it to work. Either pass around the actual values or do calculations involving those values and replicate the result. For example, consider the following replication requests.

# $var indicates immediate substitution from the original context
# %var indicates a transaction-local variable

# Wrong: sc_rate is passed by reference and interpreted at destination
replicate transaction "transfer #zzz" {
    acct_x -= $amt * (1.0 + sc_rate);
    acct_y += $amt;

# Right: sc_rate is interpreted at source and passed by value
replicate transaction "transfer #zzz" {
    %sc_rate = $sc_rate;
    acct_x -= $amt * (1.0 + %sc_rate);
    acct_y += $amt;

# Right: service charge is calculated at source
# works, but not good for auditing
amt_with_sc = amt * (1.0 + sc_rate)
replicate transaction "transfer #zzz" {
    acct_x -= $amt_with_sc;
    acct_y += $amt;

# Right: service charge as separate transaction
sc = amt * sc_rate;
replicate transaction "transfer #zzz" {
    acct_x -= $amt;
    acct_y += $amt;
replicate transaction "service charge for #zzz" {
    acct_x -= $sc;

In an ideal world, the interface and behavior for the replication subsystem would disallow or strongly discourage the wrong form. For example, it could require that any values meant to be interpreted or modified at the destination must be explicitly listed or tagged, and reject anything that abuses “extraneous” variables as in the first form above. (Auto-conversion of the first form into the second is likely to introduce its own kinds of unexpected behavior.) That would force people to use one of the methods that actually works.

Introduction to Distributed Filesystems

When Alex Popescu wrote about scalable storage solutions and I said that the omission of distributed filesystems made me cry, he suggested that I could write an introduction. OK. Here it is.

All filesystems – even local ones – have a similar data model and API model. The data model consists of files inside directories, where both have user-assigned names. In most modern filesystems directories can be nested, file contents are byte-addressable, and names are free-form character sequences. The API model, commonly referred to as POSIX after the standard of the same name, includes two broad categories of calls – those that operate on files, and those that operate on the “namespace” of files within directories. Examples of the first category include open, close, read, write, and fsync. Examples of the second category include opendir, readdir, link, unlink, and rename. People who actually develop filesystems, especially in a Linux context, often talk in terms of a different three-way distinction (file/inode/dirent operations) but that has more to do with filesystem internals than with the API users see. The other thing about filesystems is that they’re integrated into the operating system. Any program should be able to use any filesystem without using special libraries. That makes real filesystems a bit harder to implement, but it also makes them more generally useful than impostors that just have “FS” in the name to imply functionality they don’t have. There are many ways to categorize filesystems – according to how they’re accessed, how they’re implemented, what they’re good for, and so on. In the context of scalable storage solutions, though, the most important groupings are these.

  • A local filesystem is local to a single machine, in the sense that only a process on the same machine can make POSIX calls to it. That process might in fact be a server for some “higher level” kind of filesystem, and in fact local filesystems are an essential building block for most others, but for this to work the server must make a new local-filesystem call which is not quite the same as continuing the client’s call.
  • A network filesystem is one that can be shared, but where each client communicates with a single server. NFS (versions 2 and 3) and CIFS (a.k.a. SMB which is what gives Samba its name) are the best known examples of this type. Servers can of course be clustered and made highly available and so on, but this must be done transparently – almost behind the clients’ backs or under their noses. This approach fundamentally only allows vertical scaling, and the trickery necessary to scale horizontally within a single-server basic model can become quite burdensome.
  • A distributed filesystem is one in which clients actually know about and directly communicate with multiple servers (of one or more types). Lustre, PVFS2, GlusterFS, and Ceph all fit into this category despite their other differences. Unfortunately, the term “distributed filesystem” makes no distinction between filesystems distributed across a fast and lossless LAN and those distributed across a WAN with exactly opposite characteristics. I sometimes use “near-distributed” and “far-distributed” to make this distinction, but as far as I know there’s no concise and commonly accepted terms. AFS is the best known example of a far-distributed filesystem, and one of the longest-lived filesystems in any category (still in active large-scale use at several places I know of).
  • A parallel filesystem is a distributed filesystem in which a single file, or even a single I/O request, can be striped across multiple servers. This is primarily beneficial in terms of performance, but can also help to distribute capacity more evenly than if every file lives on exactly one server. I’ve often used the term to refer to near-distributed filesystems as distinct from their far-distributed cousins, because there’s a high degree of overlap, but it’s not technically correct. There are near-distributed filesystems that aren’t parallel filesystems (GlusterFS is usually configured this way) and parallel filesystems that are not near-distributed (Tahoe-LAFS and other crypto-oriented filesystems might fit this description).
  • A cluster or shared-storage filesystem is one in which clients are directly attached to shared block storage. GFS2 and OCFS2 are the best known examples of this category, which also includes MPFS. Once touted as a performance- or scale-oriented solution, these are now positioned mainly as availability solutions with a secondary emphasis on strong data consistency (compared to the weak consistency offered by many other network and distributed filesystems). Due to this focus and the general characteristics of shared block storage, the distribution in this case is always near.

This set of distinctions is certainly neither comprehensive nor ideal, as illustrated by pNFS which allows multiple “layout” types. With a file layout, pNFS would be a distributed filesystem by these definitions. With a block layout, it would be a cluster filesystem. With an object layout, a case could be made for either, and yet all three are really the same filesystem with (mostly) the same protocol and (definitely) the same warts.

One of the most important distinctions among network/distributed/cluster filesystems, from a scalability point of view, is whether it’s just data that’s being distributed or metadata as well. One of the issues I have with Lustre, for example, is that it relies on a single metadata server (MDS). The Lustre folks would surely argue that having a single metadata server is not a problem, and point out that Lustre is in fact used at some of the most I/O-intensive sites in the world without issue. I would point out that I have actually watched the MDS melt down many times when confronted with any but the most embarrassingly metadata-light workloads, and also ask why they’ve expended such enormous engineering effort – on at least two separate occasions – trying to make the MDS distributed if it’s OK for it not to be. Similarly, with pNFS you get distributed data but only some pieces of the protocol (and none of any non-proprietary implementation) to distribute metadata as well. Anybody who wants a filesystem that’s scalable in the same way that non-filesystem data stores such as Cassandra/Riak/Voldemort are scalable would and should be very skeptical of claims made by advocates of a distributed filesystem with non-distributed metadata.

A related issue here is of performance. While near-distributed parallel filesystems can often show amazing megabytes-per-second numbers on large-block large-file sequential workloads, as a group they’re notoriously poor for random or many-small-file workloads. To a certain extent this is the nature of the beast. If files live on dozens of servers, you might have to contact dozens of servers to list a large directory, or the coordination among those servers to maintain consistency (even if it’s just metadata consistency) can become overwhelming. It’s harder to do things this way than to blast bits through a simple pipe between one client and one server without any need for further coordination. Can Ma’s Pomegranate project deserves special mention here as an effort to overcome this habitual weakness of distributed filesystems, but in general it’s one of the reasons many have sought alternative solutions for this sort of data.

So, getting back to Alex’s original article and my response to it, when should one consider using a distributed filesystem instead of an oh-so-fashionable key/value or table/document store for one’s scalable data needs? First, when the data and API models fit. Filesystems are good at hierarchical naming and at manipulating data within large objects (beyond the whole-object GET and PUT of S3-like systems), but they’re not so good for small objects and don’t offer the indices or querying of databases (SQL or otherwise). Second, it’s necessary to consider the performance/cost curve of a particular workload on a distributed filesystem vs. that on some other type of system. If there’s a fit for data model and API and performance, though, I’d say a distributed filesystem should often be preferred to other options. The advantage of having something that’s accessible from every scripting language and command-line tool in the world, without needing special libraries, shouldn’t be taken lightly. Getting data in and out, or massaging it in any of half a million ways, is a real problem that isn’t well addressed by any storage system with a “unique” API (including REST-based ones) no matter how cool that storage system might be otherwise.

Morning in Tempe

I’m in Tempe for the Fedora Users and Developers Conference, a.k.a. FUDCon. Here are some random thoughts.

  • Enhanced pat-downs aren’t so bad.
  • The weather’s nice. I should have expected the palm trees, but I totally didn’t expect to see orange trees with ripe fruit hanging just out of arm’s reach (because the ASU students picked everything lower already).
  • The ASU campus is much more interesting and varied architecturally than any other campus I’ve been on. Sure, the color palette is a bit limited – light brown, dark brown, reddish brown – but the shapes and textures make up for it. Actually there was one nice splash of color, which was a gigantic wild rose bush clinging to the side of one building. That ugly bump just north of campus doesn’t do much for me, though.
  • I haven’t seen a single squirrel on campus. I did see two cats, though – fluffy persians who must be very uncomfortable in all this heat. I’ve seen and heard lots of unfamiliar birds, too – mostly grackles, I think
  • Meeting people in person is great. The Fedora crowd is notably casual, international, and friendly – even by technical-conference standards, in all three regards. I’d particularly like to thank Robyn Bergeron and Seth Vidal, very busy leaders in that community who have nonetheless gone out of their way to make me feel welcome and included. It was also especially nice to meet Pete Zaitcev and Major Hayden, because we’ve interacted so much online but never met until now.

Here’s a Flickr set for some of the pictures I’ve taken while here. OK, enough of the fluff. What about the real stuff? More bullet points, because that’s how I roll.

  • The whole “Bar Camp” style of pitching and voting on sessions was new to me. It did seem to work, though.
  • The first talk I attended was Marek Goldmann talking about BoxGrinder. I was pretty familiar with this work from my own involvement with Deltacloud/Aeolus, but Marek deserves kudos for presenting it well and even giving a live demo.
  • After lunch, it was Steven Dake talking about Sheepdog. Again, it’s work I’m familiar with. I think Steven and I will never quite agree on the value/importance of Sheepdog. On the one hand, the notion of distributed block storage has been very appealing to me for a long time. It’s why I went to Conley in 1998, and worked on C3D at EMC a few years later. On the other hand, block storage using a single specialized application interface which isn’t even as complex as the real system-level block device interface seems a bit unambitious to me. It just limits the applicability of the result too much IMO, and that seems a meager payoff for all that work solving the harder distributed-data problems. Of course, in this case it’s all NTT’s effort anyway. As far as the talk, a comparison to RBD would have been nice since anybody who’s interested in one should definitely check out the other as well.
  • Next up was Mike McGrath, talking about how cloud computing is going to displace non-cloud computing. Even as somebody who’s working on cloud stuff, I’m a little bit skeptical. Still, it was a good talk to get people thinking about all the implications.
  • I’ll skip the next talk, since it was mine and I’ll have more to say below.
  • The last talk of the day, for me, was Chris Lalancette talking about cloud management – especially Deltacloud and Aeolus. Having worked for a while on this project (and sitting about twenty feet from Chris most days) this was also pretty familiar territory, and Chris did a good job presenting on a complex subject. I apologize to both him and to Tobias Kunze (with whom I had an awesome chat later in the evening BTW) for putting them on the spot about the relationship between Makara and Aeolus.

So, how did my own presentation go? Somebody pointed out that I’d seemed a bit on edge the night before. Partly that was just the stress of travel and of being an introvert mingling with an unfamiliar group of people, but there’s another factor that I hadn’t even consciously realized until I was writing this post. I’ve presented about CloudFS privately and/or in fairly abstract terms so many times that I’d actually forgotten this was the first truly public presentation about a concrete thing that I’ll actually be delivering in the near future. That’s a big deal. I was a bit concerned at first because they’d put me in the largest room and at five past the hour it was still three-quarters empty. Nobody likes talking to an empty room. Shortly after I started, though, the room was pretty much full – not standing-room-only full, but I don’t remember seeing many empty seats. Not that I was trying too hard to count, of course; I was otherwise occupied. Even better, people were engaged. There were many questions, and they were good questions – questions that to me indicated genuine curiosity and constructive intent, not just the “I’m going to prove I’m smart” or “if you don’t get this one right your project will look silly” kinds of questions that one often gets. The post-presentation chatter even went on so long that Chris had to kick us away from the lectern. Good problem to have. :)

The best part of all, in my opinion, was outside of the talk itself. In at least two other presentations, and in even more hallway conversations, the possibility of using CloudFS to solve some problem or add some functionality came up. Also, at least one person had clearly given the code a pretty detailed look since my talk, asking questions and making comments about internal details that he could not have known about otherwise. That is so cool. It’s all very well to have people’s attention for an hour or so before people move on to the next new thing, but when something you’ve talked about shows up in colleagues’ own thinking about how to solve their own problems that’s an even surer measure of being on the right track. Thank you, everyone, for letting me be part of the broader progress we’re all making together.

All Quiet on the Eastern Front

In case people are wondering why I haven’t been posting here, it’s partly being busy at work, partly being busy with Christmas stuff, partly being sick with what has come to be known in the office as the Creeping Death, and partly because I’ve been writing a lot over at If you’re interested in the why and what of CloudFS, go check it out.

CloudFS on the Horizon

I try to keep blogging about $dayjob to an absolute minimum, but this is kind of a big deal. Today, I pushed the first bits of CloudFS out to a public git repository. A week or so ago I registered to hold non-code content, but it’s not really up yet so for now you can clone that repository and look at the therein, or at the Fedora 15 feature page. Here are a few nuggets to get you started.

  • CloudFS is basically GlusterFS enhanced to allow deployment by a provider as a permanent shared service, rather than as a private thing that users can run within their own compute instances.
  • The enhancements necessary fall into several broad categories: authentication, encryption, isolation (each tenant gets their own namespace and UID/GID space), quota/billing, and some necessary enhancements to existing GlusterFS functionality.
  • This is a very pragmatic and unambitious release, explicitly not including the improved-DHT and multi-site-replication functionality that I think will make CloudFS really cool. Think of it as a warm-up to the main attraction.
  • The code is nowhere near complete. The three translators I’ve written are complete enough to do useful things – and more importantly to be worth reviewing – but all need to be improved in various ways and there are other bits (mostly around configuration and management) that don’t even exist yet. To put it another way, I think the code represents that point on a journey where you’ve climbed the highest hill and can see the destination, but there are many miles yet to be walked.

Once I get set up the rest of the way, I’ll probably start posting more info there. Stay tuned.

Caching vs. Replication II

Since this is a subject that has been much on my mind lately, I’m going to take another shot at discussing and clarifying the differences between caches and replicas. I’ve written about this before, but I’m going to take a different perspective this time and propose a different distinction.

A replica is supposed to be complete and authoritative. A cache can be incomplete and/or non-authoritative.

I’m using “suppose” here in the almost old-fashioned sense of assume or believe, not demand or require. The assumption or belief might not actually reflect reality. A cache might in fact be complete, while a replica might be incomplete – and probably will be, when factors such as propagation delays and conflict resolution are considered. The important thing is how these two contrary suppositions guide behavior when a client requests data. This distinction is most important in the negative case: if you can’t find a datum in a replica then you proceed as if it doesn’t exist anywhere, but if you can’t find a datum in a cache then you look somewhere else. Here are several other possible distinctions that I think do not work as well.

  • Read/write vs. read-only. CPU caches and NoSQL replicas are usually writable, while web caches and secondary replicas used for Disaster Recovery across a WAN are more often read-only.
  • Hierarchy vs. peer to peer. CPU caches are usually hierarchical with respect to (more authoritative) memory, but treat each other as peers in an SMP system – and don’t forget COMA either. On the flip side, the same NoSQL and DR examples show that both arrangements apply to replicas as well.
  • Increasing performance vs. increasing availability. This is the distinction I tried to make in my previous article, and in practice that distinction often applies to the domain where I made it (see the comments). I’m not backing away from the claim that a cache which fails to improve performance or a replica which fails to improve availability has failed, but either a cache or a replica can serve both purposes so I don’t think it’s the most essential distinction between the two.
  • Pull vs. push. Caches usually pull data from a more authoritative source, while replicas are usually kept up to date by pushing from where data was modified, but counterexamples do exist. Many caches do allow push-based warming, and in the case of a content delivery network there’s probably more pushing than pulling. I’m having trouble thinking of a pull-based replica that’s not really more of a proxy (possibly with a cache attached) but the existence of push-based caches already makes this distinction less essential. One thing that does remain true is that a cache can and will fall back to pulling data for a negative result, while a replica either can’t or won’t.
  • Independent operation. Again, replicas are usually set up so that they can operate independently (i.e. without connection to any other replica) while caches usually are not, but this is not necessarily the case. For example, Coda allowed users to run with a disconnected cache.

Now that I’ve said that some of these differences are inessential, I’m going to incorporate them into two pedagogical examples for further discussion. Let’s say that you’re trying to extend your existing primary data store with either a push-based peer-to-peer replica or a pull-based hierarchical cache at a second site. What are the practical reasons for choosing one or the other? The first issue is going to be capacity. A replica requires capacity roughly equal to that of any other replica; a cache may be similarly sized, but may also and most often will be smaller. While storage is cheap, it does add up and for very large data sets a (nearly) complete replica at the second site might not be an economically feasible option. The second issue is going to be at the intersection of network bandwidth/latency/reliability and consistency. Consider the following two scenarios:

  • For a naively implemented push-based replica, each data write is going to generate cross-site traffic. If the update rate is high, then this is going to force a tradeoff between high costs for high bandwidth vs. low consistency at low bandwidth. You can reduce the cross-site bandwidth by delaying and possibly coalescing or skipping updates, but now you’re moving toward even lower consistency.
  • For a pull-based cache, your tradeoff will be high cost for a large cache vs. high latency for a small one.

Yes, folks, we’re in another classic triangle – Price vs. Latency vs. Consistency. Acronyms and further analysis in those terms is left for the reader, except that I’ll state a preference for PLaC over CLaP. The point I’d rather make here is that the choice doesn’t have to be one or another for all of your data. Caches and replicas both have a scope. It’s perfectly reasonable to use caching for one scope and replication for another, even between two sites. The main practical consequences of being a cache instead of a replica are as follows.

  • You can drop data whenever you want, without having to worry about whether that was the only copy.
  • You can ignore (or shut off) updates that are pushed to you.
  • You can fall back to pulling data that’s not available locally.

If you have a data set that’s subdivided somehow, and a primary site that’s permanently authoritative for all subsets, then it’s easy to imagine dynamically converting caches into replicas and vice versa based on usage. All you need to do is add or remove these behaviors on a per-subset basis. In fact, you can even change how the full set is divided into subsets dynamically. Demoting from a replica to a cache is easy (so long as you’ve ensured the existence of an authoritative copy elsewhere); you just give yourself permission to start dropping data and updates. Promoting from a cache to a replica is trickier, since you have to remain in data-pulling cache mode until you’ve allocated space and completed filling the new replica, but it’s still quite possible. With this kind of flexibility, you could have a distributed storage system capable of adapting to all sorts of PLaC requirements as those change over time, instead of representing one fixed point where only the available capacity/bandwidth can change. Maybe some day I’ll implement such a system. ;)

November Tweets

For those (few) who follow me but not there, here are my top ten from November.

  • OH at museum: Now that we’ve appreciated all the diversity, can we please move on? (November 07)
  • If I have publicly and violently clashed with the founders, please pardon my raucous laughter when you try to recruit me. (November 09)
  • Using Inconsolata font for code editing. Quite nice. (November 11)
  • My take on the “your argument is invalid” meme, inspired by driftx on #cassandra. (November 12)
  • Tea Party yard work: borrow a neighbor’s leaf blower, then blow all your leaves onto his yard. (November 14)
  • shows a *negative* correlation between some popular coding-style preferences and actual readability. (November 15)
  • If you work in distributed systems but haven’t read Saito and Shapiro then fix that. (November 16)
  • How many applications have you used today? How many are you personally willing to rewrite to use an “alternative data store”? (November 29)
  • I am the king of . . . just a minute. Where was I? Oh yeah, the king of . . . just a sec . . . multitasking. (November 29)
  • If hand-waving built muscle, I’d know some very buff architects. (November 30)

October Tweets and Links

A little early, but it’s been a good month.

  • There are only two problems with distributed counters. Or maybe three.
  • Problems with .ly domains? Less money for Libyan government, some URL shorteners die. And the downside is . . . ?
  • Self-driving cars, offshore wind farms, embryonic stem cell treatments – all on one glance at the news. What times we live in.
  • Assembly Instructions from Hell
  • Sauron to bid for Tea Party leadership. It would be an improvement.
  • Caching separate from the DB (any type) is not the enterprise version of a DB which is inherently distributed. It’s the buggy version.
  • The last frame is so funny because so many really do think such “answers” are useful. Not Invented Here
  • Matchstick Tirith
  • Simply awesome microphotography.
  • Lord of the Rings + Rudolph the Red Nosed Reindeer
  • Sears for zombies. (“Afterlife. Well spent.”)
  • Anybody who refers to an “internet cable” shouldn’t do distributed programming.
    [Just for you, PeteZ: the person was referring to a physical object, not just internet service from a cable company]
  • App devs shouldn’t create infrastructure . . . not because their time is better spent, but because they suck at it.

When Partitions Attack

Stu Charlton has written an interesting post called Confused CAP Arguments. The tone is mostly civil and reasonable, though it’s a bit disappointing to see some of the negative stereotypes (“cocktails…newly-graduated devs”) and strawmen (“only using CAP”) repeated yet again. Oh, well. The issue of “is CA possible” seems to be rather central to Stu’s post, and it also came up in my last post in the subject, so I’ll delve into that aspect a bit and even share some of the war stories that I promised to.

First, though, a bit of background. Back in 1993, before either Stu or the current generation of anti-RDBMS hipsters entered the workforce, I was working on HACMP/6000 at Clam. At first I worked on the lock manager, which can be thought of as a key/value store where the values are very small but have complex semantics. A lot of the hashing and replication techniques it used would be instantly recognizable to NoSQL folks. Later, I was responsible for the network routing and monitoring infrastructure. The “DGSP” (Die, Gravy Sucking Pig) message that we used to enforce quorum was my invention. I became pretty intimately familiar with all of the ways such systems could fail. In particular, I became familiar with how even systems with redundant network hardware and cutting-edge network software could still experience network partitions. To address one of Stu’s points, about treating network failures as the only kind, one observation I made pretty early on is that for all practical purposes they are the only kind. It’s not hard to promote any local failure to a node failure, and it greatly simplifies the rest of the system if you do. Anyway, the real point is that I might not have been a database guy myself, but I directly supported them (including time on-site at Oracle HQ) so I have very little patience for database guys trying to pretend that only they have the proper context to understand the application or business value of various tradeoffs in this area. IBM wouldn’t have sold tens of thousands of HACMP licenses if we hadn’t thoroughly understood those same issues. So, back to the issue at hand. Here’s one of the less coherent parts of Stu’s post.

What is Partition Intolerance?

This is never well defined. Like lactose intolerance, the results, can be, err, unpredictable. It seems to imply one of: complete system unavailability, data inconsistency, or both.

First, data inconsistency is a sacrifice of C. That’s pretty clearly defined. Gilbert and Lynch defined P thus: the system continues to operate despite arbitrary message loss. That’s still a bit vague both in terms of “the system” (all or part) and “continues to operate” (as before or degraded somehow). I tried to provide a clearer interpretation a while back, distinguishing P from A by contrasting the effect of a partition on requests vs. nodes. In short, a non-A system fails some nodes, while non-P system fails some requests. Some might disagree with that definition – several did in the comments to that article – but I think it’s clear enough to make “never well defined” untrue.

What Stu does get absolutely right is that the effect of a partition in a partition-intolerant system can be unpredictable. This is why I don’t like CA systems much. If you’ve partitioned your data somehow and some nodes become unreachable, then some requests will be unable to reach the data (or acquire the locks) that they need, and a CA system must either block or fail such requests. If it could satisfy them by eliminating possible sources of conflict, e.g. by forcing nodes down and breaking their locks, then it would be a CP system (with reduced capacity). If it could satisfy them by allowing the conflict and hoping to perform some sort of resolution later, it would be an AP system (with reduced consistency). As a CA system, preserving both capacity and consistency, block or fail are the only options. Failing requests is the better option, but is unfortunately rarely chosen because people don’t usually seem to choose CA. Instead, they fall into it because they try to deny that partitions could occur. They have unbounded faith in the efficacy of redundant networks. Anybody who has seen the dark underbelly of network infrastructure knows how misplaced that faith can be. In “CAP-naive” systems, the most common behavior I used to see was that some requests would succeed and others would block indefinitely waiting for some resource they needed. It’s bad enough that hash-based resource distribution would cause these blockages to be completely unpredictable, but there’s an even bigger iceberg to worry about: cascading failures. Database requests often need many resources, and if they block unexpectedly while acquiring the fifth then they block while still holding the other four. That would often cause even more requests to block waiting for those four, and so on until practically every resource is unavailable and the whole system locks up. This isn’t theoretical. I’ve seen it happen many many times. I’ve seen it happen almost instantaneously when a critical resource becomes unreachable in a busy system, and I’ve seen it unfold slowly over hours or even days when a less popular resource did so in a lightly loaded system. It’s not some exceptional exotic behavior in CA systems. It’s the norm when partitions occur – and they do.

OK, so this can happen. Is it worth designing around? Implementing redundant networks can reduce the frequency of partitions to very low levels. That means some additional cost for network hardware and software, but without replication to ensure reachability (on top of potentially hardware-assisted replication to ensure durability) the savings on node and storage hardware to reach fixed capacity/performance goals might be even greater. Let’s just eat the cost on those rare occasions when the unthinkable happens. Doesn’t that make more economic sense? The first argument against that is that partitions even in redundant networks are more common than most people think, so the economic cost of such failures is likewise greater. I think the more compelling argument, though, is that a better tradeoff is practically always available. Any CA system which degrades messily into zero capacity when a partition occurs can be easily replaced with a CP system which degrades cleanly to half capacity under the same conditions. The technology for quorum-enforcing CP systems is mature and readily accessible, the configurations and running costs are exactly the same, so there’s no good reason to implement a CA system instead. Every CA system you see is the result not of diligent design but of (often willful) ignorance. CA systems can exist; they just shouldn’t.

Someone is Wrong on the Internet

There are many ways to be wrong in a technical discussion, but the ways usually fall into one of two categories – ways that can easily be corrected, and ways that can not. In the first case, somebody else can provide a counterexample or explanation that rests on only a few widely agreed-upon axioms, so that even those not particularly familiar with the subject matter can see that the original claim was wrong. In the second case, showing how or why a claim is wrong can require making subtle distinctions or pointing out less well-known facts. In the process, the whole discussion can often be drawn into a morass of contested claims leading to no clear conclusion. The first type of wrongness often isn’t worth addressing at all – the author and/or others will quickly spot the error without any assistance – or can be dealt with quickly. The second kind of wrongness can persist almost indefinitely, but still needs to be addressed because it can trap the unwary and lead them to waste a lot of time pursuing bad ideas when they could have been pursuing good ones. As futile as it may be, I’m going to address one particularly pernicious example regarding a topic I’ve written about many times – the infamous CAP theorem.

xkcd #386

In his latest missive, Michael Stonebraker claims to be addressing some mis-perceptions of his views on Brewer’s CAP theorem. I would say many of those “mis-perceptions” are entirely accurate, but it’s no surprise that he’d want to back-pedal a bit. Unfortunately, it’s not very effective to fight mischaracterization of one’s views by tossing out mischaracterizations of others. For example (near the end):

In summary, appealing to the CAP theorem exclusively for engineering guidance is, in my opinion, inappropriate.

I’d like to know who’s appealing to the CAP theorem exclusively. Not me. Not Coda Hale, who refers to the three properties of CAP as “Platonic ideals” and repeatedly refers to design heuristics or compromises involving more than just CAP. Not anyone I’ve seen involved in this discussion. The most extreme view I’ve seen is the complete rejection of CAP by those who just can’t let go of consistency and transactions. Those are easy models, it’s not hard to understand their appeal, but sometimes they’re just not appropriate for the problem at hand. Just like simpler Newtonian physics had to yield to the “weird” models proposed by Planck or Einstein, computing has moved into the “relativistic” world of Lamport or Lynch. It’s a world where there is no such thing as absolute time in any system with more than one physical clock, and where the same event for all practical purposes occurs at different times in different places (even without network partitions). The only concepts of time that matter in such systems are before/after and causality, but not duration. Because of this, node speed really doesn’t matter except to the extent that it affects the number of nodes you need to perform a task. As Stonebraker puts it,

Next generation DBMS technologies, such as VoltDB, have been shown to run around 50X the speed of conventional SQL engines. Thus, if you need 200 nodes to support a specific SQL application, then VoltDB can probably do the same application on 4 nodes. The probability of a failure on 200 nodes is wildly different than the probability of failure on four nodes.

What he fails to mention is that VoltDB gains that speed mostly by being a memory-based system, with all of the data-protection and capacity limitations that implies. If you need 200 nodes to support a specific SQL application, then you might still need 200 nodes not because of performance but because of capacity, so VoltDB won’t do the same job on 4 nodes. That’s exactly the kind of “trap the unwary” omission I was talking about.

He’s right on the last point, though: the probability of failure of 200 nodes really is wildly different than the same probability on 4 – exactly the point Coda makes, even doing one better by providing the actual formula, in his supposedly “misrepresentative” article. However, it’s worth examining the causes of node failures. Again, Stonebraker’s take.

The following important sources of outages are not considered in the CAP theorem.

Bohrbugs. These are repeatable DBMS errors that cause the DBMS to crash…

Application errors. The application inadvertently updates (all copies) of the data base…

Human error. A human types the database equivalent of RM * and causes a global outage…


Back here in reality, most Bohrbugs will cause a single node to crash, and the very relativity that makes distributed systems so challenging also makes it very unlikely that other nodes will experience the exact same sequence of events that triggers the failure. Other than overload, bugs that cause such “contagion” to take down the entire system are very rare. That’s why they’re newsworthy. You never see anything about twenty servers at Google or Yahoo failing, because that happens every day and because the people who designed those systems understand how to deal with it. More about that in a moment.

Going down the list, of course CAP doesn’t address application or human errors. Neither does Stonebraker’s approach. Neither can, because neither can control how applications or humans behave. Application errors have to be fixed in the applications, and human errors have to be fixed at a higher level too – e.g. by using automation to minimize the need for human intervention. It’s not worth talking about the cases where no tradeoffs are possible. What do you “trade off” to make human error disappear? Citing these kinds of errors as shortcomings of CAP, without noting their more general intractability, is just another dirty trick. As for reprovisioning as a “stop the world” operation, Benjamin Black and others have already pointed out that it’s simply not so for them . . . and I’ll add that it need not be so even in a more consistency-oriented world. In any system that can survive a failure of some nodes, those nodes can be upgraded while they’re offline but the rest of the system keeps running. The fact that some systems don’t have that property is merely a deficiency in their implementation, not a commentary on CAP.

What I find most misguided about Stonebraker’s article, though, is this.

In my experience, network partitions do not happen often. Specifically, they occur less frequently than the sum of bohrbugs, application errors, human errors and reprovisioning events. So it doesn’t much matter what you do when confronted with network partitions. Surviving them will not “move the needle” on availability because higher frequency events will cause global outages. Hence, you are giving up something (consistency) and getting nothing in return.

So, because network partitions occur less than some other kind of error, we shouldn’t worry about them? Because more people die in cars than in planes, we shouldn’t try to make planes safer? Also, notice how he says that network partitions are rare in his experience. His experience may be vast, but much of it is irrelevant because the scale and characteristics of networks nowadays are unlike those of even five years ago. People with more recent experience at higher scale seem to believe that network partitions are an important issue, and claiming that partitions are rare in (increasingly common) multi-datacenter environments is just ridiculous. Based on all this plus my own experience, I think dealing with network partitions does “move the needle” on availability and is hardly “nothing in return” at all. Sure, being always prepared for a partition carries a cost, but so does the alternative and that’s the whole point of CAP.

Remember how I exempted overload from my comment about systemic failure? I did that because the #1 cause of system overload is parts of the system being too dependent on another. Sooner or later, even in the best-designed system, some node somewhere is going to become overloaded. The more nodes you have waiting synchronously for each others’ responses, as they must when the system is built around consistency or consensus, the more likely it becomes that the local bottleneck will turn into a global traffic jam. Forcing non-quorum nodes down to preserve consistency among the rest – the other part of the traditional approach that Stonebraker clings to – only makes this even more likely because it increases load on the survivors. That’s the other part of why you don’t hear about a few nodes going down at Facebook or Amazon. Their engineers know that consistency has a cost too. Consistency means coupling, and coupling is bad for availability, and availability matters more to some people than consistency.

The conclusion, then, is much as it was before. Partitions are real, and significant, so we’re left with a tradeoff between consistency and availability. Real engineers make that tradeoff in various ways. Other people try to deny that the tradeoff is necessary, or that any choice other than their own might be valid, and they make up all manner of counter-factual reasons why. Which would you rather be?