Canned Platypus

Saving the world one byte at a time since 2000

Archive for the ‘distributed’ Category

I try to keep blogging about $dayjob to an absolute minimum, but this is kind of a big deal. Today, I pushed the first bits of CloudFS out to a public git repository. A week or so ago I registered cloudfs.org to hold non-code content, but it’s not really up yet so for now you can clone that repository and look at the README.md therein, or at the Fedora 15 feature page. Here are a few nuggets to get you started.

  • CloudFS is basically GlusterFS enhanced to allow deployment by a provider as a permanent shared service, rather than as a private thing that users can run within their own compute instances.
  • The enhancements necessary fall into several broad categories: authentication, encryption, isolation (each tenant gets their own namespace and UID/GID space), quota/billing, and some necessary enhancements to existing GlusterFS functionality.
  • This is a very pragmatic and unambitious release, explicitly not including the improved-DHT and multi-site-replication functionality that I think will make CloudFS really cool. Think of it as a warm-up to the main attraction.
  • The code is nowhere near complete. The three translators I’ve written are complete enough to do useful things – and more importantly to be worth reviewing – but all need to be improved in various ways and there are other bits (mostly around configuration and management) that don’t even exist yet. To put it another way, I think the code represents that point on a journey where you’ve climbed the highest hill and can see the destination, but there are many miles yet to be walked.

Once I get cloudfs.org set up the rest of the way, I’ll probably start posting more info there. Stay tuned.

Since this is a subject that has been much on my mind lately, I’m going to take another shot at discussing and clarifying the differences between caches and replicas. I’ve written about this before, but I’m going to take a different perspective this time and propose a different distinction.

A replica is supposed to be complete and authoritative. A cache can be incomplete and/or non-authoritative.

I’m using “suppose” here in the almost old-fashioned sense of assume or believe, not demand or require. The assumption or belief might not actually reflect reality. A cache might in fact be complete, while a replica might be incomplete – and probably will be, when factors such as propagation delays and conflict resolution are considered. The important thing is how these two contrary suppositions guide behavior when a client requests data. This distinction is most important in the negative case: if you can’t find a datum in a replica then you proceed as if it doesn’t exist anywhere, but if you can’t find a datum in a cache then you look somewhere else. Here are several other possible distinctions that I think do not work as well.

  • Read/write vs. read-only. CPU caches and NoSQL replicas are usually writable, while web caches and secondary replicas used for Disaster Recovery across a WAN are more often read-only.
  • Hierarchy vs. peer to peer. CPU caches are usually hierarchical with respect to (more authoritative) memory, but treat each other as peers in an SMP system – and don’t forget COMA either. On the flip side, the same NoSQL and DR examples show that both arrangements apply to replicas as well.
  • Increasing performance vs. increasing availability. This is the distinction I tried to make in my previous article, and in practice that distinction often applies to the domain where I made it (see the comments). I’m not backing away from the claim that a cache which fails to improve performance or a replica which fails to improve availability has failed, but either a cache or a replica can serve both purposes so I don’t think it’s the most essential distinction between the two.
  • Pull vs. push. Caches usually pull data from a more authoritative source, while replicas are usually kept up to date by pushing from where data was modified, but counterexamples do exist. Many caches do allow push-based warming, and in the case of a content delivery network there’s probably more pushing than pulling. I’m having trouble thinking of a pull-based replica that’s not really more of a proxy (possibly with a cache attached) but the existence of push-based caches already makes this distinction less essential. One thing that does remain true is that a cache can and will fall back to pulling data for a negative result, while a replica either can’t or won’t.
  • Independent operation. Again, replicas are usually set up so that they can operate independently (i.e. without connection to any other replica) while caches usually are not, but this is not necessarily the case. For example, Coda allowed users to run with a disconnected cache.

Now that I’ve said that some of these differences are inessential, I’m going to incorporate them into two pedagogical examples for further discussion. Let’s say that you’re trying to extend your existing primary data store with either a push-based peer-to-peer replica or a pull-based hierarchical cache at a second site. What are the practical reasons for choosing one or the other? The first issue is going to be capacity. A replica requires capacity roughly equal to that of any other replica; a cache may be similarly sized, but may also and most often will be smaller. While storage is cheap, it does add up and for very large data sets a (nearly) complete replica at the second site might not be an economically feasible option. The second issue is going to be at the intersection of network bandwidth/latency/reliability and consistency. Consider the following two scenarios:

  • For a naively implemented push-based replica, each data write is going to generate cross-site traffic. If the update rate is high, then this is going to force a tradeoff between high costs for high bandwidth vs. low consistency at low bandwidth. You can reduce the cross-site bandwidth by delaying and possibly coalescing or skipping updates, but now you’re moving toward even lower consistency.
  • For a pull-based cache, your tradeoff will be high cost for a large cache vs. high latency for a small one.

Yes, folks, we’re in another classic triangle – Price vs. Latency vs. Consistency. Acronyms and further analysis in those terms is left for the reader, except that I’ll state a preference for PLaC over CLaP. The point I’d rather make here is that the choice doesn’t have to be one or another for all of your data. Caches and replicas both have a scope. It’s perfectly reasonable to use caching for one scope and replication for another, even between two sites. The main practical consequences of being a cache instead of a replica are as follows.

  • You can drop data whenever you want, without having to worry about whether that was the only copy.
  • You can ignore (or shut off) updates that are pushed to you.
  • You can fall back to pulling data that’s not available locally.

If you have a data set that’s subdivided somehow, and a primary site that’s permanently authoritative for all subsets, then it’s easy to imagine dynamically converting caches into replicas and vice versa based on usage. All you need to do is add or remove these behaviors on a per-subset basis. In fact, you can even change how the full set is divided into subsets dynamically. Demoting from a replica to a cache is easy (so long as you’ve ensured the existence of an authoritative copy elsewhere); you just give yourself permission to start dropping data and updates. Promoting from a cache to a replica is trickier, since you have to remain in data-pulling cache mode until you’ve allocated space and completed filling the new replica, but it’s still quite possible. With this kind of flexibility, you could have a distributed storage system capable of adapting to all sorts of PLaC requirements as those change over time, instead of representing one fixed point where only the available capacity/bandwidth can change. Maybe some day I’ll implement such a system. ;)

For those (few) who follow me but not there, here are my top ten from November.

  • OH at museum: Now that we’ve appreciated all the diversity, can we please move on? (November 07)
  • If I have publicly and violently clashed with the founders, please pardon my raucous laughter when you try to recruit me. (November 09)
  • Using Inconsolata font for code editing. Quite nice. (November 11)
  • My take on the “your argument is invalid” meme, inspired by driftx on #cassandra. http://imgur.com/WtiFL (November 12)
  • Tea Party yard work: borrow a neighbor’s leaf blower, then blow all your leaves onto his yard. (November 14)
  • http://www.cs.virginia.edu/~weimer/ shows a *negative* correlation between some popular coding-style preferences and actual readability. (November 15)
  • If you work in distributed systems but haven’t read Saito and Shapiro then fix that. (November 16)
  • How many applications have you used today? How many are you personally willing to rewrite to use an “alternative data store”? (November 29)
  • I am the king of . . . just a minute. Where was I? Oh yeah, the king of . . . just a sec . . . multitasking. (November 29)
  • If hand-waving built muscle, I’d know some very buff architects. (November 30)

A little early, but it’s been a good month.

  • There are only two problems with distributed counters. Or maybe three.
  • Problems with .ly domains? Less money for Libyan government, some URL shorteners die. And the downside is . . . ?
  • Self-driving cars, offshore wind farms, embryonic stem cell treatments – all on one glance at the news. What times we live in.
  • Assembly Instructions from Hell
  • Sauron to bid for Tea Party leadership. It would be an improvement.
  • Caching separate from the DB (any type) is not the enterprise version of a DB which is inherently distributed. It’s the buggy version.
  • The last frame is so funny because so many really do think such “answers” are useful. Not Invented Here
  • Matchstick Tirith
  • Simply awesome microphotography.
  • Lord of the Rings + Rudolph the Red Nosed Reindeer
  • Sears for zombies. (“Afterlife. Well spent.”)
  • Anybody who refers to an “internet cable” shouldn’t do distributed programming.
    [Just for you, PeteZ: the person was referring to a physical object, not just internet service from a cable company]
  • App devs shouldn’t create infrastructure . . . not because their time is better spent, but because they suck at it.

Stu Charlton has written an interesting post called Confused CAP Arguments. The tone is mostly civil and reasonable, though it’s a bit disappointing to see some of the negative stereotypes (“cocktails…newly-graduated devs”) and strawmen (“only using CAP”) repeated yet again. Oh, well. The issue of “is CA possible” seems to be rather central to Stu’s post, and it also came up in my last post in the subject, so I’ll delve into that aspect a bit and even share some of the war stories that I promised to.

First, though, a bit of background. Back in 1993, before either Stu or the current generation of anti-RDBMS hipsters entered the workforce, I was working on HACMP/6000 at Clam. At first I worked on the lock manager, which can be thought of as a key/value store where the values are very small but have complex semantics. A lot of the hashing and replication techniques it used would be instantly recognizable to NoSQL folks. Later, I was responsible for the network routing and monitoring infrastructure. The “DGSP” (Die, Gravy Sucking Pig) message that we used to enforce quorum was my invention. I became pretty intimately familiar with all of the ways such systems could fail. In particular, I became familiar with how even systems with redundant network hardware and cutting-edge network software could still experience network partitions. To address one of Stu’s points, about treating network failures as the only kind, one observation I made pretty early on is that for all practical purposes they are the only kind. It’s not hard to promote any local failure to a node failure, and it greatly simplifies the rest of the system if you do. Anyway, the real point is that I might not have been a database guy myself, but I directly supported them (including time on-site at Oracle HQ) so I have very little patience for database guys trying to pretend that only they have the proper context to understand the application or business value of various tradeoffs in this area. IBM wouldn’t have sold tens of thousands of HACMP licenses if we hadn’t thoroughly understood those same issues. So, back to the issue at hand. Here’s one of the less coherent parts of Stu’s post.

What is Partition Intolerance?

This is never well defined. Like lactose intolerance, the results, can be, err, unpredictable. It seems to imply one of: complete system unavailability, data inconsistency, or both.

First, data inconsistency is a sacrifice of C. That’s pretty clearly defined. Gilbert and Lynch defined P thus: the system continues to operate despite arbitrary message loss. That’s still a bit vague both in terms of “the system” (all or part) and “continues to operate” (as before or degraded somehow). I tried to provide a clearer interpretation a while back, distinguishing P from A by contrasting the effect of a partition on requests vs. nodes. In short, a non-A system fails some nodes, while non-P system fails some requests. Some might disagree with that definition – several did in the comments to that article – but I think it’s clear enough to make “never well defined” untrue.

What Stu does get absolutely right is that the effect of a partition in a partition-intolerant system can be unpredictable. This is why I don’t like CA systems much. If you’ve partitioned your data somehow and some nodes become unreachable, then some requests will be unable to reach the data (or acquire the locks) that they need, and a CA system must either block or fail such requests. If it could satisfy them by eliminating possible sources of conflict, e.g. by forcing nodes down and breaking their locks, then it would be a CP system (with reduced capacity). If it could satisfy them by allowing the conflict and hoping to perform some sort of resolution later, it would be an AP system (with reduced consistency). As a CA system, preserving both capacity and consistency, block or fail are the only options. Failing requests is the better option, but is unfortunately rarely chosen because people don’t usually seem to choose CA. Instead, they fall into it because they try to deny that partitions could occur. They have unbounded faith in the efficacy of redundant networks. Anybody who has seen the dark underbelly of network infrastructure knows how misplaced that faith can be. In “CAP-naive” systems, the most common behavior I used to see was that some requests would succeed and others would block indefinitely waiting for some resource they needed. It’s bad enough that hash-based resource distribution would cause these blockages to be completely unpredictable, but there’s an even bigger iceberg to worry about: cascading failures. Database requests often need many resources, and if they block unexpectedly while acquiring the fifth then they block while still holding the other four. That would often cause even more requests to block waiting for those four, and so on until practically every resource is unavailable and the whole system locks up. This isn’t theoretical. I’ve seen it happen many many times. I’ve seen it happen almost instantaneously when a critical resource becomes unreachable in a busy system, and I’ve seen it unfold slowly over hours or even days when a less popular resource did so in a lightly loaded system. It’s not some exceptional exotic behavior in CA systems. It’s the norm when partitions occur – and they do.

OK, so this can happen. Is it worth designing around? Implementing redundant networks can reduce the frequency of partitions to very low levels. That means some additional cost for network hardware and software, but without replication to ensure reachability (on top of potentially hardware-assisted replication to ensure durability) the savings on node and storage hardware to reach fixed capacity/performance goals might be even greater. Let’s just eat the cost on those rare occasions when the unthinkable happens. Doesn’t that make more economic sense? The first argument against that is that partitions even in redundant networks are more common than most people think, so the economic cost of such failures is likewise greater. I think the more compelling argument, though, is that a better tradeoff is practically always available. Any CA system which degrades messily into zero capacity when a partition occurs can be easily replaced with a CP system which degrades cleanly to half capacity under the same conditions. The technology for quorum-enforcing CP systems is mature and readily accessible, the configurations and running costs are exactly the same, so there’s no good reason to implement a CA system instead. Every CA system you see is the result not of diligent design but of (often willful) ignorance. CA systems can exist; they just shouldn’t.

There are many ways to be wrong in a technical discussion, but the ways usually fall into one of two categories – ways that can easily be corrected, and ways that can not. In the first case, somebody else can provide a counterexample or explanation that rests on only a few widely agreed-upon axioms, so that even those not particularly familiar with the subject matter can see that the original claim was wrong. In the second case, showing how or why a claim is wrong can require making subtle distinctions or pointing out less well-known facts. In the process, the whole discussion can often be drawn into a morass of contested claims leading to no clear conclusion. The first type of wrongness often isn’t worth addressing at all – the author and/or others will quickly spot the error without any assistance – or can be dealt with quickly. The second kind of wrongness can persist almost indefinitely, but still needs to be addressed because it can trap the unwary and lead them to waste a lot of time pursuing bad ideas when they could have been pursuing good ones. As futile as it may be, I’m going to address one particularly pernicious example regarding a topic I’ve written about many times – the infamous CAP theorem.

xkcd #386

In his latest missive, Michael Stonebraker claims to be addressing some mis-perceptions of his views on Brewer’s CAP theorem. I would say many of those “mis-perceptions” are entirely accurate, but it’s no surprise that he’d want to back-pedal a bit. Unfortunately, it’s not very effective to fight mischaracterization of one’s views by tossing out mischaracterizations of others. For example (near the end):

In summary, appealing to the CAP theorem exclusively for engineering guidance is, in my opinion, inappropriate.

I’d like to know who’s appealing to the CAP theorem exclusively. Not me. Not Coda Hale, who refers to the three properties of CAP as “Platonic ideals” and repeatedly refers to design heuristics or compromises involving more than just CAP. Not anyone I’ve seen involved in this discussion. The most extreme view I’ve seen is the complete rejection of CAP by those who just can’t let go of consistency and transactions. Those are easy models, it’s not hard to understand their appeal, but sometimes they’re just not appropriate for the problem at hand. Just like simpler Newtonian physics had to yield to the “weird” models proposed by Planck or Einstein, computing has moved into the “relativistic” world of Lamport or Lynch. It’s a world where there is no such thing as absolute time in any system with more than one physical clock, and where the same event for all practical purposes occurs at different times in different places (even without network partitions). The only concepts of time that matter in such systems are before/after and causality, but not duration. Because of this, node speed really doesn’t matter except to the extent that it affects the number of nodes you need to perform a task. As Stonebraker puts it,

Next generation DBMS technologies, such as VoltDB, have been shown to run around 50X the speed of conventional SQL engines. Thus, if you need 200 nodes to support a specific SQL application, then VoltDB can probably do the same application on 4 nodes. The probability of a failure on 200 nodes is wildly different than the probability of failure on four nodes.

What he fails to mention is that VoltDB gains that speed mostly by being a memory-based system, with all of the data-protection and capacity limitations that implies. If you need 200 nodes to support a specific SQL application, then you might still need 200 nodes not because of performance but because of capacity, so VoltDB won’t do the same job on 4 nodes. That’s exactly the kind of “trap the unwary” omission I was talking about.

He’s right on the last point, though: the probability of failure of 200 nodes really is wildly different than the same probability on 4 – exactly the point Coda makes, even doing one better by providing the actual formula, in his supposedly “misrepresentative” article. However, it’s worth examining the causes of node failures. Again, Stonebraker’s take.

The following important sources of outages are not considered in the CAP theorem.

Bohrbugs. These are repeatable DBMS errors that cause the DBMS to crash…

Application errors. The application inadvertently updates (all copies) of the data base…

Human error. A human types the database equivalent of RM * and causes a global outage…

Reprovisioning.

Back here in reality, most Bohrbugs will cause a single node to crash, and the very relativity that makes distributed systems so challenging also makes it very unlikely that other nodes will experience the exact same sequence of events that triggers the failure. Other than overload, bugs that cause such “contagion” to take down the entire system are very rare. That’s why they’re newsworthy. You never see anything about twenty servers at Google or Yahoo failing, because that happens every day and because the people who designed those systems understand how to deal with it. More about that in a moment.

Going down the list, of course CAP doesn’t address application or human errors. Neither does Stonebraker’s approach. Neither can, because neither can control how applications or humans behave. Application errors have to be fixed in the applications, and human errors have to be fixed at a higher level too – e.g. by using automation to minimize the need for human intervention. It’s not worth talking about the cases where no tradeoffs are possible. What do you “trade off” to make human error disappear? Citing these kinds of errors as shortcomings of CAP, without noting their more general intractability, is just another dirty trick. As for reprovisioning as a “stop the world” operation, Benjamin Black and others have already pointed out that it’s simply not so for them . . . and I’ll add that it need not be so even in a more consistency-oriented world. In any system that can survive a failure of some nodes, those nodes can be upgraded while they’re offline but the rest of the system keeps running. The fact that some systems don’t have that property is merely a deficiency in their implementation, not a commentary on CAP.

What I find most misguided about Stonebraker’s article, though, is this.

In my experience, network partitions do not happen often. Specifically, they occur less frequently than the sum of bohrbugs, application errors, human errors and reprovisioning events. So it doesn’t much matter what you do when confronted with network partitions. Surviving them will not “move the needle” on availability because higher frequency events will cause global outages. Hence, you are giving up something (consistency) and getting nothing in return.

So, because network partitions occur less than some other kind of error, we shouldn’t worry about them? Because more people die in cars than in planes, we shouldn’t try to make planes safer? Also, notice how he says that network partitions are rare in his experience. His experience may be vast, but much of it is irrelevant because the scale and characteristics of networks nowadays are unlike those of even five years ago. People with more recent experience at higher scale seem to believe that network partitions are an important issue, and claiming that partitions are rare in (increasingly common) multi-datacenter environments is just ridiculous. Based on all this plus my own experience, I think dealing with network partitions does “move the needle” on availability and is hardly “nothing in return” at all. Sure, being always prepared for a partition carries a cost, but so does the alternative and that’s the whole point of CAP.

Remember how I exempted overload from my comment about systemic failure? I did that because the #1 cause of system overload is parts of the system being too dependent on another. Sooner or later, even in the best-designed system, some node somewhere is going to become overloaded. The more nodes you have waiting synchronously for each others’ responses, as they must when the system is built around consistency or consensus, the more likely it becomes that the local bottleneck will turn into a global traffic jam. Forcing non-quorum nodes down to preserve consistency among the rest – the other part of the traditional approach that Stonebraker clings to – only makes this even more likely because it increases load on the survivors. That’s the other part of why you don’t hear about a few nodes going down at Facebook or Amazon. Their engineers know that consistency has a cost too. Consistency means coupling, and coupling is bad for availability, and availability matters more to some people than consistency.

The conclusion, then, is much as it was before. Partitions are real, and significant, so we’re left with a tradeoff between consistency and availability. Real engineers make that tradeoff in various ways. Other people try to deny that the tradeoff is necessary, or that any choice other than their own might be valid, and they make up all manner of counter-factual reasons why. Which would you rather be?

Oct
14
Luwak

Basho has announced Luwak, an Erlang library for storing large files in Riak. The original code was contributed by Cliff Moon (@moonpolysoft) so I’m guessing that the slightly scatological name comes from him. I chatted with Cliff on IRC a bit. I also exchanged some email with Bryan Fink (@hobbyist) who wrote the HTTP interface and seems to be the current maintainer at Basho. Many thanks to both of them for taking the time to educate me. What follows might come across as criticism, but I don’t mean it as such. Most of it comes from my background as a filesystem developer, which is most assuredly not the best perspective from which to view Luwak, but it’s the perspective I have. Considered relative to Luwak’s goals and to the stage of its development, most of these apparent criticisms are weak or invalid, even when I managed to fight through my poor knowledge of Erlang to understand what the code’s doing. I can’t stress enough that I think Luwak is cool, and I wouldn’t have spent even as much time as I already have on it otherwise.

The first thing that strikes me about Luwak is that it’s all about what’s inside files and there’s nothing about managing namespaces – no directories, no renaming, no attributes as we filesystem types would expect, etc. That makes perfect sense, since Riak already has plenty of ways to index and connect Luwak files. Who needs directories when you have so many other ways to do the same things? Bryan even points out that “object” might be more accurate than “file” because it doesn’t carry the weight of expectations that Luwak was never intended to meet. This does mean that an application developer accustomed to arranging files into hierarchies will have to come up with their own way of mapping those semantics onto what Luwak provides, and maybe it would be nicer if that mapping were done in common code, but it’s not really that big a deal.

The structure within a file is of blocks arranged into a Merkle tree. The Merkle-tree approach is an interesting one. In the case of rewriting an entire file in which little has changed, it allows the update to be done with very little data transfer. I’m not sure it helps all that much in the case of writing a new file, or rewriting only part of a file, though. It makes me wonder whether the “atomic non-extending write within a single allocated block” optimization I mentioned here would apply to Luwak. The Merkle approach is also related to another interesting feature which isn’t mentioned in the README but does warrant a comment in luwak_io.erl

%% The write will start at the offset specified by
%% Start and overwrite anything at that position with the
%% contents of Data. Writes starting beyond the end of the file
%% will occur at the end of the file. Luwak does not allow for
%% gaps in a file.

I can totally see how this makes the design simpler. It avoids a whole lot of grunge like populating nodes with “holes” instead of pointers to real data, and dealing with reads in the holes, and so on. The part about writes starting beyond the end occurring at the end worries me, though. If an application were to write out of order – few do, but something like BitTorrent comes to mind – the result would be a mangled mess. If gaps aren’t allowed that’s fine, but it would seem safer to reject them outright than to risk rearranging them. I also don’t see any mention of a true append operation, which would imply that appending is a potentially racy process of finding the current EOF and then writing to that offset. What if something else extended the file in between?

Speaking of concurrency, the general approach in Luwak is similar to that in VoldFS and elsewhere: do all writes (including internal data structures) into new space, then write a new root which points to the new bits. In VoldFS this final write is into the inode for data operations or into the root directory for namespace operations, and is done very carefully with a conditional update so that conflicting writes are detected and retried – effectively serialized – instead of taking partial effect. In Luwak the “write into new space” rule does seem to be followed, but not the conditional-update part. That means two simultaneous writes could end up making separate copies of the same node in a common ancestor, and one write could be lost even though there was no actual overlap. As near as my weak Erlang skills can determine, simultaneous updates might even stomp on each others’ ancestor lists, so reconciliation at that level wouldn’t be possible either. Now, don’t get me wrong. It’s entirely reasonable to say that Luwak isn’t intended to handle that kind of concurrent-access regime and that if it had been then it would have been implemented a whole different way. I’m just saying that it’s an area where it might be interesting to experiment some more and see if at least occasional/accidental sharing might be handled more gracefully.

Since I mentioned ancestor lists, I should also point out that they seem to include all previous versions. Similarly, and again according to Cliff, there’s no garbage collection of no-longer-used data blocks. Again that’s totally reasonable for such a young project; there’s no such garbage collection in VoldFS either. Since data blocks are addressed by content hash, the problem might even be a bit more complex, and of course one should never pass up an opportunity to remind people of Valerie Aurora’s excellent HotOS 2003 paper on the dangers of compare-by-hash.

That’s all I can think of right now. All quibbles and disclaimers aside, I think the most important thing is that more people are working on ways to store large objects in some of the modern distributed data stores. Even if we all come up with different semantics and different approaches, that’s definitely a good thing. Progress is messy that way, and thanks to everyone involved with Luwak for contributing to that progress.

Coda Hale’s post on the CAP theorem seems to have set of a flurry of Twitter activity. Justin Sheehy mentioned a talk he recently gave, Henry Robinson mentioned a post he wrote a while ago, Sergio Bossa and Edward Ribeiro have joined in, and so on. There’s a bit of a competitive aspect to it, everybody – most definitely including me – trying to claim the first/last/best word on the subject, but it’s all good.

Edward Ribeiro also points out that Daniel Abadi doesn’t seem to be going along with the consensus I mentioned earlier. I decided to go with the humorous response.

@edward_ribeiro Yeah, @daniel_abadi didn’t get the memo. Updates haven’t propagated everywhere yet. ;)

More seriously, though, there will always be holdouts. There will always be a few who are too ignorant to understand what Brewer was saying. There will always be some whose business interests or academic reputations are too dependent on a contrary view to admit some of the implications. Lastly, of course, there will always be trolls. Among people who actually try to do useful things with distributed systems, though, the consensus seems pretty broad. There are still differences of opinion about some of the minor points, such as whether CA systems are literally impossible or just highly undesirable. I’m in the highly undesirable group, myself. ;) Seriously, I’ve worked on systems that I’d characterize as CA, and their failure modes when faced with the inevitable partition have been more gruesome than total shutdown would have been. I can go into more detail if anyone wants, in a separate post. None of these differences have any practical import, though. They’re fun to argue about, preferably over beers, but regardless of whether people think CA is impossible or impractical nobody in their right mind is going to recommend them in most cases.

The last point is whether CAP really boils down to “two out of three” or not. Of course not, even though I’ve probably said that myself a couple of times. The reason is merely pedagogical. It’s a pretty good approximation, much like teaching Newtonian physics or ideal gases in chemistry. You have to get people to understand the basic shape of things before you start talking about the exceptions and special cases, and “two out of three” is a good approximation. Sure, you can trade off just a little of one for a little of another instead of purely either/or, but only after you thoroughly understand and appreciate why the simpler form doesn’t suffice. The last thing we need is people with learner’s permits trying to build exotic race cars. They just give the doubters and trolls more ammunition with which to suppress innovation.

Coda Hale has written a post entitled You Can’t Sacrifice Partition Tolerance. If you’ve already read my articles about the CAP theorem, or Dan Weinreb’s, or Julian Browne’s, every point Coda makes up to the mention of “harvest” and “yield” should seem very familiar, but even if the post contained only that it’s well worth recommending and Coda does bring a certain panache even to well-trodden ground. I particularly liked his way of poking much-deserved fun at the too-often-heard argument that error responses somehow preserve availability.

A 500 The Bees They’re In My Eyes response does not count as an actual response any more than a network timeout does. A response contains the results of the requested work.

It has been very interesting to watch all the CAP discussions unfold. I was far from the first to write about it; at this point I’m also far from the last. It seems to me that there is a consensus emerging. Even if Gilbert and Lynch only formally proved a narrower version of Brewer’s original conjecture, that conjecture and the tradeoffs it implies are still alive and well and highly relevant to the design of real working systems that serve real business needs. It’s also about time the rest of Brewer’s keynote got some attention. Thanks, Coda.

Several people on Twitter pointed to Kenn Ejima’s Thoughts on Redis, calling it a “good read” and so on. I’m going to take a contrarian view. The article does contain some insight, but also a whole lot of misunderstanding and vitriol. The first few paragraphs, at least, read like something Michael Stonebraker or Dennis Forbes might have written. OK, maybe that was a bit too harsh, but “anti-SQL fanboys” and “nothing really inherent in the NoSQL technologies” are clearly cries for attention so I’ll give Kenn some. I’m going to use the email/Usenet second-person style just to avoid jarring inconsistency between “Kenn says” in some places and “[generic] you should” in others, but I hope people will realize that it’s still meant to be an open conversation.

the relational algebra has a lot more mathematical implications in practice than the CAP theorem

Perhaps so, but does that make the CAP theorem invalid or inapplicable? Of course not. It’s not either/or. You don’t have to understand just one or another. If you’re building a modern database system, you need to understand both. Dismissing either one out of hand is the mark of an immature developer.

I first looked at Cassandra and MongoDB, and got an impression that they were already over-featured enough to obfuscate the true focus on what kind of problems they were trying to solve.

Cassandra, MongoDB, and Redis? That’s a very small sample (though a very fashionable one) and there’s a danger in extrapolating from small samples. That’s probably where some of the later ridiculous statements about “most NoSQL databases” comes from. If the only two alternatives you looked at seemed too complex, then first that’s because you don’t even understand the problems they’re trying to solve and second you need to enlarge your sample. If CouchDB and Riak are also too complex for you to understand, try looking at Voldemort or Kumofs or Membase which have simpler data models.

First, scaling horizontally has little to do with the database engine itself – creating a transparent, consistent hash function is the easiest part. The hard part is choosing a good namespace for keys – you still need to organize keys in some ways, and from time to time, you need to “migrate” keys on refactoring. And when you are set with a solid naming structure, you are able to choose whatever database, including RDBMS, by redirecing (or proxying) requests based on the same partitioning logic.

Consistent hashing is indeed the easy part. The hard part is managing replicas, ensuring that they remain available even during (inevitable) partitions and reach a consistent state after, and that the whole is done efficiently. That’s what the “cache invalidation” Hard Thing is really about, but it extends beyond transient caches to permanent replicas as well. Automatic rebalancing and proxying and such are part and parcel of what those “too complicated” options do for you, because doing those things in ad hoc ways becomes completely untenable for large long-lived systems.

Second, by scaling horizontally, you only get performance gain in O(N), at the cost of decreased MTBF. If you want to double the memory, you need to double the machines in the cluster. In the computer science terminology, an O(N) algorithm is considered “naive”, and in the computer security terminology, it even has a name – “brute force”.

An algorithm is only “naive” or “brute force” if the problem itself is known to have lower complexity than the solution, but that’s not even the worst thing about your appeal to computational complexity. Big-O notation is normally used to plot number of operations vs. number of entities, either to characterize latency (if the operations are serial) or throughput (if they’re parallel). O(N) behavior is only bad in that context – linear speedup is actually a very good result when looking at aggregate performance – and none of the systems you’re criticizing have that kind of O(N) behavior.

As for increased MTBF, that’s also misleading. Yes, the mean time between component failure does increase, but thanks to replication the mean time to system failure (i.e. data loss) in these systems is much lower than any single system. If you’re worried about failures, you need to worry about both durability and replication, and systems that have both shouldn’t be dismissed as “over-featured” or “obfuscated” compared to a system that lacks the second. That complexity exists for a reason, even if you don’t understand it.

I don’t think it’s reasonable to always expect exponential increase in the capital fund – I think adding 1000 servers every week should be considered as a necessary evil, not as a goal.

If you need extra capacity, you need to get it somehow. You can scale up or scale out. Having implemented both kinds of systems, I can say that they’re based on much of the same technology, except that in the scale-up case the connections and algorithms are often proprietary so there’s a huge price premium. Where is the system that can even scale up to 1000x the power of a modern commodity-based 1U server? They exist, I’ve actually worked on them, but the pricing makes them economically feasible only for special purposes which require a completely different CPU/memory/interconnect and performance/reliability balance than exists outside national labs and large universities. For everyone else, scale-out gets you exactly the same or superior (for your purposes) functionality far more economically.

It’s not horizontal scalability at all that interested me in NoSQL.

Here, after several paragraphs of pointless bashing, we get to the real nub of the matter: you don’t care about scalability, thus you’ve never invested the time necessary to understand it, and so you play “sour grapes” by dismissing it as a goal.

By saying “write to disk” it really means: fork a child process for background processing, serialize all data on memory, write it to a temporary file in the background and rename the file atomically to the real one upon finish. Even though the overhead of fork is zero in theory when the OS supports copy-on-write, you still need to turn on the overcommit_memory setting if you want to deal with a dataset more than 1/2 of RAM, which contradicts our habit to battle against OOM killer as a database administrator.

It’s a good habit, and one that should not be broken in this case. If all you’re using this system for is “global variables on steroids” then half of RAM should be more than enough. Also, you’ll probably run out of capacity to handle the requests to all of those global variables before you run out of RAM, and you damn well shouldn’t have a single non-replicated non-sharded system as a SPOF for any application that big. The half-of-RAM thing is only an implementation artifact anyway. Later on you point out that Redis snapshot files are only about the tenth the size of the data in memory. If you’re willing to sacrifice some disk-space efficiency for the sake of memory efficiency, you could write out the snapshot incrementally and only compress/dedup within a chunk instead of globally. With a different snapshot implementation you should be able to use 90% or more of your memory for live data, and basing a design critique on such implementation artifacts is invalid.

First, with AOF, Redis has introduced the possible bottleneck that RDBMS have been suffering from.

Yes, and that bottleneck will always be there for any non-distributed store, no matter how well it’s implemented. That, along with the need for redundancy to ensure availability of all that data, is why those “too complicated” alternatives exist. Are you starting to see, by having these needs placed into the context of your own comments, why I consider your dismissal of them premature?

Finally, I’m going to formulate the 1:10 problem – that is, the ability or inability to handle 10GB of dataset with 1GB of memory.

Even 10GB of memory is nothing nowadays, when even commodity systems have ten times as much and terabyte systems are likely within a year. That might seem like an irrelevant point, that you can just bump the numbers up and have the ensuing argument remain valid, except for what I said above about how big your “global variables on steroids” system needs to be. Even the very largest application shouldn’t need 100GB worth of high-throughput high-semantic-level shared global variables. That’s bad not only from a system-design standpoint, but also from a basic-software-engineering “don’t rely on shared data so damn much” standpoint as well. You need a backing store for durability, sure, but if you need it for capacity in this scenario then You’re Doing It Wrong and you’ll fail anyway because you’ll run out of access to that capacity before you run out of capacity itself.

Usually it is suggested to avoid reinventing yet another virtual memory layer at the application level, but Redis has already crossed the line.

It’s a good line. Reinventing virtual memory uniformly a bad idea, leading to leaky abstractions and unmanaged contention for the resources that both the I/O and VM systems need. Been there, done that, again. Step back from that line, already.

For people who don’t really grok what’s been said in this post (maybe because it was just too long to read), my recommended setup is: “Use Redis for small datasets that don’t grow fast (stay far less than 1GB). Have at least 2x memory than the dataset. Use default snapshotting and disable AOF.”

OK, enough. You get the point. I’d rephrase above as “Use Redis for small datasets (less than 50GB this year) that don’t need to be highly available, have memory at least 2x your actual dataset (until the snapshot implementation improves), use frequent snapshotting or AOF (depending on your need for performance vs. durability – not both) and always avoid overcommit.” I also have nothing against Redis, it’s a fine tool for what it does, but I think its durability story is a bit confused and its reinvented VM can only serve a need that it’s not good for anyway. As always, the real answer is to use multiple data stores to serve multiple needs, with careful consideration of the tradeoffs each represents.