October Tweets and Links

A little early, but it’s been a good month.

  • There are only two problems with distributed counters. Or maybe three.
  • Problems with .ly domains? Less money for Libyan government, some URL shorteners die. And the downside is . . . ?
  • Self-driving cars, offshore wind farms, embryonic stem cell treatments – all on one glance at the news. What times we live in.
  • Assembly Instructions from Hell
  • Sauron to bid for Tea Party leadership. It would be an improvement.
  • Caching separate from the DB (any type) is not the enterprise version of a DB which is inherently distributed. It’s the buggy version.
  • The last frame is so funny because so many really do think such “answers” are useful. Not Invented Here
  • Matchstick Tirith
  • Simply awesome microphotography.
  • Lord of the Rings + Rudolph the Red Nosed Reindeer
  • Sears for zombies. (“Afterlife. Well spent.”)
  • Anybody who refers to an “internet cable” shouldn’t do distributed programming.
    [Just for you, PeteZ: the person was referring to a physical object, not just internet service from a cable company]
  • App devs shouldn’t create infrastructure . . . not because their time is better spent, but because they suck at it.

When Partitions Attack

Stu Charlton has written an interesting post called Confused CAP Arguments. The tone is mostly civil and reasonable, though it’s a bit disappointing to see some of the negative stereotypes (“cocktails…newly-graduated devs”) and strawmen (“only using CAP”) repeated yet again. Oh, well. The issue of “is CA possible” seems to be rather central to Stu’s post, and it also came up in my last post in the subject, so I’ll delve into that aspect a bit and even share some of the war stories that I promised to.

First, though, a bit of background. Back in 1993, before either Stu or the current generation of anti-RDBMS hipsters entered the workforce, I was working on HACMP/6000 at Clam. At first I worked on the lock manager, which can be thought of as a key/value store where the values are very small but have complex semantics. A lot of the hashing and replication techniques it used would be instantly recognizable to NoSQL folks. Later, I was responsible for the network routing and monitoring infrastructure. The “DGSP” (Die, Gravy Sucking Pig) message that we used to enforce quorum was my invention. I became pretty intimately familiar with all of the ways such systems could fail. In particular, I became familiar with how even systems with redundant network hardware and cutting-edge network software could still experience network partitions. To address one of Stu’s points, about treating network failures as the only kind, one observation I made pretty early on is that for all practical purposes they are the only kind. It’s not hard to promote any local failure to a node failure, and it greatly simplifies the rest of the system if you do. Anyway, the real point is that I might not have been a database guy myself, but I directly supported them (including time on-site at Oracle HQ) so I have very little patience for database guys trying to pretend that only they have the proper context to understand the application or business value of various tradeoffs in this area. IBM wouldn’t have sold tens of thousands of HACMP licenses if we hadn’t thoroughly understood those same issues. So, back to the issue at hand. Here’s one of the less coherent parts of Stu’s post.

What is Partition Intolerance?

This is never well defined. Like lactose intolerance, the results, can be, err, unpredictable. It seems to imply one of: complete system unavailability, data inconsistency, or both.

First, data inconsistency is a sacrifice of C. That’s pretty clearly defined. Gilbert and Lynch defined P thus: the system continues to operate despite arbitrary message loss. That’s still a bit vague both in terms of “the system” (all or part) and “continues to operate” (as before or degraded somehow). I tried to provide a clearer interpretation a while back, distinguishing P from A by contrasting the effect of a partition on requests vs. nodes. In short, a non-A system fails some nodes, while non-P system fails some requests. Some might disagree with that definition – several did in the comments to that article – but I think it’s clear enough to make “never well defined” untrue.

What Stu does get absolutely right is that the effect of a partition in a partition-intolerant system can be unpredictable. This is why I don’t like CA systems much. If you’ve partitioned your data somehow and some nodes become unreachable, then some requests will be unable to reach the data (or acquire the locks) that they need, and a CA system must either block or fail such requests. If it could satisfy them by eliminating possible sources of conflict, e.g. by forcing nodes down and breaking their locks, then it would be a CP system (with reduced capacity). If it could satisfy them by allowing the conflict and hoping to perform some sort of resolution later, it would be an AP system (with reduced consistency). As a CA system, preserving both capacity and consistency, block or fail are the only options. Failing requests is the better option, but is unfortunately rarely chosen because people don’t usually seem to choose CA. Instead, they fall into it because they try to deny that partitions could occur. They have unbounded faith in the efficacy of redundant networks. Anybody who has seen the dark underbelly of network infrastructure knows how misplaced that faith can be. In “CAP-naive” systems, the most common behavior I used to see was that some requests would succeed and others would block indefinitely waiting for some resource they needed. It’s bad enough that hash-based resource distribution would cause these blockages to be completely unpredictable, but there’s an even bigger iceberg to worry about: cascading failures. Database requests often need many resources, and if they block unexpectedly while acquiring the fifth then they block while still holding the other four. That would often cause even more requests to block waiting for those four, and so on until practically every resource is unavailable and the whole system locks up. This isn’t theoretical. I’ve seen it happen many many times. I’ve seen it happen almost instantaneously when a critical resource becomes unreachable in a busy system, and I’ve seen it unfold slowly over hours or even days when a less popular resource did so in a lightly loaded system. It’s not some exceptional exotic behavior in CA systems. It’s the norm when partitions occur – and they do.

OK, so this can happen. Is it worth designing around? Implementing redundant networks can reduce the frequency of partitions to very low levels. That means some additional cost for network hardware and software, but without replication to ensure reachability (on top of potentially hardware-assisted replication to ensure durability) the savings on node and storage hardware to reach fixed capacity/performance goals might be even greater. Let’s just eat the cost on those rare occasions when the unthinkable happens. Doesn’t that make more economic sense? The first argument against that is that partitions even in redundant networks are more common than most people think, so the economic cost of such failures is likewise greater. I think the more compelling argument, though, is that a better tradeoff is practically always available. Any CA system which degrades messily into zero capacity when a partition occurs can be easily replaced with a CP system which degrades cleanly to half capacity under the same conditions. The technology for quorum-enforcing CP systems is mature and readily accessible, the configurations and running costs are exactly the same, so there’s no good reason to implement a CA system instead. Every CA system you see is the result not of diligent design but of (often willful) ignorance. CA systems can exist; they just shouldn’t.

Someone is Wrong on the Internet

There are many ways to be wrong in a technical discussion, but the ways usually fall into one of two categories – ways that can easily be corrected, and ways that can not. In the first case, somebody else can provide a counterexample or explanation that rests on only a few widely agreed-upon axioms, so that even those not particularly familiar with the subject matter can see that the original claim was wrong. In the second case, showing how or why a claim is wrong can require making subtle distinctions or pointing out less well-known facts. In the process, the whole discussion can often be drawn into a morass of contested claims leading to no clear conclusion. The first type of wrongness often isn’t worth addressing at all – the author and/or others will quickly spot the error without any assistance – or can be dealt with quickly. The second kind of wrongness can persist almost indefinitely, but still needs to be addressed because it can trap the unwary and lead them to waste a lot of time pursuing bad ideas when they could have been pursuing good ones. As futile as it may be, I’m going to address one particularly pernicious example regarding a topic I’ve written about many times – the infamous CAP theorem.

xkcd #386

In his latest missive, Michael Stonebraker claims to be addressing some mis-perceptions of his views on Brewer’s CAP theorem. I would say many of those “mis-perceptions” are entirely accurate, but it’s no surprise that he’d want to back-pedal a bit. Unfortunately, it’s not very effective to fight mischaracterization of one’s views by tossing out mischaracterizations of others. For example (near the end):

In summary, appealing to the CAP theorem exclusively for engineering guidance is, in my opinion, inappropriate.

I’d like to know who’s appealing to the CAP theorem exclusively. Not me. Not Coda Hale, who refers to the three properties of CAP as “Platonic ideals” and repeatedly refers to design heuristics or compromises involving more than just CAP. Not anyone I’ve seen involved in this discussion. The most extreme view I’ve seen is the complete rejection of CAP by those who just can’t let go of consistency and transactions. Those are easy models, it’s not hard to understand their appeal, but sometimes they’re just not appropriate for the problem at hand. Just like simpler Newtonian physics had to yield to the “weird” models proposed by Planck or Einstein, computing has moved into the “relativistic” world of Lamport or Lynch. It’s a world where there is no such thing as absolute time in any system with more than one physical clock, and where the same event for all practical purposes occurs at different times in different places (even without network partitions). The only concepts of time that matter in such systems are before/after and causality, but not duration. Because of this, node speed really doesn’t matter except to the extent that it affects the number of nodes you need to perform a task. As Stonebraker puts it,

Next generation DBMS technologies, such as VoltDB, have been shown to run around 50X the speed of conventional SQL engines. Thus, if you need 200 nodes to support a specific SQL application, then VoltDB can probably do the same application on 4 nodes. The probability of a failure on 200 nodes is wildly different than the probability of failure on four nodes.

What he fails to mention is that VoltDB gains that speed mostly by being a memory-based system, with all of the data-protection and capacity limitations that implies. If you need 200 nodes to support a specific SQL application, then you might still need 200 nodes not because of performance but because of capacity, so VoltDB won’t do the same job on 4 nodes. That’s exactly the kind of “trap the unwary” omission I was talking about.

He’s right on the last point, though: the probability of failure of 200 nodes really is wildly different than the same probability on 4 – exactly the point Coda makes, even doing one better by providing the actual formula, in his supposedly “misrepresentative” article. However, it’s worth examining the causes of node failures. Again, Stonebraker’s take.

The following important sources of outages are not considered in the CAP theorem.

Bohrbugs. These are repeatable DBMS errors that cause the DBMS to crash…

Application errors. The application inadvertently updates (all copies) of the data base…

Human error. A human types the database equivalent of RM * and causes a global outage…


Back here in reality, most Bohrbugs will cause a single node to crash, and the very relativity that makes distributed systems so challenging also makes it very unlikely that other nodes will experience the exact same sequence of events that triggers the failure. Other than overload, bugs that cause such “contagion” to take down the entire system are very rare. That’s why they’re newsworthy. You never see anything about twenty servers at Google or Yahoo failing, because that happens every day and because the people who designed those systems understand how to deal with it. More about that in a moment.

Going down the list, of course CAP doesn’t address application or human errors. Neither does Stonebraker’s approach. Neither can, because neither can control how applications or humans behave. Application errors have to be fixed in the applications, and human errors have to be fixed at a higher level too – e.g. by using automation to minimize the need for human intervention. It’s not worth talking about the cases where no tradeoffs are possible. What do you “trade off” to make human error disappear? Citing these kinds of errors as shortcomings of CAP, without noting their more general intractability, is just another dirty trick. As for reprovisioning as a “stop the world” operation, Benjamin Black and others have already pointed out that it’s simply not so for them . . . and I’ll add that it need not be so even in a more consistency-oriented world. In any system that can survive a failure of some nodes, those nodes can be upgraded while they’re offline but the rest of the system keeps running. The fact that some systems don’t have that property is merely a deficiency in their implementation, not a commentary on CAP.

What I find most misguided about Stonebraker’s article, though, is this.

In my experience, network partitions do not happen often. Specifically, they occur less frequently than the sum of bohrbugs, application errors, human errors and reprovisioning events. So it doesn’t much matter what you do when confronted with network partitions. Surviving them will not “move the needle” on availability because higher frequency events will cause global outages. Hence, you are giving up something (consistency) and getting nothing in return.

So, because network partitions occur less than some other kind of error, we shouldn’t worry about them? Because more people die in cars than in planes, we shouldn’t try to make planes safer? Also, notice how he says that network partitions are rare in his experience. His experience may be vast, but much of it is irrelevant because the scale and characteristics of networks nowadays are unlike those of even five years ago. People with more recent experience at higher scale seem to believe that network partitions are an important issue, and claiming that partitions are rare in (increasingly common) multi-datacenter environments is just ridiculous. Based on all this plus my own experience, I think dealing with network partitions does “move the needle” on availability and is hardly “nothing in return” at all. Sure, being always prepared for a partition carries a cost, but so does the alternative and that’s the whole point of CAP.

Remember how I exempted overload from my comment about systemic failure? I did that because the #1 cause of system overload is parts of the system being too dependent on another. Sooner or later, even in the best-designed system, some node somewhere is going to become overloaded. The more nodes you have waiting synchronously for each others’ responses, as they must when the system is built around consistency or consensus, the more likely it becomes that the local bottleneck will turn into a global traffic jam. Forcing non-quorum nodes down to preserve consistency among the rest – the other part of the traditional approach that Stonebraker clings to – only makes this even more likely because it increases load on the survivors. That’s the other part of why you don’t hear about a few nodes going down at Facebook or Amazon. Their engineers know that consistency has a cost too. Consistency means coupling, and coupling is bad for availability, and availability matters more to some people than consistency.

The conclusion, then, is much as it was before. Partitions are real, and significant, so we’re left with a tradeoff between consistency and availability. Real engineers make that tradeoff in various ways. Other people try to deny that the tradeoff is necessary, or that any choice other than their own might be valid, and they make up all manner of counter-factual reasons why. Which would you rather be?

More Cool Pumpkins

Every year at this time, my scary pumpkin post starts getting a lot of hits. This year, I have a couple of ideas for pumpkin-carving ideas I’d actually consider implementing myself (though the subject hasn’t come up at home and I won’t be the first to mention it).

Big Mac pumpkin
(from Neatorama)
Death Star pumpkin
(from Fantasy Pumpkins)

Thoughts on Lua

I spent much of my spare time over the weekend playing with the Lua Lua programming language and Kepler web framework. It was amusing, I wouldn’t mind playing with Lua a bit more some day, but I think I’ve gone about as far with it as I care to for right now. Before I go, I figure I’ll share some thoughts and impressions.

To start with, Lua reminded me a lot of Python. It has the same very visible relationships between tables (dictionaries in Python), objects, and modules. The meta-programming looks very similar too. It’s kind of like a very stripped down, almost primitive, version of Python, though. The documentation on object orientation or module structure contains not so much descriptions of how those things are done in Lua as examples of how other people have done them in Lua. On the object front, Lua has no established object model; you just use meta-programming to build your own. There are some very small nods (e.g. the colon operator) to object orientation, but it never extends very far. Even operations on built-in types often involve passing them as arguments to library functions instead of calling methods on them. On the module front there is a module function, but using it tends to pollute the global namespace (in this it reminds me of PHP more than Python) so there are multiple suggestions of how else to do the same thing. Opposite to Python, which has a “global” keyword to make things explicitly global, Lua has a “local” keyword to do the exact opposite. One area where Lua’s drive for simplicity seems to have gone too far is in eschewing “normal” lists or arrays in favor of having tables do that as well – and do it poorly. Having to use pairs(x) for tables which are used as objects/dictionaries and ipairs(x) for tables which are used as arrays/lists is just awkward, and one-based addressing (also used for strings) feels positively COBOLian. Overall, the language seems designed around flexibility and ease of implementation, which are fine goals but a little bit of a shock compared to languages designed more explicitly to maximize programmer productivity.

OK, enough about the language. How about the Kepler framework? I started out trying to use Orbit, but it didn’t take long at all for me to get tired of all the MVC-ish-ness so I mostly just used WSAPI without it. I pretty quickly figured out how to do the kind of request dispatch I wanted on my own, and Cosmo seemed to do fine for templates – although I didn’t exactly stress it much. I barely even noticed the Copas/Coxpcall/Rings parts except for using coroutine.yield in handlers, and Xavante seemed to work just fine as a server. Then I kind of hit a wall when I tried to figure out how to do PUT of large files. There seem to be several recipes out there for handling POST of files up to 4GB as forms, but the neither documentation nor forum searches nor perusal of the code seemed to show any way to do a chunked upload bigger than that. The obvious ways of reading from wsapi_env.input all yielded poor results, which seemed to have something to do with WSAPI wraps things in coroutines but I never quite got to the bottom of it. (This tendency toward too-clever and too-fragile use of meta-programming instead of doing things in a more straight-forward way seems quite pervasive in Lua code and had also bugged me in Orbit.) On a similar subject, Lua seems to have no real native threading, though there are documents describing how to do it (and how you’ll get stuck on locks around the Lua core) in extensions. Coroutines just don’t cut it in a multi-core world, and I couldn’t convince myself that Lua’s implementation even allows for effective use of a single core, which puts a serious dent in my interest level.

In the end, I just don’t feel like Lua’s a good fit for the kind of program I decided to use for this experiment. In a different problem domain the flexibility and the ease of extension/embedding and the awesome performance of LuaJIT might make it a much better choice, but for now I’m debating whether to try JavaScript or Ruby (or maybe Clojure) next.


Basho has announced Luwak, an Erlang library for storing large files in Riak. The original code was contributed by Cliff Moon (@moonpolysoft) so I’m guessing that the slightly scatological name comes from him. I chatted with Cliff on IRC a bit. I also exchanged some email with Bryan Fink (@hobbyist) who wrote the HTTP interface and seems to be the current maintainer at Basho. Many thanks to both of them for taking the time to educate me. What follows might come across as criticism, but I don’t mean it as such. Most of it comes from my background as a filesystem developer, which is most assuredly not the best perspective from which to view Luwak, but it’s the perspective I have. Considered relative to Luwak’s goals and to the stage of its development, most of these apparent criticisms are weak or invalid, even when I managed to fight through my poor knowledge of Erlang to understand what the code’s doing. I can’t stress enough that I think Luwak is cool, and I wouldn’t have spent even as much time as I already have on it otherwise.

The first thing that strikes me about Luwak is that it’s all about what’s inside files and there’s nothing about managing namespaces – no directories, no renaming, no attributes as we filesystem types would expect, etc. That makes perfect sense, since Riak already has plenty of ways to index and connect Luwak files. Who needs directories when you have so many other ways to do the same things? Bryan even points out that “object” might be more accurate than “file” because it doesn’t carry the weight of expectations that Luwak was never intended to meet. This does mean that an application developer accustomed to arranging files into hierarchies will have to come up with their own way of mapping those semantics onto what Luwak provides, and maybe it would be nicer if that mapping were done in common code, but it’s not really that big a deal.

The structure within a file is of blocks arranged into a Merkle tree. The Merkle-tree approach is an interesting one. In the case of rewriting an entire file in which little has changed, it allows the update to be done with very little data transfer. I’m not sure it helps all that much in the case of writing a new file, or rewriting only part of a file, though. It makes me wonder whether the “atomic non-extending write within a single allocated block” optimization I mentioned here would apply to Luwak. The Merkle approach is also related to another interesting feature which isn’t mentioned in the README but does warrant a comment in luwak_io.erl

%% The write will start at the offset specified by
%% Start and overwrite anything at that position with the
%% contents of Data. Writes starting beyond the end of the file
%% will occur at the end of the file. Luwak does not allow for
%% gaps in a file.

I can totally see how this makes the design simpler. It avoids a whole lot of grunge like populating nodes with “holes” instead of pointers to real data, and dealing with reads in the holes, and so on. The part about writes starting beyond the end occurring at the end worries me, though. If an application were to write out of order – few do, but something like BitTorrent comes to mind – the result would be a mangled mess. If gaps aren’t allowed that’s fine, but it would seem safer to reject them outright than to risk rearranging them. I also don’t see any mention of a true append operation, which would imply that appending is a potentially racy process of finding the current EOF and then writing to that offset. What if something else extended the file in between?

Speaking of concurrency, the general approach in Luwak is similar to that in VoldFS and elsewhere: do all writes (including internal data structures) into new space, then write a new root which points to the new bits. In VoldFS this final write is into the inode for data operations or into the root directory for namespace operations, and is done very carefully with a conditional update so that conflicting writes are detected and retried – effectively serialized – instead of taking partial effect. In Luwak the “write into new space” rule does seem to be followed, but not the conditional-update part. That means two simultaneous writes could end up making separate copies of the same node in a common ancestor, and one write could be lost even though there was no actual overlap. As near as my weak Erlang skills can determine, simultaneous updates might even stomp on each others’ ancestor lists, so reconciliation at that level wouldn’t be possible either. Now, don’t get me wrong. It’s entirely reasonable to say that Luwak isn’t intended to handle that kind of concurrent-access regime and that if it had been then it would have been implemented a whole different way. I’m just saying that it’s an area where it might be interesting to experiment some more and see if at least occasional/accidental sharing might be handled more gracefully.

Since I mentioned ancestor lists, I should also point out that they seem to include all previous versions. Similarly, and again according to Cliff, there’s no garbage collection of no-longer-used data blocks. Again that’s totally reasonable for such a young project; there’s no such garbage collection in VoldFS either. Since data blocks are addressed by content hash, the problem might even be a bit more complex, and of course one should never pass up an opportunity to remind people of Valerie Aurora’s excellent HotOS 2003 paper on the dangers of compare-by-hash.

That’s all I can think of right now. All quibbles and disclaimers aside, I think the most important thing is that more people are working on ways to store large objects in some of the modern distributed data stores. Even if we all come up with different semantics and different approaches, that’s definitely a good thing. Progress is messy that way, and thanks to everyone involved with Luwak for contributing to that progress.

Languages/Frameworks for Web Stuff

I’ve been thinking about playing with a new programming language or two, to see if that helps me rediscover some of the fun I used to have when the discovery of new capabilities or ways of doing things was more common than the discovery of new nasty things to work around. The vehicle I’ve chosen for this is sort of a stripped down version of the image-warehouse daemon I’ve been working on at Red Hat, supporting basic bucket/object/attribute operations sort of like S3 but skipping a lot of the more complex query/replication stuff that makes iwhd special. Here are the things I need, besides basic filesystem stuff.

  • Request routing/dispatch. I like the Sinatra model for this, but I’m not tied to it.
  • Templates sufficient to generate simple XML and JSON output (e.g. for bucket listings).
  • Some sort of database interface, but no ORM.
  • A decent parsing library, preferably of the PEG flavor.

I’m planning to implement the same functionality in three languages, just to compare and contrast lines of code, performance, etc. Since the point is to learn new languages, that rules out the ones (like Python) I already know. Since the point is also to have fun, that rules out anything too enterprise-y or CS-purist-y. I don’t mean to disparage such languages, Erlang or Haskell or Scala might be more worthwhile and others might even consider them fun, but for this purpose something more “quick and dirty” is called for. So, the candidates are:

  • Ruby, Sinatra, ??? for templates and web server (Thin?)
  • LuaJIT, Kepler
  • Javascript, node.js or ringo.js, ??? for routing and templates
  • As you can see, I’m looking for suggestions on some parts. Any ideas? I’m not even completely tied to the choices already made above. Just try to remember that I’ll need support for the same database (that’s undecided too) across whatever languages I try, and that I’m trying to have fun so I don’t want anything “heavier” than it really needs to be. Thanks to all who contribute.

    Tax Lies from Mankiw

    My wife, a former math major, freely admits that she’s not particularly good at arithmetic. Knowing group theory and being able to calculate a tip quickly are two different skills. Similarly, knowing economic theory and handling your own money competently are two different things. This is illustrated quite well by Greg Mankiw, who uses some really bad assumptions to claim that his effective marginal income tax rate is 90% . . . and of course it’s all Obama’s fault. Now, I might not be a Harvard economist, but even I know about tax-deferred investments and about half a dozen ways to avoid the estate tax – especially on money saved for a child’s education. I know that very few individuals actually pay their top marginal rate, even fewer corporations do, and that a tax rate on earnings is not the same as a tax on dividends and capital gains. Mankiw pretends to know none of these things, so either he’s not smart enough even to find Harvard on a map or he’s deliberately misrepresenting facts to his readers.

    Behind the mathematical lies, though, lurks an even worse deceit. What applies to Mankiw doesn’t apply to everyone. He may be able to forego $1000 in income out of petulance at what he sees as a too-high tax rate, but anyone at or below median income would have to consider it more seriously even if the tax rates were as high as he claims. Also, economics might not be a zero-sum game but, at an annual growth rate of only a couple of percent it’s damn close and taxes are even closer than income. We’ve all heard about trickle-down and the “bigger pie for everyone” and so on but, Brad DeLong points out, Mankiw also helped shape policies that do a notably poor job of demonstrating any such effect. DeLong also quotes Friedman as saying, “To tax is to spend.” Somebody has to pay for all of the costs Mankiw helped impose on us. If not the rich, then who? If not now, when? If Mankiw pays less in taxes so that his children can have more money – which they’ll hardly need – for college, who else’s children will have less? Almost certainly someone who has benefited less from our system of tax and property and liability and regulatory law than he has. Maybe someone who pays the taxes Harvard doesn’t, or whose taxes in other years were directed into Mankiw’s pocket while he as an adviser to George W. Bush, or who got screwed by one of the policies he promoted.

    Yeah, go take a break, Greg. The fewer hours wealth-destroyers like you work, the better off we’ll all be.

    Don’t worry, I won’t make a habit of writing political posts. It’s been a long time since the last one and it will probably be a long time before the next. I just had to get this off my chest, and it’s my blog. I’ll be returning to the technical content shortly.

    Reactions to Coda’s CAP Post

    Coda Hale’s post on the CAP theorem seems to have set of a flurry of Twitter activity. Justin Sheehy mentioned a talk he recently gave, Henry Robinson mentioned a post he wrote a while ago, Sergio Bossa and Edward Ribeiro have joined in, and so on. There’s a bit of a competitive aspect to it, everybody – most definitely including me – trying to claim the first/last/best word on the subject, but it’s all good.

    Edward Ribeiro also points out that Daniel Abadi doesn’t seem to be going along with the consensus I mentioned earlier. I decided to go with the humorous response.

    @edward_ribeiro Yeah, @daniel_abadi didn’t get the memo. Updates haven’t propagated everywhere yet. ;)

    More seriously, though, there will always be holdouts. There will always be a few who are too ignorant to understand what Brewer was saying. There will always be some whose business interests or academic reputations are too dependent on a contrary view to admit some of the implications. Lastly, of course, there will always be trolls. Among people who actually try to do useful things with distributed systems, though, the consensus seems pretty broad. There are still differences of opinion about some of the minor points, such as whether CA systems are literally impossible or just highly undesirable. I’m in the highly undesirable group, myself. ;) Seriously, I’ve worked on systems that I’d characterize as CA, and their failure modes when faced with the inevitable partition have been more gruesome than total shutdown would have been. I can go into more detail if anyone wants, in a separate post. None of these differences have any practical import, though. They’re fun to argue about, preferably over beers, but regardless of whether people think CA is impossible or impractical nobody in their right mind is going to recommend them in most cases.

    The last point is whether CAP really boils down to “two out of three” or not. Of course not, even though I’ve probably said that myself a couple of times. The reason is merely pedagogical. It’s a pretty good approximation, much like teaching Newtonian physics or ideal gases in chemistry. You have to get people to understand the basic shape of things before you start talking about the exceptions and special cases, and “two out of three” is a good approximation. Sure, you can trade off just a little of one for a little of another instead of purely either/or, but only after you thoroughly understand and appreciate why the simpler form doesn’t suffice. The last thing we need is people with learner’s permits trying to build exotic race cars. They just give the doubters and trolls more ammunition with which to suppress innovation.

    Another CAP Article

    Coda Hale has written a post entitled You Can’t Sacrifice Partition Tolerance. If you’ve already read my articles about the CAP theorem, or Dan Weinreb’s, or Julian Browne’s, every point Coda makes up to the mention of “harvest” and “yield” should seem very familiar, but even if the post contained only that it’s well worth recommending and Coda does bring a certain panache even to well-trodden ground. I particularly liked his way of poking much-deserved fun at the too-often-heard argument that error responses somehow preserve availability.

    A 500 The Bees They’re In My Eyes response does not count as an actual response any more than a network timeout does. A response contains the results of the requested work.

    It has been very interesting to watch all the CAP discussions unfold. I was far from the first to write about it; at this point I’m also far from the last. It seems to me that there is a consensus emerging. Even if Gilbert and Lynch only formally proved a narrower version of Brewer’s original conjecture, that conjecture and the tradeoffs it implies are still alive and well and highly relevant to the design of real working systems that serve real business needs. It’s also about time the rest of Brewer’s keynote got some attention. Thanks, Coda.