Replication II

In my previous post about replication, I said that a “unified” local and remote replication approach could be harder to implement than a two-layer approach. Not too surprisingly, Jonathan Ellis responded thus:

my experience is the opposite. E.g., compared to Cassandra’s single-kind-of-replication, HBase’s one-kind-of-local-replication and another-for-WAN looks pretty messy and fragile.

My reactions to this fall into two general categories: it depends on what your goals are, and there can be both good and bad implementations of either approach. What Jonathan says about HBase’s replication might well be true. Or it might not. I don’t know enough about it to say. What I do know is that I’ve seen many systems that use a multi-layer replication strategy work quite well. For example, I’ve seen DRBD used for remote replication in conjunction with the local replication provided by RAID or clustered filesystems, and many databases are well known to use this approach. Most relevantly, Yahoo’s PNUTS uses this sort of model. Again, though, it all depends on what you’re trying to achieve. It’s important to consider not just consistency but ordering, not just two sites but several, not just failed links but congested or flaky ones. To see how this could play out with an imperfect implementation of a unified replication model, let’s consider a hypothetical new data store called Galatea, deployed at a company called Mutter.

  • Mutter is spread across five sites, with all-to-all connectivity. Accordingly, the Galatea replication factor was also set to five and a replication strategy was put in place that assigns one node from each site to the preferred set for each datum.
  • The first problem occurred when a developer, having been told that he could get strongly consistent behavior by using R=3 and W=3, did just that. Thus, every request had to wait for at least two remote sites to respond, and the application was dog-slow. Management, at the urging of operations, handed down an edict that only R=1 and W=1 would be acceptable. The developers grumbled about false claims and shifting requirements, but persevered.
  • With R=1 and W=1, each request only had to wait for one response, which was practically always the one from the local replica. Operations complained about the 80% of read results which were putting load on the network only to be thrown away upon arrival, so Galatea was modified to read only from the local server unless that had already been tried and failed.
  • Writes, of course, were still being sent as they happened to all five sites. Because Galatea was based on a synchronous memory-based model instead of an asynchronous disk-based one, when a link became congested then replica requests that used it would stay in memory as the queues got longer and longer . . . until nodes started falling over from memory exhaustion. The Galatea developers responded by adding per-replication-request timeouts, after which requests would be handled via hinted handoff as though the destination had failed.
  • After a congestion or link-failure event, propagating previously-failed writes occurred in such an unpredictable order that the application writers couldn’t do anything useful until they got an “all clear” signal . . . which might not come before the next congestion or link-failure event. They asked for at least approximate ordering so that they could do their own catch-up incrementally. The Galatea developers looked at each other nervously.
  • Management decided to add a sixth data center, but this one would only be connected to three of the original five. The Galatea developers tried to explain that the nodes were all in one big consistent-hashing ring and keys would hash to nodes that weren’t directly reachable, so they’d have to add some sort of internal routing layer and it would be really ugly and and and…

At this point one of the Mutter developers proposed an alternative design. This design, naturally called Paphos, would use the Galatea replication locally. However, there would be a separate consistent-hashing ring at each site. Each write would go to a local node immediately, and that node would forward to remote nodes asynchronously (based on knowledge of neighboring sites’ rings). Those remote nodes could again replicate asynchronously to their own neighbor sites, if necessary. Remote replication was explicitly epochal; remote replication for one epoch could not begin until the previous epoch was complete. This provided both a form of flow control when links were congested, and approximate ordering for recovery after a failure, without unreasonable overhead (the epoch protocol wasn’t very expensive). The ops folks were happy because from their perspective Paphos was much better behaved than Galatea alone had been. Management was happy because they didn’t have to lease as many OC-12s. The developers were happy about getting the incremental catch-up functionality they’d asked for, though a few did still grumble about weak consistency until everyone else told them to shut up. Most importantly, users were happy because response times were stable even under load, and there were fewer of those “missing update” glitches that had become routine during the Galatea days.

I know this fictional story doesn’t make a fair comparison. Galatea was a poor representative of its category, and enhancing it might well have been preferable to replacing it. Then again, the same argument could have been made for enhancing Galatea’s own predecessor instead of replacing it. With very little effort, I could write a version of the above in which Paphos would struggle and Galatea would come along to save the day. It’s always dangerous to compare a best-of-breed implementation of one idea to a sloppy implementation of another, and much of the recent backlash against NoSQL is a reaction to exactly that tendency. (On the other hand, many of the backlash articles I’ve read are themselves fatally flawed by comparing across wildly different use cases.)

It’s entirely possible for two contrary sets of observations, such as Jonathan’s and mine, to be simultaneously valid because of differences in context. Any real deployment decision should be based on separating the relevant from the irrelevant and then comparing good-faith implementations of multiple ideas to solve the problem at hand. If Dynamo-style replication suits your needs, that’s great. If it doesn’t, that’s OK too. There are plenty of building blocks available, and plenty of ways to combine them into useful systems.


Replication is a feature of practically every distributed system out there, but I haven’t seen much writing about it lately so I might as well give it a shot. The first question about replication is: why do it? There are essentially two reasons: to protect against failure, and to spread the load for reads. The protection against failure is the one everyone focuses on. Just today, I saw someone ask about replication to improve read performance, and I saw someone else tell them very firmly that replication was only for availability. In some systems it is. In others it isn’t. By increasing the number of replicas one can increase read performance, but usually at the expense of write performance because updating all those replicas is costly. Nonetheless, it can be highly worthwhile for data that is read far more than it’s written. What do you call a system that makes a whole lot of replicas across a whole lot of sites? A content distribution network. I think a few companies have made a bit of money doing nothing but that, so clearly replication has value beyond improving availability.

The “whole lot of sites” part brings me to my next point: choosing where to put your replicas. A key concept here is of avoiding correlated failures. If availability is a concern – and when isn’t it? – then you’ll want to store replicas in multiple geographic locations if at all possible. Of course, then you have to deal with high latency, low/unpredictable throughput, and the possibility of partitions. This is where all of that eventual consistency, CAP-theorem stuff comes from. Even within a data center, though, you can place replicas to reduce the odds of correlated failure. You can place them in different racks, in different aisles, or even at different heights. Yes, heights. If your racks are all configured the same way, machines at the same height in different racks will occupy the same power/thermal/airflow milieu. I saw something similar with some of the SiCortex systems, where imperfections in airflow made the nodes in certain positions run hotter and fail more often. (Those boards were mounted vertically, but if anything such problems are likely to be worse with horizontally oriented systems. BTW, this is the same system where we discovered the importance of altitude and air density.) The OceanStore guys went even further, and talked about placing replicas to avoid correlated failures attributable to a common hardware design, operating system, etc. Distributing replicas across this many dimensions can get pretty complicated, but as long as it reduces the odds of losing all replicas simultaneously it’s hard to say it’s not worth it. If you’re replicating to improve read performance, there’s a whole different set of graph theory you can apply to minimize the maximum latency from any client to their nearest replica.

The last point I’d like to make about replication is where it should occur. In some systems, notably Dynamo and its descendants, local and remote replication are handled within a single framework. In others, including too many database- and filesystem-replication systems to mention, they’re handled at separate levels. That’s the approach I generally favor, because local and remote replication are simply different. Two particular issues to watch out for (you knew there’d be a bullet list somewhere):

  • Impacting local performance. If one or more links are congested (not failed) then you have to be careful that requests which could still complete locally aren’t unduly affected e.g. because of high R/W/N values.
  • Catch-up ordering. After a partition, it can take a while to bring two sites back in sync. It’s often desirable, during this time, that events be relayed in approximate temporal order – or at least that writes from hours ago don’t overtake those from mere seconds ago.

Users often end up with one set of requirements for local replication, and quite another for remote replication. It’s possible for a system which treats local and remote replication the same to satisfy both sets of requirements simultaneously, but in my experience the implementation ends up being more difficult than using two separate mechanisms. Asynchronous but ordered replication queues are not that hard to implement, and offer a pretty good set of performance/consistency/robustness characteristics. You still need to apply some kind of conflict resolution as things pop out of the queue, but that might well be different than the conflict resolution/avoidance you use within a site. I’m not saying that the integrated replication approach is wrong, but I’ve found that a two-layer approach is usually more flexible and easier to implement.

Lies, Damn Lies, and Benchmarks

Dennis Forbes is at it again: Fighting The NoSQL Mindset. I’d love to write a detailed response, but I’m too busy and too tired right now. Besides, I already got my hands dirty with Dennis once this month. Just be aware that the lack of compelling counterarguments in the comments might not indicate that none have been made; they might just have been moderated into oblivion. All of the comments about “NoSQL folks who are just didn’t learn computer science” are allowed to stand, of course, as though indices and storage optimization are more advanced computer science than eventual consistency and phi-accrual fault detectors. Riiight. Anyway, feel free to read and rebut. I don’t intend to, except perhaps to point out that there are right and wrong ways to get a power-law distribution across a large set of values. I guess that CS stuff does come in handy sometimes.

Spring Cleaning

You might notice that things look a little different around here. I stayed up a bit later than I meant to last night, tweaking a new theme to give the place a slightly fresher look. If I broke anything too badly you wouldn’t be able to read this, but for minor damage just leave a comment. Thanks!

Conflict Resolution of a Different Sort

My article on Riak and Cassandra seems to have ignited quite a controversy, involving more than just those two projects. Basho’s Justin Sheehy correctly makes the point that, “When you are disrupting a much larger established industry, aggression amongst peers isn’t a good route to success.” Benjamin Black puts it even more succinctly: it’s a big ocean. Alex Popescu says, “I think the market is still too young and small to see real attacks between the players.” These are all good points, so let me get one thing out there right away.

Riak is not a bad product. Basho people are not bad people.

Yes, I’ve had some issues both with Riak and with their comparison to Cassandra. On the other hand, as far back as last October of last year I recommended their descriptions of consistent hashing and vector clocks. Later in the same month, I had this to say about Riak.

I like its feature set and what I’ve read about its architecture passes the “how I would have done it” test

More recently, I have commended Andy Gross and Rusty Klophaus for their efforts to improve the situation. I am neither anti-Riak nor anti-Basho. Heck, we’re not far apart geographically. They’re in a building only a couple of blocks from some of my old stomping grounds. Whatever differences we have could probably be resolved over a couple of beers at Cambridge Brewing Co.

That said, I still think it’s worth addressing the issues that Justin/Benjamin/Alex raise. There will be aggression between competitors. Back when I was at Revivio, when we and others were trying to establish the market for continuous data protection, many of the same issues of cooperation vs. competition played out the same way. In the early days, we were generally happy to share the stage (sometimes literally) with our competitors, all positioning ourselves vs. The Enemy which was single-point-in-time (pejoratively “SPIT”) backup. We each tried to carve out our own niche within this new space, but things were very friendly. Sound familiar? This is very consistent with Justin’s portrayal of NoSQL and craft brewing. Of course, there did come a time when that comity started to dissolve and people started taking cheap shots at one another. Times change, and changing times call for changing tactics.

There’s room for reasonable people to disagree about who started the “real attacks between the players” in this case. I might point to Mark Phillips adding the inflammatory language about performance vs. data integrity to the Basho comparison on March 17 (subsequently retracted). Mark might point to my post on March 19. It doesn’t matter. Let me repeat that: It. Doesn’t. Freaking. Matter. Aggressive marketing will become commonplace sooner or later. The real question is: what should the people who want to maintain a more collegial tone do about it? As with NoSQL itself, there’s no single answer. Sometimes “turn the other cheek” is the right answer. Sometimes “hit back harder” is the right answer. I don’t believe in “do unto others” in either its “…as you would have done” or its “…before they do” forms. Having studied the Iterated Prisoners’ Dilemma and the Evolution of Cooperation more than a bit, I believe in doing unto others as I have seen them do (to me or to others). People learn from both positive and negative reinforcement, to varying degrees.

Here’s my advice, then: don’t shy away from conflict, but don’t make it more than it has to be. In game-theory terms, retaliate proportionally, then forgive and move on. Sometimes people will respond well when you approach them politely Sometimes they won’t respond at all until they encounter forceful opposition. Be alert for both, and respond to both accordingly. In particular, don’t let people get away with pretending to be innocent victims when they contributed significantly to the conflict, and especially not when in their very appeal to collegiality they continue to push negative characterizations of those who quite justifiably took offense. Also, reward and encourage people who do play the game the way you’d like to see it played. Don’t let people think that the only way to get attention is by going negative. I’ve had very positive interactions and/or have praised not only the Cassandra folks but also (deep breath) Voldemort, HBase, Hypertable, MongoDB, and Keyspace. There are others I’ve seen but not directly interacted with, who seem like fine upstanding folks too. I appreciate that, and I think the NoSQL community in general is still in a good place. We should try to keep that going, which means actively working toward that goal by offering kudos and congratulations as well as criticism either of projects or of what is written about them.

I’m quite willing to move past this. I’m not going to apologize for having responded caustically to the Basho comparison, but I will apologize for having allowed what could have been a minor and short-lived conflict to become something more damaging to all concerned. I’m sorry. I regret the effect my writings have had both on Riak/Basho and on the broader NoSQL community. There, I did it. Now, Justin/Mark/anyone: how about those beers?

Conflict Resolution

In my Riak and Cassandra thread, one issue that came up was the use of vector clocks to resolve conflicts. Vector what? I’ve had this post about vector clocks queued up for a while (check the numbers) but what I had here wasn’t very good so I’ve decided to talk about conflict resolution more generally. The basic issue here is that, in a system where writes can happen independently in more than one place, multiple writes can happen which conflict or overlap, and at some point you have to decide which one survives. This is not just a problem in those newfangled NoSQL thingies either. Distributed filesystems and other systems (e.g. databases) that use asynchronous replication have been dealing with the same issue for decades – and worse, because of things like partial overlaps and data dependencies that don’t exist in the common NoSQL data models.

Vector clocks and closely related version vectors are among the common techniques used to resolve conflicts. The common idea here is that, for each datum, each node keeps track of the last version on every other node which is known to precede the current one. In other words, the version for a datum is not a single value but a vector, with a separate value for each (relevant) node. If I’m X, and I had already seen Y’s version 7 before I wrote my version 3, then I note that fact by including a 7 in Y’s entry within that vector and a 3 in mine. Then anyone who sees that 7 (or later) in my version would know that I had already seen and (presumably) accounted for Y’s update when I made mine, and could treat my update as superseding it. If they already saw Y’s 7 and they see a 6 from me, they’d be able to recognize that as a conflict. There’s a lot more about rules for updating or comparing vectors in either case, but that’s probably enough for now. No post here would be complete without a bullet list, though, so here are a couple of additional observations.

  • There are a lot of papers about vectors becoming too big, and ways to make them smaller. These are very real concerns when there are many updaters, as in a distributed filesystem. However, in many NoSQL stores the vectors only need to include an entry for each node where a replica might be stored. Since this number is likely to be very small, vector size isn’t an issue in these cases. Thanks to Alex Feinberg of Project Voldemort for reminding me of this recently.
  • Version vectors are usually per-object on a node, while vector clocks are usually global for the node, but this need not be the case. The version-vector update rules can be used for the entire set of objects on a node just as well.
  • There are subtle but important differences between the update rules for version vectors and vector clocks. In particular, message exchanges count as events for vector clocks and the vectors end up being more “advanced” on the receiver than on the sender. This extra ordering information can be useful, e.g. to detect ordering of operations across multiple objects, but can also create problems – e.g. with clocks passed back and forth never converging, or conflict-resolution events generating new vectors and thus potentially new conflicts.

The most important thing about both vector clocks and version vectors (henceforth “vectors” for both) is that they do not by themselves resolve conflicts. All they can do is detect conflicts, meaning updates whose order cannot be determined. The conflicting versions must all be saved until someone, at some time, looks at them and determines how to resolve the conflict – i.e. turn them into a single combined version. In most cases, this is the user. Both Voldemort and Riak, for example, follow Dynamo’s lead in this way. The Riak folks even claim that vector clocks are easy, but I don’t agree. Using vector clocks to detect conflicts isn’t hard, so they’re sort of right as far as that goes, but detection is not the same as resolution and resolution is where the harder problems live. Even Werner Vogels recently said that Dynamo is not very user friendly and vector clocks are a large part of the reason. Vectors aren’t that hard to deal with once you’re used to them, but there’s a bit of a “last straw” effect; they usually tend to show up when you’re already doing something fairly complex, and the last thing you need is one more $#@! bit of complexity. Also, not all conflicts are equally easy to resolve. Basho’s example (set union) turns out to be particularly easy due to certain mathematical properties, but even something as seemingly simple as incrementing an integer can run into more serious problems. How can increment be difficult? Well, let’s look at what might happen at node zero in a three-node system.

  1. Node zero has a value of 100 for the integer, and a version of {10,20,30}.
  2. An update comes in with a value of 101 and a version of {10,21,30}.
  3. Another update comes in, also with a value of 101 but with a version of {10,20,31}.

Hm. Node one’s update was apparently made while oblivious to node two’s, and vice versa. No problem, they each obviously incremented the variable once, so we just make it 102 with a version of {10,21,31}. Everyone’s happy, right? Not so fast. If all we kept was the first update (including its version) then we no longer have any information to tell us whether {10,21,30} was an update from 100 to 101 or from 99 to 101. We’re missing the same information for {10,20,31}. Thus, we don’t really know whether the correct new value is 100+1+1=102 or 99+2+2=103 or something else. To know that, we’d have to keep {10,20,30} so that we could see what each of the subsequent updates really did. (This is where the special nature of set union, which is not subject to such problems, becomes apparent.) What if the second update had used a version of {9,20,31}? That would indicate a conflict not with {10,21,30} but with {10,20,30} – meaning we’d better have its predecessor(s) handy as well. Basically we’re stuck holding onto all previous versions until we can be sure no new conflict with them can occur – i.e. we’ve heard from every possible updater with a version clearly equal to or later than {10,20,30} in this case. In a system with many updaters, or where long-lived partitions can occur, this can get pretty messy indeed. Not so easy any more, is it?

But wait, some might say, you can avoid all this by using vectors in a different way – to prevent update conflicts by issuing conditional writes which specify a version (vector) and only succeed if that version is still current. Sorry, but no, or at least not generally. In a partition-tolerant system, nodes on each side of a partition may simultaneously accept conflicting writes against the same initial version, and you’ll still have to resolve the conflict when you resolve the partition. For conditional writes to work, the condition must be evaluated at all update sites before the write can be allowed to succeed. Implications regarding SimpleDB internals are left as an exercise for the reader.

In conclusion, then, vectors are an incredibly powerful and useful tool to support conflict resolution . . . but conflict resolution remains a fundamentally hard problem. The complexity can be contained in many cases, but there will always be a few where it leaks out and gets all over you. This is why stronger forms of consistency, which truly don’t allow such conflicts to occur and therefore don’t require their resolution, have such enduring appeal. If you want a system that remains available even when partitions occur, though, you’ll need to weaken that consistency and therefore you’ll need to deal with conflicts. Understanding vector clocks and their relatives – and I mean really understanding them, not just treating them as some sort of magic pixie dust that will solve all of your problems – will likely be key to that effort. I hope I’ve provided at least a few clues to where both the pitfalls and the safe paths through that minefield are.

Riak and Cassandra

Please don’t read this without also reading about the social context. I stand by the technical content, but it was worded more harshly than it should have been and in particular the attribution of motive was wrong. I’ve already admitted as much to the Basho folks in person, and I’ve already apologized. Let’s move on.

One of the most annoying things about many NoSQL projects is the relentless promotion of certain projects. The competition for users, contributors, potential investors and customers, speaking and consulting engagements, and general attention is fierce. People really really want projects that they’re involved with to succeed, and I can respect that they’re willing to fight for that . . . but sometimes they fight dirty and I don’t respect that so much. Sadly, one example that recently appeared is Basho’s comparison of their own product Riak to Cassandra. Now, I know the Basho guys aren’t stupid. Riak does basically work, and stupid people wouldn’t have gotten it that far. Some of the explanations on their site of things like consistent hashing and vector clocks are quite good. Even the article I’m about to address demonstrates that they actually do know a lot about this stuff . . . so ignorance is not a likely excuse for its misrepresentations. They must have known their comparisons were inaccurate, I know some of the inaccuracies have been addressed by others, they’ve had ample time to make corrections, and yet the misrepresentations remain. Let me address a few so people can see what I mean.

When you add a new node [in Riak], it immediately begins taking an equal share of the existing data from the other machines in the cluster, as well as an equal share of all new requests and data. This works because the available data space for your cluster is divided into partitions (64 by default).

When you add a machine to a Cassandra cluster, by default Cassandra will take on half the key range of whichever node has the largest amount of data stored. Alternatively, you can override this by specifying an InitialToken setting, providing more control over which range is claimed. In this case data is moved from the two nodes on the ring adjacent to the new node. As a result, if all nodes in an N-node cluster were overloaded, you would need to add N/2 new nodes. Cassandra also provides a tool (‘nodetool loadbalance’) that can be run in a rolling manner on each node to rebalance the cluster.

Quick question: does dividing keys into 64 partitions give as much load-balancing flexibility as dividing at any point in the 128-bit space provided by MD5 (which is what you’d have with Cassandra’s RandomPartitioner)? 64 is a particularly problematic number, not only because of what happens if you have more than 64 nodes but because even with fewer nodes the granularity is just too coarse. If a partitition is overloaded, too bad. There’s even a comment in the admin documentation that comes with Riak saying that you should set this to several times the number of nodes in your cluster . . . as though that’s static. What if you set it appropriately for a four-node cluster and then grew to forty? You won’t find them discussing those issues in the comparison but, oh, they sure are quick to speculate about needing to add N/2 nodes with Cassandra. At least they mention “nodeprobe loadbalance” but they get the command wrong and don’t seem to appreciate what it really does. Go read CASSANDRA-192 for yourself, and you’ll see that it can actually balance load much better than Riak with its partitions ever could.

Cassandra has achieved considerably faster write performance than Riak.
That said, the Riak engineering team has spent a lot of time optimizing Riak performance and benchmarking the system to ensure that it stays fast under heavy load, over long periods of time, even in the 99th-percentile case. We like speed, just not at the expense of reliability and We like speed, just not at the expense of reliability and scalability.

In other words, “We’re forced to admit that Cassandra is faster but we’ll spread FUD to distract from that.” I’m pretty sure the Cassandra folks are also unwilling to compromise on reliability and scalability for the sake of speed. The authors of the comparison have in no way shown that Riak even has an advantage in reliability or scalability, but their conclusion is worded to imply – without actually having the guts to state outright – that they made a responsible tradeoff and Cassandra made an irresponsible one. That’s just disgusting.

Riak tags each object with a vector clock, which can be used to detect when two processes try to update the same data at the same time, or to ensure that the correct data is stored after a network split.

In contrast, Cassandra tags data with a timestamp, and compares timestamps to determine which data is newer. If a client’s timestamp is out of sync with the rest of the cluster, or if a client waits too long between reading and writing data, then it is possible to lose the data that was written in between.

The difference between timestamps and vector clocks is a legitimate one, and in this case I think the Riak folks are right to bring it up. I personally would prefer a vector-clock-based approach. The “waits too long . . . in between” is kind of FUD-ish, though. This is a potential problem in any eventually consistent system, including those that use vector clocks (which resolve some but not all conflicts). Riak does provide the building blocks for a good solution, in the form of their X-Riak-Vclock extension and user-driven conflict resolution, but they don’t even allude to that. I guess “Cassandra might screw up your data” was easier than discussing a point that might actually have favored them.

Riak buckets are created on the fly when they are first accessed. This allows an application to evolve its data model easily.

In contrast, the Cassandra Keyspaces and Column Families (akin to Databases and Tables) are defined in an XML file. Changing the data-model at this level requires a rolling reboot of the entire cluster.

Another completely fair point. Supercolumns have their uses, but they’re really no substitute for dynamic bucket/ColumnFamily creation.

In contrast, Cassandra has only one setting for the number of replicas

Absolutely, positively untrue. Cassandra in fact allows you to specify the number or replicas on a per request basis, to trade off performance vs. protection from failures. This is such a commonly discussed and central feature that I find it impossible to believe the authors weren’t aware of it. Interpreting the abundant information on this topic as “only one setting” is, again, reprehensible.

As I hope readers can see, I’m not just rejecting every criticism of Cassandra. I like Cassandra, I like the developers, but I’m no fanboi even on projects I’m more directly involved in. There’s always room for improvement. What I do object to, though, is criticism given without due diligence. I might even be wrong about Basho’s load distribution, for example, but at least I tried to find the facts. I expect at least that much diligence and objectivity from people who are posting their findings on a company-sponsored website as part of their day jobs, and frankly I don’t think those traits are very evident in Basho’s comparison.

How Not to Moderate a Blog

I’ve never removed a comment on this blog, even in fairly extreme situations. There are many reasons, including a general dislike of censorship and the notion that once I start policing content I become responsible for that which remains. There are also some purely practical concerns related to the near impossibility of such moderation actually being helpful. As the most active moderator on a forum through nearly a million contentious posts, I learned a few lessons that also apply to blogs.

  • Whipping out the moderator hat with no prior attempts to persuade or warn people only convinces them that you’re more interested in directing than in participating.
  • If moderation is necessary, it’s better to moderate an entire line of discussion instead of trying to make and enforce hard-to-defend distinctions between one comment and another. Everybody – participants and observers alike – has their own idea who initiated a thread’s decline. People’s annoyance at being “caught in the net” is nothing compared to their anger at being singled out.
  • Deleting comments is a bad idea. Often, a comment will contain both a good part and a bad part. At best, deleting both leaves the remainder of the conversation disconnected and nonsensical. At worst, it also tells people that the effort they put into the good part means nothing to you. Close comments, mark bad ones, but don’t delete.

Unfortunately, in the thread elsewhere that inspired my own recent post about standards, Herb Sutter managed to make every one of these mistakes. I’m not saying that he was unjustified in taking action; it’s his blog, he wants traffic, and I for one had stopped visiting that thread because I got tired of being called a “freetard” and such by some of the other participants. What I’m saying is that the action he took was ill considered and poorly executed. For example, I can see that one comment, which was 95% serious commentary with one rude remark, was removed in toto; the “tinfoil hat” insult to which that one remark was a response was allowed to stand. How this fits Herb’s desire for “respectful disagreement” is a mystery, and it’s hard to escape the conclusion that the “moderator” was reacting to what was said rather than how. Whether intentionally or not, he has done less to improve the tone of the discussion than to influence its direction.

That such authoriarian and inconsistent actions were taken in the context of a thread about standards and openness is particularly telling. For too many people, those terms are just marketing hooks and not sincerely held principles. We already have “greenwashing” for the same phenomenon as it applies to ecological concerns. We need a similar term for people who talk the inclusive talk but don’t walk the walk. It’s a shame really, because I think that overall Herb is a good guy who brings great value to the techie blogosphere, but in this particular instance he seems to have taken a stance against responsible blogging.

The TAG Conjecture

I briefly alluded to this in an earlier post, and figure it’s time for a more complete explanation. Let me just get one thing out of the way first: I am not so presumptuous as to think this can compete with Brewer’s CAP Theorem. It’s almost more of a corollary or elaboration, to clarify certain issues that arise when thinking/talking about CAP. So, what is TAG? It’s Timely Agreement in a Global system. Like CAP, the idea behind TAG is that a system can have only two of the three named properties. Timeliness is pretty intuitive and self-explanatory. Agreement is also intuitive, referring to most of what other terms such as consistency or consensus would mean. “Global” requires a little more explanation. It refers to systems that are distributed not only in the sense of many nodes but also in the sense of many places, where long-lived partitions are a serious concern. Think WAN, not LAN. Far, not near. “Global” does not include distributed systems where latency is low, bandwidth is free, and partitions are rare or transient or both. (We really need two different terms instead of using “distributed” for both, but that’s a rant for another time.) At risk of sounding elitist, I’ll say that distribution within a site is an operational rather than algorithmic problem, or tactical vs. strategic. It’s not really what CAP or TAG are about. So, what tradeoffs does TAG allow?

  • First up is TA – Timely Agreement. Everyone gets quick answers, and they’re the same answers for everyone. This is what a well designed non-Global system is likely to have. It’s also what most people who claim to have refuted CAP turn out to have, because they’re not really thinking about the world that CAP describes.
  • Second is AG, or perhaps GA – Global Agreement. You get consistency, so long as you don’t mind waiting. This is where most people will end up when they just try to run a TA system across sites, without thinking about CAP and such. Locks and timeouts can be made to work in a local environment, but completely fail in a global one. Other approaches such as MVCC offer even better concurrency and fault behavior, but – at least in the forms that would be ideal for local environments – tend to fail at global scale too. Thus, people who just add G often unintentionally lose T in the process.
  • Last is TG – Timely and Global. You get quick answers, no matter where you are or whether partitions exist, but to do that you have to weaken agreement (consistency). Systems in this space tend to be characterized less by at-the-time synchronization, in any form, as asynchronous queues with after-the-fact repair and conflict resolution.

Relating TAG to CAP is an interesting exercise. Consistency in CAP and Agreement in TAG are practically the same thing. It’s not quite right to translate CAP’s Partitionable into TAG’s Global, but it’s close. Where the direct mapping really falls down is CAP’s Available vs. TAG’s Timely. They’re not the same thing at all; performance or time-boundedness has always been the hidden fourth dimension of CAP, and is made explicit in TAG while the often misleading “availability” is pushed into the background. Most AP systems in CAP are also TG in TAG. That’s why I don’t think TAG really stands alone, but is more of a commentary on CAP. You can think of CAP plus Time as a tetrahedron, normally viewed from a certain angle and thus appearing as a triangle. By viewing the same shape from a second angle, though, some aspects might come into better focus. Other projections of the same shape are left as an exercise for the reader.

Morning Links

First, Nathan Hurst’s excellent Visual Guide to NoSQL Systems. There are the expected quibbles in the comments about CAP definitions and where particular projects belong, but I’m just going to say that even if some of the details were wrong – and I’m not saying they are – it’s great to have such a handy map of the territory. I particularly like the use of color to represent orthogonal distinctions. Thanks, Nathan!

Second, Joe Landman’s “New” File systems worth watching. Joe’s take on Lustre vs. GlusterFS is pretty similar to my own, which I guess is not too surprising since Joe is one of the few people I’ve met who has real knowledge and experience with both. (Disclosure: I’ve been working with GlusterFS quite a bit at my day job, too.) Bit of a shame that he didn’t include PVFS2, though, since it’s also a worthy contender in this space. On the newer systems, he’s also spot on. Ceph is great stuff, and I look forward to playing with it more in the not-so-distant future. Tahoe-LAFS is the only thing out there addressing very high levels of security/confidentiality in a usable way, and Twisted Storage is a great data integration/consolidation platform. I’m not sure either of them really play in quite the same space as the others, so much as that they complement each other, but both are still worth looking into. I hope Zooko (who certainly reads here) and Chuck Wegrzyn (who probably does) don’t object to that characterization.

Lastly, something from me on Twitter.

First rule of distributed computing: split happens.

That got retweeted quite a bit, and one person even suggested a T-shirt. I’m rather proud of it, in a “still wouldn’t want it to be my claim to fame” kind of way.