In my previous post about replication, I said that a “unified” local and remote replication approach could be harder to implement than a two-layer approach. Not too surprisingly, Jonathan Ellis responded thus:

my experience is the opposite. E.g., compared to Cassandra’s single-kind-of-replication, HBase’s one-kind-of-local-replication and another-for-WAN looks pretty messy and fragile.

My reactions to this fall into two general categories: it depends on what your goals are, and there can be both good and bad implementations of either approach. What Jonathan says about HBase’s replication might well be true. Or it might not. I don’t know enough about it to say. What I do know is that I’ve seen many systems that use a multi-layer replication strategy work quite well. For example, I’ve seen DRBD used for remote replication in conjunction with the local replication provided by RAID or clustered filesystems, and many databases are well known to use this approach. Most relevantly, Yahoo’s PNUTS uses this sort of model. Again, though, it all depends on what you’re trying to achieve. It’s important to consider not just consistency but ordering, not just two sites but several, not just failed links but congested or flaky ones. To see how this could play out with an imperfect implementation of a unified replication model, let’s consider a hypothetical new data store called Galatea, deployed at a company called Mutter.

  • Mutter is spread across five sites, with all-to-all connectivity. Accordingly, the Galatea replication factor was also set to five and a replication strategy was put in place that assigns one node from each site to the preferred set for each datum.
  • The first problem occurred when a developer, having been told that he could get strongly consistent behavior by using R=3 and W=3, did just that. Thus, every request had to wait for at least two remote sites to respond, and the application was dog-slow. Management, at the urging of operations, handed down an edict that only R=1 and W=1 would be acceptable. The developers grumbled about false claims and shifting requirements, but persevered.
  • With R=1 and W=1, each request only had to wait for one response, which was practically always the one from the local replica. Operations complained about the 80% of read results which were putting load on the network only to be thrown away upon arrival, so Galatea was modified to read only from the local server unless that had already been tried and failed.
  • Writes, of course, were still being sent as they happened to all five sites. Because Galatea was based on a synchronous memory-based model instead of an asynchronous disk-based one, when a link became congested then replica requests that used it would stay in memory as the queues got longer and longer . . . until nodes started falling over from memory exhaustion. The Galatea developers responded by adding per-replication-request timeouts, after which requests would be handled via hinted handoff as though the destination had failed.
  • After a congestion or link-failure event, propagating previously-failed writes occurred in such an unpredictable order that the application writers couldn’t do anything useful until they got an “all clear” signal . . . which might not come before the next congestion or link-failure event. They asked for at least approximate ordering so that they could do their own catch-up incrementally. The Galatea developers looked at each other nervously.
  • Management decided to add a sixth data center, but this one would only be connected to three of the original five. The Galatea developers tried to explain that the nodes were all in one big consistent-hashing ring and keys would hash to nodes that weren’t directly reachable, so they’d have to add some sort of internal routing layer and it would be really ugly and and and…

At this point one of the Mutter developers proposed an alternative design. This design, naturally called Paphos, would use the Galatea replication locally. However, there would be a separate consistent-hashing ring at each site. Each write would go to a local node immediately, and that node would forward to remote nodes asynchronously (based on knowledge of neighboring sites’ rings). Those remote nodes could again replicate asynchronously to their own neighbor sites, if necessary. Remote replication was explicitly epochal; remote replication for one epoch could not begin until the previous epoch was complete. This provided both a form of flow control when links were congested, and approximate ordering for recovery after a failure, without unreasonable overhead (the epoch protocol wasn’t very expensive). The ops folks were happy because from their perspective Paphos was much better behaved than Galatea alone had been. Management was happy because they didn’t have to lease as many OC-12s. The developers were happy about getting the incremental catch-up functionality they’d asked for, though a few did still grumble about weak consistency until everyone else told them to shut up. Most importantly, users were happy because response times were stable even under load, and there were fewer of those “missing update” glitches that had become routine during the Galatea days.

I know this fictional story doesn’t make a fair comparison. Galatea was a poor representative of its category, and enhancing it might well have been preferable to replacing it. Then again, the same argument could have been made for enhancing Galatea’s own predecessor instead of replacing it. With very little effort, I could write a version of the above in which Paphos would struggle and Galatea would come along to save the day. It’s always dangerous to compare a best-of-breed implementation of one idea to a sloppy implementation of another, and much of the recent backlash against NoSQL is a reaction to exactly that tendency. (On the other hand, many of the backlash articles I’ve read are themselves fatally flawed by comparing across wildly different use cases.)

It’s entirely possible for two contrary sets of observations, such as Jonathan’s and mine, to be simultaneously valid because of differences in context. Any real deployment decision should be based on separating the relevant from the irrelevant and then comparing good-faith implementations of multiple ideas to solve the problem at hand. If Dynamo-style replication suits your needs, that’s great. If it doesn’t, that’s OK too. There are plenty of building blocks available, and plenty of ways to combine them into useful systems.