Replication is a feature of practically every distributed system out there, but I haven’t seen much writing about it lately so I might as well give it a shot. The first question about replication is: why do it? There are essentially two reasons: to protect against failure, and to spread the load for reads. The protection against failure is the one everyone focuses on. Just today, I saw someone ask about replication to improve read performance, and I saw someone else tell them very firmly that replication was only for availability. In some systems it is. In others it isn’t. By increasing the number of replicas one can increase read performance, but usually at the expense of write performance because updating all those replicas is costly. Nonetheless, it can be highly worthwhile for data that is read far more than it’s written. What do you call a system that makes a whole lot of replicas across a whole lot of sites? A content distribution network. I think a few companies have made a bit of money doing nothing but that, so clearly replication has value beyond improving availability.

The “whole lot of sites” part brings me to my next point: choosing where to put your replicas. A key concept here is of avoiding correlated failures. If availability is a concern – and when isn’t it? – then you’ll want to store replicas in multiple geographic locations if at all possible. Of course, then you have to deal with high latency, low/unpredictable throughput, and the possibility of partitions. This is where all of that eventual consistency, CAP-theorem stuff comes from. Even within a data center, though, you can place replicas to reduce the odds of correlated failure. You can place them in different racks, in different aisles, or even at different heights. Yes, heights. If your racks are all configured the same way, machines at the same height in different racks will occupy the same power/thermal/airflow milieu. I saw something similar with some of the SiCortex systems, where imperfections in airflow made the nodes in certain positions run hotter and fail more often. (Those boards were mounted vertically, but if anything such problems are likely to be worse with horizontally oriented systems. BTW, this is the same system where we discovered the importance of altitude and air density.) The OceanStore guys went even further, and talked about placing replicas to avoid correlated failures attributable to a common hardware design, operating system, etc. Distributing replicas across this many dimensions can get pretty complicated, but as long as it reduces the odds of losing all replicas simultaneously it’s hard to say it’s not worth it. If you’re replicating to improve read performance, there’s a whole different set of graph theory you can apply to minimize the maximum latency from any client to their nearest replica.

The last point I’d like to make about replication is where it should occur. In some systems, notably Dynamo and its descendants, local and remote replication are handled within a single framework. In others, including too many database- and filesystem-replication systems to mention, they’re handled at separate levels. That’s the approach I generally favor, because local and remote replication are simply different. Two particular issues to watch out for (you knew there’d be a bullet list somewhere):

  • Impacting local performance. If one or more links are congested (not failed) then you have to be careful that requests which could still complete locally aren’t unduly affected e.g. because of high R/W/N values.
  • Catch-up ordering. After a partition, it can take a while to bring two sites back in sync. It’s often desirable, during this time, that events be relayed in approximate temporal order – or at least that writes from hours ago don’t overtake those from mere seconds ago.

Users often end up with one set of requirements for local replication, and quite another for remote replication. It’s possible for a system which treats local and remote replication the same to satisfy both sets of requirements simultaneously, but in my experience the implementation ends up being more difficult than using two separate mechanisms. Asynchronous but ordered replication queues are not that hard to implement, and offer a pretty good set of performance/consistency/robustness characteristics. You still need to apply some kind of conflict resolution as things pop out of the queue, but that might well be different than the conflict resolution/avoidance you use within a site. I’m not saying that the integrated replication approach is wrong, but I’ve found that a two-layer approach is usually more flexible and easier to implement.