Since this is a subject that has been much on my mind lately, I’m going to take another shot at discussing and clarifying the differences between caches and replicas. I’ve written about this before, but I’m going to take a different perspective this time and propose a different distinction.

A replica is supposed to be complete and authoritative. A cache can be incomplete and/or non-authoritative.

I’m using “suppose” here in the almost old-fashioned sense of assume or believe, not demand or require. The assumption or belief might not actually reflect reality. A cache might in fact be complete, while a replica might be incomplete – and probably will be, when factors such as propagation delays and conflict resolution are considered. The important thing is how these two contrary suppositions guide behavior when a client requests data. This distinction is most important in the negative case: if you can’t find a datum in a replica then you proceed as if it doesn’t exist anywhere, but if you can’t find a datum in a cache then you look somewhere else. Here are several other possible distinctions that I think do not work as well.

  • Read/write vs. read-only. CPU caches and NoSQL replicas are usually writable, while web caches and secondary replicas used for Disaster Recovery across a WAN are more often read-only.
  • Hierarchy vs. peer to peer. CPU caches are usually hierarchical with respect to (more authoritative) memory, but treat each other as peers in an SMP system – and don’t forget COMA either. On the flip side, the same NoSQL and DR examples show that both arrangements apply to replicas as well.
  • Increasing performance vs. increasing availability. This is the distinction I tried to make in my previous article, and in practice that distinction often applies to the domain where I made it (see the comments). I’m not backing away from the claim that a cache which fails to improve performance or a replica which fails to improve availability has failed, but either a cache or a replica can serve both purposes so I don’t think it’s the most essential distinction between the two.
  • Pull vs. push. Caches usually pull data from a more authoritative source, while replicas are usually kept up to date by pushing from where data was modified, but counterexamples do exist. Many caches do allow push-based warming, and in the case of a content delivery network there’s probably more pushing than pulling. I’m having trouble thinking of a pull-based replica that’s not really more of a proxy (possibly with a cache attached) but the existence of push-based caches already makes this distinction less essential. One thing that does remain true is that a cache can and will fall back to pulling data for a negative result, while a replica either can’t or won’t.
  • Independent operation. Again, replicas are usually set up so that they can operate independently (i.e. without connection to any other replica) while caches usually are not, but this is not necessarily the case. For example, Coda allowed users to run with a disconnected cache.

Now that I’ve said that some of these differences are inessential, I’m going to incorporate them into two pedagogical examples for further discussion. Let’s say that you’re trying to extend your existing primary data store with either a push-based peer-to-peer replica or a pull-based hierarchical cache at a second site. What are the practical reasons for choosing one or the other? The first issue is going to be capacity. A replica requires capacity roughly equal to that of any other replica; a cache may be similarly sized, but may also and most often will be smaller. While storage is cheap, it does add up and for very large data sets a (nearly) complete replica at the second site might not be an economically feasible option. The second issue is going to be at the intersection of network bandwidth/latency/reliability and consistency. Consider the following two scenarios:

  • For a naively implemented push-based replica, each data write is going to generate cross-site traffic. If the update rate is high, then this is going to force a tradeoff between high costs for high bandwidth vs. low consistency at low bandwidth. You can reduce the cross-site bandwidth by delaying and possibly coalescing or skipping updates, but now you’re moving toward even lower consistency.
  • For a pull-based cache, your tradeoff will be high cost for a large cache vs. high latency for a small one.

Yes, folks, we’re in another classic triangle – Price vs. Latency vs. Consistency. Acronyms and further analysis in those terms is left for the reader, except that I’ll state a preference for PLaC over CLaP. The point I’d rather make here is that the choice doesn’t have to be one or another for all of your data. Caches and replicas both have a scope. It’s perfectly reasonable to use caching for one scope and replication for another, even between two sites. The main practical consequences of being a cache instead of a replica are as follows.

  • You can drop data whenever you want, without having to worry about whether that was the only copy.
  • You can ignore (or shut off) updates that are pushed to you.
  • You can fall back to pulling data that’s not available locally.

If you have a data set that’s subdivided somehow, and a primary site that’s permanently authoritative for all subsets, then it’s easy to imagine dynamically converting caches into replicas and vice versa based on usage. All you need to do is add or remove these behaviors on a per-subset basis. In fact, you can even change how the full set is divided into subsets dynamically. Demoting from a replica to a cache is easy (so long as you’ve ensured the existence of an authoritative copy elsewhere); you just give yourself permission to start dropping data and updates. Promoting from a cache to a replica is trickier, since you have to remain in data-pulling cache mode until you’ve allocated space and completed filling the new replica, but it’s still quite possible. With this kind of flexibility, you could have a distributed storage system capable of adapting to all sorts of PLaC requirements as those change over time, instead of representing one fixed point where only the available capacity/bandwidth can change. Maybe some day I’ll implement such a system. ;)