Ned Batchelder started an interesting discussion about caches. In the course of that discussion, the distinction between caching and replication came up, and I think there are some very frequent misunderstandings about that distinction and its implications so I’ll attempt to clarify. Here are some definitions:

  • Cache: a data location created/deployed to provide lower request latency than the main data store (either by being located nearer to requesters or by using faster components).
  • Replica: a data store, separate from that where a request is served, that is created/deployed to continue service after a failure.

In short, a cache exists to improve performance and a replica exists to improve resilience. A cache that doesn’t improve performance is a failure, as is a replica that doesn’t improve resilience, but the possibility of failure doesn’t turn one thing into another. Since defining things in terms of purpose or intent often leaves things unclear, here are some practical implications of the difference.

  • Caches need not be current or complete. They may return stale data, or no data at all, although many caches are designed to avoid stale data and “transparent” caches will re-request data from the main store instead of requiring that the requester do so after a miss.
  • Replicas must be both current and complete (perhaps not perfectly but always within defined limits), and authoritative or at least capable of becoming authoritative. “Authoritative” means that they may not be contradicted by alternative sources of information; if a conflict exists, the authoritative source is unconditionally given precedence over any non-authoritative one. (Authority loses its meaning if authorities disagree, of course, but that’s a philosophical issue best left for another time. For now, assume that authorities always agree.)
  • Caches exist to improve request latency, but replication might actually degrade request latency at the nearer data store as messages are exchanged with the further one to preserve the required replica behavior.
  • Replicas exist to improve resilience, but caching might degrade resilience as the number of components (the caches and extra data paths) and logical complexity both increase.

Much of the confusion arises because a single pool of data can serve as both a cache and a replica. In fact, if multiple replicas are simultaneously accessible at all (i.e. “live” or “dual active” replication vs. “standby” or “active/passive” replication) then it’s often easy to use the nearest replica as a cache. Enabling this can often yield big advantages for little work, but it can also be disastrous if such simultaneous access leads to thrashing. I’m sure some of my readers will recognize this phenomenon as it applies to simultaneous access using two disk controllers with “auto-trespass” enabled, leading to 100x performance degradation. It’s still safe to use a replica as a cache, though, even if it’s terribly inefficient. By contrast, using a cache as though it were a replica might qualify as one of the classic mistakes in computing. Often, you can get away with doing that 99% of the time, until that one time when the cache does what a cache does and returns stale data. Then you can have very hard-to-debug data corruption or misbehavior on your hands. This is but one example of a lesson I often have to pound into people, regardless of whether caching or replication is involved. In fact, it’s so important that I’ll put it in a quote box.

Never make copies of data without a strategy for dealing with currency/consistency issues.

As I said in Ned’s thread, “don’t sweat it and live with the consequences” is a valid strategy so long as it’s a conscious choice; simply ignoring or forgetting the issue is not.

…but I digress. Getting back to the topic at hand, another source of confusion seems to revolve around performance. In Ned’s thread, the claim was made that replication can improve performance more than caching. I consider that untrue; what I would say instead is that a hybrid can outperform a “pure cache” (i.e. one which is not also a replica). This can happen for two reasons. The simpler reason is that since a replica must be complete it’s also likely to be larger than a cache, so any comparisons are really apples to oranges. The more complicated reason is that replicas need to “push” data between replicas before a need for it is recognized, whereas caches usually “pull” data in response to a user request. Thus, at the time the user requests data, the hybrid replica/cache will already have it locally while the pure cache might have to request it from the (remote) authoritative store. This is not really a performance benefit of replication itself, though, which becomes apparent when one considers that push-based caches are also possible – and do exist in the form of web Content Distribution Networks. (CPU and disk caches also tend to have prefetch features which are essentially similar.) Such a cache would, for equal resource levels, outperform a cache/replica hybrid. The supposed performance benefit for replication is really a performance cost associated with how caches are usually implemented. They are implemented that way because it usually involves less complexity, and because there are also cases – involving read vs. write ratios and other things I won’t get into – where “push” would be a disaster and “pull” is the correct model, so it’s not a mistake, but it does sometimes make it appear that replication is faster than caching when it’s not. A pure replica will always degrade performance at least somewhat, while a pure cache is supposed to improve it. The advantage accrues not to the function but to devices that happen to implement multiple functions and exploit synergy between them.