Everyone has their own unique set of interests. Professionally, mine is distributed storage systems. I’m mostly not very interested in systems that are limited to a single machine, which is not to say that I think nobody should be interested but just that it’s not my own personal focus. I believe somewhat more strongly, and I’m sure more controversially, that “in-memory storage” is an oxymoron. Memory is part of a computational system, not a storage system, even though (obviously) even storage systems have computational elements as well. “Storage” systems that are limited to a single system’s memory are therefore doubly uninteresting to me. Any junior in a reputable computer science program should be able to whip up some sort of network service that can provide access to a single system’s memory, and any senior should be able to make it perform fairly well. Yawn. It was in this context that I read the following tweet today (links added).

what does membase do that redis can’t ? #redis #membase

Yeah, I guess you could also ask “what does membase do that iPhoto can’t” and it would make almost as much sense. They’re just fundamentally not the same thing. One is distributed and the other isn’t. I don’t mean that membase is better just because it’s distributed, by the way. It’s not clear whether it’s a real data store or just another “runs from memory with snapshots to/from disk” system targeting durability but not capacity. In fact many such systems don’t even provide real durability if they’re based on mmap/msync and thus can’t guarantee that writes occur in an order which facilitates later recovery, and by failing to make proper use of either rotating or solid-state storage they definitely fail to provide a cost-effective capacity solution. In addition to that, membase looks to me like a fairly incoherent collection of agents to paper over the gaping holes in the memcache “architecture” (e.g. rebalancing). No, I’m no particular fan of membase, but the fact that it’s distributed makes it pretty non-comparable to Redis. It might make more sense to compare it to Cassamort or Mongiak. It would make more sense still to compare it to LightCloud or kumofs, which already solved essentially the same set of problems via distribution add-ons to existing projects using the same protocol as membase. Comparing to Redis just doesn’t make sense.

But wait, I’m sure someone’s itching to say, there are sharding projects for Redis. Indeed there are, but there are two problems with saying that they make Redis into a distributed system Firstly, adding a sharding layer to something else doesn’t make that something else distributed; it only makes the combination distributed. Gizzard can add partitioning and replication to all kinds of non-distributed data stores, but that doesn’t make them anything but non-distributed data stores. Secondly, the distribution provided by many sharding layers – and particularly those I’ve seen for Redis – is often of a fairly degenerate kind. If you don’t solve the consistency or data aggregation/dependency problems or node addition/removal problems that come with making data live on multiple machines, it’s a pretty weak distributed system. I’m not saying you have to provide full SQL left-outer-join functionality with foreign-key constraints and full ACID guarantees and partition-tolerant replication across long distances, but you can’t just slap some basic consistent hashing on top of several single-machine data stores and claim to be in the same league as some of the real distributed data stores I’ve mentioned. You need to have a reasonable level of partitioning and replication and membership-change handling integrated into the base project to be taken seriously in this realm.

Lest anyone think I’m setting the bar too high, consider this list of projects. That’s a year and a half old, and I count seven projects that meet the standard I’ve described. There are a few more that Richard missed, and more have appeared since then. There are already close to two dozen more-or-less mature projects in this space, not even counting things like distributed filesystems and clustered databases that still meet these criteria even if they don’t offer partition tolerance. It’s already too crowded to justify throwing every manner of non-distributed or naively-sharded system into the same category, even if they have other features in common. Redis or Terrastore, for example, are fine projects that are based on great technology and offer great value to their users, but my phone pretty much fits that description too and I don’t put it in the same category either. Let’s at least compare apples to other fruit.