Comparing Key/Value Stores

Jeff Darcy October 26, 2009 17:16

Over the last few days, I’ve been using some of my spare time to benchmark several of the key/value stores that are out there. I don’t think all that much of limited-to-memory stores, because I think a store designed to be persistent can do everything a memory-based store can do by the simple expedient of running it on top of a RAM disk or filesystem. In a world where horizontal scaling is the name of the game, the argument that a memory-based store can avoid syscall overhead or some such is much less compelling than the argument that the human resource cost of deploying a whole separate infrastructure outweighs the material resource cost of deploying a few more servers. Whether you agree or disagree with that reasoning, what I set out to compare was a set of persistent stores. They varied in lots of other ways, though. Here are the competitors and how they distinguish themselves:

  • Tokyo Tyrant is a single-node data store which would require an extra layer to make it into the kind of multi-node data store I think most people really want/need (more about that later). It also has a very simple data model and practically no security model, but everyone says it’s fast so it’s a good baseline.
  • Chunkd is the lower layer of Project Hail, and is mostly equivalent to Tyrant except that it has a notion of multiple users and channel security (which was turned off for benchmarks).
  • Voldemort differs from Tyrant along the other axis – it still has a simple data model and no security features to speak of, though it does span nodes and supports multiple stores served through a single port.
  • Cassandra and Riak add richer (but different) data models to Voldemort’s functionality, but testing was for a very simple “one key maps to one value” model.
  • Tabled is the upper layer of Project Hail, and is more of an S3-equivalent large object store with full multi-user access control etc. I figured it would perform worse than the others (almost but not quite true) but I was curious to see how much of a penalty was associated with the extra functionality.

To test performance, I wrote a Python script which could speak to the interfaces for any of these stores. I had to create such an interface for chunkd using the ctypes module, but that was no big deal. I then ran tests as follows:

  • Use two virtual machines on my desktop. I’ll re-compare with EC2 instances, including multiple servers where appropriate, soon.
  • Run ten client threads on one machine, vs. all the server parts on the other, for a minute.
  • Test write, warm read, and cold read (i.e. daemon killed, all caches dropped, daemon restarted).
  • Test 100-byte and 1000-byte values. I probably should have tested (in future will test) 10K or larger, but for now I was interested in two small sizes.

Here are the performance highlights. I’ll post full numbers after I re-run the tests on EC2 as I mentioned; all I’ll say for now is that the per-second rates for all tests ranged from low hundreds to low thousands.

  • Most of the candidates were two to four times faster for warm reads than for writes, except for Tyrant which was only about 10% faster.
  • Half of the candidates (Chunkd, Cassandra, Riak) were two to three times as fast for warm reads as for cold. The other half (Tyrant, Voldemort, Tabled) were only 0-20% faster.
  • Tyrant was the clear winner, more than 20% faster than its closest competitor even for the warm-read cases and showing very little drop-off for either writes or cold reads – making it up to four times as fast in some tests.
  • Chunkd almost kept up with Tyrant in the warm-read case, but fell far behind otherwise. It still took second place in every test, though.
  • After that, things slowed down pretty consistently. Cassandra was only 30-85% as fast as chunkd, Voldemort only 30-80% as fast as Cassandra (except for the larger cold-read test where it actually pulled ahead by 25%), and tabled only 50-90% as fast as Voldemort. Trailing the pack by a considerable margin was Riak, which was only 15-45% as fast as tabled.
  • By the time all was said and done, Tyrant was 8-24 times as fast as Riak.

The fairly obvious conclusion is that the performance differences between alternatives in this space are definitely big enough to be worth thinking about. A top-to-bottom ratio of 2:1 or 3:1 might be made up with tuning or by modifying applications to take advantage of each store’s special features, but at 24:1 that starts to seem unlikely. Like Leonard Lin, I find Tyrant’s performance lead compelling enough to think that a consistent-hashing layer on top of Tyrant might kick ass. Like Leonard, I’m not convinced that LightCloud really delivers on that promise – though it’s interesting enough that I’ll probably include it in my next round of comparisons.

As I said, I still need to re-test in a real cloud environment to validate these results, plus I’ll test more sizes. I’ll probably also drop tabled and Riak in favor of LightCloud and maybe Keyspace. I’ll definitely publish the benchmark program and any ancillary pieces (such as my Python interface for chunkd), and go into more detail about exactly how everything was set up. I’ve found way too little information in this area that’s even empirical, let alone quantitative, so maybe I can get something started here to fill the void.

7 Responses to “Comparing Key/Value Stores”

  1. Dave Smithon 27 Oct 2009 at 4:46 pm

    Interesting analysis. I have a random assortment of observations and questions, if you don’t mind:

    * I respectfully posit that load testing on a VM is something of an oxymoron. You’d never (or at least I wouldn’t) run these sorts of apps in a VM for production purposes due to the unpredictable latency introduced. Even more so when both VMs are competing for the same hardware.

    * The warm vs. cold measurements may be misleading due to the caching of the VM and the underlying O/S. To get an accurate picture (assuming VM testing is acceptable), you’d need to reboot both the VMs AND the host machine to get anything meaningful.

    * There are a lot of numbers in your analysis, but no breakdown of actual throughput or latency. IMNSHO, latency at some # of requests per second are the two dimensions that are most informative for this type of performance testing. You also would need a break down of read vs. write requests, since light write load could be significantly faster.

    * Without knowing how big your key space (assuming this is all key/value pairs) is, it’s hard to determine how much load was actually exerted on the data stores. Also, using a fixed-size key can be misleading as it permits some structures unfair advantages since they try to match up similar block sizes.

    * Another big question is how long your load tests ran. Most data stores slow down as the amount of data grows (typically logarithmically, IIRC). The severity of that curve is tightly coupled to the data structures used. Implementation details matter, a lot, when you start talking more than a few gigabytes of data.

    * What steps did you take to ensure your load generator did not introduce a bottleneck? My personal experience is that writing the load generator is as hard, or harder, than writing the original server.

    * Did you use multiple connections to each server? What protocol was used? The client protocol (and number of conns) can make a HUGE dent on performance measurements.

    * Saying that Tyrant is 8-24 times faster than Riak is a bit apples-to-oranges. Tyrant is basically a btree w/ sockets on front. Riak is an eventually-consistent, multiple-replica data store. It’s like comparing a bullet-proof vest to a tank — I know which one I’d prefer to be using when things go south. :)

  2. Jeff Darcyon 27 Oct 2009 at 5:30 pm

    I respectfully posit that load testing on a VM is something of an oxymoron. You’d never (or at least I wouldn’t) run these sorts of apps in a VM for production purposes due to the unpredictable latency introduced.

    I don’t think that’s true at all. These kinds of stores are quite popular in the cloud crowd, where they are often deployed in virtualized environments. Since most of them lack any kind of multi-user security, running them as a non-virtualized service is untenable and anyone who wants to use them in the cloud has little choice but to keep them on-instance behind firewalls.

    The warm vs. cold measurements may be misleading due to the caching of the VM and the underlying O/S.

    If that were the case, I’d expect the warm vs. cold numbers to end up being similar, but instead they are quite different. More significantly, they are consistently different. Hypervisors actually do less caching than you’d think, because they generally want to devote as much memory as possible to guests which do their own caching.

    IMNSHO, latency at some # of requests per second are the two dimensions that are most informative for this type of performance testing.

    IMNSHO and IMX that’s true sometimes. Other times it’s requests/second that matters, still other times it’s MB/second. I picked a value and measured it. Maybe I’ll measure another value some time, with different workloads and all sorts of other methodological twists. Gotta start somewhere.

    Another big question is how long your load tests ran.

    A minute, as I said above. I’m aware that this doesn’t measure effects that kick in when the database is large, and that those can be very important, but it’s also important to measure the starting-line performance and since my time is limited that’s what I did.

    What steps did you take to ensure your load generator did not introduce a bottleneck?

    Not much. OTOH, there seemed to be little sign of it. I do have some experience measuring both network and storage performance, and I know where to look for excess load. There just didn’t seem to be any on the client side, though there was plenty – CPU utilization, I/O wait time, etc. – visible on the server side in pretty much all cases. If I were trying to measure latency I’d need a more sophisticated test program (part of the reason I didn’t go that route) but for testing throughput multiple instances of a simple test program seems to work fine.

    Did you use multiple connections to each server?

    Ten, as I said above.

    What protocol was used?

    Generally “best available” – e.g. native instead of memcached for Tyrant, plain TCP instead of HTTP for some others, etc. I think the quality of the available interfaces is part of the performance equation. Yes, you can do better with some of these by rolling your own hyper-tweaked interface, just as you can do better by using features unique to each, but then the comparisons between them tend to become meaningless. Generally, “out of the box” and “tweaked to the max” are the only configurations worth comparing. This is an “out of the box” comparison and I doubt that I’ll ever have time to compare “tweaked to the max” for more than a couple of alternatives.

    Saying that Tyrant is 8-24 times faster than Riak is a bit apples-to-oranges. Tyrant is basically a btree w/ sockets on front. Riak is an eventually-consistent, multiple-replica data store. It’s like comparing a bullet-proof vest to a tank — I know which one I’d prefer to be using when things go south.

    Bear in mind that these are single-node tests so they’re more directly comparable than you seem to want. As I discussed above (I see a pattern here), a 24:1 difference in single-node performance strongly implies that a distributed and eventually consistent layer on top of the faster alternative might yield a significntly faster solution than a similar layer on top of the slower one – whether or not such a layer already exists. At that level of difference, one could say to hell with eventual consistency and do synchronous mirroring between a half-dozen Tyrant instances, and still be faster than Riak. I actually kind of wanted Riak to do better, because I like its feature set and what I’ve read about its architecture passes the “how I would have done it” test, but I ran the tests and it fared very poorly even compared to other distributed and eventually consistent stores like Cassandra and Voldemort. I know which one of those I’d prefer to be using when things go south. This is why we do tests, to see whether first impressions survive actual experience, and so far the results seem to suggest that Cassandra might be a better choice. Maybe the results will be different when I run multi-server tests, but I sure as hell wouldn’t assume any such thing.

  3. Dave Smithon 27 Oct 2009 at 6:32 pm

    Fair enough re: # of client connections and duration — rookie mistake on my part. :)

    I think we’ll have to agree to disagree about whether or not VMs are good testing grounds. At the root of my dislike is the hypervisor-introduce latency, particularly for disk access; I’ve seen enough in production to make me very leery of using it for low-latency situations. Perhaps that’s acceptable in some use cases.

    I’m not fully sure how to respond to your last comment block. Perhaps I should state my biases for starters. I’ve tried using tokyo-cabinent in production and found it to be fast in the short-term and then increasingly slow/latency-unhappy as the amount of data grew. In addition,there were also some significant issues with recovery of TC data in the scenario where the server or host fails; speed (for me) is pointless if you can’t guarantee (most) of my data will be present in the event of a power failure. FWIW, I’m now using embedded Inno and have found that to exceed 2x+ what TC could do in raw requests/sec, while providing decent data integrity.

    I’ve also benchmarked both Voldemort and Riak. Yes, V is ~4x as fast as Riak, but it crashes when you cross a threshold of data storage/memory usage. I’ve built my own version of the Dynamo system and have been able to approach V in terms of speed. All this to say that it’s harder, maybe far harder, than you seem to think to build something the provides the sort of data integrity, speed and multi-replica storage on top of Tyrant. So there’s a trade-off there of time to implementation vs. raw perf — that 24:1 number dwindles quickly when faced with the harsh realities of actually implementing a solution. Tyrant isn’t a valid solution for distributed data stores right now, maybe it can be with time. To compare it against existing data stores and say it’s 24 times faster is misleading and unfair, in my opinion. I appreciate your comment re: comparing first impressions with actual experience, but would argue that it’s equally important to ensure that the tests are providing meaningful comparisons.

    It’s not my intention to be an ass about this — but perhaps I’m succeeding in spite of myself. :) I wish more people put the time and effort into benchmarking, and I thank you for the effort.

  4. Jeff Darcyon 27 Oct 2009 at 6:57 pm

    Well, Dave, look at it this way: most of the comparisons out there are entirely measurement-free. I know this is just a tiny little step toward something more quantitative, but the longest journey begins with a single step. Instead of sitting around doing nothing, or devising some test plan that would take more time and resources than I have to execute, I decided to start walking and talking as I go instead of waiting. I seriously hope that more steps will be taken, either by myself or by others, but the important thing is that even lame measurements are better than no measurements (and measurements that aren’t shared are effectively no measurements as far as anyone else is concerned).

    As for how difficult it is to implement a more robust distributed system, I’m pretty well aware. There are enough bits of my background all over the site to make any convenient assumption to the contrary quite risible. After all, you say you’ve replicated Dynamo. Why do you assume I couldn’t as well, and BTW is your version available anywhere? The 24:1 ratio is just a single observation, not the be-all and end-all of the observations I intend to make or expect others to make. The fact is that an efficient single-node data store is an essential building block for any distributed data store. If one alternative is 24x as fast as another, it’s not at all unreasonable to consider that it might be worthwhile to make it properly distributed despite the difficulty. In this particular case, it might well be possible to combine the upper layer from one of the already-distributed stores with a more efficient single-node store to get something better than either as it is now. As I said in a previous article, scalability is not the same as performance but it’s no excuse for ignoring performance either.

  5. Justin Sheehyon 28 Oct 2009 at 12:08 am

    (disclaimer: I work on Riak)

    Jeff, I entirely agree with you that measurements that aren’t shared are effectively no measurements. Can you please share your measurements and your benchmarking code? I see no measurements here.

    You are also very right that an efficient single-node data store is an essential building block for any distributed data store. However, you’re comparing basic building blocks with entire distributed systems, which is a little confusing to me. Your point is a good one — if you set aside TC’s problems with recovery, it is among the many things whose shape makes it a possible underlying storage container for Riak.

    You mentioned that this is an “out of the box” comparison, but Riak’s “out of the box” configuration is optimized for its more common use cases… which are deployments on three or more machines. Among other things, your test was almost certainly writing and reading three replicas of each object on Riak while the other systems were likely only using a single copy. If you only wanted single-system behavior this is easy to set up in Riak, but the default behavior was almost certainly costing a great deal of wasted extra work in this situation.

    You said that you used the best available protocols for each system, choosing native interfaces instead of HTTP, etc… but it sounds like you only did that for some systems. It seems that you used Riak’s HTTP interface instead of the native client. While the HTTP interface is very nice and provides some useful features (and excellent interoperability) it certainly isn’t very objective to choose it in this situation. In addition to the protocol marshaling, that interface must do additional computation to produce headers such as Last-Modified and Etag as it goes out of its way to be very well-behaved and cache-friendly HTTP.

    Without you providing any of your benchmark code, system specifications, or your output values and statistical methods, it is impossible to have any useful discussion about the “results”. Just saying “8-24x faster” isn’t just apples-to-oranges, it is also entirely vacuous without both data and repeatable methods backing it up.

  6. Jeff Darcyon 28 Oct 2009 at 7:31 am

    Yes, Justin, I used the JSON/HTTP interface because the descriptions of the lower-level interface were Erlang-specific and inadequate. If you like, in the next round I’ll reverse engineer enough of that interface to use it. Would that make you happy? Would you like to place any bets on how much that will change the overall picture? It’s worth noting that Riak trailed even tabled in the initial results, even though tabled also uses an HTTP interface and offers more functionality. If your HTTP implementation is so bad that it’s the only culprit, then that’s useful information all by itself. I’ve seen people play the “benchmark this, recommend that” game for twenty years and I have no tolerance for it.

    As for sharing more detailed results, I’ve already said I’ll provide them when I’ve run what we all seem to agree are the more meaningful tests. I’m just providing a first look here, not a last word. How are impressions now and details later worse than just details later? How are they worse than nothing now and nothing later, which seems to be what most of the developers for these systems are providing? For example, this is from your own FAQ.

    Performance depends on many factors, including hardware and network parameters as well as a great many tunable parameters in the way you set up your cluster and the way that you use Riak from an application. We’ve found it to be fast enough for our purposes, and our goal is not to be “fastest” but rather to stay “fast enough” as the system grows, as hosts fail, and so on. That said, as soon as we get a chance to produce a general, reproducible benchmarking suite, we’ll share it with you.

    So, how’s that effort going? Would you like to collaborate on developing such tests? You clearly have some expertise to share, and that would be valuable.

  7. kuntharon 31 Oct 2009 at 9:02 am

    I really would like to see second round benchmarks with a little help from owners.
    Just pinning this message to keep fire alive…

Comments RSS

Leave a Reply