Comparing Key/Value Stores, round 2

Jeff Darcy October 29, 2009 11:10

As promised, here are some more detailed results from my comparisons of various key/value stores. As mentioned previously, the methodology is pretty simple: set up a server on one (virtual) machine, fire up ten client threads on another, let it run for one minute, and count how many requests got through. I did this using a Python script per client thread along with a bash script to run all ten and tally results. To test cold reads, I’d stop all the servers and use “echo 3 > /proc/sys/vm/drop_caches” before restarting. As noted in the comments to the previous post, this approach does have some limitations or deficiencies: it measures only throughput rather than latency, it doesn’t include a warmup period to factor out high startup costs, it doesn’t generate enough data to measure high-key-count behavior, etc. I also quite deliberately measure by using the longest run time of the ten spawned threads, because in my experience worst-case component performance is usually the dominant factor in overall performance and inconsistent results are effectively bad results. FWIW, I didn’t actually see all that much variation in thread completion times. Anyway, here are the stores and interfaces I used:

  • tabled (git clone on 10/27) using boto
  • Cassandra 0.4.1 using thrift
  • Riak (hg clone on 10/27) using jiak
  • Voldemort 0.56 using my own voldemort.py (I had to fix a tcp:// URL-parsing glitch, and also fixed it so that it doesn’t disconnect/reconnect on every request)
  • Tokyo Tyrant 1.1.37 (Cabinet 1.4.36) using pytyrant
  • chunkd (git clone on 10/27) using my own chunkd.py based on Python’s ctypes module
  • Keyspace 1.2 using the built-in Python interface

I have a spreadsheet of the results, including some more configuration information. I’ll note that every test was run at least twice, sometimes several times on separate sets of instances, and I never saw much variation to indicate that the results were being affected by contention vs. other instances on the same physical machine. The results did shed some interesting light on differences between virtualized environments, which wasn’t my intent, but that information might be useful to some people in its own right. I also don’t think those differences affected the comparison of the stores; the “finish order” did shift around some, but in general if store X significantly outperformed store Y in one environment it was highly likely to do so in another. The results are what they are, and I think they do reflect the inherent performance characteristics of the stores more than anything else. The last general note I’ll make is that all of the results seemed slow to me. I’ve seen lots of claims about this store doing 20K ops/s/node, that store doing 100K, etc. Bunk. Maybe on a well-provisioned non-virtualized node, tweaked to the max, using more threads than any sane application writer would consider, such numbers are achievable. I very much doubt that’s the case for realistic workloads in a realistic environment. Caveat emptor.

I’ve already discussed the results from my own desktop, so let’s start with the EC2 single-node results. I was quite surprised that the results overall were better than on my machine, since the opposite has been true for other tests (e.g. parallel filesystems). Nonetheless, some interesting patterns started to emerge. Excluding Tyrant, chunkd consistently led for warm reads, while Keyspace did so for cold reads. This probably reflects on how well each does network and disk I/O, with chunkd doing the former well (hardly a surprise considering its authors’ backgrounds) and Keyspace doing the latter well. The write results seem to indicate that Cassandra is doing something pretty effective for small writes, but whatever it is doesn’t work for large writes.

Moving on to the EC2 three-node results, things got even more confusing. The first thing to notice is that very little positive scaling was evident. Many of the results were flat, or even significantly worse on three nodes than on one. Having worked in distributed systems for a long time, I was disappointed but not surprised. What’s interesting is that I would expect that result in the environment with higher network latency, but in my experience EC2 does pretty well vs. Rackspace in that regard (for disk I/O it’s the exact opposite). Whatever the reasons, Keyspace continued to lead for cold reads and did (comparatively) well on everything else this time too. The only test where it didn’t outperform Cassandra and Voldemort was 10K writes, and that test had the least variation. Cassandra trailed the pack in every single test this time.

The Rackspace results are more straightforward. For one thing, they show that Rackspace’s provisioning is much better that Amazon’s for this kind of workload. I think having multi-CPU instances is a major part of that. With this different machine balance, a different kind of pattern emerges. Let’s ignore Riak for now, for reasons I’ll get to in a moment. Keyspace ran well ahead of the pack for small reads (both hot and cold) while Voldemort acquitted itself well for larger ones. Cassandra’s small-write advantage seemed to disappear, but instead it outperformed the others for large writes. Clearly, different applications would get different results depending on their data-size distribution and read/write mix.

OK, so why did I say to ignore Riak? To put it bluntly, because it didn’t really work as a distributed system and the results are for only a single node. This was a very straightforward build, but even when I followed the (meager) instructions for cluster deployment exactly, the “doorbell port” used to put the cluster together would never get opened. I was able to confirm (via net:ping) that Erlang processes running on both nodes could connect properly in the general case, but I’m not very conversant with Erlang and I wasn’t getting anything useful from the highly “unique” Riak logging facilities, so I eventually stopped butting my head against it and ran the single-node tests instead. I’m a bit peeved, because the time I wasted on that breakage meant that I didn’t have enough left to test LightCloud as well and I really wanted to. Grrr. The results were still absymal, so that even with perfect scaling and no distribution overhead Riak would be unlikely to compare well even for its best case of small warm reads. There goes the “you should have compared apples to apples” excuse, because these are all apples in this basket. Riak proponents, beware. You’re on thin ice already.

At this point, I’m really liking Keyspace. It was one of the easiest to set up or to write an interface for, and performance was pretty good compared to the others. More importantly, it was pretty consistently good. Another factor here is resource usage. I was pretty disturbed to see Cassandra consume a few percent of my CPU and a tenth of my memory (on Rackspace, where “top” and friends give really weird but slightly informative results) just sitting there, before I’d even made a single connection. Cassandra CPU usage on the servers spiked very high during tests, while Voldemort CPU usage on the client did likewise. Keyspace usage was certainly noticeable, but moderate compared to both of the others. Well done, Keyspace team!

As I’ve been at pains to point out, these results are pretty lame. They’re just the first step on a long road toward a seriously useful qualitative comparison across alternatives in this fairly new space. I invite others to build on and improve these results because, frankly, I’ve run out of time and this has already become too much of a distraction from other things I should be doing. Maybe with some teamwork we can make some comparisons – not just of performance but of persistence/consistency guarantees which definitely vary and even more definitely matter – that an application writer can use to predict which alternative will work best for their particular situation.

7 Responses to “Comparing Key/Value Stores, round 2”

  1. Jonathan Ellison 29 Oct 2009 at 2:35 pm

    (I’m a Cassandra committer.)

    A few notes on Cassandra:
    – as with any JVM-based app, the jvm will grab as much memory as you let it. That doesn’t mean it actually needs that much… Cassandra defaults to a 1GB heap but you can certainly get away with less, especially on simple benchmarks.
    – Also as with any JVM-based app, you need to do about 10k of each op you are benchmarking (per node) to let it JIT things before you start measuring.
    – You’ll easily double performance by setting the log level from DEBUG to INFO (unclear if you actually did this, so mentioning it for completeness)
    – The CPU load you’re seeing is from bad default GC options. the defaults will be fixed for 0.4.2 and 0.5, but it’s easy to tweak for 0.4.1: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-user/200910.mbox/%3Ce06563880910222021k6e84262cl912bf80772c1dbb@mail.gmail.com%3E
    – we found that the GIL has a significant impact on doing threaded testing from Python. prefer multiprocessing.
    – Cassandra isn’t designed to handle large blob values; it gives you a column model and expects you to take advantage of that :)

    Finally, some general notes:
    – 10 concurrent clients isn’t a whole lot, for any of these systems. I doubt you can max out the CPU on a single Cassandra node with 10 client threads, even a relatively wimpy EC2 VM (once you fix GC/logging/etc options). (And yes, CPU is almost always the first bottleneck you hit.)
    – 10 clients is especially not enough when hitting a cluster of 3 — you’re not seeing throughput go up because you’re going from 100% of ops being local to 1/3, so intra-node latency is killing you.

  2. Jeff Darcyon 29 Oct 2009 at 4:24 pm

    Thanks for the tips, Jonathan. I’m pretty sure I did fix the log levels (have to remember where I stashed all the configs to check) but I didn’t tweak much else. If/when I get back to this, I’ll look into some of the other tweaks you mention.

    as with any JVM-based app, you need to do about 10k of each op you are benchmarking (per node) to let it JIT things before you start measuring.

    I’m not sure if this is unduly affecting the results right now. In the course of running all those read and write tests I would have done far more than 10K requests, and I didn’t see any significant change with a second set of runs following the first.

    we found that the GIL has a significant impact on doing threaded testing from Python. prefer multiprocessing.

    I didn’t use Python threading for pretty much this reason. The tests were ten separate processes.

    Cassandra isn’t designed to handle large blob values; it gives you a column model and expects you to take advantage of that

    10K isn’t really that large. It’s at the upper end of what should go into a database row or C/C++ structure, but I’ve seen larger items plenty of times. All it takes is a few long strings/arrays/lists. If there’s a data-model issue here, it’s probably that I’m only using one column. Column-oriented stores like Cassandra can make it possible for an application programmer to use fewer operations to get/set related values, or reduce the amount of data that needs to be transferred for a client-side-serialization approach. Then again, there’s a whole range of other features such as enumerations and extra operations besides get/set/delete that distinguish some stores from others. Keeping results directly comparable often means testing the lowest common denominator, unfortunately, but application developers should keep in mind that there’s more to the story.

    10 concurrent clients isn’t a whole lot, for any of these systems. I doubt you can max out the CPU on a single Cassandra node with 10 client threads, even a relatively wimpy EC2 VM (once you fix GC/logging/etc options). (And yes, CPU is almost always the first bottleneck you hit.)

    I won’t argue that it is, but I’m far from convinced that it should be. Especially in a virtualized environment, where computation is almost as fast as native but network and disk I/O are markedly slower (even with paravirtualized drivers), I don’t think I’d necessarily expect CPU to be the first bottleneck. How much computation really needs to be done per packet or per block, and if CPU is really the limiting factor then how did Keyspace get higher transaction rates with less CPU? Nonetheless, if/when I get back to this I’ll try running with more threads to see what kind of difference it makes. It might make things better, but I wouldn’t be much more surprised if it made things worse.

    10 clients is especially not enough when hitting a cluster of 3 — you’re not seeing throughput go up because you’re going from 100% of ops being local to 1/3, so intra-node latency is killing you.

    The client was completely separate, so it was 100% remote either way.

    FWIW, I don’t mean to be disparaging toward Cassandra at all. If I were doing this “for real” as the basis for something I was actually putting into production, I’d probably do my detailed comparisons with Cassandra and Voldemort (taking advantage of Cassandra’s richer data model as much as possible). Why not Keyspace? Even though it did well in these tests, I remain leery of a single-floating-master architecture at serious scale, and there seems to be more evidence that Cassandra in particular looks better as the dataset gets larger. I think you guys have done good work, and I don’t want that to go unsaid.

  3. Jonathan Ellison 29 Oct 2009 at 9:35 pm

    Thanks — and fwiw it’s clear from where I sit that you’re making a good-faith effort to be fair, so I’m not trying to be prickly either, thought it’s always hard to tell in this medium. :) Good luck!

  4. The xixon 30 Oct 2009 at 10:27 am

    These are apparently very well thought tests and benchmarks, but i can’t explain why many sites (facebook,digg..) started using Cassandra by looking at these numbers. And there is a missing benchmark too : MySQL!

  5. Jeff Darcyon 30 Oct 2009 at 11:48 am

    Good points, xix. I hadn’t thought of baselining MySQL, but it’s a good enough idea that I think I’ll make it a first priority when I get back to this. Thanks!

    As for Cassandra, I think those sites are using Cassandra because it’s a great piece of software. I’ll be the first to admit that these benchmark results are just the tip of the iceberg. In particular, they are specific to virtualized environments and they only test “out of the gate performance” with small datasets. A place like Facebook or Digg is likely to run a bunch of servers natively, which represents a very different performance milieu, and they’re much more concerned with huge datasets. A system like Cassandra that’s well proven at that scale, which also presents a data model and features that they can take advantage of, has huge value. I see the Digg story as a crucial point of validation not just for Cassandra – though I offer my hearty congratulations to that team – but for the entire space of alternative data stores. We’re still in the space where a win for one is a win for all. That will change, but for now let’s enjoy it.

  6. Jeff Darcyon 30 Oct 2009 at 4:41 pm

    BTW, another point about Cassandra taking 100M off the bat. Those are resident pages. I understand the bit about the JVM allocating a large heap, but there doesn’t seem to be any particularly good reason to go touching so much of it so that the pages become resident. If it just touches them once and they can easily be reclaimed when needed then I guess there’s little harm done, though. Checking whether that’s actually the case is now on my list of things to do when I get back to this.

  7. Ismael Jumaon 01 Nov 2009 at 12:39 pm

    Hi Jeff,

    A note regarding the Voldemort tests. The Python client was created as a proof of concept and I am not sure if it has had any production usage. The Java client has had the most testing (it’s what LinkedIn uses) and the performance numbers from the C++ client seem to be good. Some numbers here:

    http://groups.google.com/group/project-voldemort/browse_thread/thread/9dd7ee33305da887/371d23a2afe244c4

    Best,
    Ismael

Comments RSS

Leave a Reply