As promised, here are some more detailed results from my comparisons of various key/value stores. As mentioned previously, the methodology is pretty simple: set up a server on one (virtual) machine, fire up ten client threads on another, let it run for one minute, and count how many requests got through. I did this using a Python script per client thread along with a bash script to run all ten and tally results. To test cold reads, I’d stop all the servers and use “echo 3 > /proc/sys/vm/drop_caches” before restarting. As noted in the comments to the previous post, this approach does have some limitations or deficiencies: it measures only throughput rather than latency, it doesn’t include a warmup period to factor out high startup costs, it doesn’t generate enough data to measure high-key-count behavior, etc. I also quite deliberately measure by using the longest run time of the ten spawned threads, because in my experience worst-case component performance is usually the dominant factor in overall performance and inconsistent results are effectively bad results. FWIW, I didn’t actually see all that much variation in thread completion times. Anyway, here are the stores and interfaces I used:

  • tabled (git clone on 10/27) using boto
  • Cassandra 0.4.1 using thrift
  • Riak (hg clone on 10/27) using jiak
  • Voldemort 0.56 using my own (I had to fix a tcp:// URL-parsing glitch, and also fixed it so that it doesn’t disconnect/reconnect on every request)
  • Tokyo Tyrant 1.1.37 (Cabinet 1.4.36) using pytyrant
  • chunkd (git clone on 10/27) using my own based on Python’s ctypes module
  • Keyspace 1.2 using the built-in Python interface

I have a spreadsheet of the results, including some more configuration information. I’ll note that every test was run at least twice, sometimes several times on separate sets of instances, and I never saw much variation to indicate that the results were being affected by contention vs. other instances on the same physical machine. The results did shed some interesting light on differences between virtualized environments, which wasn’t my intent, but that information might be useful to some people in its own right. I also don’t think those differences affected the comparison of the stores; the “finish order” did shift around some, but in general if store X significantly outperformed store Y in one environment it was highly likely to do so in another. The results are what they are, and I think they do reflect the inherent performance characteristics of the stores more than anything else. The last general note I’ll make is that all of the results seemed slow to me. I’ve seen lots of claims about this store doing 20K ops/s/node, that store doing 100K, etc. Bunk. Maybe on a well-provisioned non-virtualized node, tweaked to the max, using more threads than any sane application writer would consider, such numbers are achievable. I very much doubt that’s the case for realistic workloads in a realistic environment. Caveat emptor.

I’ve already discussed the results from my own desktop, so let’s start with the EC2 single-node results. I was quite surprised that the results overall were better than on my machine, since the opposite has been true for other tests (e.g. parallel filesystems). Nonetheless, some interesting patterns started to emerge. Excluding Tyrant, chunkd consistently led for warm reads, while Keyspace did so for cold reads. This probably reflects on how well each does network and disk I/O, with chunkd doing the former well (hardly a surprise considering its authors’ backgrounds) and Keyspace doing the latter well. The write results seem to indicate that Cassandra is doing something pretty effective for small writes, but whatever it is doesn’t work for large writes.

Moving on to the EC2 three-node results, things got even more confusing. The first thing to notice is that very little positive scaling was evident. Many of the results were flat, or even significantly worse on three nodes than on one. Having worked in distributed systems for a long time, I was disappointed but not surprised. What’s interesting is that I would expect that result in the environment with higher network latency, but in my experience EC2 does pretty well vs. Rackspace in that regard (for disk I/O it’s the exact opposite). Whatever the reasons, Keyspace continued to lead for cold reads and did (comparatively) well on everything else this time too. The only test where it didn’t outperform Cassandra and Voldemort was 10K writes, and that test had the least variation. Cassandra trailed the pack in every single test this time.

The Rackspace results are more straightforward. For one thing, they show that Rackspace’s provisioning is much better that Amazon’s for this kind of workload. I think having multi-CPU instances is a major part of that. With this different machine balance, a different kind of pattern emerges. Let’s ignore Riak for now, for reasons I’ll get to in a moment. Keyspace ran well ahead of the pack for small reads (both hot and cold) while Voldemort acquitted itself well for larger ones. Cassandra’s small-write advantage seemed to disappear, but instead it outperformed the others for large writes. Clearly, different applications would get different results depending on their data-size distribution and read/write mix.

OK, so why did I say to ignore Riak? To put it bluntly, because it didn’t really work as a distributed system and the results are for only a single node. This was a very straightforward build, but even when I followed the (meager) instructions for cluster deployment exactly, the “doorbell port” used to put the cluster together would never get opened. I was able to confirm (via net:ping) that Erlang processes running on both nodes could connect properly in the general case, but I’m not very conversant with Erlang and I wasn’t getting anything useful from the highly “unique” Riak logging facilities, so I eventually stopped butting my head against it and ran the single-node tests instead. I’m a bit peeved, because the time I wasted on that breakage meant that I didn’t have enough left to test LightCloud as well and I really wanted to. Grrr. The results were still absymal, so that even with perfect scaling and no distribution overhead Riak would be unlikely to compare well even for its best case of small warm reads. There goes the “you should have compared apples to apples” excuse, because these are all apples in this basket. Riak proponents, beware. You’re on thin ice already.

At this point, I’m really liking Keyspace. It was one of the easiest to set up or to write an interface for, and performance was pretty good compared to the others. More importantly, it was pretty consistently good. Another factor here is resource usage. I was pretty disturbed to see Cassandra consume a few percent of my CPU and a tenth of my memory (on Rackspace, where “top” and friends give really weird but slightly informative results) just sitting there, before I’d even made a single connection. Cassandra CPU usage on the servers spiked very high during tests, while Voldemort CPU usage on the client did likewise. Keyspace usage was certainly noticeable, but moderate compared to both of the others. Well done, Keyspace team!

As I’ve been at pains to point out, these results are pretty lame. They’re just the first step on a long road toward a seriously useful qualitative comparison across alternatives in this fairly new space. I invite others to build on and improve these results because, frankly, I’ve run out of time and this has already become too much of a distraction from other things I should be doing. Maybe with some teamwork we can make some comparisons – not just of performance but of persistence/consistency guarantees which definitely vary and even more definitely matter – that an application writer can use to predict which alternative will work best for their particular situation.