One of the bigger news items in the NoSQL world lately has been the announcement of Bitcask by the folks at Basho. Now, I’ll admit, my first reaction was that the design looks quite a bit like the Memtable/SSTable design already used in Cassandra[1]. When I made this observation on #cassandra, both Jonathan Ellis and Benjamin Black were quick to point out that Bitcask is actually quite a bit simpler than that design. Benjamin, whose involvement with Bitcask was unknown to me until I saw his name in the commit log later, also suggested that I should give it a try it myself to satisfy my curiosity about its performance. Unlike some, I’m not the sort to base claims on bluster or reputation when there are facts to be had, so I decided to spend some time digging into it.

UPDATE: @tlipcon pointed out that the read-latency figures looked unreasonable, and he was right. I had run a version of the reader that used exactly the same sequence of keys as the writer, resulting in sequential reads. See comment #2 for the graphs with this error corrected.

Since @dizzyco had already done some comparisons to innostore[2], I decided to compare against another target – Tokyo Cabinet, which is another fairly popular local key/value store[3] and also happens to be supported as a Riak back end. I used the same 10KB size, but tested only the stores themselves without Riak on top. I tested up to about 2x memory size to factor out the most egregious caching effects, and deliberately used a “stripey” key distribution that’s neither sequential nor random. Here are the results for writing with Tokyo’s hash tables.

This is actually a pretty common sort of pattern that I’ve seen with filesystems, databases, and other things – not bad up to a point, then utter crap afterward. The inflection point isn’t quite at memory size, but it’s at the point where the OS is likely to decide it better start paging stuff out, and after that things get ugly. Over a fifth of a second for more than 1% of requests? Ouch. For contrast, here’s the same graph for Bitcask.

You have to look at the scale on the left to really appreciate how different this is – the 99th-percentile number mostly stays under 2ms, and only up to 6ms in the worst samples. Sweet. Of course, given Bitcask’s design we might expect that the read numbers wouldn’t look quite so good. Let’s take a look.

That turns out to be even better. Very sweet. I do have to throw in one caveat, though. I expect that reads will be more sensitive than writes to the size of the index, and I wasn’t testing enough I/O to really exercise that. Maybe when David’s runs finish they’ll shed some more light on that part of the picture. Still, that’s a nice-looking graph. Let’s complete the square and look at the same thing for Tokyo.

Not quite as bad as the write graph, but still the same sudden degradation when the system starts paging. To be fair, maybe the B-tree flavor would work better, and everybody says that tuning can really help with TC performance, but I pretty deliberately sought the untuned numbers for both systems.

What can we conclude from this? Mostly that Bitcask performs as advertised, at least for these simple tests. Without tuning, both reads and writes seem to be amortizing the cost of seeks over a very large number of operations. Nice job, Justin and David!

[1] Actually what both remind me of, more than anything, is one of several storage designs we worked out at Revivio in ~2003. This one had a similar “containerized” structure, with a two-level filter system that IMO worked better for our purposes than Bloom filters. Unfortunately, it hadn’t yet become clear that the “big sales are just around the corner if we get this out” message coming from VP-land was total BS, so we bowed to (apparent) expediency and stuck with a MySQL-based system. That haunted us for years, but that’s a rant for another time.

[2] I’m not sure what to make of those graphs, actually. They seem to show approximately 400 requests/sec for Bitcask, but mean latency of 20ms for gets and 60ms for puts. If it’s a 10-thread read-heavy workload (let’s say 80%) that might make some sense, but that does seem like an odd place to start measuring. I look forward to seeing more detail.

[3] Popular doesn’t mean good. I’ve heard more than one person say before that TC tends to choke on big datasets (much larger than I tested), that it can crash and leave data unrecoverable, etc. – all good and sufficient reasons for Basho to seek alternatives quite aside from any performance difference. Maybe some would prefer to interpret this comparison as an indictment of TC rather than a validation of BC, but I think the BC numbers would look good in just about any light.