One of the bigger news items in the NoSQL world lately has been the announcement of Bitcask by the folks at Basho. Now, I’ll admit, my first reaction was that the design looks quite a bit like the Memtable/SSTable design already used in Cassandra[1]. When I made this observation on #cassandra, both Jonathan Ellis and Benjamin Black were quick to point out that Bitcask is actually quite a bit simpler than that design. Benjamin, whose involvement with Bitcask was unknown to me until I saw his name in the commit log later, also suggested that I should give it a try it myself to satisfy my curiosity about its performance. Unlike some, I’m not the sort to base claims on bluster or reputation when there are facts to be had, so I decided to spend some time digging into it.
UPDATE: @tlipcon pointed out that the read-latency figures looked unreasonable, and he was right. I had run a version of the reader that used exactly the same sequence of keys as the writer, resulting in sequential reads. See comment #2 for the graphs with this error corrected.
Since @dizzyco had already done some comparisons to innostore[2], I decided to compare against another target – Tokyo Cabinet, which is another fairly popular local key/value store[3] and also happens to be supported as a Riak back end. I used the same 10KB size, but tested only the stores themselves without Riak on top. I tested up to about 2x memory size to factor out the most egregious caching effects, and deliberately used a “stripey” key distribution that’s neither sequential nor random. Here are the results for writing with Tokyo’s hash tables.

This is actually a pretty common sort of pattern that I’ve seen with filesystems, databases, and other things – not bad up to a point, then utter crap afterward. The inflection point isn’t quite at memory size, but it’s at the point where the OS is likely to decide it better start paging stuff out, and after that things get ugly. Over a fifth of a second for more than 1% of requests? Ouch. For contrast, here’s the same graph for Bitcask.

You have to look at the scale on the left to really appreciate how different this is – the 99th-percentile number mostly stays under 2ms, and only up to 6ms in the worst samples. Sweet. Of course, given Bitcask’s design we might expect that the read numbers wouldn’t look quite so good. Let’s take a look.

That turns out to be even better. Very sweet. I do have to throw in one caveat, though. I expect that reads will be more sensitive than writes to the size of the index, and I wasn’t testing enough I/O to really exercise that. Maybe when David’s runs finish they’ll shed some more light on that part of the picture. Still, that’s a nice-looking graph. Let’s complete the square and look at the same thing for Tokyo.

Not quite as bad as the write graph, but still the same sudden degradation when the system starts paging. To be fair, maybe the B-tree flavor would work better, and everybody says that tuning can really help with TC performance, but I pretty deliberately sought the untuned numbers for both systems.
What can we conclude from this? Mostly that Bitcask performs as advertised, at least for these simple tests. Without tuning, both reads and writes seem to be amortizing the cost of seeks over a very large number of operations. Nice job, Justin and David!
[2] I’m not sure what to make of those graphs, actually. They seem to show approximately 400 requests/sec for Bitcask, but mean latency of 20ms for gets and 60ms for puts. If it’s a 10-thread read-heavy workload (let’s say 80%) that might make some sense, but that does seem like an odd place to start measuring. I look forward to seeing more detail.
[3] Popular doesn’t mean good. I’ve heard more than one person say before that TC tends to choke on big datasets (much larger than I tested), that it can crash and leave data unrecoverable, etc. – all good and sufficient reasons for Basho to seek alternatives quite aside from any performance difference. Maybe some would prefer to interpret this comparison as an indictment of TC rather than a validation of BC, but I think the BC numbers would look good in just about any light.
Here’s some more data using the two stores’ respective “sync on every put” flags, otherwise the same. Tokyo:
And Bitcask:
Whoa. That’s ugly. The Tokyo result isn’t that different than it was before, except for more 99th-percentile samples coming in at 10-20ms after the midpoint – which makes me wonder a bit whether it’s really honoring that sync flag – and a few more going over 100ms. Not sure what to make of that Bitcask graph, though. The numbers are very consistent, but consistently bad. 15ms average? 40ms 99th-percentile? It sort of implies that each put is not doing just one seek but multiple. I’m going to speculate a bit before I look at the code – always a bad idea – to guess that the sync code is doing more work than it really needs to but the fast path for put is amortizing that cost over even more requests than I’d thought. That suggests opportunities for further optimization, which is actually very good news for a project so young. Get the sync numbers down, and the async numbers should be even more awesome.
As mentioned in the update above, @tlipcon pointed out an error in my methodology. Here are the graphs corrected so that the writer and reader use independent key sequences so that they’re forced to do non-sequential (though still not exactly random) I/O. Tokyo:
Bitcask:
Here we see that the initial performance of Tokyo for reads is gone, and where we got 99th-percentile numbers in the 40-50ms range it’s now 50-75ms throughout. The difference for Bitcask is even more dramatic – 99th percentile is now 30-40ms most of the time – but still significantly better than Tokyo. That’s also more in line with @dizzyco’s numbers, depending on what kind of hardware he’s using. While not as stellar as before, these results are still very respectable. The numbers for put remain as they were before.
i like bitcask but it’s a bit unfair. buy keeping index and data separated it’s just to easy to index the sorted list which is in the end the data file.
with TC index and data are mixed, which requires defragmentation & C. This is of course going to be paid in terms of performance.