Bitcask Rocks

One of the bigger news items in the NoSQL world lately has been the announcement of Bitcask by the folks at Basho. Now, I’ll admit, my first reaction was that the design looks quite a bit like the Memtable/SSTable design already used in Cassandra[1]. When I made this observation on #cassandra, both Jonathan Ellis and Benjamin Black were quick to point out that Bitcask is actually quite a bit simpler than that design. Benjamin, whose involvement with Bitcask was unknown to me until I saw his name in the commit log later, also suggested that I should give it a try it myself to satisfy my curiosity about its performance. Unlike some, I’m not the sort to base claims on bluster or reputation when there are facts to be had, so I decided to spend some time digging into it.

UPDATE: @tlipcon pointed out that the read-latency figures looked unreasonable, and he was right. I had run a version of the reader that used exactly the same sequence of keys as the writer, resulting in sequential reads. See comment #2 for the graphs with this error corrected.

Since @dizzyco had already done some comparisons to innostore[2], I decided to compare against another target – Tokyo Cabinet, which is another fairly popular local key/value store[3] and also happens to be supported as a Riak back end. I used the same 10KB size, but tested only the stores themselves without Riak on top. I tested up to about 2x memory size to factor out the most egregious caching effects, and deliberately used a “stripey” key distribution that’s neither sequential nor random. Here are the results for writing with Tokyo’s hash tables.

This is actually a pretty common sort of pattern that I’ve seen with filesystems, databases, and other things – not bad up to a point, then utter crap afterward. The inflection point isn’t quite at memory size, but it’s at the point where the OS is likely to decide it better start paging stuff out, and after that things get ugly. Over a fifth of a second for more than 1% of requests? Ouch. For contrast, here’s the same graph for Bitcask.

You have to look at the scale on the left to really appreciate how different this is – the 99th-percentile number mostly stays under 2ms, and only up to 6ms in the worst samples. Sweet. Of course, given Bitcask’s design we might expect that the read numbers wouldn’t look quite so good. Let’s take a look.

That turns out to be even better. Very sweet. I do have to throw in one caveat, though. I expect that reads will be more sensitive than writes to the size of the index, and I wasn’t testing enough I/O to really exercise that. Maybe when David’s runs finish they’ll shed some more light on that part of the picture. Still, that’s a nice-looking graph. Let’s complete the square and look at the same thing for Tokyo.

Not quite as bad as the write graph, but still the same sudden degradation when the system starts paging. To be fair, maybe the B-tree flavor would work better, and everybody says that tuning can really help with TC performance, but I pretty deliberately sought the untuned numbers for both systems.

What can we conclude from this? Mostly that Bitcask performs as advertised, at least for these simple tests. Without tuning, both reads and writes seem to be amortizing the cost of seeks over a very large number of operations. Nice job, Justin and David!

[1] Actually what both remind me of, more than anything, is one of several storage designs we worked out at Revivio in ~2003. This one had a similar “containerized” structure, with a two-level filter system that IMO worked better for our purposes than Bloom filters. Unfortunately, it hadn’t yet become clear that the “big sales are just around the corner if we get this out” message coming from VP-land was total BS, so we bowed to (apparent) expediency and stuck with a MySQL-based system. That haunted us for years, but that’s a rant for another time.

[2] I’m not sure what to make of those graphs, actually. They seem to show approximately 400 requests/sec for Bitcask, but mean latency of 20ms for gets and 60ms for puts. If it’s a 10-thread read-heavy workload (let’s say 80%) that might make some sense, but that does seem like an odd place to start measuring. I look forward to seeing more detail.

[3] Popular doesn’t mean good. I’ve heard more than one person say before that TC tends to choke on big datasets (much larger than I tested), that it can crash and leave data unrecoverable, etc. – all good and sufficient reasons for Basho to seek alternatives quite aside from any performance difference. Maybe some would prefer to interpret this comparison as an indictment of TC rather than a validation of BC, but I think the BC numbers would look good in just about any light.

Writing New Code

Just about every programmer prefers writing their own code to maintaining somebody else’s. This is even more true in the open-source world. You get to choose your language, your libraries and frameworks, your style. Do you prefer Erlang or Ruby? Shared state or message passing? Long variable names or short? Many exceptions or few? Whatever you write will use all of your favorite toys and tricks, be sophisticated where you think sophistication is called for and simple where you think simplicity should be the rule. Even if you have to maintain it later, it’s still more fun than fixing bugs in someone else’s code representing someone else’s priorities and aesthetics. Writing a big new blob of code might be considered a negative a company or project, but to the individual programmer it’s almost always considered a positive. I don’t even have a problem with that if entertainment or self-education is the admitted goal. Heck, pretty much all of the code I’ve published here has been written that way, and I’ve generally been careful to admit as much. If you write code because you enjoy it, that’s great. What does irk me, though, is when people write code for fun and then try to pretend it was for something else. “Scratching an itch” is merely a euphemism for what such people are usually doing.

One common excuse for writing new code, or rewriting old code, is that the new code is supposedly more maintainable. There might even be a grain of truth in that, to the extent that maintenance requires understanding and it’s generally easier to understand your own code than to understand someone else’s. That only works as long as the person who wrote the code is still around, and still interested in helping. Usually the same people who are quick to reimplement one thing are also quick to drop that project and go off to rewrite another. How many of the people reading this have, at some point in their careers, had to pick up a pile of crap code left by some attention-challenged goatee-wearer who left the company to work somewhere more fashionable? Pretty much all of those over 30, I’ll bet. The potential maintenance benefits of locally developed code should always be weighed against both the initial development cost and the risk of it being less maintainable when its author leaves. As a general rule of thumb, the effort required to rewrite or replace code should be estimated and no such project should be undertaken until its advocates have spent at least as long learning what problems the old code really solves. (Side note: this also applies to NoSQL projects in general, and specifically to why I’m not writing my own Dynamo derivative.)

The other popular excuse for writing new code here is performance. “XYZ will be much faster” is one of the most common things I’ve heard from coworkers. Maybe I’m jaded, but my reaction nowadays is likely to be either “Who cares?” or “Why should I believe that?” I say “Who cares?” not only because performance is often not a relevant figure of merit, but because it’s generally system-level performance I care about. Really, if the part of the system addressed by XYZ isn’t the bottleneck, making it faster is just making development slower. Even if it is the system-wide bottleneck, making it faster might not help. It’s often possible to make something faster by letting it use ten times as much memory, for example, but often that’s effectively a resource stolen from some other part of the system and the net effect is not worth the development effort. Squeeze one part of a water balloon and another part bulges out. That might be fun, but it’s not productive.

Even more important than measuring performance at a system level, though, is measuring it at all. The burden of proof for performance claims is on those making the claims, not on those who express skepticism. If performance is purportedly the reason for doing something, you should be able to measure that performance the moment the code is written. Don’t expect others to do it for you. That’s not collaboration; that’s laziness. “You can try it yourself” is particularly offensive when the languages and libraries etc. to do so are lying there in your sandbox but don’t exist in any standard distribution and have to be built from source. Anybody who uses performance as a justification without even being able to define what kind of performance, measured under what conditions and compared to which alternatives, is just a dilettante. Even a back-of-the-envelope calculation – of cycles, messages, seeks, or whatever tends to affect performance most in your particular domain – is worth more than empty claims.

As I said at the beginning, if you want to write new code for fun, knock yourself out. If you want to write it first and then explore its characteristics later, that’s fine too. That’s a form of research that I’d never discourage. Just don’t expect me or anyone else to be impressed by the mere fact that you wrote it, or that you wrote it in some particular style. The proof is in the pudding, and I’m hungry.

Yeah, right

This is (just barely) too long to fit in a tweet. Tim O’Reilly says:

The biotorrent project demonstrates that piracy is not the point (or best use) of bitTorrrent.

If I take a sword and use it as a can opener, then killing people is not the point of a sword? That’s utterly absurd. One unique use of a thing, never imagined by the thing’s inventor, does not outweigh the multitude of other uses. Either the inventor’s intent or the common usage might determine the point of a thing, but a single unique use sheds no light on either and thus proves nothing.

The RAM-Cloud Zombie is Back

Looks like this is the week to re-hash old arguments. Nati Shalom: Memory is the New Disk for the Enterprise. He even tries to run some numbers, but they didn’t look very convincing to me so I decided to run my own. Here are some data points.

  • 8GB of DDR2-800 ECC RAM costs $249 or a bit over $30/GB.
  • A 147GB 15K RPM SAS drive costs $179 or around $1.20/GB.
  • A fairly generic 1U Xeon-based server with 2GB costs $1435.

Nati was using Cisco UCS systems. Somehow I doubt that either the systems themselves or the memory that goes in them are cheaper than generic stuff at NewEgg, so the above figures should be considered lower bounds for those prices. He also claims that a 4TB data set fits into 4 UCS chassis, which doesn’t seem right considering that the data sheet for the UCS C250 M1 “extended memory” server only claims 384GB. Now, 384GB in 1U is a beautiful thing, but it’s not 1TB. It also costs a lot more than Nati claims. In reality, you’ll need a dozen systems just for capacity. Since most of the data you’re accessing is going to be remote, you’re probably going to need more to get enough bandwidth to all that data. Yes, even with 10GbE. Even with QDR IB, which is 3x as fast for the same price. I’ll be generous, though, and stick with a dozen. In fact, I’m going to ignore NICs and switches and cables altogether. Let’s look at what you really need to satisfy Nati’s 4TB use case.

  • RAM-based: $1435 * 12 servers + $249 * 500 * 8GB RAM = $142K
  • “Lean” disk-based: $1435 * 12 servers + $179 * 28 * 147GB disk = $22K
  • “Beefy” disk-based: $1435 * 12 servers + $249 * extra 8GB RAM + 56 * 147GB disk = $30K

The difference between the “lean” and “beefy” configurations is based on two observations. First is that disk data can be cached in memory. Gasp! It might seem obvious to most of us, but somehow the RAM-cloud advocates always seem to “forget” that in their comparisons. In fact, a RAM-based system that can spill to disk isn’t all that different than a disk-based system that can cache in RAM, and most of the differences are negative. The second observation is that more spindles offer better transfer rates and (more importantly) more ops/second, and are often deployed even when capacity goals have already been met. Still not fast enough for your workload? Spend a few grand on another rank of disks. Repeat as necessary, for trivial cost compared to the all-RAM approach.

The RAM-based system would have to perform a lot better than the disk-based one to justify the 5x price differential. That would require that practically all of the data be hot all the time, which is generally unlikely and particularly so in Nati’s two examples. Both the online retailer and the airline reservations are likely to have access patterns that are characterized by a small time-based window onto a much larger data set. It’s also not clear that these systems qualify as large any more, and the comparison only looks worse for RAM as systems get bigger. You don’t have to be Facebook to have millions of users accessing ten times more data than Nati’s use cases. There are at least a couple of dozen applications within Facebook that each have million daily users, matched by many times more at standalone websites, in turn matched by many times more within corporations. Applications scaling from dozens of terabytes up to petabytes are more common than many people seem to think. There are applications that fit into Nati’s model – I saw a few in my HPC days – but in general if your data set is that large then most of it’s going to be cold and most of it is going to be remote. When you’re already incurring the cost of a network operation, the system-level performance difference between disk cached/prefetched appropriately in RAM and all-RAM is negligible compared to the cost difference (for the vast majority of workloads). Nine out of ten people who think they have a truly RAM-cloud-appropriate access pattern should be spending their money not on extra RAM but on smarter programmers.

In the end, the numbers just don’t add up. Maybe when systems have at least a terabyte each and a network to match, but even then I remain skeptical. Of course making disk do RAM’s job can be slow. And making RAM do disk’s job can be expensive. People shouldn’t harp on one without recognizing the other. If you really want to get serious about optimizing data-access performance, here’s another bonus observation: a two-level operational/archival distinction is too confining and moving the line from disk/tape to RAM/disk is pointless. What people should really be thinking about is local RAM to remote RAM to SSD to fast disk to slow disk to cloud, or similar. Use all of the tools in the box, instead of one for everything.

Tech Field Day

I went to the Tech Field Day party at Fenway Park last night, and have to say I had a great time. I got to meet Stephen Foskett, Greg Knieriemen, Greg Ferro, Robin Harris, and many others. (I mention those four specifically because they’re people with whom I had interacted significantly online without having met them FTF.) None of them particularly seemed to want to throw me over the railing, though Greg and I did almost get into a discussion of iSCSI vs. FC at one point. I got to see an iPad for the first time. I also particularly enjoyed talking to some of the folks from Akorri and Data Robotics. The basic concept of getting bloggers – oops, I meant “independent thought leaders” – together to talk both to one another and to vendors without the usual six layers of filtering seems to have worked out extremely well. Thanks to Stephen and to everyone else who was involved for the chance both to learn and to evangelize some of my own ideas.

The Anti-CAP Zombie is Back

Nothing new here, really. Michael Stonebraker: Errors in Database Systems, Eventual Consistency, and the CAP Theorem. Stonebraker doesn’t like the CAP Theorem or eventual consistency. Who knew? Oh, wait. Here’s the really, really short version of the CAP theorem.

If your network breaks, and it will, then you can preserve A by letting the pieces continue independently (and resolve things later), or you can preserve C by shutting down the non-majority pieces.

The “and it will” part is also addressed James Hamilton. I’ve worked on systems with several redundant networks, and they were still subject to partitions. Mostly the reasons come down to something Stonebraker alludes to in the context of databases but which is also true of networks: programmers aren’t perfect. If a database programmer screws up, they can take down the most carefully designed database no matter what consistency model it uses (Stonebraker’s case 2). If a network programmer screws up, guess what? Same story, except now it’s your carefully designed network that fails instead. Given enough time and a large enough system, even rare failures become inevitable. If you have a network, it will fail. Unless you’re willing to promote router glitches into whole-system failures you only get to choose CP or AP. An important point here is that in a consumer-facing system a failure of consistency might be well tolerated but a failure of availability might be disastrous, while in a “back room” system the exact opposite is likely to be true. Whole-system failure might actually be a valid option for a back-room system that’s already protected by redundant switches etc. so that other failures are more likely anyway (and probably more severe). Stonebraker apparently sees these as the only relevant systems, and dismisses anything not fitting that model. I’m not going to say that they’re irrelevant (hi Kirby) but I will say that they’re uninteresting. Whether they’re more common or not, more important or not, they don’t shed much light on the CAP/EC debate – but that debate is central to a different set of systems which are increasingly common and increasingly important.

At least when the network programmer screws up, the database might be able to handle it via eventual consistency. Bradford Stephens points out in JH’s thread that application programmers might not be able to do much with that, but abstractions leak and errors bubble up (in some form or other) and if they bubble up to someone who can’t handle them then maybe you would have been better off promoting to whole-system failure in the first place. Eventual consistency and vector clocks and all the other stuff I discuss here are about giving those higher-level programmers a chance, giving them an option if they want it, not saying it’s the only possibility. Yes, systems based on such principles can have “weird” behavior even in the absence of failures. Don’t like that? Don’t use them. Stonebraker’s right that eventual consistency might represent a poor tradeoff in many cases, and it’s all about tradeoffs. Some people live in an AP world, and must pay attention to these issues. Others live in a CA or CP world, and can ignore them. All three groups can and should coexist amicably. “Your problems don’t matter” just isn’t very constructive, but it’s apparently all some people have to say.

Camera Review: Panasonic Lumix DMC-FS7

And now for something completely different…

I got this camera a few months ago after I’d lost its predecessor, and I think it has become my favorite out of the half-dozen or so digital cameras I’ve had over the years. You can find specs etc. anywhere, but here are some of the highlights from my perspective.

  • It’s small, easy to use, and takes video as well as stills, so it’s very convenient to bring everywhere.
  • Battery life seems excellent.
  • Start-up and between-picture times are better than average. This was one of my main selection criteria.
  • It doesn’t have any of the focus, exposure, or color-balance problems that I’ve seen in other cameras (especially its immediate predecessor).
  • When I zoom in, the pictures seem remarkably noise-free, so they resize well (using iPhoto – Facebook’s auto-resize is amazingly bad so don’t judge by that) and I can often skip a clean-up stage.

That’s really it. I’m no expert, just a casual family photographer, but that’s a description that fits many people so maybe someone will find my positive experience useful.

Vector Clocks Again

Justin Sheehy at Basho has posted a good article about vector clocks, which has led to an interesting discussion. I highly recommend that you read what’s said there, but my own key takeaway is much the same as it was in my own fairly recent article about conflict resolution: when the problem is conflicts in a distributed data system, there is no magic bullet. Vector clocks won’t save you. Client-side vector clocks won’t save you. Conditional writes won’t save you. This is not an issue that can be handled entirely within the storage system, or within a thin isolated layer of your application (which is effectively part of the storage system). Like security or performance, data conflicts are an issue that will pervade any distributed application, and anyone who says otherwise is a charlatan. Application designers should understand the tools available, including the limitations of those tools, and take responsibility for using them to build the application that best satisfies their requirements. It’s a top-to-bottom effort. Get used to it.

A related point I think is worth making is that I’m not saying it’s OK to lose data. As a storage professional for many years, I would never say such a thing, but I will say that people need to be clear about the difference between losing data and allowing data to be overwritten (or deleted) in response to user actions. I will also say is that storage systems should be clear about the promises they make, and rigorously keep those promises. I’m perfectly comfortable saying that “last write wins” is a valid rule, with a long tradition of usefulness in filesystems and databases, and that it can continue to govern writes to a single node in a distributed system. Vector clocks still have a lot of value between nodes, where “last write” or any other concept of relative time can be impossible to pin down, but even Justin’s article mentions some limitations. Ultimately, the application (or the user) still has to deal with conflicts. To enable that, it can sometimes be more important for the storage system to offer easily understood behavior than to offer complex behavior that’s “better” only according to certain assumptions about what the user would want. “Pilot error” isn’t likely to be a very welcome response to a user who misunderstood complex rules and then lost data as a result. The rules must be defined with an eye toward minimizing the probability of data loss or corruption throughout the system, not just within one component.