Rackspace Block Storage

A while ago, Rackspace announced their own block storage. I hesitate to say it’s equivalent to Amazon’s EBS, them being competitors and all, but that’s the quickest way to explain what it is/does. I thought the feature itself was long overdue, and the performance looked pretty good, so I said so on Twitter. I also resolved to give it a try, which I was finally able to do last night. Here are some observations.

  • Block storage is only available through their “next generation” (OpenStack based) cloud, and it’s clearly a young product. Attaching block devices to a server often took a disturbingly long time, during which the web interface would often show stale state. Detaching was even worse, and in one case took a support ticket and several hours before a developer could get it unstuck. If I didn’t already have experience with Rackspace’s excellent support folks, this might have been enough to make me wander off.
  • Still before I actually got to the block storage, I was pretty impressed with the I/O performance of the next-gen servers themselves. In my standard random-sync-write test, I was seeing over 8000 4KB IOPS. That’s a kind of weird number, clearly well beyond the typical handful of local disks but pretty low for SSD. In any case, it’s not bad for instance storage.
  • After seeing how well the instance storage did, I was pretty disappointed by the block storage I’d come to see. With that, I was barely able to get beyond 5000 IOPS, and it didn’t seem to make any difference at all if I was using SATA- or SSD-backed block storage. Those are still respectable numbers at $15/month for a minimum 100GB volume. Just for comparison, at Amazon’s prices that would get you a 25-IOPS EBS volume of the same size. Twenty-five, no typo. With the Rackspace version you also get a volume that you can reattach to a different server, while in the Amazon model the only way to get this kind of performance is with storage that’s permanently part of one instance (ditto for Storm on Demand).
  • Just for fun, I ran GlusterFS on these systems too. I used a replicated setup for comparison to previous results, getting up to 2400 IOPS vs. over 4000 for Amazon and over 5000 for Storm on Demand. To be honest, I think these numbers mostly reflect the providers’ networks rather than their storage. Three years ago when I was testing NoSQL systems, I noticed that Amazon’s network seemed much better than their competitors’ and that more than made up for a relative deficit in disk I/O. It seems like little has changed.

The bottom line is that Rackspace’s block storage is interesting, but perhaps not enough to displace others in this segment. Let’s take a look at IOPS per dollar for a two-node replicated GlusterFS configuration.

  • Amazon EBS: 1000 IOPS (provisioned) for $225/month or 4.4 IOPS/$ (server not included)
  • Amazon SSD: 4300 IOPS for $4464/month or 1.0 IOPS/$ (that’s pathetic)
  • Storm on Demand SSD: 5500 IOPS for $590/month or 9.3 IOPS/$
  • Rackspace instance storage: 3400 IOPS for $692/month (8GB instances) or 4.9 IOPS/$
  • Rackspace with 4x block storage per server: 9600 IOPS for $811/month or 11.8 IOPS/$ (hypothetical, assuming CPU or network don’t become bottlenecks)

Some time I’ll have to go back and actually test that last configuration, because I seriously doubt that the results would really be anywhere near that good and I suspect Storm would still remain on top. Maybe if the SSD volumes were really faster than the SATA volumes, which just didn’t seem to be the case when I tried them, things would be different. I should also test some other less-known providers such as CloudSigma or CleverKite, which also offer SSD instances at what seem to be competitive prices (though after Storm I’m wary of providers who do monthly billing with “credits” for unused time instead of true hourly billing).

Another Amazon Post Mortem

Amazon has posted an analysis of the recent EBS outage. Here’s what I would consider to be the root cause

this inability to contact a data collection server triggered a latent memory leak bug in the reporting agent on the storage servers. Rather than gracefully deal with the failed connection, the reporting agent continued trying to contact the collection server in a way that slowly consumed system memory

After that, predictably, the affected storage servers all slowly ground to a halt. It’s a perfect illustration of an important principle in distributed-system design.

System-level failures are more more likely to be caused by bugs or misconfiguration than by hardware faults.

It is important to write code that guardds not only against external problems but against internal ones as well. How might that have played out in this case? For one thing, something in the system could have required positive acknowledgement of the DNS update (it’s not clear why they relied on DNS updates at all instead of assigning a failed server’s address to its replacement). An alert should have been thrown when such positive acknowledgement was not forthcoming, or when storage servers reached a threshold of failed connection attempts. Another possibility would be from the Recovery Oriented Computing project: periodically reboot apparently healthy subsystems to eliminate precisely the kind of accumulated degradation that something like a memory leak would cause. A related idea is Netflix’s Chaos Monkey: reboot components periodically to make sure the recovery paths get exercised. Any of these measures – I admit they’re only obvious in hindsight, and that they’re all other people’s ideas – might have prevented the failure.

There are other more operations-oriented lessons from the Amazon analysis, such as the manual throttling that exacerbated the original problem, but from a developer’s perspective that’s what I get froom it.

Thoughts on Multitasking

A lot of people, especially in the geek community, have historically taken pride in their ability to multi-task. More recently, a lot of research has shown that multi-tasking is less effective than people think, leading many to a conclusion that multi-tasking really doesn’t exist. I think both sides are full of bunk.

On a CPU, there is a cost associated with switching from one task to another. Whether the switch is worth the cost depends on which task needs those cycles more. If the old task is likely to be blocked anyway and the new one is ready to go, then the switch is likely to be worth it. Conversely, if the old task still has work to do and the new one isn’t ready yet, then a switch would be a complete waste of time. As it turns out, on a typical computer most tasks are blocked most of the time, waiting for disks or networks or people. It’s relatively easy to detect and distinguish between these conditions, so multi-tasking works really well.

The problem is that we’re not like computers. For one thing, while a lot of things can happen at once in a human brain, we only have one “core” devoted to higher-level activities like coding, writing, or carrying on a conversation. For another, our brains are actually quite slow, so we don’t typically have a lot of idle cycles on either side of the “should we switch” equation. That slowness also means that our task switches are many orders of magnitude more expensive than those on computers – possibly seconds, depending on the complexity of the task we’re setting aside and the task we’re taking up, instead of microseconds. For human multi-tasking to work, we must make much more intelligent decisions about when to switch and when not to, based on much more subtle features of the old and new tasks. Even the decision to accept or reject an interruption takes significant time, which is why interruptions harm productivity so much. People who say they multi-task well usually mean that they can make accept/reject decisions quickly, but that doesn’t mean they make those decisions well – and there’s still the effect of the switch itself to consider. Besides being very slow, for us a task switch often SQUIRREL! Quick, can you remember what I was just saying three sentences ago? I doubt it, and that’s the point: unlike computers, when we switch our recall of where we were can be highly imperfect. We can get through the accept/reject part and the switch part quickly and still lose because in the process we’ve forgotten more context than switching was worth. We would still have been better off single-tasking.

The upshot is that you can train yourself to be multi-task more efficiently, but it’s an ability you should be reluctant to exercise. Unless you’re in one of those situations where you really should stop thinking about something because you’re overanalyzing or going around in circles (the infamous “finally figured out that bug while driving home” scenario), you should probably stick to what you’re doing until you’re done. Learn to schedule exactly the amount of work that you’re really able to do well, and do it in an organized way, instead of trying to be a hero by multi-tasking and doing them all poorly.

Today’s Memes

Some people will know what I’m talking about. Some won’t.

crazy-config meme

new-release meme

Second Debate Opening Remarks

“I would like to congratulate Mitt Romney on an excellent performance in the first debate. I would also like to offer him thanks and an apology. The thanks are for validating my belief that my policy positions are the ones that appeal to most Americans. If they didn’t, why would he have spent the entire debate adopting them? In one hour, he changed almost every sincerely held belief that helped to get him the Republican nomination, apparently hoping I’d reciprocate by adopting his positions. The apology is because I must decline. I’m a bit stubborn about standing by my beliefs. I’m sure he’d consider that a weakness, and I hope he can find it in his heart to forgive me.”

Pulling Up Weeds

Some software projects are obviously hard. Nobody thinks writing a compiler or an operating system, especially one comparable to existing production-grade examples, will be easy. Other projects seem to be easy until you get into them. Unfortunately, distributed storage is one of those. It’s really not that hard to put together some basic server/connection management and some consistent hashing, add a simple REST interface, and have a pretty useful type of distributed object store. The problem is that some people do that and then claim to have reinvented Swift, or to have invented something that’s even better than GlusterFS because it’s simpler. Um, no. A real distributed storage system has to go well beyond those simple requirements, and I’m not even talking about all of the complexity imposed by POSIX. I’ve beaten the NoPOSIX drum a few times, calling for simplification of semantics for distributed filesystems. I have nothing against object stores like Swift, which simplify even further. However, NoPOSIX and NoFS and NoSQL don’t mean giving up all requirements and expectations altogether. A minimum standard still includes things like basic security, handling disk-full or node-failure errors gracefully, automating the process of adding/removing/rearranging servers (preferably without downtime), and so on. That complexity is there for a reason. Comparing something that has these features to something that doesn’t isn’t just incorrect. It’s dangerous. Users are even less capable than fellow developers of evaluating claims about such systems. Overstating your capabilities increases risk that users will choose systems which aren’t really ready for serious use, and might even cost them their valuable data.

What brought all of this to mind was some recent Quora spamming by the author of Weed-FS, claiming that it would perform better than existing systems because of its great simplicity. In some ways it’s unfair to pick on Weed-FS specifically, but it represents a general category of “existing data stores are too complex” and “we invented a better data store” BS that I’ve been seeing entirely too much lately. Also, I kind of promised/threatened to run some performance tests myself if the author was too lazy or scared, so here we go.

STRIKE 1: no real packaging I fired up a Rackspace cloud server, and went to see if I could install Weed-FS on it. No such luck. The only build packages are for Windows/Darwin/Linux amd64, but that’s a relative rarity on cloud services, so I cloned the source tree and tried to build from that. Too bad there are no makefiles. Apparently the author builds using Eclipse, and didn’t bother including all of the information from that in the source tree. Nonetheless, it only took me a few minutes to figure out the correct build order and build the single executable using gccgo.

STRIKE 2: barely usable interface Unlike most object-store interfaces, Weed-FS has no buckets/containers/directories and insists on assigning its own keys to objects. Therefore you can’t use keys that are meaningful to you; you have to use theirs and store the mapping in some other kind of database. There also seems to be no enumeration function (I guess we don’t need to bother measuring that kind of performance) so if you ever lose the mapping between your own key and theirs then you’ll never find your data again. Similarly, there are no functions to get/set metadata on objects, so there’s pretty much no way to use Weed-FS except by pairing it with a database and wrapping a library around the whole thing. Oh, and there’s no delete either. Too bad if you ever want to reclaim any space.

STRIKE 3: poor performance Despite all the above, I set about writing some scripts to test performance. Before I could read a million files I had to create a million files. Just to be on the safe side, I decided to try creating 100K files first to make sure it wasn’t going to take forever – both making this exercise very tedious and costing me money in extra instance hours. It didn’t take forever, but it did take over 14 minutes. That’s over 8ms per object create, or over two hours just to set up for a real test. It’s particularly egregious since there doesn’t seem to be any evidence of using O_SYNC/fsync, so it’s not even clear that the index file is sure to have been updated. I tried speeding it up by running five client threads in parallel, but one by one they hung waiting for a response from one of the volume servers – probably related to the “unexpected EOF” errors that the volume server would spit out periodically. I guess concurrency isn’t a strong suit, and neither of error reporting since I had already noticed that the servers would return an HTTP 200 response even when requests failed. Just for comparison, GlusterFS completed the same setup in about seven minutes. That’s with full data durability, plus a result that has real user-friendly file names (plus extended attributes) and directory listings and actual security. At this point I decided it wasn’t even worth moving on to the test I’d meant to do. I’d seen enough.

Some people might think I’m being too harsh here, but I disagree. As I said in the first paragraph, systems like Weed-FS can be quite useful. As I also said, representing them as more than they are is not only incorrect but dangerous. The author as developer has done some interesting stuff and deserves encouragement. This feedback might not be what he wants to hear, but it’s a kind of feedback that developers need to hear, plus I’ve already done some testing and scripting work that might be useful. On the other hand, the author as social-media marketer deserves nothing but contempt. This is not a system that one can trust with real data, or that is in any significant way comparable to systems that already existed, and yet it was blithely presented as something actually superior to alternatives. That’s not acceptable. Building real storage is hard and often tedious work. The people who do it – including my competitors – don’t deserve to have their efforts trivialized by comparison to half-baked spare-time projects. They deserve better, and users deserve better, and anybody who doesn’t respect that deserves a few harsh words.

Open Source: Doing It Wrong

Hi, here are fifteen patches to make your code work on my platform. I haven’t tested to see if it still works on any other platform. Heck, it might not even build[1]. I know you have a well defined patch submission process, but I didn’t use it because it’s not what I’m used to. Also, I couldn’t be bothered rebasing to the current code, so the patches might not apply cleanly[2] and they all depend on each other. I’m sure you can find someone to shepherd these through your review process, because their time is surely worth less than my own. If you don’t accept these, I’ll tell all my friends how your project isn’t really open to outside contributors and I was forced to create my own fork. Have a nice day[3].

[1] No, it didn’t.
[2] Ditto.
[3] Die in a fire.