This Is Competition?

As I’m sure you’ve all noticed by now, I’ve become a bit sensitive about people bashing GlusterFS performance. It’s really hard to make even common workloads run well when everything has to go over a network. It’s impossible to make all workloads run well in that environment, and when people blame me for the speed of light I get a bit grouchy. There are a couple of alternatives that I’ve gotten particularly tired of hearing about, not because I fear losing in a fair fight but because I feel that their reality doesn’t match their hype. Either they’re doing things that I think a storage system shouldn’t do, or they don’t actually perform all that well, or both. When I found out that I could get my hands on some systems with distributed block storage based on one of these alternatives, it didn’t take me long to give it a try.

The first thing I did was check out the basic performance of the systems, without even touching the new distributed block storage. I was rather pleased to see that my standard torture test (random synchronous 4KB writes) would ramp very smoothly and consistently up to 25K IOPS. That’s more than respectable. That’s darn good – better IOPS/$ than I saw from any of the alternatives I mentioned last time. So I spun up some of the distributed stuff and ran my tests with high hopes.

synchronous IOPS

Ouch. That’s not totally awful, but it’s not particularly good and it’s not particularly consistent. Certainly not something I’d position as high-performance storage. At the higher thread counts it gets into a range that I wouldn’t be too terribly ashamed of for a distributed filesystem, but remember that this is block storage. There’s a local filesystem at each end, but the communication over the wire is all about blocks. It’s also directly integrated into the virtualization code, which should minimize context switches and copies. Thinking that the infrastructure just might not be handling the hard cases well, I tried throwing an easier test at it – sequential buffered 64KB writes.

buffered IOPS

WTF? That’s even worse that the synchronous result! You can’t see it at this scale, but some of those lower numbers are single digit IOPS. I did the test three times, because I couldn’t believe my eyes, then went back and did the same for the synchronous test. I’m not sure if the consistent parts (such as the nose-dive from 16 to 18 threads all three times) or the inconsistent parts bother me more. That’s beyond disappointing, it’s beyond bad, it’s into shameful territory for everyone involved. Remember 25K IOPS for this hardware using local disks? Now even the one decent sample can’t reach a tenth of that, and that one sample stands out quite a bit from all the rest. Who would pay one penny more for so much less?

Yes, I feel better now. The next time someone mentions this particular alternative and says we should be more like them, I’ll show them how the lab darling fared in the real world. That’s a load off my mind.

The “Gather, Prepare, Cook, Eat” Design Pattern

This post is actually about an anti-pattern, which I’ll call the “grazing” pattern. Code wanders around, consuming new bits of information here and there, occasionally excreting new bits likewise. This works well “in the small” because all of the connections between inputs and outputs are easy to keep in your head. In code that has to make complex and important choices, such as what response to a failure will preserve users’ data instead of destroying it, such a casual approach quickly turns into a disaster. You repeatedly find yourself in some later stage of the code, responsible for initiating some kind of action, and you realize that you might or might not have some piece of information you need based on what code path you took to get there. So you recalculate it, or worse you re-fetch it from its origin. Or it’s not quite the information you need so you add a new field that refers to the old ones, but that gets unwieldy so you make a near copy of it instead (plus code to maintain the two slightly different versions). Sound familiar? That’s how (one kind of) technical debt accumulates.

Yes, of course I have a particular example in mind – GlusterFS’s AFR (Advanced File Replication) translator. There, we have dozens of multi-stage operations, which rely on up to maybe twenty pieces of information – booleans, status codes, arrays of booleans or status codes, arrays of indices into the other arrays, and so on. That’s somewhere between one and ten thousand “is this data current” questions the developer might need to ask before making a change. There’s an awful lot of recalculation and duplication going on, leading to something that is (some coughing, something that might rhyme with “butter frappe”) hard to maintain. This is not a phenomenon unique to this code. It’s how all old code seems to grow without frequent weeding, and I’ve seen the pattern elsewhere more times than I can count. How do we avoid this? That’s where the title comes in.

  • Gather
    Get all of the “raw” information that will be relevant to your decision, in any code path.
  • Prepare
    Slice and dice all the information you got from the real world, converting it into what you need for your decision-making process.
  • Cook
    This is where all the thinking goes. Decide what exactly you’re going to do, then record the decision separately from the data that led to it.
  • Eat
    Execute your decision, using only the information from the previous stage.

The key here is never go back. Time is an arrow, not a wheel. The developer should iterate on decisions; the code should not. If you’re in the Cook or Eat phase and you feel a need to revisit the Gather or Prepare stages, it means that you didn’t do the earlier stages properly. If you’re worried that gathering all data for all code paths means always gathering some information that the actual code path won’t need, it probably means that you’re not structuring that data in the way that best supports your actual decision process. There are exceptions, I’m not going to pretend that a structure like this will solve all of your complex-decision problems for you, but what this pattern does is make all of those dependencies obvious enough to deal with. Having dozens of fields that are “private” to particular code flows and ignored by others is how we got into this mess. (Notice how OOP tends to make this worse instead of better?) Having those fields associated with stages is how we get out of it, because the progression between stages is much more obvious than the (future) choice of which code path to follow. All of those artifacts that lead to “do we have that” and “not quite what I wanted” sit right there and stare you in the face instead of lurking in the shadows, so they get fixed.

Use Big Data For Good

There seems to be a growing awareness that there’s something odd about the recent election. “How did Obama win the presidential race but Republicans get control of the House?” seems to be a common question. People who have never said “gerrymandering” are saying it now. What even I hadn’t realized was this (emphasis mine).

Although the Republicans won 55 percent of the House seats, they received less than half of the votes for members of the House of Representatives.
 – Geoffrey R. Stone

What does this have to do with Big Data? This is not a technical problem. Mostly I think it’s a problem that needs to be addressed at the state level, for example by passing ballot measures requiring that district boundaries be set by an independent directly-elected commission. Maybe those members could even be elected via Approval Voting or Single Transferable Vote – systems which IMO should actually be used to elect the congresscritters themselves, but that’s not feasible without establishing voter familiarity in a different context.

Here’s the technical part. Most of the Big Data “success stories” seem to involve the rich (who can afford to buy/run big clusters) getting richer by exploiting consumers and invading their privacy. Very rarely do I hear about good uses, such as tracking drug interactions or disease spread. Where are the “data scientists” doing real science? Here’s an opportunity, while the election and its consequences are fresh in everybody’s minds, for those tools to do some good. How about if we use Big Data tools and Machine Learning techniques to crunch through demographic data and at least come up with congressional-district proposals that meet some rationally debatable definition of fairness? Obviously the results themselves can’t just be used as is, nor can the algorithms or data sets be enshrined into law, but maybe at least the operative definitions and the results they produce can provide decent starting points for a commission or the people themselves to consider. It seems like a lot better goal than targeting ads, anyway.

The Future of Storage for Big Data

I probably shouldn’t post this. It’s unduly inflammatory in general, and unnecessarily insulting to the HDFS developers in particular. It will probably alienate some people, deepening divisions that are already too deep. On the other hand, there’s a “manifest destiny” attitude in the Hadoop community that really needs to be addressed. Throwing a bucket of ice water on someone is an obnoxious thing to do, but sometimes that’s what it takes to get their attention.

The other day I got into a discussion on Quora with Cloudera’s Jeff Hammerbacher, about how Impala uses storage. Now Jeff is someone I like, and respect immensely, but he said something in that discussion that I find deeply disturbing (emphasis mine).

MapReduce and Impala are just the first two of what are likely to be many frameworks for parallel processing that will run well over data stored in HDFS, HBase, and future data storage solutions built above HDFS.

I agree that there will always be multiple data stores within an organization. I suspect that as the storage and computational capabilities of Hadoop expand, the amount of data stored within Hadoop will come to dominate the amount of data stored outside Hadoop, and most analyses will be performed within the cluster.

That’s a pretty bold prediction, and a pretty HDFS-centric strategy for getting there. If HDFS is central to the Hadoop storage strategy, and most data is stored in Hadoop, then what he’s really saying is that most data will be stored in HDFS, and I really hope that’s not the plan because any Hadoop company that has that as a plan will lose. I’m a firm believer in polyglot storage. Some storage is optimized for reads and some for writes, some for large requests and some for small requests, some for semantically complex operations and some for simple operations, some for strong consistency and some for high performance. I don’t think the storage I work on is good for every purpose, and that’s even less the case for HDFS. It’s not even a complete general-purpose filesystem, let alone a good one. It was designed to support exactly one workload, and it shows. There’s no track record to show that the people interested in seeing it used more widely could pull off the fundamental change that would be required to make it a decent general-purpose filesystem. Thus, if the other parts of Hadoop are only tuned for HDFS, only tested with HDFS, stubbornly resistant to the idea of integrating with any other kind of storage, then users will be left with a painful choice.

  • Put data into HDFS even though it’s a poor fit for their other needs, because it’s the only thing that works well with the rest of Hadoop.
  • Put data into purpose-appropriate storage, with explicit import/export between that and HDFS.

Painful choices tend to spur development or adoption of more choices, and there are plenty of competitors ready to pick up that dropped ball. That leaves the HDFS die-hards with their own two choices.

  • Hire a bunch of real filesystem developers to play whack-a-mole with the myriad requirements of real general-purpose storage.
  • Abandon the “one HDFS to rule them all” data-lock-in strategy, and start playing nice with other kinds of storage.

Obviously, I think it’s better to let people use the storage they want, as and where it already exists, to serve their requirements instead of Hadoop developers’. That means a storage API which is stable, complete, well documented, and consistently used by the rest of Hadoop – i.e. practically the opposite of where Impala and libhdfs are today. It means actually testing with other kinds of storage, and actually collaborating with other people who are working on storage out in the real world. There’s no reason those future parallel processing frameworks should be limited to working well over HDFS, or why those future storage systems should be based on HDFS. There’s also little chance that they will be, in the presence of alternatives that are both more open and more attuned to users’ needs.

Perhaps some day most data will be connected to Hadoop, but it will never be within. The people who own that data won’t stand for it.

Stackable Exit Hooks for Bash

I’m just going to leave this here and then quietly back away before the flames start.

# Stackable atexit functionality for bash.
# Bash's "trap ... EXIT" is somewhat similar to libc's "atexit" with the
# limitation that such functions don't stack.  If you use this construct twice,
# the cleanup code in the second invocation *replaces* that in the first, so
# the first actually doesn't happen.  Oops.  This snippet shows a way to get
# stackable behavior by editing the current trap function to incorporate a new
# one, either at the beginning or the end.  That's a really cheesy thing to do,
# but it works.
function atexit_func {
	# Bash doesn't have anything like Python's 'pass' so we do nothing
	# this way instead.
	echo -n
trap "atexit_func" EXIT
# Call this function to have your cleanup called *before* others.
function atexit_prepend {
	tmpfile=$(mktemp atexit.XXXXXX)
	typeset -f atexit_func > $tmpfile
	echo -en "2a\n$1\n.\nw\nq\n" | ed - $tmpfile
	. $tmpfile
	rm $tmpfile
# Call this function to have your cleanup called *after* others.
function atexit_append {
	tmpfile=$(mktemp atexit.XXXXXX)
	typeset -f atexit_func > $tmpfile
	echo -en "\$i\n$1\n.\nw\nq\n" | ed - $tmpfile
	. $tmpfile
	rm $tmpfile
function first_atexit {
	echo "first atexit function"
atexit_append first_atexit
function second_atexit {
	echo "second atexit function"
atexit_append second_atexit
function third_atexit {
	echo "third atexit function"
atexit_prepend third_atexit
# Should see third/first/second here.

Rackspace Block Storage

A while ago, Rackspace announced their own block storage. I hesitate to say it’s equivalent to Amazon’s EBS, them being competitors and all, but that’s the quickest way to explain what it is/does. I thought the feature itself was long overdue, and the performance looked pretty good, so I said so on Twitter. I also resolved to give it a try, which I was finally able to do last night. Here are some observations.

  • Block storage is only available through their “next generation” (OpenStack based) cloud, and it’s clearly a young product. Attaching block devices to a server often took a disturbingly long time, during which the web interface would often show stale state. Detaching was even worse, and in one case took a support ticket and several hours before a developer could get it unstuck. If I didn’t already have experience with Rackspace’s excellent support folks, this might have been enough to make me wander off.
  • Still before I actually got to the block storage, I was pretty impressed with the I/O performance of the next-gen servers themselves. In my standard random-sync-write test, I was seeing over 8000 4KB IOPS. That’s a kind of weird number, clearly well beyond the typical handful of local disks but pretty low for SSD. In any case, it’s not bad for instance storage.
  • After seeing how well the instance storage did, I was pretty disappointed by the block storage I’d come to see. With that, I was barely able to get beyond 5000 IOPS, and it didn’t seem to make any difference at all if I was using SATA- or SSD-backed block storage. Those are still respectable numbers at $15/month for a minimum 100GB volume. Just for comparison, at Amazon’s prices that would get you a 25-IOPS EBS volume of the same size. Twenty-five, no typo. With the Rackspace version you also get a volume that you can reattach to a different server, while in the Amazon model the only way to get this kind of performance is with storage that’s permanently part of one instance (ditto for Storm on Demand).
  • Just for fun, I ran GlusterFS on these systems too. I used a replicated setup for comparison to previous results, getting up to 2400 IOPS vs. over 4000 for Amazon and over 5000 for Storm on Demand. To be honest, I think these numbers mostly reflect the providers’ networks rather than their storage. Three years ago when I was testing NoSQL systems, I noticed that Amazon’s network seemed much better than their competitors’ and that more than made up for a relative deficit in disk I/O. It seems like little has changed.

The bottom line is that Rackspace’s block storage is interesting, but perhaps not enough to displace others in this segment. Let’s take a look at IOPS per dollar for a two-node replicated GlusterFS configuration.

  • Amazon EBS: 1000 IOPS (provisioned) for $225/month or 4.4 IOPS/$ (server not included)
  • Amazon SSD: 4300 IOPS for $4464/month or 1.0 IOPS/$ (that’s pathetic)
  • Storm on Demand SSD: 5500 IOPS for $590/month or 9.3 IOPS/$
  • Rackspace instance storage: 3400 IOPS for $692/month (8GB instances) or 4.9 IOPS/$
  • Rackspace with 4x block storage per server: 9600 IOPS for $811/month or 11.8 IOPS/$ (hypothetical, assuming CPU or network don’t become bottlenecks)

Some time I’ll have to go back and actually test that last configuration, because I seriously doubt that the results would really be anywhere near that good and I suspect Storm would still remain on top. Maybe if the SSD volumes were really faster than the SATA volumes, which just didn’t seem to be the case when I tried them, things would be different. I should also test some other less-known providers such as CloudSigma or CleverKite, which also offer SSD instances at what seem to be competitive prices (though after Storm I’m wary of providers who do monthly billing with “credits” for unused time instead of true hourly billing).

Another Amazon Post Mortem

Amazon has posted an analysis of the recent EBS outage. Here’s what I would consider to be the root cause

this inability to contact a data collection server triggered a latent memory leak bug in the reporting agent on the storage servers. Rather than gracefully deal with the failed connection, the reporting agent continued trying to contact the collection server in a way that slowly consumed system memory

After that, predictably, the affected storage servers all slowly ground to a halt. It’s a perfect illustration of an important principle in distributed-system design.

System-level failures are more more likely to be caused by bugs or misconfiguration than by hardware faults.

It is important to write code that guardds not only against external problems but against internal ones as well. How might that have played out in this case? For one thing, something in the system could have required positive acknowledgement of the DNS update (it’s not clear why they relied on DNS updates at all instead of assigning a failed server’s address to its replacement). An alert should have been thrown when such positive acknowledgement was not forthcoming, or when storage servers reached a threshold of failed connection attempts. Another possibility would be from the Recovery Oriented Computing project: periodically reboot apparently healthy subsystems to eliminate precisely the kind of accumulated degradation that something like a memory leak would cause. A related idea is Netflix’s Chaos Monkey: reboot components periodically to make sure the recovery paths get exercised. Any of these measures – I admit they’re only obvious in hindsight, and that they’re all other people’s ideas – might have prevented the failure.

There are other more operations-oriented lessons from the Amazon analysis, such as the manual throttling that exacerbated the original problem, but from a developer’s perspective that’s what I get froom it.

Thoughts on Multitasking

A lot of people, especially in the geek community, have historically taken pride in their ability to multi-task. More recently, a lot of research has shown that multi-tasking is less effective than people think, leading many to a conclusion that multi-tasking really doesn’t exist. I think both sides are full of bunk.

On a CPU, there is a cost associated with switching from one task to another. Whether the switch is worth the cost depends on which task needs those cycles more. If the old task is likely to be blocked anyway and the new one is ready to go, then the switch is likely to be worth it. Conversely, if the old task still has work to do and the new one isn’t ready yet, then a switch would be a complete waste of time. As it turns out, on a typical computer most tasks are blocked most of the time, waiting for disks or networks or people. It’s relatively easy to detect and distinguish between these conditions, so multi-tasking works really well.

The problem is that we’re not like computers. For one thing, while a lot of things can happen at once in a human brain, we only have one “core” devoted to higher-level activities like coding, writing, or carrying on a conversation. For another, our brains are actually quite slow, so we don’t typically have a lot of idle cycles on either side of the “should we switch” equation. That slowness also means that our task switches are many orders of magnitude more expensive than those on computers – possibly seconds, depending on the complexity of the task we’re setting aside and the task we’re taking up, instead of microseconds. For human multi-tasking to work, we must make much more intelligent decisions about when to switch and when not to, based on much more subtle features of the old and new tasks. Even the decision to accept or reject an interruption takes significant time, which is why interruptions harm productivity so much. People who say they multi-task well usually mean that they can make accept/reject decisions quickly, but that doesn’t mean they make those decisions well – and there’s still the effect of the switch itself to consider. Besides being very slow, for us a task switch often SQUIRREL! Quick, can you remember what I was just saying three sentences ago? I doubt it, and that’s the point: unlike computers, when we switch our recall of where we were can be highly imperfect. We can get through the accept/reject part and the switch part quickly and still lose because in the process we’ve forgotten more context than switching was worth. We would still have been better off single-tasking.

The upshot is that you can train yourself to be multi-task more efficiently, but it’s an ability you should be reluctant to exercise. Unless you’re in one of those situations where you really should stop thinking about something because you’re overanalyzing or going around in circles (the infamous “finally figured out that bug while driving home” scenario), you should probably stick to what you’re doing until you’re done. Learn to schedule exactly the amount of work that you’re really able to do well, and do it in an organized way, instead of trying to be a hero by multi-tasking and doing them all poorly.

Today’s Memes

Some people will know what I’m talking about. Some won’t.

crazy-config meme

new-release meme

Second Debate Opening Remarks

“I would like to congratulate Mitt Romney on an excellent performance in the first debate. I would also like to offer him thanks and an apology. The thanks are for validating my belief that my policy positions are the ones that appeal to most Americans. If they didn’t, why would he have spent the entire debate adopting them? In one hour, he changed almost every sincerely held belief that helped to get him the Republican nomination, apparently hoping I’d reciprocate by adopting his positions. The apology is because I must decline. I’m a bit stubborn about standing by my beliefs. I’m sure he’d consider that a weakness, and I hope he can find it in his heart to forgive me.”