This Is Competition?

As I’m sure you’ve all noticed by now, I’ve become a bit sensitive about people bashing GlusterFS performance. It’s really hard to make even common workloads run well when everything has to go over a network. It’s impossible to make all workloads run well in that environment, and when people blame me for the speed of light I get a bit grouchy. There are a couple of alternatives that I’ve gotten particularly tired of hearing about, not because I fear losing in a fair fight but because I feel that their reality doesn’t match their hype. Either they’re doing things that I think a storage system shouldn’t do, or they don’t actually perform all that well, or both. When I found out that I could get my hands on some systems with distributed block storage based on one of these alternatives, it didn’t take me long to give it a try.

The first thing I did was check out the basic performance of the systems, without even touching the new distributed block storage. I was rather pleased to see that my standard torture test (random synchronous 4KB writes) would ramp very smoothly and consistently up to 25K IOPS. That’s more than respectable. That’s darn good – better IOPS/$ than I saw from any of the alternatives I mentioned last time. So I spun up some of the distributed stuff and ran my tests with high hopes.

synchronous IOPS

Ouch. That’s not totally awful, but it’s not particularly good and it’s not particularly consistent. Certainly not something I’d position as high-performance storage. At the higher thread counts it gets into a range that I wouldn’t be too terribly ashamed of for a distributed filesystem, but remember that this is block storage. There’s a local filesystem at each end, but the communication over the wire is all about blocks. It’s also directly integrated into the virtualization code, which should minimize context switches and copies. Thinking that the infrastructure just might not be handling the hard cases well, I tried throwing an easier test at it – sequential buffered 64KB writes.

buffered IOPS

WTF? That’s even worse that the synchronous result! You can’t see it at this scale, but some of those lower numbers are single digit IOPS. I did the test three times, because I couldn’t believe my eyes, then went back and did the same for the synchronous test. I’m not sure if the consistent parts (such as the nose-dive from 16 to 18 threads all three times) or the inconsistent parts bother me more. That’s beyond disappointing, it’s beyond bad, it’s into shameful territory for everyone involved. Remember 25K IOPS for this hardware using local disks? Now even the one decent sample can’t reach a tenth of that, and that one sample stands out quite a bit from all the rest. Who would pay one penny more for so much less?

Yes, I feel better now. The next time someone mentions this particular alternative and says we should be more like them, I’ll show them how the lab darling fared in the real world. That’s a load off my mind.

The “Gather, Prepare, Cook, Eat” Design Pattern

This post is actually about an anti-pattern, which I’ll call the “grazing” pattern. Code wanders around, consuming new bits of information here and there, occasionally excreting new bits likewise. This works well “in the small” because all of the connections between inputs and outputs are easy to keep in your head. In code that has to make complex and important choices, such as what response to a failure will preserve users’ data instead of destroying it, such a casual approach quickly turns into a disaster. You repeatedly find yourself in some later stage of the code, responsible for initiating some kind of action, and you realize that you might or might not have some piece of information you need based on what code path you took to get there. So you recalculate it, or worse you re-fetch it from its origin. Or it’s not quite the information you need so you add a new field that refers to the old ones, but that gets unwieldy so you make a near copy of it instead (plus code to maintain the two slightly different versions). Sound familiar? That’s how (one kind of) technical debt accumulates.

Yes, of course I have a particular example in mind – GlusterFS’s AFR (Advanced File Replication) translator. There, we have dozens of multi-stage operations, which rely on up to maybe twenty pieces of information – booleans, status codes, arrays of booleans or status codes, arrays of indices into the other arrays, and so on. That’s somewhere between one and ten thousand “is this data current” questions the developer might need to ask before making a change. There’s an awful lot of recalculation and duplication going on, leading to something that is (some coughing, something that might rhyme with “butter frappe”) hard to maintain. This is not a phenomenon unique to this code. It’s how all old code seems to grow without frequent weeding, and I’ve seen the pattern elsewhere more times than I can count. How do we avoid this? That’s where the title comes in.

  • Gather
    Get all of the “raw” information that will be relevant to your decision, in any code path.
  • Prepare
    Slice and dice all the information you got from the real world, converting it into what you need for your decision-making process.
  • Cook
    This is where all the thinking goes. Decide what exactly you’re going to do, then record the decision separately from the data that led to it.
  • Eat
    Execute your decision, using only the information from the previous stage.

The key here is never go back. Time is an arrow, not a wheel. The developer should iterate on decisions; the code should not. If you’re in the Cook or Eat phase and you feel a need to revisit the Gather or Prepare stages, it means that you didn’t do the earlier stages properly. If you’re worried that gathering all data for all code paths means always gathering some information that the actual code path won’t need, it probably means that you’re not structuring that data in the way that best supports your actual decision process. There are exceptions, I’m not going to pretend that a structure like this will solve all of your complex-decision problems for you, but what this pattern does is make all of those dependencies obvious enough to deal with. Having dozens of fields that are “private” to particular code flows and ignored by others is how we got into this mess. (Notice how OOP tends to make this worse instead of better?) Having those fields associated with stages is how we get out of it, because the progression between stages is much more obvious than the (future) choice of which code path to follow. All of those artifacts that lead to “do we have that” and “not quite what I wanted” sit right there and stare you in the face instead of lurking in the shadows, so they get fixed.

Use Big Data For Good

There seems to be a growing awareness that there’s something odd about the recent election. “How did Obama win the presidential race but Republicans get control of the House?” seems to be a common question. People who have never said “gerrymandering” are saying it now. What even I hadn’t realized was this (emphasis mine).

Although the Republicans won 55 percent of the House seats, they received less than half of the votes for members of the House of Representatives.
 – Geoffrey R. Stone

What does this have to do with Big Data? This is not a technical problem. Mostly I think it’s a problem that needs to be addressed at the state level, for example by passing ballot measures requiring that district boundaries be set by an independent directly-elected commission. Maybe those members could even be elected via Approval Voting or Single Transferable Vote – systems which IMO should actually be used to elect the congresscritters themselves, but that’s not feasible without establishing voter familiarity in a different context.

Here’s the technical part. Most of the Big Data “success stories” seem to involve the rich (who can afford to buy/run big clusters) getting richer by exploiting consumers and invading their privacy. Very rarely do I hear about good uses, such as tracking drug interactions or disease spread. Where are the “data scientists” doing real science? Here’s an opportunity, while the election and its consequences are fresh in everybody’s minds, for those tools to do some good. How about if we use Big Data tools and Machine Learning techniques to crunch through demographic data and at least come up with congressional-district proposals that meet some rationally debatable definition of fairness? Obviously the results themselves can’t just be used as is, nor can the algorithms or data sets be enshrined into law, but maybe at least the operative definitions and the results they produce can provide decent starting points for a commission or the people themselves to consider. It seems like a lot better goal than targeting ads, anyway.

The Future of Storage for Big Data

I probably shouldn’t post this. It’s unduly inflammatory in general, and unnecessarily insulting to the HDFS developers in particular. It will probably alienate some people, deepening divisions that are already too deep. On the other hand, there’s a “manifest destiny” attitude in the Hadoop community that really needs to be addressed. Throwing a bucket of ice water on someone is an obnoxious thing to do, but sometimes that’s what it takes to get their attention.

The other day I got into a discussion on Quora with Cloudera’s Jeff Hammerbacher, about how Impala uses storage. Now Jeff is someone I like, and respect immensely, but he said something in that discussion that I find deeply disturbing (emphasis mine).

MapReduce and Impala are just the first two of what are likely to be many frameworks for parallel processing that will run well over data stored in HDFS, HBase, and future data storage solutions built above HDFS.

I agree that there will always be multiple data stores within an organization. I suspect that as the storage and computational capabilities of Hadoop expand, the amount of data stored within Hadoop will come to dominate the amount of data stored outside Hadoop, and most analyses will be performed within the cluster.

That’s a pretty bold prediction, and a pretty HDFS-centric strategy for getting there. If HDFS is central to the Hadoop storage strategy, and most data is stored in Hadoop, then what he’s really saying is that most data will be stored in HDFS, and I really hope that’s not the plan because any Hadoop company that has that as a plan will lose. I’m a firm believer in polyglot storage. Some storage is optimized for reads and some for writes, some for large requests and some for small requests, some for semantically complex operations and some for simple operations, some for strong consistency and some for high performance. I don’t think the storage I work on is good for every purpose, and that’s even less the case for HDFS. It’s not even a complete general-purpose filesystem, let alone a good one. It was designed to support exactly one workload, and it shows. There’s no track record to show that the people interested in seeing it used more widely could pull off the fundamental change that would be required to make it a decent general-purpose filesystem. Thus, if the other parts of Hadoop are only tuned for HDFS, only tested with HDFS, stubbornly resistant to the idea of integrating with any other kind of storage, then users will be left with a painful choice.

  • Put data into HDFS even though it’s a poor fit for their other needs, because it’s the only thing that works well with the rest of Hadoop.
  • Put data into purpose-appropriate storage, with explicit import/export between that and HDFS.

Painful choices tend to spur development or adoption of more choices, and there are plenty of competitors ready to pick up that dropped ball. That leaves the HDFS die-hards with their own two choices.

  • Hire a bunch of real filesystem developers to play whack-a-mole with the myriad requirements of real general-purpose storage.
  • Abandon the “one HDFS to rule them all” data-lock-in strategy, and start playing nice with other kinds of storage.

Obviously, I think it’s better to let people use the storage they want, as and where it already exists, to serve their requirements instead of Hadoop developers’. That means a storage API which is stable, complete, well documented, and consistently used by the rest of Hadoop – i.e. practically the opposite of where Impala and libhdfs are today. It means actually testing with other kinds of storage, and actually collaborating with other people who are working on storage out in the real world. There’s no reason those future parallel processing frameworks should be limited to working well over HDFS, or why those future storage systems should be based on HDFS. There’s also little chance that they will be, in the presence of alternatives that are both more open and more attuned to users’ needs.

Perhaps some day most data will be connected to Hadoop, but it will never be within. The people who own that data won’t stand for it.

Stackable Exit Hooks for Bash

I’m just going to leave this here and then quietly back away before the flames start.

 
#!/bin/bash
 
# Stackable atexit functionality for bash.
#
# Bash's "trap ... EXIT" is somewhat similar to libc's "atexit" with the
# limitation that such functions don't stack.  If you use this construct twice,
# the cleanup code in the second invocation *replaces* that in the first, so
# the first actually doesn't happen.  Oops.  This snippet shows a way to get
# stackable behavior by editing the current trap function to incorporate a new
# one, either at the beginning or the end.  That's a really cheesy thing to do,
# but it works.
 
function atexit_func {
	# Bash doesn't have anything like Python's 'pass' so we do nothing
	# this way instead.
	echo -n
}
trap "atexit_func" EXIT
 
# Call this function to have your cleanup called *before* others.
function atexit_prepend {
	tmpfile=$(mktemp atexit.XXXXXX)
	typeset -f atexit_func > $tmpfile
	echo -en "2a\n$1\n.\nw\nq\n" | ed - $tmpfile
	. $tmpfile
	rm $tmpfile
}
 
# Call this function to have your cleanup called *after* others.
function atexit_append {
	tmpfile=$(mktemp atexit.XXXXXX)
	typeset -f atexit_func > $tmpfile
	echo -en "\$i\n$1\n.\nw\nq\n" | ed - $tmpfile
	. $tmpfile
	rm $tmpfile
}
 
function first_atexit {
	echo "first atexit function"
}
atexit_append first_atexit
 
function second_atexit {
	echo "second atexit function"
}
atexit_append second_atexit
 
function third_atexit {
	echo "third atexit function"
}
atexit_prepend third_atexit
 
# Should see third/first/second here.