Yak Electrolysis

While I was reviewing a patch yesterday, I found code that used lots of distinct directory names for a series of tests – test1 would use brick1 and brick2, test2 would use brick3 and brick4, etc. I’ve run into this pattern myself, and it can be a bit of a maintenance problem as tests are added or removed. For example, in the test scripts for iwhd, there were multiple occasions when adding a test led to accidental reuse of names, and much non-hilarity ensued (everything about that project was non-hilarious but that’s a story for another time). The simplest pattern to deal with this is something like the following, which I suggested in a review comment:

sequence=0
...
# Test 1
sequence=$((sequence+1))
srcdir=/foo/bar$sequence
sequence=$((sequence+1))
dstdir=/foo/bar$sequence
...
# Test 2
sequence=$((sequence+1))
thedir=/foo/bar$sequence

This works pretty well, but the inline manipulation of $sequence kind of bugged me so I tried to put it in a function. My first try looked something like this.

sequence=0
 
function next_value {
    sequence=$((sequence+1))
    echo $sequence
}
 
thedir=/foo/bar/$(next_value)

Yeah, I hear the laughter. For those who didn’t get the joke yet, this falls prey to bash’s handling of variable scope and subshells. The $(next_value) construct ends up getting executed in a subshell, so changes it makes to variables aren’t reflected in the parent and you end up with the same value every time. I really should have stopped there, satisfying myself with the original inline version. Sure, that version can still hit the scope/subshell issue, but only if you use functions in your own code and not as a side effect of the idiom itself. I realized that getting around the scope/subshell issue would involve something ugly and inefficient, which is why I should have stopped, but I was intrigued. Surely, I thought, there should be a way to do this in an encapsulated and yet robust way. The first idea was to stick the persistent context in a temporary file.

tmpfile=$(my_secure_tmpfile_generator)
 
function next_value {
    prev_value=$(cat $tmpfile)
    next_value=$((prev_value+1))
    echo $next_value > $tmpfile
    echo $next_value
}

OK, it’s kind of icky, but it should work. Again, I should have stopped there, but that temporary file bothered me. Surely I could do that without the file I/O, perhaps by spawning a subprocess and talking to that through a pipe. Yes, folks, I had embarked on a quest to find the most insanely complicated way to solve a pretty simple problem. The result is generator.sh and here’s an example of how to use it.

source generator.sh
start_generator int_generator 5 6
...
dir1=/foo/bar$(next_value 5 6)
dir2=/foo/bar$(next_value 5 6)

Doesn’t look too bad, does it? OK, now go ahead and look at how it’s done. I dare you. Here are some of the funnier bits.

# start_generator
ctop=$(mktemp -t -u fifoXXXXXX)
mkfifo $ctop || fubar=1

Yes, really. Not polluting the filesystem with a temporary file was part of the point here, but I ended up dropping not one but two orts instead. (Cool word, and yes, I did use a thesaurus.) To be fair, these are only visible in the filesystem momentarily before they’re opened and then deleted, but still. I tried to find a way to do this with anonymous pipes, but there just didn’t quite seem to be a way to get bash to do that right. Here’s the next fun bit.

# start_generator
$1 < $ptoc >$ctop &
eval "exec $2> $ptoc"
eval "exec $3< $ctop"

The first line invokes the subprocess, with input and output fifos. The two execs are the bash way to create read and write file descriptors for a file. They’re wrapped in evals to satisfy my goal of making things as complicated as possible by allowing the caller to specify both the generator function/program and the file descriptors to use. Eval is very evil, of course, so let’s play Spot The Security Flaw.

start_generator int_generator "do_something_evil;"
# ...causes us to eval...
exec do_something_evil;> $ptoc

I'm not going to fix this, because it's only an "insider" threat. This code already runs with the same privilege as the caller, and can't do anything the caller can't. They could also pass in a totally bogus generator function, and I'm not going to worry about that either because they'd only be shooting themselves. On to the next fun piece.

# next_value
echo more 1>&$1
read -u $2 x

Again, this is kind of standard bash stuff to write and then read from specific file descriptors. Having an example of this is one of the main reasons I didn't just throw away the script. With a little bit of tweaking, the same technique could be used as the basis for a general form of IPC to/from a subprocess, and that might be useful some day.

To reiterate: this is some of the craziest code I've ever written. It's way more complicated than other solutions that better satisfy any likely set of requirements, and the implementation threads its way through some particularly perilous bash minefields. FFS, I might as well have just used mktemp in the first place and skipped all of this. You'd have to be nuts to solve this problem this way, but maybe my documentation of the discoveries I made along the way will help someone solve a similar problem. Or maybe it's just a funny story about bash scripting gone horribly wrong.

Scaling Filesystems vs. Other Things

David Strauss tweeted an interesting comment about using filesystems (actually he said “block devices” but I think he really meant filesystems) for scale and high availability. I thought I was following him (I definitely am now) but in fact I saw the comment when it was retweeted by Jonathan Ellis. The conversation went on a while, but quickly reached a point where it became impossible to fit even a minimally useful response under 140 characters, so I volunteered to extract the conversation into blog form.

Before I start, I’d like to point out that I know both David and Jonathan. They’re both excellent engineers and excellent people. I also don’t know the context in which David originally made his statement. On the other hand, NoSQL/BigData folks pissing all over things they’re too lazy to understand has been a bit of a hot button for me lately (e.g. see Stop the Hate). So I’m perfectly willing to believe that David’s original statement was well intentioned, perhaps a bit hasty or taken out of context, but I also know that others with far less ability and integrity than he has are likely to take such comments even further out of context and use them in their ongoing “filesystems are irrelevant” marketing campaign. So here’s the conversation so far, rearranged to show the diverging threads of discussion and with some extra commentary from me.

DavidStrauss Block devices are the wrong place scale and do HA. It’s always expensive (NetApp), unreliable (SPOF), or administratively complex (Gluster).

Obdurodon Huh? GlusterFS is *less* administratively complex than e.g. Cassandra. *Far* less. Also, block dev != filesystem.

Obdurodon It might not be the right choice for any particular case, but for reasons other than administrative complexity.
What reasons, then? Wrong semantics, wrong performance profile, redundant wrt other layers of the system, etc. I think David and I probably agree that scale and HA should be implemented in the highest layer of any particular system, not duplicated across layers or pushed down into a lower layer to make it Somebody Else’s Problem (the mistake made by every project to make the HDFS NameNode highly available). However, not all systems have the same layers. If what you need is a filesystem, then the filesystem layer might very well be the right place to deal with these issues (at least as they pertain to data rather than computation). If what you need is a column-oriented database, that might be the right place. This is where I think the original very general statement fails, though it seems likely that David was making it in a context where layering two systems had been suggested.

DavidStrauss GlusterFS is good as it gets but can still get funny under split-brain given the file system approach: http://t.co/nRu1wNqI
I was rather amused by David quoting my own answer (to a question on the Gluster community site) back at me, but also a bit mystified by the apparent change of gears. Wasn’t this about administrative complexity a moment ago? Now it’s about consistency behavior?

Obdurodon I don’t think the new behavior (in my answer) is markedly weirder than alternatives, or related to being a filesystem.

DavidStrauss It’s related to it being a filesystem because the consistency model doesn’t include a natural, guaranteed split-brain resolution.

Obdurodon Those “guarantees” have been routinely violated by most other systems too. I’m not sure why you’d single out just one.
I’ll point out here that Cassandra’s handling of Hinted Handoff has only very recently reached the standard David seems to be advocating, and was pretty “funny” (to use his term) before that. The other Dynamo-derived projects have also done well in this regard, but other “filesystem alternatives” have behavior that’s too pathetic to be funny.

DavidStrauss I’m not singling out Gluster. I think elegant split-brain recovery eludes all distributed POSIX/block device systems.
Perhaps this is true of filesystems in practice, but it’s not inherent in the filesystem model. I think it has more to do with who’s working on filesystems, who’s working on databases, who’s working on distributed systems, and how people in all of those communities relate to one another. It just so happens that the convergence of database and distributed-systems work is a bit further along, but I personally intend to apply a lot of the same distributed-system techniques in a filesystem context and I see no special impediment to doing so.

DavidStrauss #Gluster has also come a long way in admin complexity, but high-latency (geo) replication still requires manual failover.

Obdurodon Yes, IMO geosync in its current form is tres lame. That’s why I still want to do *real* wide-area replication.

DavidStrauss Top-notch geo replication requires embracing split-brain as a normal operating mode and having guaranteed, predictable recovery.

Obdurodon Agreed wrt geo-replication, but that still doesn’t support your first general statement since not all systems need that.

DavidStrauss Agreed on need for geo-replication, but geo-repl. issues are just an amplified version of issues experienced in any cluster.
As I’ve pointed out before, I disagree. Even systems that do need this feature need not – and IMO should not – try to do both local/sync and remote/async replication within a single framework. They’re different beasts, most relevantly with respect to split brain being a normal operating mode. I’ve spent my share of time pointing out to Stonebraker and other NewSQL folks that partitions really do occur even within a single data center, but they’re far from being a normal case there and that does affect how one arranges the code to handle it.

Obdurodon I’m loving this conversation, but Twitter might not be the right forum. I’ll extract into a blog post.

DavidStrauss You mean complex, theoretical distributed systems issues aren’t best handled in 140 characters or less? :-)

I think that about covers it. As I said, I disagree with the original statement in its general form, but might find myself agreeing with it in a specific context. As I see it, aggregating local filesystems to provide a single storage pool with a filesystem interface and aggregating local filesystems to provide a single storage pool with another interface (such as a column-oriented database) aren’t even different enough to say that one is definitely preferable to the other. The same fundamental issues, and many of the same techniques, apply to both. Saying that filesystems are the wrong way to address scale is like saying that a magnetic #3 Phillips screwdriver is the wrong way to turn a screw. Sometimes it is exactly the right tool, and other times the “right” tool isn’t as different from the “wrong” tool as its makers would have you believe.