David Strauss tweeted an interesting comment about using filesystems (actually he said “block devices” but I think he really meant filesystems) for scale and high availability. I thought I was following him (I definitely am now) but in fact I saw the comment when it was retweeted by Jonathan Ellis. The conversation went on a while, but quickly reached a point where it became impossible to fit even a minimally useful response under 140 characters, so I volunteered to extract the conversation into blog form.

Before I start, I’d like to point out that I know both David and Jonathan. They’re both excellent engineers and excellent people. I also don’t know the context in which David originally made his statement. On the other hand, NoSQL/BigData folks pissing all over things they’re too lazy to understand has been a bit of a hot button for me lately (e.g. see Stop the Hate). So I’m perfectly willing to believe that David’s original statement was well intentioned, perhaps a bit hasty or taken out of context, but I also know that others with far less ability and integrity than he has are likely to take such comments even further out of context and use them in their ongoing “filesystems are irrelevant” marketing campaign. So here’s the conversation so far, rearranged to show the diverging threads of discussion and with some extra commentary from me.

DavidStrauss Block devices are the wrong place scale and do HA. It’s always expensive (NetApp), unreliable (SPOF), or administratively complex (Gluster).

Obdurodon Huh? GlusterFS is *less* administratively complex than e.g. Cassandra. *Far* less. Also, block dev != filesystem.

Obdurodon It might not be the right choice for any particular case, but for reasons other than administrative complexity.
What reasons, then? Wrong semantics, wrong performance profile, redundant wrt other layers of the system, etc. I think David and I probably agree that scale and HA should be implemented in the highest layer of any particular system, not duplicated across layers or pushed down into a lower layer to make it Somebody Else’s Problem (the mistake made by every project to make the HDFS NameNode highly available). However, not all systems have the same layers. If what you need is a filesystem, then the filesystem layer might very well be the right place to deal with these issues (at least as they pertain to data rather than computation). If what you need is a column-oriented database, that might be the right place. This is where I think the original very general statement fails, though it seems likely that David was making it in a context where layering two systems had been suggested.

DavidStrauss GlusterFS is good as it gets but can still get funny under split-brain given the file system approach: http://t.co/nRu1wNqI
I was rather amused by David quoting my own answer (to a question on the Gluster community site) back at me, but also a bit mystified by the apparent change of gears. Wasn’t this about administrative complexity a moment ago? Now it’s about consistency behavior?

Obdurodon I don’t think the new behavior (in my answer) is markedly weirder than alternatives, or related to being a filesystem.

DavidStrauss It’s related to it being a filesystem because the consistency model doesn’t include a natural, guaranteed split-brain resolution.

Obdurodon Those “guarantees” have been routinely violated by most other systems too. I’m not sure why you’d single out just one.
I’ll point out here that Cassandra’s handling of Hinted Handoff has only very recently reached the standard David seems to be advocating, and was pretty “funny” (to use his term) before that. The other Dynamo-derived projects have also done well in this regard, but other “filesystem alternatives” have behavior that’s too pathetic to be funny.

DavidStrauss I’m not singling out Gluster. I think elegant split-brain recovery eludes all distributed POSIX/block device systems.
Perhaps this is true of filesystems in practice, but it’s not inherent in the filesystem model. I think it has more to do with who’s working on filesystems, who’s working on databases, who’s working on distributed systems, and how people in all of those communities relate to one another. It just so happens that the convergence of database and distributed-systems work is a bit further along, but I personally intend to apply a lot of the same distributed-system techniques in a filesystem context and I see no special impediment to doing so.

DavidStrauss #Gluster has also come a long way in admin complexity, but high-latency (geo) replication still requires manual failover.

Obdurodon Yes, IMO geosync in its current form is tres lame. That’s why I still want to do *real* wide-area replication.

DavidStrauss Top-notch geo replication requires embracing split-brain as a normal operating mode and having guaranteed, predictable recovery.

Obdurodon Agreed wrt geo-replication, but that still doesn’t support your first general statement since not all systems need that.

DavidStrauss Agreed on need for geo-replication, but geo-repl. issues are just an amplified version of issues experienced in any cluster.
As I’ve pointed out before, I disagree. Even systems that do need this feature need not – and IMO should not – try to do both local/sync and remote/async replication within a single framework. They’re different beasts, most relevantly with respect to split brain being a normal operating mode. I’ve spent my share of time pointing out to Stonebraker and other NewSQL folks that partitions really do occur even within a single data center, but they’re far from being a normal case there and that does affect how one arranges the code to handle it.

Obdurodon I’m loving this conversation, but Twitter might not be the right forum. I’ll extract into a blog post.

DavidStrauss You mean complex, theoretical distributed systems issues aren’t best handled in 140 characters or less? :-)

I think that about covers it. As I said, I disagree with the original statement in its general form, but might find myself agreeing with it in a specific context. As I see it, aggregating local filesystems to provide a single storage pool with a filesystem interface and aggregating local filesystems to provide a single storage pool with another interface (such as a column-oriented database) aren’t even different enough to say that one is definitely preferable to the other. The same fundamental issues, and many of the same techniques, apply to both. Saying that filesystems are the wrong way to address scale is like saying that a magnetic #3 Phillips screwdriver is the wrong way to turn a screw. Sometimes it is exactly the right tool, and other times the “right” tool isn’t as different from the “wrong” tool as its makers would have you believe.