David Strauss tweeted an interesting comment about using filesystems (actually he said “block devices” but I think he really meant filesystems) for scale and high availability. I thought I was following him (I definitely am now) but in fact I saw the comment when it was retweeted by Jonathan Ellis. The conversation went on a while, but quickly reached a point where it became impossible to fit even a minimally useful response under 140 characters, so I volunteered to extract the conversation into blog form.
Before I start, I’d like to point out that I know both David and Jonathan. They’re both excellent engineers and excellent people. I also don’t know the context in which David originally made his statement. On the other hand, NoSQL/BigData folks pissing all over things they’re too lazy to understand has been a bit of a hot button for me lately (e.g. see Stop the Hate). So I’m perfectly willing to believe that David’s original statement was well intentioned, perhaps a bit hasty or taken out of context, but I also know that others with far less ability and integrity than he has are likely to take such comments even further out of context and use them in their ongoing “filesystems are irrelevant” marketing campaign. So here’s the conversation so far, rearranged to show the diverging threads of discussion and with some extra commentary from me.
What reasons, then? Wrong semantics, wrong performance profile, redundant wrt other layers of the system, etc. I think David and I probably agree that scale and HA should be implemented in the highest layer of any particular system, not duplicated across layers or pushed down into a lower layer to make it Somebody Else’s Problem (the mistake made by every project to make the HDFS NameNode highly available). However, not all systems have the same layers. If what you need is a filesystem, then the filesystem layer might very well be the right place to deal with these issues (at least as they pertain to data rather than computation). If what you need is a column-oriented database, that might be the right place. This is where I think the original very general statement fails, though it seems likely that David was making it in a context where layering two systems had been suggested.
I was rather amused by David quoting my own answer (to a question on the Gluster community site) back at me, but also a bit mystified by the apparent change of gears. Wasn’t this about administrative complexity a moment ago? Now it’s about consistency behavior?
I’ll point out here that Cassandra’s handling of Hinted Handoff has only very recently reached the standard David seems to be advocating, and was pretty “funny” (to use his term) before that. The other Dynamo-derived projects have also done well in this regard, but other “filesystem alternatives” have behavior that’s too pathetic to be funny.
Perhaps this is true of filesystems in practice, but it’s not inherent in the filesystem model. I think it has more to do with who’s working on filesystems, who’s working on databases, who’s working on distributed systems, and how people in all of those communities relate to one another. It just so happens that the convergence of database and distributed-systems work is a bit further along, but I personally intend to apply a lot of the same distributed-system techniques in a filesystem context and I see no special impediment to doing so.
As I’ve pointed out before, I disagree. Even systems that do need this feature need not – and IMO should not – try to do both local/sync and remote/async replication within a single framework. They’re different beasts, most relevantly with respect to split brain being a normal operating mode. I’ve spent my share of time pointing out to Stonebraker and other NewSQL folks that partitions really do occur even within a single data center, but they’re far from being a normal case there and that does affect how one arranges the code to handle it.
I think that about covers it. As I said, I disagree with the original statement in its general form, but might find myself agreeing with it in a specific context. As I see it, aggregating local filesystems to provide a single storage pool with a filesystem interface and aggregating local filesystems to provide a single storage pool with another interface (such as a column-oriented database) aren’t even different enough to say that one is definitely preferable to the other. The same fundamental issues, and many of the same techniques, apply to both. Saying that filesystems are the wrong way to address scale is like saying that a magnetic #3 Phillips screwdriver is the wrong way to turn a screw. Sometimes it is exactly the right tool, and other times the “right” tool isn’t as different from the “wrong” tool as its makers would have you believe.
Hi Jeff,
Thanks for taking on the unenviable job of converting a conversation on Twitter into a better medium. I’ll take this as an opportunity to clarify what wouldn’t fit well into 140-character posts on Twitter. But, I’ll start by saying that of the systems I was critiquing, GlusterFS is not nearly as troublesome as DRBD and the NAS/SAN/NetApp world. The engineering work I’ve seen on GlusterFS is damn impressive and takes distributes file systems nearly as far as I can see them going without compromising on what is understood to be a *real* file system.
“block dev != filesystem”
Yes, I realize that they’re not equivalent. But, assuming file system to mean a *POSIX* file system (as GlusterFS implements), they share design constraints that disallow guaranteed resolution of split-brain situations. Clustered, raw block devices are much worse off than clustered file systems because block devices can’t tolerate any divergence and still guarantee consistency. File systems, like GlusterFS, can allow some split-brain divergence, such as two people updating the same spreadsheet, but that may still require manual administrator resolution. Other scenarios, like one user deleting a directory (rmdir /tmp/one) while another user creates a file in that directory (touch /tmp/one/two) don’t have any split-brain-capable, POSIX-compliant solution.
“What reasons, then? Wrong semantics, wrong performance profile, redundant wrt other layers of the system, etc. I think David and I probably agree that scale and HA should be implemented in the highest layer of any particular system, not duplicated across layers or pushed down into a lower layer to make it Somebody Else’s Problem (the mistake made by every project to make the HDFS NameNode highly available). However, not all systems have the same layers. If what you need is a filesystem, then the filesystem layer might very well be the right place to deal with these issues (at least as they pertain to data rather than computation). If what you need is a column-oriented database, that might be the right place. This is where I think the original very general statement fails, though it seems likely that David was making it in a context where layering two systems had been suggested.”
This is *very* well put, and we’re in agreement that HA and scale belong at the highest abstraction level. The problem I see is that a POSIX file system is almost never the highest level. It is Unix’s way for applications to make locking, permissions, and BLOB storage Somebody Else’s Problem. This made a ton of sense for traditional systems, but POSIX file systems are not conducive to distributed implementation because they have all sorts of scenarios where split-brain cannot be guaranteed to be recoverable.
Web applications use file systems for BLOB storage, but they’re really just going for a mapping from file names to binary content. Shared network drives for documents usually serve as browse-able instances of the former use case. Databases just want addressable, persistent storage — no directories really required. Using a POSIX foundation for all of these provides a superset of the required capabilities — and, as a result, a superset of the necessary restrictions on ways to make the system split-brain friendly.
For web applications and shared network drives, it’s fairly simple to design storage that works for users, is distributable, and guarantees that no unrecoverable split-brain scenario exists (without compromising availability). One such system is S3. For databases, HA and scale are best implemented in the database system itself. I just can’t think of a scenario where a distributed, POSIX-like file system is anything but a stop-gap to implementing a distributed system that better matches the actual requirements and allows better distributed capabilities.
“I’ll point out here that Cassandra’s handling of Hinted Handoff has only very recently reached the standard David seems to be advocating, and was pretty ‘funny’ (to use his term) before that. The other Dynamo-derived projects have also done well in this regard, but other ‘filesystem alternatives’ have behavior that’s too pathetic to be funny.”
Hinted handoff has never been used in Cassandra to satisfy consistency requirements for an operation [1]. It’s simply a way to reduce the window of inconsistency when an operation only needs eventual consistency. It’s like an inbox that piles up for a node that’s AWOL.
“Perhaps this is true of filesystems in practice, but it’s not inherent in the filesystem model. I think it has more to do with who’s working on filesystems, who’s working on databases, who’s working on distributed systems, and how people in all of those communities relate to one another. It just so happens that the convergence of database and distributed-systems work is a bit further along, but I personally intend to apply a lot of the same distributed-system techniques in a filesystem context and I see no special impediment to doing so.”
The reason I think high-quality distributed operation eludes file systems is the same reason I think it eludes most relational databases: The design of the interface (POSIX or SQL) assumes global read-after-write, and attempts to distribute that sort of interface and allow operation in a split-brain mode make the abstraction leaky. I feel safer building systems on fresh, non-leaky abstractions.
“[Local/sync and remote/async are] different beasts, most relevantly with respect to split brain being a normal operating mode. I’ve spent my share of time pointing out to Stonebraker and other NewSQL folks that partitions really do occur even within a single data center, but they’re far from being a normal case there and that does affect how one arranges the code to handle it.”
Split-brain in a single data center may not be an everyday thing, but implementing the hardware and software to manage it safely is the source of a lot of complexity, like Heartbeat2, Pacemaker, fencing devices, and management cards. Choosing to not manage it safely is a recipe for mysterious issues later on. Even if software implements its own quorum-style system (as GlusterFS does), things get weird on either side of partitions, in terms of what operations are possible or guaranteed to not cause problems when the split resolves.
I think the vast majority of systems are possible to implement directly on an async, partition-friendly foundation without too much extra effort. The returns to administrators and developers are many-fold in terms of not worrying about fail-over, network hiccups, and recovery after a partition.
[1] http://www.datastax.com/dev/blog/understanding-hinted-handoff
This just keeps getting better. :) If I may Fisk a bit…
Other scenarios, like one user deleting a directory (rmdir /tmp/one) while another user creates a file in that directory (touch /tmp/one/two) don’t have any split-brain-capable, POSIX-compliant solution.
Yes, this is like the filesystem-specific version of the CAP theorem. Because of the complex namespace, some cases – like the one you mention – become extremely difficult to resolve. In a database you can say that if A’s operation on a row went first then B’s operation on the same row is simply applied on top of it. The operations might not be idempotent, but they are reorderable. In a filesystem, even that might not be the case. If A’s directory-removal operation went first then B’s file-creation operation might not even be valid any more, or in the opposite order the presence of a file in the to-be-removed directory should cause the removal to fail. This can only be resolved by falling back to strict serializable consistency for namespace operations – which we all know is incompatible with A+P even if that’s not true for weaker forms of consistency, and is generally bad for performance even when it works.
This is why I want to start the NoPOSIX movement, related to traditional filesystems the same way NoSQL (or NoACID) is related to traditional databases – same general form, selectively jettisoning features that make scale etc. difficult or impossible. Get rid of links in all their flavors. Get rid of cross-directory rename. Clean up the separation between inode and entry operations, especially the way open(2) does both in various orders and combinations depending on its arguments. Replace naive fsync semantics with more flexible and scalable order specification. And for God’s sake get rid of locking. The result would be something that’s not visibly different to the vast majority of users, still far more of a real filesystem than e.g. MogileFS or even HDFS, but that can be implemented much better without all the baggage.
The problem I see is that a POSIX file system is almost never the highest level. . .Databases just want addressable, persistent storage
This pair of statements suggests that you’re thinking of databases as the next level up. Sometimes yes, sometimes no, and even when it’s yes the database isn’t really the highest level either. It’s just where some people find it convenient to draw the line between “system” and “application” but that line can be drawn elsewhere just as easily. It might make sense for database needs to drive evolution of local filesystems, but distributed filesystems are a different matter entirely. Believe me, I know; I’m in a filesystem group, and getting the local-filesystem folks to think about anything distributed is like pulling teeth. Distributed filesystems and distributed databases don’t displace one another. They complement one another. Many people do need a database interface. Forcing them to use a filesystem interface is obnoxious, and building a database on top of a distributed filesystem is stupid. I’ll bet a lot of people think that’s pretty obvious, without realizing that it’s just as true the other way around. Database folks have been obnoxious and stupid in exactly this way for twenty years. Out here in reality, a smart application designer will use either or both depending on the specific needs of their application, not put one above another.
Hinted handoff has never been used in Cassandra to satisfy consistency requirements for an operation
Never said it was. I was addressing the “natural, guaranteed split-brain resolution” claim, and HH is very much a part of that picture.
The design of the interface (POSIX or SQL) assumes global read-after-write
It’s worse than that; for POSIX at least it practically has to be serializable in many cases.
I feel safer building systems on fresh, non-leaky abstractions.
So do I, but I also believe that filesystem interfaces can evolve toward less leakiness without abandoning their essential nature. Just like databases did, and S3-style blob stores did not.
Split-brain in a single data center may not be an everyday thing, but implementing the hardware and software to manage it safely is the source of a lot of complexity
Yes, I’ve implemented some of that software . . . since HACMP in 1992. I would never suggest foregoing that complexity altogether. My point was that the complexity for a rare case can be pushed out of the main code paths, while for a normal case it would have to stay there. This is similar to the fast-path/slow-path pattern that’s used to great effect in many of the most robust and high-performance systems in the world, such as disk arrays and core routers. The result can be a system that’s more modular, more maintainable, more operator-friendly, and faster than one where everything is entangled together on the slow path.
The returns to administrators and developers are many-fold in terms of not worrying about fail-over, network hiccups, and recovery after a partition.
Sounds nice, but those abstractions leak too. The distributed databases I’ve seen don’t seem to isolate the user from those issues any more than the distributed filesystems. They require just as much setup, just as much monitoring, and just as much head-scratching when things do go wrong. Some products in both areas are surely better than others, but there’s no intrinsic reason why a database can do these things better than a filesystem merely by virtue of which category it’s in.
Very interesting conversation guys. Please continue. :)
For next time you need to fetch a Twitter conversation: http://twitter.theinfo.org/154619745709195264
This is why I want to start the NoPOSIX movement
I fully support this idea. In fact, I built a very NoPOSIX file system for my company. It’s called Valhalla [1]. Conflict recovery is fairly primitive; it just keeps the full content of the file written last. However, it’s guaranteed to always recover from split brain.
In case you’re wondering, I handle the deleted-directory/new-file-in-that-directory conflict by keeping the file and implicitly restoring the directory. Every file is tracked by its full path from the root, so some of the operations you’ve cited as difficult are quite trivial, at the cost of directory listings being — in cases of deep nesting — far less efficient.
I was addressing the “natural, guaranteed split-brain resolution” claim, and HH is very much a part of that picture.
I’m still not clear on why you’re saying HH is messy from a consistency-after-split-brain basis. You could make an argument that it’s non-optimal from a computation or bandwidth perspective, but it shouldn’t affect whether Cassandra can automatically recover from a split brain.
It’s like the difference between someone who’s been out at the office checking their inbox versus asking every other employee if there’s something for them to see. Both cases will converge on the person getting everything. It’s just that the former case is usually more efficient. Now, if they’re out of the office long enough, maintaining their inbox might get unwieldy, including many items that aren’t necessary for them to see anymore; it might then be better to actually have them go around to every other employee to ask. In Cassandra, HH is just an organized way to maintain inboxes for offline nodes.
I would never suggest foregoing that complexity altogether.
Neither would I. If anything, I’m happy GlusterFS takes the administrator-friendly approach of embedding the necessary failure-handling mechanisms rather than punting them to a crude, bolt-on solution like Heartbeat2/Pacemaker (like *cough* MySQL).
They require just as much setup, just as much monitoring, and just as much head-scratching when things do go wrong.
In my experience, the burden is mostly on developers to design their schema so that the data store’s conflict resolution results in consistent, desirable results. It’s a challenging skill, especially for folks used to global read-after-write (and stronger) consistency guarantees for everything.
The worst I’ve encountered as an administrator is a clock sync issue. Writes via node A were sometimes followed by an update to the same data, but written via node B. If the writes were close enough, it resulted in node A’s version winning because node A’s clock was fast. Synchronizing the clocks fixed it, but a more robust change would have been to make the updates non-overlapping to avoid being so sensitive to clock synchronization. It also wouldn’t have been an issue if Cassandra used vector clocks, but there are respectable reasons it doesn’t.
–David
P.S. It would be helpful if your blog documented acceptable markup near the comments box.
[1] https://getpantheon.com/news/inside-pantheon-valhalla-filesystem
I’ll also go ahead and read your history of posts on POSIX, since they seem to be numerous and well thought-out. In my original remark on Twitter, I assumed the goal of GlusterFS was to maximize POSIX compliance yet be fully distributed. (I based this on seeing “type storage/posix” in most of the volume configuration documentation and lack of guaranteed conflict resolution in some cases.) Reconciling the two is not really possible, and, having read your remarks here, it seems like we’re actually on the same page there. I’m not sure there’s anything left to debate. :-)
–David
For a drama-free solution take a look at moosefs. POSIX-compliant distributed filesystem (64MB file chunks) and comes with awesome provided tools. With 150TB+ of aggregate moosefs’ storage and growing fast, I’m still in awe of it’s resilience, it’s ease, and it’s glaring SPOF.
I tried MooseFS a while back, along with GlusterFS (of course), Ceph, OrangeFS, XtreemFS, HDFS, Sphere, and probably at least one other that I can’t remember. Of those, only GlusterFS and XtreemFS could even get through a simple “write a few big files and then read them back” test without crashing (Ceph), hanging (MooseFS) or corrupting data (OrangeFS). When I tried to talk to the MooseFS developers about the issues I’d found, they mostly ignored me except when they were being downright rude. It’s a shame, really, because I would welcome more alternatives in this area and I think MooseFS in particular has the potential to be one of the better ones. Unfortunately that’s not likely to happen as long as the number of people working on it remains so small and people who try to help are actively turned away.
Jeff, I really wished I’d listened to you more when you posted this.
Starting to find some scary problems with MooseFS and the mailing list just ignores…you.
Frightening.
“POSIX-compliant distributed filesystem” always strikes fear into me. It either means it’s brittle (SPoF), complex (Paxos-like locking), or the author doesn’t know what POSIX compliance actually means.
BTW, another note about MooseFS, in case anyone else’s Google search leads here. ;) Check out how it handles O_SYNC . . . or more accurately fails to. Basically it just throws the flag away, claiming that your data has been written when in fact it has not and a hardware failure could leave you with an unrecoverable mess. Confirmed both by code inspection and by observing the code as it runs.
Hello Guys,
just interessting and informativ discussion. i came through this by searching in the web for a good cluster solutin and while i use moosefs with more or less happiness. one of my problems is the performance while iam using it as a private cluster for my blu-ray disks. syste 3x6x3TB SWRaid Nodes with an virtual Master and the Metalogger on Master and one Node. The Performance messured with dd was between 30-50MB. how ever you are the pros with experiences and i hope you can help. which one is a good solution for my mediacluster (glusterfs, moosefs???) and in future i will build up a websolution for which i need a fileclustersystem for webservices (mysql) and also virtual machines. hope you can argue for one or give pros and negs.