One of the dangers of making something easier to do is that a lot of less skilled people will start doing it. One familiar example of this is writing multi-threaded code. All of a sudden everyone’s doing it, the vast majority without any understanding of the principles behind writing good multi-threaded code, so an awful lot of them make a complete hash of it. The same is beginning to be true of distributed code. The example that has been on my mind lately, though, is filesystems. FUSE has made it a lot easier to write filesystems, so a lot more people are doing it. I generally consider that a good thing, and (unlike many of my kernel-filesystem-developer colleagues) I’m not going to look down my nose at FUSE filesystems just because they’re FUSE. After all, I just finished writing CassFS in my spare time. On the one hand, it illustrates just how easy it can be to slap a basic filesystem interface on top of something else. It took me about twenty hours’ worth of spare time, and I’m not Zed Shaw so I’ll give credit to FUSE instead of pretending it’s proof of my own awesomeness. On the other hand, CassFS is also an example of how badly a FUSE filesystem can suck. I won’t go into details here, since I already did, but my point is that CassFS is no worse than a bunch of other FUSE filesystems out there and some of those projects’ authors still act like their little brain-fart is equal to the more mature efforts out there. That does bug me. It’s great that technologies like FUSE allow people to do something that would previously have been out of reach for them. It’s not so great that the people who’ve been working on the truly hard problems in this area for ten years or more, and who might expect credit or even profit for those efforts, have to “share the stage” with people who just got basic read/write of a single file by a single process working.

That brings me to the real topic of this article. There are a lot of parallel/distributed filesystems and other data stores out there nowadays. Some of their authors are making pretty grandiose claims because their pet does exactly one thing well and when they tested that one thing vs. better-known alternatives it didn’t do too badly. Well, sorry, but that doesn’t cut it. It’s like “racing” the guy in the car next to you who doesn’t even know you’re there because he’s busy doing what he should be doing which is paying attention to conditions up ahead. If you want your p/d filesystem to be taken seriously, you have to meet at least the following criteria.

  1. Support practically all of the standard filesystem entry points with reasonable behavior – not just read/write but link/symlink operations, chown/chmod, rename, stat returning reasonable info, etc.
  2. Have distributed metadata, not a single metadata-server SPOF/bottleneck.
  3. Provide intra-file striping for high performance access to a single file from one or many nodes/processes (the latter precluding whole-file locks) and for even data distribution across servers.
  4. Support RDMA-style as well as socket-style interconnects, also for high performance.

I’m aware of only three open-source alternatives that meet this standard, and dozens that don’t. Lustre failed criterion 2 when I worked on it, but claims to have gotten past that and I’ll give them the benefit of the doubt. PVFS2 also passes; some might quibble about whether their explicit rejection of certain obscure POSIX requirements allows them to meet criterion 1, but I think they’re close enough. GlusterFS also passes, though there’s some room for improvement on criterion 4. Of the rest, I suspect NFS4/pNFS advocates are the most likely to show up and object, but I don’t think NFS4/pNFS are even in the right space. They’re protocols, not implementations, and the existing open-source implementations don’t even address how to use the protocol features that were put in for this sort of thing. As far as I know, most if not all multi-server NFS4/pNFS implementations have used some other parallel filesystem on the back end to handle that, and it’s those other parallel filesystems (PVFS2 in one case but more often proprietary) that I’d consider.

If what you want is a real, mature parallel filesystem to deploy today, these are the ones you should look at. In another year or two, maybe some other very exciting and promising projects will join the list. Ceph is my favorite candidate, along with POHMELFS and HAMMER. Such things are great to play with, but I don’t think I’ll be putting my home directory on one. Come to think of it, I never got around to putting my home directory on any of the Big Three either. Maybe once I’m done with my current subproject I’ll take a big bite of my own dogfood.