Archive for February, 2011

Checking out the Competition

I spent most of last week on the west coast, for a combination of customer/partner visits and FAST’11. It was really cool to be able to reconnect with some people I’d not seen for a long time, meet a whole bunch of new people, and generally see what’s going on both on the academic vanguard and back in the trenches. I particularly enjoyed some of these:

  • A Scheduling Framework That Makes Any Disk Schedulers Non-Work-Conserving Solely Based on Request Characteristics
    Yuehai Xu and Song Jiang, Wayne State University
  • Scale and Concurrency of GIGA+: File System Directories with Millions of Files
    Swapnil Patil and Garth Gibson, Carnegie Mellon University
  • AONT-RS: Blending Security and Performance in Dispersed Storage Systems
    Jason K. Resch, Cleversafe, Inc.; James S. Plank, University of Tennessee

The Acunu folks were there to talk about re-writing the entire storage stack to suit their one use case. Yes, I am rolling my eyes at that. The most interesting thing at the conference for me, though, was OrangeFS. As their FAQ explains, the “orange” and “blue” branches of PVFS diverged a while ago to facilitate research in different directions, and as of “fall 2010″ the orange branch became the main branch of PVFS. It’s not entirely clear, but it seems that “PVFS2″ still means the blue branch and “OrangeFS” means the orange (main) branch. In any case, many of the features of OrangeFS move it in a direction that makes it much interesting to me.

  • Scalable huge-directory support (apparently based on the aforementioned GIGA+).
  • Better handling of small unaligned accesses.
  • Cross-server redundancy.
  • Secure access control.

With a company (Omnibond) behind it, I think we can also expect some improvement on the management/configuration and documentation fronts, which weren’t bad for an academic project but left a little to be desired for normal users. The natural question for those also interested in GlusterFS/CloudFS is how OrangeFS compares. One way I can answer that, from my own perspective, is to explain why I chose GlusterFS instead of PVFS2 as the basis for CloudFS. I had worked on PVFS2 some at SiCortex. I liked the code and I liked the people working on it, so it had the advantage of familiarity. There were basically three reasons why I didn’t go with it.

  • The lack of redundancy at the filesystem level really bugged me, and still does. “Use shared RAID and do your own heartbeat/failover if you care that much” is just not an answer to availability concerns nowadays. Never was, really. If OrangeFS did nothing other than address this, it would still be worthwhile.
  • As much as I liked and respected the people working on PVFS2, there just didn’t seem to be enough of them. In August 2009, the month I started working on CloudFS at Red Hat, there were three messages on pvfs2-developers and a big fat zero on pvfs2-users. The freenode IRC channel was dead dead dead. By way of contrast, there was plenty of activity on both the user and developer lists for GlusterFS, and the IRC channel is fairly busy. Community health/strength matters a lot in open-source projects, so that became a factor as well.
  • PVFS2 is good, modular code, but not quite as modular as GlusterFS where there are more layers and they all use the same API. The dizzying array of possible permutations, and the ease with which functionality can become moved between client and server merely by writing different volfiles, has actually become a bit of a liability for Gluster themselves. Greater flexibility almost always means greater support burdens, and this system is very flexible so it’s no surprise that Gluster is moving toward supporting only the configurations that its own tools would create. That flexibility is exactly what I needed for CloudFS, though. I could implement the CloudFS functionality in a different codebase, even a kernel codebase, but it would take significantly longer.
  • Let me repeat: I find no fault with PVFS. It’s a great piece of work, and the very existence of the orange/blue branches shows that it can be adapted for different environments and uses. It just wasn’t the right vehicle for me to do my work, with the directions and resource levels that were involved. It’s great that OrangeFS is addressing not only the cross-server redundancy issue but also others that have historically afflicted filesystems of this type, but I think that mostly remainstrue. The secure-access-control piece is definitely interesting relative to CloudFS’s own goals. I look forward to learning more about that; maybe there’s something I can steallearn from and apply in CloudFS.

    While I’m talking about other projects, I might as well mention a few more.

    • Despite all the organizational churn around Lustre, stemming from the Oracle acquisition, I don’t see a lot of technical movement. Besides, the various HPC sites that serve its audience seem to cover every tiny movement in that space quite well already.
    • A less well-known project is MooseFS. It’s broadly similar to GlusterFS in terms of being FUSE-based and having built-in replication, but diverges in other ways. Their “metadata logger” approach to surviving metadata-server death is not quite as good as true distributed metadata IMO, but it’s still far better than some systems’ “one metadata server and if it dies or slows to a crawl then SUX2BU” approach. Built-in snapshots are a feature that really sets them apart. I ran it through some very basic tests on some of my machines, and it fared very well. I don’t mind saying that it was significantly better than GlusterFS for the workloads I tested, so it’s worth checking out.
    • Ceph is, well, Ceph. It’s a great project that represents fantastic technology, and it’s making great progress. Sage Weil is one of the people I was most looking forward to meeting at FAST. Only time will tell whether its full in-kernel client implementation (vs. FUSE or NFS bridged to native as in GlusterFS) was really the best way to achieve broader project goals. Probably the biggest issue with Ceph, though, is its relationship to btrfs. Rightly or wrongly, I think a lot of people see the two as being inextricably tied and have adopted a “wait for btrfs to be finished before thinking about Ceph” attitude. That’s a shame, because I don’t think it’s really necessary to wait before giving Ceph a try, but that’s the way it seems.

    Finally, there’s pNFS. I was in the pNFS BoF at FAST, and talked to some of the Panasas folks there, plus the subject came up in conversations about Lustre/Ceph/OrangeFS. There’s certainly a lot of interest, but there’s this 800-pound gorilla in the room: pNFS defines a protocol, not an implementation, and only the client implementation is being done out in the open. The serious server work – object layout at Panasas, file layout at NetApp, block layout at EMC – isn’t being done in a way that helps people who want to deploy an end-to-end solution using open-source software on hardware they choose. Add to that the fact that pNFS is all about the data flow and inherently doesn’t even try to solve metadata coordination/availability problems, and I think it’s fair to say that it just doesn’t occupy the same space as the other projects I’ve mentioned even though many seem to think it does. I think pNFS is best considered as an access method for a real distributed filesystem, not as a complete solution in and of itself. The first of these other projects to implement pNFS as an access method – much like GlusterFS already does with NFSv3 – will in my opinion be doing both themselves and the pNFS community a great favor. If I weren’t already busy with CloudFS I’d seriously consider heading in that direction myself.

    But I am busy with CloudFS, and I guess I should get back to it now. ;)