Canned Platypus

Making the world better, one byte at a time.

Jan
4

Cassandra Filesystem

Over the holidays I had planned to work on a FUSE interface to Cassandra. Yeah, it’s a silly idea. I’m not doing it because it’s useful. Mostly I’m just doing it because I can. I like to play with code even when I’m not working, so even though this involves two work-related technologies I consider it a form of leisure. As it turns out, I didn’t get much of a chance to work on it. I always thought vacations were supposed to be voluntary and either restful or enjoyable, but when the timing is dictated by my employer and most of the the time is spent enabling someone else’s rest or enjoyment then I think a different term is necessary. I was able to squeeze three or four evenings’ worth of coding around my second job, though, and I don’t know when I’ll be able to get back to it, so this is a status update of sorts. Here’s what I have so far.

  • Data structures and key-naming conventions – roughly equivalent to the on-disk format of a disk-based filesystem.
  • Code to manipulate those structures in several important ways, including inode and block allocation.
  • Code to create/mount a filesystem and create/list arbitrarily nested subdirectories.
  • Code to create and read/write string-valued “files” within those subsubdirectories (including rewrite).

That’s really not much, but what’s probably more important than the current functionality is the structure that holds it all together. If I’d set out to implement FTP-like get/put on whole files I would have done that within a much simpler structure, but I just don’t consider that functionality interesting. I very consciously took the slower route of implementing things the way they’ll need to be for FUSE, and that integration is the obvious next step. I should be able to knock out mount, lookup, mkdir, and create/open pretty quickly at this point. I consider incremental read/write of only the affected portions within an arbitrarily large file (as opposed to reading or writing the whole thing) to be the most important feature of this whole project, and I’ve structured things so that full read/write support shouldn’t be difficult – though it’s necessarily a bit tedious. After implementing a few more calls (e.g. stat/fstat, opendir/readdir, maybe even symlink operations) the result might even be useful to someone besides myself. Then there are a lot of other things I could do…

  • I’d like to port the same functionality to other stores such as Voldemort, Tokyo/LightCloud, and/or Hail. Nothing so far particularly precludes that.
  • I probably won’t optimize around Cassandra’s multi-column data model, because that’s largely at odds with porting to other stores. Yes, I could implement yet another layer of abstraction between FUSE and the “block” level so that Cassandra could do certain things using columns and simpler stores could do them using similarly named keys, but it just wouldn’t be any fun. If anybody wants to pay me then my attitude might change, but as long as it’s a leisure-time project this seems unlikely.
  • I do intend to fix certain inefficiencies in how my own code works right now. For example, inode and block allocation hit the “superblock” key way too often. I have a very specific plan for how to do that better, but haven’t bothered to implement it. Similarly, file and directory creation both involve rewriting the entire parent directory and that’s nasty. Incremental directory updates are similar to incremental data updates, so once I have those done I’ll adapt the code.
  • I don’t intend to fix inefficiencies in how Cassandra works right now. The Thrift interface is ludicrously string-centric, forcing all kinds of copies and transformations that really shouldn’t be necessary, but fixing that would require a whole new bunch of work that I wouldn’t enjoy. See previous comment about for-pay work vs. leisure.
  • I do intend to fix some of my own general sloppiness – unchecked return values, probably memory leaks, general lack of modularity in some places.
  • I do not intend to implement any kind of multi-machine or multi-user support. That’s the kind of stuff I do for my day job; unless you want to offer me a new day job (for a lot of money) it’s both too much work and too much conflict of interest. That’s absolutely positively off the table as long as this is a hobby project.

I don’t quite feel ready to post the code yet, though I might be persuaded. If you think it’s something you might actually be interested in working on, then by all means let me know and I’d be glad to let you have it privately. I just don’t see any point in posting it for every wannabe to pick at when at least of half of it will probably change soon anyway. When I get to the point where I can mount via FUSE and unpack/build/run iozone within the resulting mountpoint, even if it’s slow and ugly, than I’ll probably put it on SourceForge under AGPL unless someone suggests another site/license.

Comments

  1. Jeff,

    I really enjoy reading about your experience and opinions on filesystems. Please do get this to a point where it can mount!

    If you don’t publish it someplace I would like to see the code myself.

    Cheers,
    John

  2. Mid-week update: not much progress. I’ve spent a couple of evenings on it, but mostly refactoring and cleanup. I decided I do want to get multi-block reads and writes working before I do the FUSE integration, so I’ve rearranged my test tool to support reading/writing to/from local files with arbitrary lengths/offsets on both sides. Now I need to write the code to walk through the relevant parts of the inode’s block list.

    I also realized my “on-disk” structures weren’t clean wrt word size or byte order. Oh well, just not going to bother with it for now.

  3. I think this is my eighth coding session, so ~20 hours so far. I split the code into a core library, CLI executable, and FUSE executable. Only read portions of the FUSE module are implemented so far, so all modifications have to be done via the CLI, but I can now mount, list directories, and read files. Implementing write should be trivial, because the FUSE interfaces are almost exactly the same as read. I’d also like to add at least mkdir, clean up some of the worst lameness, and generally package things up a bit more neatly before I publish the code, but that should only take one or two more sessions.

  4. Just curious? Why AGPL? And why SourceForge for that matter – bitbucket/github/google code are much more usable these days…

  5. No particular reason for SourceForge, except maybe habit. GitHub would be just as good, I’m sure.

    AGPL, on the other hand, is a more deliberate choice. I’ve historically favored the BSD license, and in an ideal world that’s what I’d do, but I’ve come to believe that we’re too far from that ideal. I want anybody who makes improvements to contribute those back at least to the original project if not to the public at large, and there are too many people who fail to do that unless the license requires them to. The fact that such disclosure also precludes patenting any ideas expressed in derivative works makes such licenses even more appealing. Within the GPL family, I consider AGPL a necessity because the “as a service” loophole is unacceptable. I could also go with Apache, which of course is how Cassandra itself is licensed, and dual licensing is also a viable option IMO.

  6. Another session, another milestone. I can now extract and build iozone, with a couple of caveats. I have to run the FUSE daemon single-threaded, because the Thrift stuff isn’t MT-safe. Blech. I also can’t actually run iozone. When I try to run it directly, I run into problems because I haven’t implemented setattr (needed to chmod). When I copy the file to /tmp it starts but then fails some sort of internal sanity check – probably because of another unimplemented entry point – before it really gets going. One more session should take care of that, and then I’ll clean it up for the v0.01 code release.

  7. I’ve considered doing the same thing recently; I wonder how far you got?

    I am interested in how you store stuff in Cassandra; it’d have to be very different from a “normal” filesystem for sanity / performance reasons. Personally I intend to have two column families, one for directories (one key per directory) and one for files (one key per chunk of file). File metadata (i.e. that conventionally stored in inodes) would then be stored in the directory entries, which would mean that hard links cannot be supported.

    However I think that is probably the simplest way of making it work.

  8. Here’s the announcement, which includes a link to the source. As a professional filesystem developer, whose reputation might be harmed by anyone assuming CassFS is an example of how I do things when I’m serious about them, I cannot stress enough that CassFS is just a toy. It works well enough to demonstrate that such things are possible, and to illustrate some ideas, but is unsuitable for actual use because of all the deficiencies I admit in the announcement.

    With respect to your other point, I think hard links are one of those things that are all pain and no gain for filesystem developers, and not supporting them is an entirely reasonable choice (which Artur Bergman also made with riakfuse). Rename across directories is another example. Without hard links you can’t do link-new/remove-old, so you have to do copy-new/remove-old and leave markers so that you don’t end up with duplicate inodes pointing to the same data after an ill-timed crash. Overall, I favor neither storing inodes in directories nor using whole paths as row keys; the latter is causing Artur some pain with renaming directories. Similarly, relying on the row key to identify file chunks (i.e. without explicit references in the inode or in indirect blocks) can be problematic when it comes to updating or deleting sparse files.

    I’ve actually been thinking some about doing a more-real successor to CassFS, probably using Voldemort because Cassandra and Riak have already been done and Voldemort also has the vector-clock behavior I need. The basic principle would be that directories contain only single-path-element pointers to inodes, and an inode forms the root of a block tree so that writes can be done atomically by writing to “new space” with the write fully isolated until the inode is updated. The vector clocks come into play because that would allow me to fail an inode update if there’s a collision with another write, and then retry until the conflict is resolved. I’m thinking of using the Python FUSE bindings this time, just for convenience. Perhaps you, me, and Artur should put our heads together and see if we can come up with a common format/protocol that can be implemented across multiple languages and multiple back-end stores. Wouldn’t that be cool?

Leave a Comment