Those of you who follow me on Twitter have probably seen me mention VoldFS. It’s my latest spare-time coding project, in which I’m using FUSE (and Python) to implement a filesystem interface on top of Voldemort. To satisfy the curious, here’s a first cut at a FAQ.
Why a filesystem?
Voldemort is pretty easy to use, but providing a filesystem interface makes it possible for even “naive” programs that know nothing about Voldemort to access data stored there. It also provides things like directories, UNIX-style owners and groups, and byte-level access to arbitrarily large files. Lastly, it handles some common contention/consistency issues in what I hope is a useful way, instead of requiring the application to deal with those issues itself.
I wanted to try out some ideas for implementing a filesystem on top of a distributed data store where the pieces were addressed by arbitrary keys. I’d already written CassFS, and Artur Bergman already wrote RiakFS, so Voldemort was just kind of next. The main requirement besides key/value access (preferably with binary-safe keys and values) is that the data store support some sort of conditional-put functionality as the basis for how VoldFS handles contended writes (including directory changes). I very much plan to make VoldFS work on top of kumofs as well, and I or others might add support for other back ends too.
How does it perform?
Well enough to be usable, but it’s certainly no speed daemon.
Well, not because it’s FUSE and not because it’s Python (though Python’s string/buffer semantics do force a bit more copying than I’d like). The in-store format has been specifically designed so that a transition to Cython should be easy, and the algorithms could be implemented readily in any other language including C, but I wouldn’t expect dramatic gains from that. Mostly it’s not all that fast because it does practically no caching. Some caching is possible and will improve performance significantly for non-contended use, but handling consistency issues properly in this kind of system will always involve some sacrifice of performance.
Does it scale?
Yes, pretty much as well as the underlying data store does – which is very well in this case.
But aren’t performance and scalability the same thing?
How does it work?
The basic principle of operation is copy-on-write. Instead of modifying data in place, modifications are done by copying and then storing the modified version under a new key. This is done from data “leaves” through indirect blocks until, as the very last operation, the inode “root” is updated in place pointing to the new keys. This is where we use conditional puts, to detect conflicts and retry; nobody sees any of the new keys until we’ve successfully updated the inode. Directories work in a similar way, using hashes instead of block numbers, and there’s even more stuff involving buckets and such that I’ll describe more fully later, but that’s the gist of it.
Doesn’t that approach leave you with a garbage-collection problem?
Yes, it does.
Will it be open source?
Yes. I still have to add the license verbiage – haven’t decided between LGPL/Apache/BSD/custom yet – and do a few other cleanup/packaging things, then I’ll put it on github like I did with CassFS.
Right now I can mount a VoldFS filesystem, create and list directories, read and write files, use cp/tar to move whole directory trees in and out, and expect data to persist in Voldemort so that I can remount later or on another machine and retrieve everything correctly. Many entry points (e.g. chown/chmod, rename, anything to do with symlinks) are stubbed out or completely unimplemented, so lots of other operations or programs (e.g. rsync) still fail. There’s also a bunch of stuff that’s hard-coded that shouldn’t be, so it’s not even ready for casual use yet, but I’ll put the code up anyway so my fellow developers can see where I’m going and offer feedback. After that I’ll keep implementing more of those entry points, adding proper permissions checks, and – most importantly – working on the contended-update cases. Then I’ll test the heck out of it before I declare it ready for prime time.