Basho has announced Luwak, an Erlang library for storing large files in Riak. The original code was contributed by Cliff Moon (@moonpolysoft) so I’m guessing that the slightly scatological name comes from him. I chatted with Cliff on IRC a bit. I also exchanged some email with Bryan Fink (@hobbyist) who wrote the HTTP interface and seems to be the current maintainer at Basho. Many thanks to both of them for taking the time to educate me. What follows might come across as criticism, but I don’t mean it as such. Most of it comes from my background as a filesystem developer, which is most assuredly not the best perspective from which to view Luwak, but it’s the perspective I have. Considered relative to Luwak’s goals and to the stage of its development, most of these apparent criticisms are weak or invalid, even when I managed to fight through my poor knowledge of Erlang to understand what the code’s doing. I can’t stress enough that I think Luwak is cool, and I wouldn’t have spent even as much time as I already have on it otherwise.

The first thing that strikes me about Luwak is that it’s all about what’s inside files and there’s nothing about managing namespaces – no directories, no renaming, no attributes as we filesystem types would expect, etc. That makes perfect sense, since Riak already has plenty of ways to index and connect Luwak files. Who needs directories when you have so many other ways to do the same things? Bryan even points out that “object” might be more accurate than “file” because it doesn’t carry the weight of expectations that Luwak was never intended to meet. This does mean that an application developer accustomed to arranging files into hierarchies will have to come up with their own way of mapping those semantics onto what Luwak provides, and maybe it would be nicer if that mapping were done in common code, but it’s not really that big a deal.

The structure within a file is of blocks arranged into a Merkle tree. The Merkle-tree approach is an interesting one. In the case of rewriting an entire file in which little has changed, it allows the update to be done with very little data transfer. I’m not sure it helps all that much in the case of writing a new file, or rewriting only part of a file, though. It makes me wonder whether the “atomic non-extending write within a single allocated block” optimization I mentioned here would apply to Luwak. The Merkle approach is also related to another interesting feature which isn’t mentioned in the README but does warrant a comment in luwak_io.erl

%% The write will start at the offset specified by
%% Start and overwrite anything at that position with the
%% contents of Data. Writes starting beyond the end of the file
%% will occur at the end of the file. Luwak does not allow for
%% gaps in a file.

I can totally see how this makes the design simpler. It avoids a whole lot of grunge like populating nodes with “holes” instead of pointers to real data, and dealing with reads in the holes, and so on. The part about writes starting beyond the end occurring at the end worries me, though. If an application were to write out of order – few do, but something like BitTorrent comes to mind – the result would be a mangled mess. If gaps aren’t allowed that’s fine, but it would seem safer to reject them outright than to risk rearranging them. I also don’t see any mention of a true append operation, which would imply that appending is a potentially racy process of finding the current EOF and then writing to that offset. What if something else extended the file in between?

Speaking of concurrency, the general approach in Luwak is similar to that in VoldFS and elsewhere: do all writes (including internal data structures) into new space, then write a new root which points to the new bits. In VoldFS this final write is into the inode for data operations or into the root directory for namespace operations, and is done very carefully with a conditional update so that conflicting writes are detected and retried – effectively serialized – instead of taking partial effect. In Luwak the “write into new space” rule does seem to be followed, but not the conditional-update part. That means two simultaneous writes could end up making separate copies of the same node in a common ancestor, and one write could be lost even though there was no actual overlap. As near as my weak Erlang skills can determine, simultaneous updates might even stomp on each others’ ancestor lists, so reconciliation at that level wouldn’t be possible either. Now, don’t get me wrong. It’s entirely reasonable to say that Luwak isn’t intended to handle that kind of concurrent-access regime and that if it had been then it would have been implemented a whole different way. I’m just saying that it’s an area where it might be interesting to experiment some more and see if at least occasional/accidental sharing might be handled more gracefully.

Since I mentioned ancestor lists, I should also point out that they seem to include all previous versions. Similarly, and again according to Cliff, there’s no garbage collection of no-longer-used data blocks. Again that’s totally reasonable for such a young project; there’s no such garbage collection in VoldFS either. Since data blocks are addressed by content hash, the problem might even be a bit more complex, and of course one should never pass up an opportunity to remind people of Valerie Aurora’s excellent HotOS 2003 paper on the dangers of compare-by-hash.

That’s all I can think of right now. All quibbles and disclaimers aside, I think the most important thing is that more people are working on ways to store large objects in some of the modern distributed data stores. Even if we all come up with different semantics and different approaches, that’s definitely a good thing. Progress is messy that way, and thanks to everyone involved with Luwak for contributing to that progress.