Even when I’m happy with my current project, I often like to think about alternative solutions to the same basic set of distributed-data problems that are my area of expertise. So it is with GlusterFS. I think that’s on a good track, with a robust set of features already and a clear path to being the best available solution for an even broader area, but . . . there’s always a “but” isn’t there? As I’ve worked on GlusterFS for the last few years, there have been certain patterns of problems and feature additions that have proven unexpectedly difficult. A lot of these come down to the issue of write granularity and conflicting simultaneous writes. For example, what if two HekaFS clients using block-based encryption try to write different parts of the same cipher block at the same time? They’re not conflicting from the user’s point of view, so one shouldn’t be allowed to undo the other, but they are conflicting in the sense that every byte written affects every byte of that cipher block. It’s impossible to merge the two writes without keys, and of course the server can’t have keys in our model. Erasure coding runs into much the same problem. In both cases, the solution is to serialize the read-modify-write sequences used for partial writes. We can have all sorts of fun with various optimistic locking schemes and so on, but the net result is still a bit sad. When these issues are considered in the context of asynchronously replicating those writes across geographic locations, things become even uglier . . . and that’s the train of thought leading to this post. Is there a fundamentally different approach, absolutely infeasible to implement within GlusterFS, that would handle this combination (async replication plus erasure coding plus encryption) more gracefully? If I were to design such a system from the ground up, here’s what it might look like. While I’m at it, I’ll throw in some other ideas that aren’t really related to this fundamental problem, just for fun.

  • Would I even want the result to be a filesystem? I’ve often said that one of the reasons I got involved with GlusterFS was because I really didn’t want to develop an entire filesystem, and I still feel that way. There’s tremendous value in the result, but if I were to do something on my own maybe I’d go for something less complicated. Object (or “blob”) stores like S3 or Swift are about twenty times easier to implement, and many people seem willing to adopt them instead of filesystems, so maybe I’d do that. With things like HTTP PATCH coming along I’d still have to deal with overlapping byte-aligned writes within a file/object (I’m just going to say “file” from here on for simplicity), but that’s OK.
  • Distribution would still be based on consistent hashing. I have many of my own favorite tweaks in this area, most of which are equally applicable to GlusterFS and PBFS.
  • The first major change I’d make is to get out of the business of dealing with cluster membership and consensus with respect to configuration myself. There’s other software available to do just that, so there’s no need to implement an inferior home-grown version or put up with its deficiencies.
  • The other major change would be a move from a “write in place” model to more of a log-structured or “event sourcing” model. When a client writes data, the entire write is encrypted and erasure coded according to its own internal alignment, regardless of how it’s aligned with respect to the file. The pieces are then distributed among the servers responsible for that file. Voila, no more server-side merging or read-modify-write sequences.
  • As pieces of the write’s contents are being written to several servers, information about the write as a whole is fully replicated to a smaller set of servers. Any one of these “catalog” servers can then, when asked about that write, identify which servers got erasure fragments and also provide any authentication/integrity metadata necessary to validate the combined result.
  • Since the entire write is maintained as a single unit, that unit can be transmitted verbatim to another site. It can even be erasure-coded differently on that end, to satisfy different requirements e.g. at a primary vs. DR site. This simplifies matters greatly compared to approaches that require writes to be merged somehow before being propagated.
  • As with other patch-based systems as they’re well known and understood in the source-control world, the merging happens on the client side during reads. To read a byte range within a file, you ask one of its catalog servers for all current writes overlapping that range. What you get back is potentially a list of writes, with extent and ordering information. It’s up to you to decrypt and merge to get the bytes you need. In some cases, it would make sense to re-encrypt and write back the results so that the next reader only gets a single chunk, but that’s optional. In that case, the write-back might cause some or all of the previous chunks to be marked as no longer current and (eventually) garbage-collected.

As I’m sure you can see, this is a very different kind of animal than GlusterFS – or, for that matter, any other distributed filesystem. Both the read and write paths are fundamentally different. Writes can be significantly more efficient, but reads are more complicated and rely more on high scale to deliver good aggregate performance. It would certainly be an interesting exercise, if only I had time…