Filesystems are complex. Distributed systems are complex. It should come as no surprise, then, that distributed filesystems are exceedingly complex, and that in a nutshell is why there isn’t one in common use already. What about NFS, you say, or CIFS? Well, those aren’t really distributed filesystems in the sense that I or other researchers in the field use the term. CIFS and NFS are technically referred to as network filesystems, which are distinguished from their distributed brethren as follows:

  • Network filesystems allow remote access to data, but that data resides in one and only one place. This server can be both a bottleneck and a single point of failure limiting availability. Clustering at the server end can ameliorate both of these problems, but they cannot truly be solved without changing the basic architecture. In addition, network filesystem clients need explicit information about where each datum resides.
  • Distributed filesystems “virtualize” data, so that it can be replicated, moved, etc. transparently to address both performance and availability concerns. Distributed filesystem clients do not need to know exactly where a datum is located; this is discovered dynamically and transparently within the system instead.

There are two traditional approaches to distributed filesystem design:

  • In the volume-based approach, each filesystem is based on a single volume providing relatively low-level – often mere block read/write – semantics, responsible for handling data distribution and replication. A layer that runs above this data store provides synchronization and higher-level filesystem abstractions such as directories and attributes.
  • In the more common file-based approach, there is no concept of a volume at all. Every file is treated as a completely separate data store which must be managed and located separately.

The volume-based approach does a good job of “divide and conquer”; each layer has plenty of complexity, and each can manage its own complexity independently. The downside is that the load and availability characteristics of the overall system is limited by the least-distributed layer. In fact it’s even worse than that; as the two layers adapt to changing conditions they do so without coordination, and sometimes might even be working at cross purposes. Lastly, the layering itself inevitably creates some inefficiency.

The file-based approach has none of these problems. Its problem is that tracking each file separately requires more (and more complex) metadata and – more importantly – more network traffic. This additional overhead has brought many file-based distributed filesystems to their knees, sometimes making them all but unusable in anything but a fast-local-network environment – which in my opinion misses the whole point of having a distributed filesystem in the first place.

What I would like to suggest is that in a filesystem containing N files, we need not limit ourselves to either one or N data stores. Instead, a filesystem could consist of a number of volumes somewhere between one and N, linked together in a way that makes the whole thing transparent. For example, a user could create a volume to contain a few hundred of their own files. If those files are popular, the volume might be mirrored and subsequently moved somewhere else to ensure performance and availability, but the association of those files with that volume would remain intact. At the same time, some directories within that volume might actually be links to other volumes elsewhere, which are managed separately and accessed as needed.

This approach bears some superficial similarities to mount points and symbolic links, except that the URI-like links in this case are totally transparent and platform-independent, managed by the filesystem internally instead of being exposed to an OS which is – in most cases – ill prepared to deal with them in the sorts of numbers they’re likely to occur. A closer similarity exists to “junction points” in Windows NT’s DFS.

It might seem that the “root” volume in such a system would still be more critical than others, creating an undesirable point of centralization. In fact that’s not true because there need not be any one root. In a sense, every volume is a root volume, potentially containing using links to create a “personalized view” of the entire filesystem as in Plan 9. Even loops need not be considered a problem. So what if volume A contains a link to B, while B also contains a link to A? That’s not a problem on the web, nor would it be one here. Certainly some roots might begin to seem more “authoritative” than others because of usage, but that’s entirely voluntary. One would hope that the owners of such roots would adopt more robust replication/distribution policies, just as the providers of more popular content on the web use content distribution networks to satisfy the same sorts of goals.

One of the nice things about this sort of “federated filesystem” is that the volume boundaries can match administrative boundaries. Each volume could have its own methods or policies for distribution or authentication, or even its own protocol, within a coherent general framework. The impact of a failure would be limited to a volume, and recovery would involve only that volume, instead of the whole filesystem being affected. At the same time, the prohibitive overhead of finding and tracking every single file would also be avoided. As often turns out to be the case, the middle path might turn out to be preferable to either extreme.