After quite a few years of relative quiet, there’s a lot of activity and interest in the filesystem world. Many interesting local filesystems are newly available or well into development. Besides allowing the creation of bigger and faster filesystems, features like improved data integrity and built-in snapshot/clone support are being added to the mix. There’s another group of new filesystems oriented toward providing better support for solid-state disks. There’s activity in the parallel-filesystem world too, with both old and new entrants battling for and sometimes gaining acceptance beyond their traditional niches. A lot of old assumptions are being revisited, but maybe there are still others that need to be.
In the old days, a filesystem’s size was determined when it was created, and could never be changed. If you wanted more capacity, you created a new filesystem on a new set of disks or partitions. If you wanted to free up space, perhaps to use it for a new filesystem mounted elsewhere, you were pretty much out of luck. Nowadays, of course, things are a bit better. Filesystems and volume managers often work together to support increasing (and, less often, decreasing) the size of a filesystem. Throughout all of this, however, the filesystem has remained monolithic in the sense that all files within the filesystem have essentially the same characteristics. If the underlying volume is RAID-1, then all files are replicated to the same degree. If it’s RAID-5, then all files are striped to the same degree. If you want files in a particular subdirectory to have different characteristics, such as being more widely striped on faster disks, then you pretty much have to create a whole separate filesystem on a separate volume, and mount it on that subdirectory. In some parallel filesystems, you get a little more flexibility; you can define how many object servers, and perhaps even which ones, a certain file or directory should be striped across. That still doesn’t give you much control over replication, though, and the parameters you specify generally apply only to newly created files (i.e. existing files aren’t re-striped if you change the parameters).
What’s going on here is that a filesystem boundary is not only a capacity boundary but also a layout-policy boundary and a consistency boundary and an everything-else boundary as well. Many people realize that moving a multi-gigabyte file within a filesystem is likely to be quick and atomic, while moving it across filesystems is really a copy-then-delete sequence that will require massive data movement. Somewhat fewer realize that you can force ordering of writes within a filesystem but you have no guarantees when you do writes across more than one. Very few indeed seem to think about whether these boundaries all need to be the same. Valerie Aurora’s chunkfs is one example of breaking up a filesystem into multiple units that can be checked separately. It’s not hard to imagine a local filesystem that offers the same kind of control over striping as parallel filesystems do, or even one that adds re-striping and control over replication as well. In a distributed filesystem, consistency or ordering across sub-filesystem domains could be quite beneficial. In some other environments, implementing quota or security policy across smaller domains would be helpful too.
The key in all of these cases is that the concept of a “policy point” – the top of a hierarchy within which a certain layout or consistency or other policy applies – needs to be decoupled from the concept of a mount point. One fairly obvious way to do this would be to treat every directory as a policy point, inheriting from its parent if need be, but that might result in having so many policy points that managing them becomes a problem. It’s probably sufficient to label only specific directories – e.g. a user’s home directory, an application’s working directory – as policy points, however many directory levels exist below each. A large filesystem might therefore have some dozens to hundreds, but not thousands to millions, of policy points within it. It’s a slightly more complex model than what we have now, but I think it’s also one that maps better to users’ needs. I expect that I’ll be applying the concept to any filesystems I work on, but maybe that’s getting ahead of myself.