Local Filesystems Suck

Distributed filesystems represent an important use case for local filesystems. Local-filesystem developers can't seem to deal with that. That, in a nutshell, is one of the most annoying things about working on distributed filesystems. Sure, there are lots of fundamental algorithmic problems. Sure, networking stuff can be difficult too. However, those problems are "natural" and not caused by the active reality-rejection that afflicts a related community. Even when I was in the same group as the world's largest collection of local-filesystem developers, with the same boss, it was often hard to get past the belief that software dealing with storage in user space was just A Thing That Should Not Exist and therefore its needs could be ignored. That's an anti-progress attitude.

So, what are the problems with local filesystems? Many of the problems I'm going to talk about aren't actually in the local filesystems themselves - they're in the generic VFS layer, or in POSIX itself - but evolution of those things is almost entirely driven by local-filesystem needs so that's a distinction without a difference. Let's look at some examples from my own recent experience.

  • Both the interfaces and underlying semantics for extended attributes still vary across filesystems and operating systems, despite their usefulness and the obvious benefits of converging on a single set of answers. This is true even for the most basic operations; if you want to do something "exotic" like set multiple xattrs at once, you have to use truly FS-specific calls.

  • Mechanisms to deallocate/discard/trim/hole-punch unused space still haven't converged, after $toomany years of being practically essential to deal with SSDs and thin provisioning.

  • Ditto for overlay/union mounts, which have been worked on for years to no useful result. There's a pattern here.

  • The readdir interface is just totally bogus. Besides being barely usable and inefficient, besides having the worst possible consistency model for concurrent reads and writes, it poses a particular problem for distributed filesystems layered on top of their local cousins. It requires the user to remember and return N bits with every call, instead of using a real cursor abstraction. Then the local filesystem at the far end gets to use that N bits however it wants. This leaves a distributed filesystem in between, constrained by its interfaces to that same N bits, with zero left for itself. That means distributed filesystems have to do all sorts of gymnastics to do the same things that local filesystems can do trivially.

  • Too often, local filesystems implement enhancements (such as aggressive preallocation and request batching) that look great in benchmarks but are actually harmful for real workloads and especially for distributed filesystems. There's another big pile of unnecessary work shoved onto other people.

  • It's ridiculously hard to make even such a simple and common operation as renaming a file atomic. Here's the magic formula that almost nobody knows.

The last point above relates to the really problematic issue: very poor support for specifying things like ordering and durability of requests without taking out the Big Hammer of forcing synchronous operations. By the time we get a request, it has already been cached and buffered and coalesced and so on all to hell and back by the client. Those games have already been played, so our responsibility is to provide immediate durability, while respecting operation order, with minimal performance impact. It's a tall order at the best of times, but the paucity of support from local filesystems makes it far worse.

In a previous life, I worked on some SCSI drivers. There, we had tagged command queuing, which was a bit of a pain sometimes but offered excellent control over which requests overlapped or followed which others. With careful management of your tags and queues, you could enforce the strictest order or provide maximum parallelism or make any tradeoff in between. So what does the "higher level" filesystem interface provide? We get fsync, sync, O_SYNC, O_DIRECT and AIO. That might be enough, except...

  • Fsync is pretty broken in most local filesystems. The "stop the world" entanglement problems in ext4 are pretty well known. What's less well known is that XFS (motto: "at least it's not ext4") has essentially the same problem. An fsync forces everything queued internally before to complete, but that's completely useless to an application which still gets no useful information about which other file descriptors no longer need their own fsync. The pattern continues even when you look further afield.

  • O_SYNC has essentially the same problems as fsync, and sync is defined to require "stop the world" behavior.

  • O_DIRECT throws away too much functionality. Sure, we don't want write-back, but a write-through cache would still be nice for subsequent reads and O_DIRECT eliminates even that.

  • AIO still uses a thread pool behind the scenes on Linux, unless you use a low-level interface that even its developers admit isn't ready for prime time, so it fails the efficiency requirement.

Implementing a correct and efficient server is way harder than it needs to be when all you have to work with is broken fsync, broken O_DIRECT, and broken AIO. Apparently btrfs tries to get some of this stuff right, thanks to the Ceph folks, but even they balked at trying to make those changes more generic, so unless you want to use btrfs you're still out of luck. That's why I return the local-filesystem developers' contempt, plus interest. Virtualization systems, databases, and other software all have many of the same needs a distributed filesystems, for many of the same reasons, and are also ignored by the grognards who continue to optimize for synthetic workloads that weren't even realistic twenty years ago. While I still believe that the POSIX abstraction is far from being obsolete, pretty soon it might not be possible to say the same about the people most involved with implementing or improving it.

Comments for this blog entry