Fixing Fsync

When I wrote about how local filesystems suck a while ago, it sparked a bit of debate. Mostly it was just local-filesystem developers being defensive, but Dave Chinner did make the quite reasonable suggestion that I could help by proposing a better alternative to the fsync problem. I've owed him an answer since then; here it is.

To recap, the main problem with fsync is that it conflates ordering with synchrony. There's no way to ensure that two writes happen in order except by waiting for the first to complete before issuing the second. This sacrifices any possibility of pipelining requests, which is essential for performance. What's funny is that local filesystems themselves take advantage of a model that does allow such ordered pipelines - tagged command queuing, which I first encountered twenty years ago when I worked with parallel SCSI. The basic idea is that a device has multiple queues. Each request specifies which queue it should go on, plus some bits to specify how it should be queued and dequeued.

  • SIMPLE means that there are no particular queuing or ordering restrictions. A series of SIMPLE requests can be reordered and/or issued in parallel with respect to each other, but not with respect to non-SIMPLE requests.

  • ORDERED means that the request must wait for all earlier requests on the same queue, and all later requests on the same queue must wait for it. This allows pipelining, but not parallelism or reordering.

  • HEAD is the same as ORDERED, except that the new request is inserted at the head of the queue instead of the tail. This is generally a very bad idea, but it's necessary in certain situations. For example, the drivers I was writing used it to issue the commands for controller failover while leaving the rest of the queue intact.

The funny thing is that this model has been around so long that it has bubbled up to the OS block layer, where local filesystems can take advantage of it to ensure correct ordering while maintaining performance, but then those same local filesystems don't expose it to anyone else. Seems rather selfish to me.

The obvious solution is simply to add queue/type parameters to writev (and possibly other calls as well). Current behavior is equivalent to SIMPLE queuing. Fsync is equivalent to an ORDERED no-op issued synchronously. That's all very well, but the model provides pretty obvious ways to do even more interesting things.

  • An asynchronous fsync becomes possible simply by issuing an ORDERED no-op (zero-length write?) using AIO. You don't have to wait for it, but you can be assured that it's in the pipeline and order will be maintained.

  • If you only need ordering between some requests, you can use ORDERED on multiple queues.

This is a clean, powerful, and well proven model. Unfortunately, local-FS developers will probably argue that it's too hard to implement (even though they've already implemented it for their own uses e.g. ordering data vs. inode writes). However, most of this complexity has to do with multiple queues. Implementing just SIMPLE/ORDERED without multiple queues would be much easier, and still much better than what we have now.

The other problem with fsync is that it only flushes the pipeline for a single file descriptor (actually in practice it's more likely to be the inode). If you want to flush the pipelines for a bunch of file descriptors, you have to issue fsync for each one separately. This is not just an inconvenience; it means that you either need to wait for N fsyncs in sequence, or have N threads handy to wait for them in parallel. The other alternative is to issue syncfs instead - possibly having to wait for I/O from other applications and other users as well as your own. All of these options are awful. A better option would be a way to group file descriptors together through a single "special" one, and then issue more powerful combined operations on that. In fact, such an interface already exists - epoll. Some of that same code could probably be reused to implement a way of flushing multiple files instead of waiting for them. At the very least, this would make flushing lots of files at once simpler and less syscall-intensive. Even better, a decent implementation might allow filesystems to reason about a whole bunch of fsyncs as a group and optimize how all of the relevant block-level I/O gets done. I don't expect that to happen soon, but at least the right API makes it possible.

Of course, it's always easy to make suggestions for other people to implement. I try not to tell other people their business, because I have quite enough of them telling me mine. Nonetheless, the need is there and I was asked to propose a solution, so I have. Maybe if the people whose job it is don't want to do it themselves then I'd even be willing to help.

Comments for this blog entry