Fri 22 November 2013
When I wrote about how
local filesystems suck a while ago, it sparked a
bit of debate. Mostly it was just local-filesystem developers being defensive,
but Dave Chinner did make the quite reasonable suggestion that I could help by
proposing a better alternative to the fsync problem. I've owed him an answer
since then; here it is.
To recap, the main problem with fsync is that it conflates
synchrony. There's no way to ensure that two writes happen in order except
by waiting for the first to complete before issuing the second. This
sacrifices any possibility of pipelining requests, which is essential for
performance. What's funny is that local filesystems themselves take advantage
of a model that does allow such ordered pipelines - tagged command queuing,
which I first encountered twenty years ago when I worked with parallel SCSI.
The basic idea is that a device has multiple queues. Each request specifies
which queue it should go on, plus some bits to specify how it should be queued
SIMPLE means that there are no particular queuing or ordering restrictions.
A series of SIMPLE requests can be reordered and/or issued in parallel with
respect to each other, but not with respect to non-SIMPLE requests.
ORDERED means that the request must wait for all earlier requests on the
same queue, and all later requests on the same queue must wait for it. This
allows pipelining, but not parallelism or reordering.
HEAD is the same as ORDERED, except that the new request is inserted at the
head of the queue instead of the tail. This is generally a very bad idea,
but it's necessary in certain situations. For example, the drivers I was
writing used it to issue the commands for controller failover while leaving
the rest of the queue intact.
The funny thing is that this model has been around so long that it has bubbled
up to the OS block layer, where local filesystems can take advantage of it to
ensure correct ordering while maintaining performance, but then those same
local filesystems don't expose it to anyone else. Seems rather selfish to me.
The obvious solution is simply to add queue/type parameters to writev (and
possibly other calls as well). Current behavior is equivalent to SIMPLE
queuing. Fsync is equivalent to an ORDERED no-op issued synchronously. That's
all very well, but the model provides pretty obvious ways to do even more
An asynchronous fsync becomes possible simply by issuing an ORDERED no-op
(zero-length write?) using AIO. You don't have to
wait for it, but you
can be assured that it's in the pipeline and order will be maintained.
If you only need ordering between
some requests, you can use ORDERED on
This is a clean, powerful, and well proven model. Unfortunately, local-FS
developers will probably argue that it's too hard to implement (even though
they've already implemented it for their own uses e.g. ordering data vs. inode
writes). However, most of this complexity has to do with multiple queues.
Implementing just SIMPLE/ORDERED without multiple queues would be much easier,
and still much better than what we have now.
The other problem with fsync is that it only flushes the pipeline for a single
file descriptor (actually in practice it's more likely to be the inode). If
you want to flush the pipelines for a bunch of file descriptors, you have to
issue fsync for each one separately. This is not just an inconvenience; it
means that you either need to wait for N fsyncs in sequence, or have N threads
handy to wait for them in parallel. The other alternative is to issue syncfs
instead - possibly having to wait for I/O from other applications and other
users as well as your own. All of these options are awful. A better option
would be a way to group file descriptors together through a single "special"
one, and then issue more powerful combined operations on that. In fact, such
an interface already exists -
epoll. Some of that same code could probably
be reused to implement a way of flushing multiple files instead of waiting for
them. At the very least, this would make flushing lots of files at once
simpler and less syscall-intensive. Even better, a decent implementation might
allow filesystems to reason about a whole bunch of fsyncs as a group and
optimize how all of the relevant block-level I/O gets done. I don't expect
that to happen soon, but at least the right API makes it possible.
Of course, it's always easy to make suggestions for other people to implement.
I try not to tell other people their business, because I have quite enough of
them telling me mine. Nonetheless, the need is there and I was asked to
propose a solution, so I have. Maybe if the people whose job it is don't want
to do it themselves then I'd even be willing to help.