One of the hottest topics in the Linux world lately has been the issue of atomically updating a file on a filesystem that uses delayed allocation, and whether fsync() is an acceptable solution. This is an issue now because, even though many filesystems have used delayed allocation for a while, ext4 is the first to make it into common enough use to spark the debate. One of the best discussions I’ve seen so far is from Alexander Larsson’s (thanks to Wes Felter for the link). It also refers to a proposal from Ted Ts’o regarding the issue which is worth reading.
One of the things that might not be obvious about Ted’s proposal is that it’s constructed to maintain a separation between files and the directory entries that (might) point to them. The desirability of such separation is a bit of a religious issue which I’m not going to get into; the point here is that, while Ted doesn’t explicitly mention it, this explains many things about his proposal that might otherwise seem strange or unnecessary. It’s actually a good proposal as far as the file/directory separation issue goes, but I think it runs smack into another issue: like the fsync() approach, it tries to fix an ordering issue by forcing synchronous updates. In the same LWN discussion Ted even cites Anton Ertl’s explanation of what’s wrong with synchronous metadata updates, but I would say that synchronous data updates – such as the fsync-like behavior implied by the comment attached to flinkat() in Ted’s proposal – are bad for almost exactly the same reasons. The problem here is that the common open/write/rename idiom represents a clearly intended ordering of both file and directory operations, and that ordering can be preserved for the file operations (the writes) but the directory operation is allowed to “jump the queue” because it’s not a file operation. (Note, BTW, that the open is both a file and a directory operation, with clear ordering semantics wrt the writes. So much for that mythical separation between file and directory operations.) My suggestion is that if you have an ordering problem then you should provide a way to preserve ordering. Forcing certain operations to be done synchronously is not necessary and hurts performance/scalability, which is exactly why people are avoiding or complaining about fsync() in the first place.
Unfrtunately, the issue of ordering vs. synchrony highlights a pretty fundamental problem that pervades POSIX: the assumption that synchronous operations are the norm, and asynchronous operations are handled in a second-class kind of way if at all. If not for that, then all metadata calls including rename() could be done asynchronously. Once you’re doing operations asynchronously, it’s a small step to add predicates that must be satisfied before they execute. A solution for some hypothetical system not hobbled by some of the sillier Linux/UNIX/POSIX dogma might therefore look like this:
- token1 = fbarrier(fd);
- Inserts a marker into fd‘s I/O stream, and returns a token corresponding to that marker. The token does not become valid until the marker leaves the I/O stream.
- token2 = rename_async(old_name,new_name,token1);
- Queues a rename operation from old_name to new_name, to execute when token1 becomes valid, and returns token2 representing the status of the rename itself (e.g. queued, completed, failed). Note that token1 could represent any kind of pending event, not just a token from fbarrier.
- status = query_token(token2);
- Find out whether the rename actually completed (optional). There could also be wait_for_token(), epoll() support for generic async tokens, etc. providing a fully generic infrastructure for asynchronous programming.
Someone’s bound to point out that such an approach does not lock the parent directory (as Ted’s proposal does). That means it’s still vulnerable to certain kinds of races involving the parent; separate solutions for that problem should be obvious and are left as an exercise for the reader. The particular problem I’m trying to focus on is of preserving a commonly expected kind of ordering between writes and renames, without forcing any of the operations to be synchronous.