Notes on File System Semantics

Fri 09 January 2015

tags: storage

Just some random thoughts from an email I sent recently, plus a bonus SCSI war story.

As the PVFS folks said long before I came along, some POSIX requirements are inappropriate for a distributed file system. I agree with that, but not with the object-store folks who claim that the entire hierarchical byte-addressable file system model is obsolete. I think most of that model is still valuable for compatibility with the thousands of applications that are out there. Only a few over-specified behaviors which few will miss (often few even know about them) and which are obviously problematic in a distributed system need to be retired.


users can't reason well about consistency guarantees that are conditional on the availability of specific servers. "After a write, readers will see X" is easy to reason about. Adding "...or reads will fail if certain system-wide conditions are met" doesn't make it much worse. Adding "...or they might see Y if some otherwise-invisible event intervenes" kind of leaves them hanging. If writes can disappear, other than in the event of a system-wide failure, then I'd say you effectively have no guarantees at all and that's OK. One of the hard-won lessons from working in this field for a long time is that it's better to make few and simple promises (which you can be sure of keeping) than get dragged into long discussions of what was or was not promised under what conditions. That's not a good place to be in when users' data is at stake.

The last point is the most important IMO. I first ran into this back in '94, when I was working on one of the earlier multi-pathing SCSI drivers (REACT for the IBM 7135). My code would try really hard to maintain or re-establish contact with a volume, despite any combination of failures. While I was working in England with the people who actually built the hardware, we discovered one case where this persistence meant we'd flap around for five minutes or so, repeatedly switching between controllers before we had finally observed and cleared enough error conditions to continue normally. I thought it was awesome that we were able to recover. One of the older engineers was unimpressed. To him, those five minutes of unpredictable behavior negated any subsequent success. He argued that it would be better to try both controllers, then simply fail. His view prevailed, and in retrospect I think rightly so. Sometimes, "weak promises strongly kept" is better than the alternative, especially when there's a higher layer that can build on that to provide its own guarantees.

BTW, the test involved here was the infamous "pen in a fan" which was amusing in its own way. The board had three signal lines to report faults, but more than three faults to report. Therefore, the lines were multiplexed. Sticking a pen in a fan would cause the board to signal fault 0x7 (all three lines asserted). However, the person who wrote the board firmware didn't read the hardware spec properly, and out in SCSI-land this would be reported as three separate faults - 0x4, 0x2, and 0x1. This is what caused us to keep going back and forth so much, clearing one pseudo-fault each time instead of all at once. Now that the wounds have healed, I can look back and laugh. At the time I was not so amused.

Comments for this blog entry