OK, I’ve had it with the ZFS crew. Believe it or not, I’ve tried to be nice so far. I’ve taken pains to point out that I respect the excellent technical accomplishment that ZFS represents, and mostly wish that the marketing hype could be matched with good-faith technical exposition. That’s not happening. There are dozens of Sun employees making a concerted effort to flood the blogosphere with effusive praise for ZFS, mostly parroting the same empty last word in filesystems hype. To be quite blunt it’s starting to smell a lot like astroturf, and that’s something I really hate. I also really hate it when people show disrespect for their peers, as I believe Bryan Cantrill exemplifies.

there is no other conclusion left to be had: ZFS is the most important revolution in storage software in two decades — and may be the most important idea since the filesystem itself.

I’m sorry, but there’s no other word for that but bullshit. RAID was a great innovation, on whose coattails ZFS attempts to ride with “RAID-Z” even though it’s in a whole different conceptual space than the standard RAID levels. Volume managers have been around for years, and ZFS embeds one; likewise for journaled and atomic-update filesystems, reflected in ZFS’s intent log. ZFS’s pooling and “vertical integration” aren’t all that new either; GFS did many of the same things, earlier, for a whole cluster. (Does the GFS implementation match ZFS’s? Perhaps not, but they haven’t had the resources that Sun has devoted to ZFS either. The important thing is that they represented the same ideas.) All of these were real innovations, not something new provided for us in ZFS through the brilliance of Sun engineers alone. ZFS might be the best synthesis ever of these ideas plus some that really are new, but … most important revolution in two decades? Not even close.

The last thing I really hate is when people claim X solves Y, but the explanation of how turns out to be complete baloney (adjective form: balonious). I’ll get to that below the fold, but first a disclaimer. I work for a company producing storage-related functionality (continuous data protection) that I’m sure the ZFS crew would claim their baby makes obsolete. I admit that I’m not an entirely disinterested party, but I’m no less disinterested than the ZFS folks themselves. Besides, they’re wrong. No number of snapshots that have to be planned and performed ahead of time is the same as the ability to restore the state of a volume at the point that you only know in retrospect immediately preceded a fault or corruption event.

I’ll deal with the less balonious example first. Matthew Ahrens has posted a good explanation of how snapshots work, neatly answering my questions about how the system knows when it can fully release a block. Kudos to him for doing that. The design actually looks very elegant. Instead of reference counts, they use a “birth time” for each block and a “dead list” associated with each “generation” (my term) of a filesystem between snapshots, and that’s going to be much more efficient. Of course, there’s a little finagling that has to happen regarding the time before the first snapshot or after the last one, but that’s no big deal. More importantly, if snapshots are to fit into ZFS’s overall reliability picture then birth times and dead lists have to be current on disk, not just in memory. That’s not going to come for free. Maybe the cost disappears in light of performance gains realized elsewhere in ZFS, but it’s still there. Similarly, walking dead lists when a snapshot is deleted could involve non-trivial work if the lists are long. Basically what ZFS has done is amortize some of the cost of snapshot creation over longer periods of normal operation, with the remainder pushed into deletion instead. That’s actually a really smart thing to do, but it’s not quite the same as making the cost go away and that’s what the hype suggests.

Now, on to the serious baloney. Jeff Bonwick claims that “RAID-Z” fixes the “write hole” we discussed last time. His “solution” to the write hole centers on the following.

Every block is its own RAID-Z stripe, regardless of blocksize. This means that every RAID-Z write is a full-stripe write.

The first question that comes to mind is: whose blocksize? In my experience, people who have spent all of their time on the host side of the I/O equation – as seems to be the case for Bonwick – are often sloppy in distinguishing between what the filesystem thinks is a block and what the hardware thinks is a block. People who’ve spent time on the storage-subsystem side are usually more careful, many using “sector” to mean the hardware concept (even though an actual disk-drive person might say that’s often incorrect too) and “block” to mean the software concept. Since most hardware does not support variable-sized blocks beyond format time, Bonwick is obviously referring to the software entity … but the hardware doesn’t know or care about his block size. It makes no guarantee that anything bigger than a sector will be updated (or not) atomically, so his “full stripe” write might be partially complete after all. It’s not even unheard of for a device to write only part of a sector, though they’re not supposed to. Defining your own filesystem blocksize might be convenient conceptually, but it’s pretty meaningless with respect to fixing the RAID-5 write hole. It gets worse, though. Bonwick makes another revealing comment as well.

Where a full-stripe write can simply issue all the writes ascynhronously [sic], a partial-stripe write must do synchronous reads before it can even start the writes.

Note the reference to “all the writes” even for a full stripe. If the user only wrote part of that stripe, either the remainder must have been read in anyway (incurring the same read-modify-write penalty as Bonwick attributes to RAID) or a whole stripe is being written but only part of it is considered meaningful (wasting space). Don’t worry, vendors will always be willing to sell you more disk. As Bonwick himself says, there’s no problem your wallet can’t solve. More to the point, what happens when you issue writes (plural) for a single supposedly contiguous area? Why, they might get reordered, they might succeed or fail independently instead of all together, etc. In short, there are now even more failure cases to account for when trying to deal with the write hole. The only reason you’d even consider doing such a thing would be if multiple discontiguous parts of the stripe were filled with user data and the rest weren’t – i.e. the second alternative above. What it sounds like – and no, I haven’t looked at the code yet to verify – is that ZFS wastes space doing fake full-stripe writes (fake because they really only contain partial data) and in the process makes the write-hole problem worse instead of better. If this is the solution, I think I’ll keep the problem.

In the end, it doesn’t look like “RAID-Z” really does squat to make the write hole go away. The real magic is in the way ZFS structures its metadata as a tree and does atomic updates by working its way up that tree (presaged by WAFL), and the way it verifies checksums along the way. That’s the really cool stuff. That’s what Bonwick should be presenting as the key to data integrity in ZFS. Maybe while he’s at it he can explain why ZFS still needs an intent log if both “RAID-Z” and atomic updates supposedly solve the problem of integrity for asynchronously-written data, or why both Neil Perrin and Neelakanth Nadgir make a point of mentioning how the supposedly-unnecessary NVRAM would make intent-log operation much faster (or why that should matter). Something there just doesn’t quite add up, but I’m getting used to that.