Yes, that’s right, folks. As Wes Felter accurately puts it, “After years of hype, Sun released ZFS.” The first thing I noticed, perusing these pages, is that the people working on it can’t seem to get their stories straight. The front page says this…

the best part â?? no need for NVRAM in hardware. ZFS loves cheap disks.

…but Neil Perrin says this…

There’s also more work to do. For example, using nvram/solid state disks for the log would turbo-ise it.

Hmmm. Similarly, Bill Moore writes…

A product is only as good as its test suite

…and yet, the FAQ actually includes a section on What can I do if ZFS panics on every boot? Panics on every boot? Don’t you think that case could have been covered by the all-singing all-dancing test suite? That’s what most people would consider a show-stopper, but apparently not the intensely quality-oriented ZFS team. Enough of the sniping, though; let’s move on to more substantial questions.

I found this claim on the ZFS page interesting:

ZFS introduces a new data replication model called RAID-Z. It is similar to RAID-5 but uses variable stripe width to eliminate the RAID-5 write hole (stripe corruption due to loss of power between data and parity updates). All RAID-Z writes are full-stripe writes. There’s no read-modify-write tax, no write hole, and â?? the best part â?? no need for NVRAM in hardware. ZFS loves cheap disks.

The write hole is real, but doing full-stripe writes does not (by itself) fix it. The mention of a read-modify-write tax hints at where this claim comes from. The tax occurs when one must read the old data plus parity, calculate the new parity, then write the new data plus parity. The thing is, few hardware-based RAID implementations pay this tax any more. Often either the old data or parity is already in cache, and the new parity can be calculated from either plus the new data without having to do a read. Even more significantly, if a write can be acknowledged when it hits a fault-tolerant cache, any necessary reads and both writes can be deferred indefinitely; mid- to high-end arrays are designed to have enough reserve power to dump their cache to disk if external power is lost, and logic to pick up where they left off when it’s restored. (Few of them use true NVRAM, by the way, which also casts a dim light on the ZFS team’s understanding of modern storage.) The write hole becomes relevant when you try this same sort of early acknowledgement in a RAID system that does not have a properly protected cache, which includes most host-based software implementations such as ZFS. Then the possibility of a failure between writes taking out data is very significant, and doing full-stripe writes does indeed reduce the danger … but not to zero. It’s still possible for a failure to occur even in the middle of a single I/O. Maybe there’s additional magic in “RAID-Z” (reminds me of the old “RAID-7″ marketing term) to address the problem further, but it looks like I’ll have to dig to find that information and when I do I won’t be surprised if their solution is a little less innovative and/or performance-neutral than claimed. The fact that they did implement an intent log (which somehow never made it into the slides) suggests that the not-very-sexy real solution lies there. The question of space reclamation, raised in my last post about ZFS, also remains unanswered. Maybe when I have a chance to do that digging I mentioned I’ll turn up some answers on that as well.

I don’t want it to seem like my feelings about ZFS are negative overall. I think the scale and the administration and the self-checking and self-healing and the performance are all great. I think the I/O scheduler looks pretty interesting and I’ll enjoy exploring its intricacies, though I do sort of share Wes’s befuddlement at why it’s not part of Solaris itself if it really is All That. My problem with ZFS is not with the technical content but with the presentation. When I see people having such trouble staying on the same page I always wonder whether what’s on the page is fiction. It seems like, instead of being confident enough of ZFS’s strengths to admit that it still has some soft spots as well, people are trying to hand-wave about the soft spots in an effort to portray ZFS as perfect. Maybe there’s pressure from Sun’s upper management, which in general seems rather desperate nowadays, for the work product of this many person-years to be the Ultimate Everything. That would be sad, but I know enough about big-company politics to think it’s pretty likely.