Jeff Bonwick, architect of Sun’s ZFS, has been kind enough to offer some clarification about RAID-Z in a comment to my last post on the subject. I’m not sure at this point whether we actually disagree or are talking past one another, but the sticking point is neatly summed up in the following statement by Jeff.

itâ??s the transactional semantics that make full-stripe writes safe, regardless of whether itâ??s RAID-Z or plain old RAID-5.

If the transactional semantics of ZFS make full-stripe writes safe even for RAID-5, then it’s clearly not RAID-Z (which isn’t in the picture) that’s solving the problem. It’s the transactional semantics, including the exclusive use of full-stripe writes, that do so. If RAID-Z is as safe as RAID-5 with ZFS’s transactional behavior, and as unsafe without (as I believe I explained in my last post), then it can hardly be considered a solution to RAID-5′s problems. Jeff’s more detailed explanation makes this equivalence even clearer.

RAID-Z addresses this by using variable stripe width. It treats all the blocks as a matrix, where the disks are columns so that entry (M, N) is the Mth sector of disk N. Space allocation is row-major, but I/O is column-major (so that data is in the clear). In (say) a 4+1 RAID-Z setup, this means that a single-sector write will only touch two disks â?? one data, one parity. A 3-sector write touches 4 disks â?? 3 data, 1 parity. A 100-sector write touches all 5 disks, with four disks getting 25 sectors of data each and one disk getting 25 sectors of parity. You might infer that RAID-Z uses more space for very small blocks, but quickly approaches the usual 25% parity overhead (in our 4+1 example) for large blocks. That is correct. Iâ??ll blog about this in considerably more detail next week.

The key point here is that you could apply a very similar technique if you were using RAID-5. However, you’d risk wasting even more space, and doing more I/O to write zeroes to the unused sectors within a stripe. The RAID-Z solution is clearly preferable from those perspectives, but not from that of data integrity. That brings me to another question about RAID-Z, which is the misleading name. RAID-Z might be a useful technique for a filesystem to use, perhaps even a significant innovation, but it’s not a RAID level. That’s an unwarranted attempt, in my opinion, to ride on RAID’s coat-tails because RAID was a truly significant advance in storage technology and is widely recognized as such. In part, I base that statement on something in Jeff’s original blog entry about RAID-Z.

You have to traverse the filesystem metadata to determine the RAID-Z geometry.

True RAID levels don’t require knowledge of higher-level “applications” (e.g. filesystems or volume managers) for reconstruction; that’s what we call a layering violation. All they require is knowledge of which disks are members of the RAID group. In some implementations of some RAID levels one further piece of information (the stripe width) is also needed, but that’s still a far cry from the arbitrarily complex metadata ZFS requires. RAID-Z is inseparable from ZFS and is therefore at ZFS’s semantic/operational level – i.e. not that at which RAID operates.

The fact that RAID-Z isn’t really a RAID level, or that it doesn’t (in and of itself) close the write hole, doesn’t mean it’s not cool. In fact I think it is cool. As I’ve said before, I’m not questioning the technology but a presentation that still seems as much based on marketing as on technical reality.