RAID-Z Redux

Jeff Darcy November 25, 2005 22:20

Jeff Bonwick, architect of Sun’s ZFS, has been kind enough to offer some clarification about RAID-Z in a comment to my last post on the subject. I’m not sure at this point whether we actually disagree or are talking past one another, but the sticking point is neatly summed up in the following statement by Jeff.

itâ??s the transactional semantics that make full-stripe writes safe, regardless of whether itâ??s RAID-Z or plain old RAID-5.

If the transactional semantics of ZFS make full-stripe writes safe even for RAID-5, then it’s clearly not RAID-Z (which isn’t in the picture) that’s solving the problem. It’s the transactional semantics, including the exclusive use of full-stripe writes, that do so. If RAID-Z is as safe as RAID-5 with ZFS’s transactional behavior, and as unsafe without (as I believe I explained in my last post), then it can hardly be considered a solution to RAID-5’s problems. Jeff’s more detailed explanation makes this equivalence even clearer.

RAID-Z addresses this by using variable stripe width. It treats all the blocks as a matrix, where the disks are columns so that entry (M, N) is the Mth sector of disk N. Space allocation is row-major, but I/O is column-major (so that data is in the clear). In (say) a 4+1 RAID-Z setup, this means that a single-sector write will only touch two disks â?? one data, one parity. A 3-sector write touches 4 disks â?? 3 data, 1 parity. A 100-sector write touches all 5 disks, with four disks getting 25 sectors of data each and one disk getting 25 sectors of parity. You might infer that RAID-Z uses more space for very small blocks, but quickly approaches the usual 25% parity overhead (in our 4+1 example) for large blocks. That is correct. Iâ??ll blog about this in considerably more detail next week.

The key point here is that you could apply a very similar technique if you were using RAID-5. However, you’d risk wasting even more space, and doing more I/O to write zeroes to the unused sectors within a stripe. The RAID-Z solution is clearly preferable from those perspectives, but not from that of data integrity. That brings me to another question about RAID-Z, which is the misleading name. RAID-Z might be a useful technique for a filesystem to use, perhaps even a significant innovation, but it’s not a RAID level. That’s an unwarranted attempt, in my opinion, to ride on RAID’s coat-tails because RAID was a truly significant advance in storage technology and is widely recognized as such. In part, I base that statement on something in Jeff’s original blog entry about RAID-Z.

You have to traverse the filesystem metadata to determine the RAID-Z geometry.

True RAID levels don’t require knowledge of higher-level “applications” (e.g. filesystems or volume managers) for reconstruction; that’s what we call a layering violation. All they require is knowledge of which disks are members of the RAID group. In some implementations of some RAID levels one further piece of information (the stripe width) is also needed, but that’s still a far cry from the arbitrarily complex metadata ZFS requires. RAID-Z is inseparable from ZFS and is therefore at ZFS’s semantic/operational level – i.e. not that at which RAID operates.

The fact that RAID-Z isn’t really a RAID level, or that it doesn’t (in and of itself) close the write hole, doesn’t mean it’s not cool. In fact I think it is cool. As I’ve said before, I’m not questioning the technology but a presentation that still seems as much based on marketing as on technical reality.

6 Responses to “RAID-Z Redux”

  1. Bryan Cantrillon 26 Nov 2005 at 1:09 pm

    What you call a “layering violation” is exactly the innovation in ZFS; the layering between file systems and volume management had become an information divide that resulted in the layers working at cross-purposes. This kind of false layering is one of the prominent anti-patterns of software systems architecture, a canonical example being the the idea (now largely dead, thankfully) of having user-level schedulers multiplexed on kernel-level schedulers: with each layer attempting to make scheduling decisions in isolation, the overall system suffered.

    You need to liberate your thinking; dicta like this one reflect a stagnant mind:

    True RAID levels donâ??t require knowledge of higher-level â??applicationsâ? (e.g. filesystems or volume managers) for reconstruction.

    Aside from being wrong (this definition wouldn’t even allow for software RAID — of any flavor), this kind of narrowness is debilitiating; it’s extraordinarily hard to innovate when one believes that the system is constrained by its definitions, that it’s tautologically finished. And certainly, it’s not at all surprising that you find ZFS so upsetting…

  2. Jeff Darcyon 26 Nov 2005 at 3:08 pm

    What you call a â??layering violationâ? is exactly the innovation in ZFS; the layering between file systems and volume management had become an information divide that resulted in the layers working at cross-purposes.

    Oh, bull. Is that the best you can do, to ignore every specific argument in favor of tilting at such a general strawman? There’s good layering and there’s bad layering. Of course bad layering can be considered an anti-pattern, but if it weren’t for layering done properly we wouldn’t be able to have this conversation because the internet wouldn’t be here. Nobody has shown that the layering between filesystems and volume managers is inherently bad, and many people do pretty well with it. Layering is even alive and well in ZFS, as is quite evident from the diagram on this page. Are you willing, in your zeal to attack me, to cast aspersions on a principle used in ZFS?

    You need to liberate your thinking

    Liberal thinking doesn’t mean accepting every piece of crap you’re told. Is “RAID-Z” recognized as a legitimate RAID level by anyone considered authoritative on the subject, or for that matter anyone but Sun? Is there any realistic chance it ever will be? Sometimes the distinctions others make are useful, even if they’re not convenient for your marketing department. If “RAID Z” qualifies as a RAID level, why not Oracle’s table format or Veritas’s volume-manager data structures? If you say those don’t qualify as well then someone might accuse you of having a “stagnant mind” but if you say they do then you’ve robbed the term of all meaning. If you don’t like my (and the rest of the world’s) definitions or distinctions, propose some of your own.

    Aside from being wrong (this definition wouldnâ??t even allow for software RAID â?? of any flavor)

    That definition does not in any way preclude software RAID, and saying so only betrays the shallowness of your understanding. Any RAID implementation, whether hardware or software, needs the information I mentioned to write data at all, and then uses it for recovery. Otherwise none of the RAID hardware that your own company sells would work or provide any value to customers. That’s not the case, as you would have known if you had been interested in educating yourself before spouting off.

    itâ??s not at all surprising that you find ZFS so upsetting

    I don’t find it upsetting at all; in fact it’s rather exciting. If you replace “upsetting” with “annoying” and “ZFS” with “immature engineers who tout ZFS without understanding it” then your statement might be true, though. Jeff Bonwick came here and answered my points in a way that was both civil and informative. You’re just pissing on the rug in someone else’s virtual home. Maybe if you learn from his example instead of being an anti-ambassador for your product and your company then some day you will be considered worthy of the respect and responsibility that he (or even I) have been granted. As long as you introduce yourself to others, no matter how much you disagree with them, with “hey jackass” that’s unlikely to happen.

  3. Jeff Bonwickon 27 Nov 2005 at 9:00 am

    Jeff,

    You asked whether RAID-Z is a “true” RAID level. If by true you mean “separable from the filesystem”, the answer would clearly be no. But Dave Patterson’s original RAID concept (1988) was decidedly more expansive:

    “Our basic approach will be to break the arrays into reliability groups, with each group having extra ‘check’ disks containing redundant information. When a disk fails we assume that within a short time the failed disk will be replaced and the information will be reconstructed on to the new disk using the redundant information.”

    Note that there’s no mention of mirroring or parity or any other specific scheme here — the requirement is simply to reconstruct the new disk using the contents of the others. RAID-Z, of course, does precisely that.

    Patterson then went on to describe the well-known RAID levels 0-5 as examples of increasing sophistication to illustrate the general concept. They were not intended to be exhaustive. In fact, at the end of the paper he posed a number of open questions, including this one:

    “Can a file system allow different striping policies for different files?”

    This question doesn’t even make sense unless you postulate a richer interface between file systems and storage than the simple block device protocol (which is the level at which RAID is customarily done).

    So I’d say that RAID-Z is not a RAID level in the sense that storage vendors mean today, but it most certainly is a form of RAID in the sense that the man who invented RAID envisioned.

    (Just for completeness, I know Dave, so I’ll ask him for his take on RAID-Z the next time we get together.)

  4. Jeff Darcyon 27 Nov 2005 at 11:59 am

    Thanks again, Jeff. I’m content to disagree on this, so long as the arguments for each side are well presented – as yours have been.

    This question doesnâ??t even make sense unless you postulate a richer interface between file systems and storage than the simple block device protocol (which is the level at which RAID is customarily done).

    Yes, that’s very true, but that richer interface doesn’t necessarily include traversal of metadata. For example, a single filesystem could span multiple sets of disks each using a different traditional RAID level, answering Dave P’s question in the affirmative without requiring traversal of filesystem metadata or sacrificing the capability for recovery to be done by a relatively “dumb” (but fast) external controller or block-level subsystem on a host. Traditional filesystems have not had this capability, of course, but it would be perfectly reasonable to do so and might even leverage RAID hardware better than RAID-Z does.

    That brings me to yet another point: the ubiquity of RAID hardware. It’s easy to take pot shots at EMC as you have done, and nobody knows that better than someone like me who ended up working there as the result of an acquisition. Everybody knows their gear is overpriced, but treating them as the only alternative to software RAID (of any level) is a bit of the excluded middle. Cheap RAID hardware is everywhere; your own company sells quite a bit. Once someone’s paying for enclosures with hot-swappable drives and redundant power supplies etc., the incremental cost of a low-end RAID controller isn’t that much and in return you get much more than just RAID. You get cache, which can help performance no matter what filesystem you’re using. You get true hot-spare capability, which ZFS still lacks. You get more ports and robust multi-initiator support, both of which are necessary for clustering – and a lot of people use clusters nowadays. Your company even sells a cluster filesystem, and some would say that you missed an important opportunity by not making ZFS a cluster filesystem from the start. Even low-end RAID hardware often supports dual controllers, solving another part of the availability puzzle that ZFS doesn’t touch. Many people even use mirroring (which avoids the write hole entirely) just so they can do split-mirror backups, even though some would argue that’s stupid. In other words, even a modest enterprise can afford a hardware-based RAID solution and typically will do so. Once they have done so, RAID-Z offers very little if any additional protection against data loss.

    Of course, there are plenty of other things that are good about ZFS even in a low-end-RAID environment. The checksumming and self-healing still address data issues that an external array can’t even see. Cheap snapshots/clones provide significant value, especially when they’re available on a per-user basis rather than per-volume – and I don’t mind saying that even though my own company’s marketing department might prefer that I say snapshots are worthless. Then there are the performance and scale manageability advantages that ZFS provides. Personally my favorite is the transactional-update design. I thought it was cool in WAFL, I thought it would have been cool if Tux2 had ever become real, and I think it’s cool that the approach has finally made its way into a general-purpose filesystem. Those are all good and sufficient reasons to tout ZFS as the most advanced local filesystem out there. I just don’t think RAID-Z adds much to all that; touting it as a solution to a problem that only the cheapest of the cheap will ever experience – and then only if a hardware failure occurs during a very narrow timing window – still seems like a marketing move.

    I know Dave, so Iâ??ll ask him for his take on RAID-Z the next time we get together.

    If you do that, tell him I said hi. He might not remember me by name (he meets a lot of people) but he was the one who signed off on my attendance at some of the Tahoe retreats a few years back.

  5. Bryan Cantrillon 27 Nov 2005 at 11:32 pm

    Even low-end RAID hardware often supports dual controllers, solving another part of the availability puzzle that ZFS doesnâ??t touch. Many people even use mirroring (which avoids the write hole entirely) just so they can do split-mirror backups, even though some would argue thatâ??s stupid. In other words, even a modest enterprise can afford a hardware-based RAID solution and typically will do so. Once they have done so, RAID-Z offers very little if any additional protection against data loss.

    False — and this assertion reflects either a naivete about data corrupting pathologies, or a desire to deliberately mislead. Hardware-based RAID solves one kind of problem — bit rot — but sliently ignores other (more common) pathologies that ZFS detects, like broken firmware, a broken DMA engine, a broken driver, etc. (Or, as the case was for Eric Lowe, a broken power supply.) But I suspect that you either knew this, or this occurred to you as you writing your screed, as you followed the above paragraph with:

    Of course, there are plenty of other things that are good about ZFS even in a low-end-RAID environment. The checksumming and self-healing still address data issues that an external array canâ??t even see.

    Yes, of course — so why then make the false claim that “RAID-Z offers very little if any protection against data loss”? Eagerly awaiting your (logorrheic) response…

  6. Jeff Darcyon 28 Nov 2005 at 6:24 am

    Hardware-based RAID solves one kind of problem â?? bit rot â?? but sliently ignores other (more common) pathologies that ZFS detects

    I was referring specifically to RAID-Z, not to the entire complement of features in ZFS. Even Jeff admitted in his first post here that it was other features in ZFS that really solved this problem. What you try to portray as a false claim about one thing (ZFS) was really a claim about something else (RAID-Z), turning your “rebuttal” into a non sequitur.

    Eagerly awaiting your (logorrheic) responseâ?¦

    Take that garbage back to the schoolyard, Bryan. That comment barely made it past moderation, and the next one won’t unless you grow up quickly. I pity your coworkers.

Comments RSS

Leave a Reply