Canned Platypus

Making the world better, one byte at a time.

Nov
19

No More Mr. Nice Guy

OK, I’ve had it with the ZFS crew. Believe it or not, I’ve tried to be nice so far. I’ve taken pains to point out that I respect the excellent technical accomplishment that ZFS represents, and mostly wish that the marketing hype could be matched with good-faith technical exposition. That’s not happening. There are dozens of Sun employees making a concerted effort to flood the blogosphere with effusive praise for ZFS, mostly parroting the same empty last word in filesystems hype. To be quite blunt it’s starting to smell a lot like astroturf, and that’s something I really hate. I also really hate it when people show disrespect for their peers, as I believe Bryan Cantrill exemplifies.

there is no other conclusion left to be had: ZFS is the most important revolution in storage software in two decades — and may be the most important idea since the filesystem itself.

I’m sorry, but there’s no other word for that but bullshit. RAID was a great innovation, on whose coattails ZFS attempts to ride with “RAID-Z” even though it’s in a whole different conceptual space than the standard RAID levels. Volume managers have been around for years, and ZFS embeds one; likewise for journaled and atomic-update filesystems, reflected in ZFS’s intent log. ZFS’s pooling and “vertical integration” aren’t all that new either; GFS did many of the same things, earlier, for a whole cluster. (Does the GFS implementation match ZFS’s? Perhaps not, but they haven’t had the resources that Sun has devoted to ZFS either. The important thing is that they represented the same ideas.) All of these were real innovations, not something new provided for us in ZFS through the brilliance of Sun engineers alone. ZFS might be the best synthesis ever of these ideas plus some that really are new, but … most important revolution in two decades? Not even close.

The last thing I really hate is when people claim X solves Y, but the explanation of how turns out to be complete baloney (adjective form: balonious). I’ll get to that below the fold, but first a disclaimer. I work for a company producing storage-related functionality (continuous data protection) that I’m sure the ZFS crew would claim their baby makes obsolete. I admit that I’m not an entirely disinterested party, but I’m no less disinterested than the ZFS folks themselves. Besides, they’re wrong. No number of snapshots that have to be planned and performed ahead of time is the same as the ability to restore the state of a volume at the point that you only know in retrospect immediately preceded a fault or corruption event.

I’ll deal with the less balonious example first. Matthew Ahrens has posted a good explanation of how snapshots work, neatly answering my questions about how the system knows when it can fully release a block. Kudos to him for doing that. The design actually looks very elegant. Instead of reference counts, they use a “birth time” for each block and a “dead list” associated with each “generation” (my term) of a filesystem between snapshots, and that’s going to be much more efficient. Of course, there’s a little finagling that has to happen regarding the time before the first snapshot or after the last one, but that’s no big deal. More importantly, if snapshots are to fit into ZFS’s overall reliability picture then birth times and dead lists have to be current on disk, not just in memory. That’s not going to come for free. Maybe the cost disappears in light of performance gains realized elsewhere in ZFS, but it’s still there. Similarly, walking dead lists when a snapshot is deleted could involve non-trivial work if the lists are long. Basically what ZFS has done is amortize some of the cost of snapshot creation over longer periods of normal operation, with the remainder pushed into deletion instead. That’s actually a really smart thing to do, but it’s not quite the same as making the cost go away and that’s what the hype suggests.

Now, on to the serious baloney. Jeff Bonwick claims that “RAID-Z” fixes the “write hole” we discussed last time. His “solution” to the write hole centers on the following.

Every block is its own RAID-Z stripe, regardless of blocksize. This means that every RAID-Z write is a full-stripe write.

The first question that comes to mind is: whose blocksize? In my experience, people who have spent all of their time on the host side of the I/O equation – as seems to be the case for Bonwick – are often sloppy in distinguishing between what the filesystem thinks is a block and what the hardware thinks is a block. People who’ve spent time on the storage-subsystem side are usually more careful, many using “sector” to mean the hardware concept (even though an actual disk-drive person might say that’s often incorrect too) and “block” to mean the software concept. Since most hardware does not support variable-sized blocks beyond format time, Bonwick is obviously referring to the software entity … but the hardware doesn’t know or care about his block size. It makes no guarantee that anything bigger than a sector will be updated (or not) atomically, so his “full stripe” write might be partially complete after all. It’s not even unheard of for a device to write only part of a sector, though they’re not supposed to. Defining your own filesystem blocksize might be convenient conceptually, but it’s pretty meaningless with respect to fixing the RAID-5 write hole. It gets worse, though. Bonwick makes another revealing comment as well.

Where a full-stripe write can simply issue all the writes ascynhronously [sic], a partial-stripe write must do synchronous reads before it can even start the writes.

Note the reference to “all the writes” even for a full stripe. If the user only wrote part of that stripe, either the remainder must have been read in anyway (incurring the same read-modify-write penalty as Bonwick attributes to RAID) or a whole stripe is being written but only part of it is considered meaningful (wasting space). Don’t worry, vendors will always be willing to sell you more disk. As Bonwick himself says, there’s no problem your wallet can’t solve. More to the point, what happens when you issue writes (plural) for a single supposedly contiguous area? Why, they might get reordered, they might succeed or fail independently instead of all together, etc. In short, there are now even more failure cases to account for when trying to deal with the write hole. The only reason you’d even consider doing such a thing would be if multiple discontiguous parts of the stripe were filled with user data and the rest weren’t – i.e. the second alternative above. What it sounds like – and no, I haven’t looked at the code yet to verify – is that ZFS wastes space doing fake full-stripe writes (fake because they really only contain partial data) and in the process makes the write-hole problem worse instead of better. If this is the solution, I think I’ll keep the problem.

In the end, it doesn’t look like “RAID-Z” really does squat to make the write hole go away. The real magic is in the way ZFS structures its metadata as a tree and does atomic updates by working its way up that tree (presaged by WAFL), and the way it verifies checksums along the way. That’s the really cool stuff. That’s what Bonwick should be presenting as the key to data integrity in ZFS. Maybe while he’s at it he can explain why ZFS still needs an intent log if both “RAID-Z” and atomic updates supposedly solve the problem of integrity for asynchronously-written data, or why both Neil Perrin and Neelakanth Nadgir make a point of mentioning how the supposedly-unnecessary NVRAM would make intent-log operation much faster (or why that should matter). Something there just doesn’t quite add up, but I’m getting used to that.

Comments

  1. Hey Jackass: why don’t you download ZFS and kick the tires a bit before jabbering on about what is and what is not bullshit. The revolutionary bit of ZFS is not RAID-Z — it’s the fact that it has obliterated the distinction between volume management and the filesystem. (ZFS does not merely “embed a volume manager” as you claim; as long as you think of it that way, you won’t understand what ZFS has done.) Ultimately, the ease of management, the performance and the reliability stem from that decision; technologies like RAID-Z are just gravy. Given the fact that you’re hawking a related technology, your complaints remind me of a DTrace t-shirt that someone in the NYC Sun office made: “DTrace: Recommended by ZERO competitors.”

  2. Wow. I thought someone might complain that my remarks were a bit unprofessional, but … well, everyone can see who’s really being unprofessional here. I don’t need to use a product to evaluate claims that fly in the face of basic facts about how the world works. You can’t exceed the speed of light, you can’t boil the ocean, and you can’t fix the write hole the way “RAID-Z” claims to. All anyone needs to know in order to reach that conclusion is what the write hole is and how disk drives work. You and Jeff Bonwick might try to BS your way to some other conclusion so you can get your patents or whatever, but it will only fool the ignorant.

    ZFS does not merely â??embed a volume managerâ? as you claim; as long as you think of it that way, you wonâ??t understand what ZFS has done.

    Why don’t you explain how it’s different than embedding, then? Is it because there’s no longer any clean separation or layering of the two abstractions? If so, then that’s a step backwards. Why does zvol even exist if there’s no longer any need or use for plain block-level service?

    Ultimately, the ease of management, the performance and the reliability stem from that decision

    How can they, when the technology underlying most of those features could be implemented without “pooling” or whatever other trademark you apply to embedding volume-manager functionality? Answer: they can’t. The ease of management, performance, and reliability are owed to other, better decisions. As I’ve said over and over again, there’s plenty of cool technology in ZFS. I don’t for a moment dispute that. The point here is that “RAID-Z” isn’t an example of that. It’s just smoke and mirrors, trying to spread FUD about being vulnerable to horrible data corruption unless you either spend millions of dollars or use ZFS. It’s a marketing idea, not a technical one.

    Given the fact that youâ??re hawking a related technology

    Mentioning where I work is not hawking related technology (especially when it’s not related). It’s a little thing called full disclosure, which your fellow astroturfer would do well to learn. I felt that I should disclose the precise extent of my interest, even if that extent is pretty small.

  3. Jeff,

    RAID-Z does in fact solve the write hole. I attempted to explain this in my blog entry, but perhaps a little more detail would help. Let’s walk through it.

    First, you’re correct that the solution assumes a transactional filesystem above it. As you surmise, it’s the transactional semantics that make full-stripe writes safe, regardless of whether it’s RAID-Z or plain old RAID-5. Assuming that we’re careful in how we update the root of the tree (and we are), this means that ZFS is free of the write hole whenever it does full-stripe writes exclusively. I think we’re on the same page up to this point.

    Partial-stripe writes pose a problem, however. Let’s say I’ve got a 4+1 RAID-5 configuration with disks A, B, C, D, E. In transaction group 37, I need to store one sector, let’s say on disk C. The corresponding parity is on disk E. If I update C before E, and lose power in between, then E is no longer the XOR of A-D. This means that in addition to failing to complete the write of C, I have now in effect corrupted A, B, and D — because if (say) disk D fails, its contents cannot be correctly reconstructed from A, B, C, and E (because C and E are inconsistent). The same problem exists if I update E before C, so write ordering tricks won’t help. This is where NVRAM comes in.

    Now: the key difference between the write hole that can occur with a full-stripe write, rather than a partial-stripe write, is that none of the blocks in a full-stripe write are live until the transaction group commits. So it doesn’t matter if you lose power and the blocks are inconsistent — they were free anyway. But in a partial-stripe write, sector A may contain live data from transaction group 36, which was already committed. The act of updating any disk in the stripe that contains A (at the same offset) means that A itself becomes vulnerable to power loss. In a transactional system, that’s unacceptable.

    Of course, the only reason to do partial-stripe writes at all is because you have to — because the RAID-5 stripe width is fixed.

    RAID-Z addresses this by using variable stripe width. It treats all the blocks as a matrix, where the disks are columns so that entry (M, N) is the Mth sector of disk N. Space allocation is row-major, but I/O is column-major (so that data is in the clear). In (say) a 4+1 RAID-Z setup, this means that a single-sector write will only touch two disks — one data, one parity. A 3-sector write touches 4 disks — 3 data, 1 parity. A 100-sector write touches all 5 disks, with four disks getting 25 sectors of data each and one disk getting 25 sectors of parity. You might infer that RAID-Z uses more space for very small blocks, but quickly approaches the usual 25% parity overhead (in our 4+1 example) for large blocks. That is correct. I’ll blog about this in considerably more detail next week.

    The essential point is that there are no partial-stripe writes, ever. All RAID-Z writes are full-stripe writes, and therefore they don’t affect preexisting live data. Given the transaction nature of ZFS, that means there’s no write hole.

    If you find this explantion lacking, please let me know what exactly it is that I’m failing to communicate so I can be more clear in the future.

    Jeff

    PS: Regarding your comments about the intent log: that’s a different thing altogether. It’s just an optimization to make O_DSYNC writes go fast. It has nothing to do with RAID-Z or the transactional integrity of ZFS. You can disable the intent log entirely (set zil_disable=1) and everything still works — RAID-Z, yanking the power cord under load, etc. And I agree that we need to explain this better.

  4. Thanks for stopping by, Jeff; I appreciate the time you’ve taken to reply. I’ve addressed the major part of your response in a new post, but I’m also curious about the last paragraph.

    Itâ??s just an optimization to make O_DSYNC writes go fast. It has nothing to do with RAID-Z or the transactional integrity of ZFS. You can disable the intent log entirely (set zil_disable=1) and everything still works

    What are the performance implications of disabling the intent log? Is it just bad, or is it really bad? If the latter, I’d say that “optimization” might be a bit misleading. Something that exists to overcome a severe enough performance deficit in a common enough case, to the extent that without that thing the whole could not be considered viable, might be considered integral to that whole. Most people consider “optimization” to mean something that improves upon a reasonable baseline for certain cases, not something that is necessary to reach that baseline.

  5. Frederick.Zeng Says: August 6th, 2010 at 2:07 am

    To Jeff Bonwick,

    If Dynamic Stripe Width means Variable Stripe Width, then every read or write request may only touch part of the total vdevs, is it right?

    http://blogs.sun.com/roch/entry/when_to_and_not_to
    “A N-way RAID-Z group achieves it’s protection by spreading a ZFS block onto the N underlying devices.”
    What does ZFS block mean here?
    A full Stripe Size = Stripe Depth X Stripe Width ? If so, then what is the Stripe Depth? Is it = recordsize? (by default sector size = 512Bytes, and recordsize = multiple sectors) If so, when Stripe Width < N, then ZFS block may be spread onto less than N underlying devices, is it right?

    http://www.solarisinternals.com/wiki/index.php/ZFS_for_Databases
    "RAIDZ caveat – Since RAIDZ reads all of its underlying elements for every read request, it is generally not advised for use with databases. Databases tend to rely heavily on efficient random reads, and RAIDZ is a bad match for that requirement."
    The same, if Stripe Width < N, then why RAIDZ must read all of its underlying elements for every read request ?

    Please refer to the page 32 of "ZFS Tutorial USENIX LISA09 Conference".
    http://www.slideshare.net/relling/zfs-tutorial-usenix-lisa09-conference

Leave a Comment