Disk Drive Interfaces

In the course of our recent conversations about ZFS and RAID-Z, one particular statement by Jeff Bonwick touched on a common misconception.

This question doesnâ??t even make sense unless you postulate a richer interface between file systems and storage than the simple block device protocol (which is the level at which RAID is customarily done).

A lot of people, especially those who have only dealt with disks at the filesystem level or above, seem to have the idea that the interface to disks (and disk arrays and other disk-like hardware) consists solely or almost solely of block reads and writes. That actually hasn’t been the case for a long time. SCSI has always had a plethora of other commands such as Inquiry, Test Unit Ready, Reserve, Release, and so on. All of SCSI’s descendants – including but not limited to iSCSI, Serial Attached SCSI, Fibre Channel, and SBP-2 – have inherited these extra commands. Vendors can do arbitrarily complex things with Mode Sense and Mode Select, Read Buffer and Write Buffer, vendor-specific opcodes, or reads and writes to pseudo-devices. Any of these facilities could be used to implement a completely general kind of RPC, so any interface that you could implement within a single system could also be implemented with an external storage device at one end (not that it would necessarily be a good idea to do so). In Jeff’s example, there are many ways that the filesystem could tell the storage device which section of a volume is using which RAID level or stripe size etc. … if it could pass the necessary information through the operating system’s own block-device interface. That’s where the “read and write and not much more” limitation really lives, and it’s not all that innovative to remove a restriction that the OS put there in the first place.

The advantage of putting extra functionality in the device instead of the host is that the functionality then becomes available to any connected host that knows the right commands, regardless of host hardware or operating system, and even simultaneously for many hosts at once with the device providing all necessary coordination instead of requiring external lock managers and such. The disadvantage, of course, is that if you rely on advanced device functionality too much you’ll be up a creek when you have to work with a device that doesn’t provide it (like a dumb SATA drive). That’s where our old friend modularity comes in. If you designed your software right you should be able to use a more advanced device interface where it exists, and emulate it where it doesn’t, without perturbation to any of your code other than a pluggable personality module for the device you’re actually using. GFS actually did exactly this for the locking primitives they require, and it was a smart choice. Designing toward the lowest common denominator is generally a very good way to stifle innovation, not promote it.

Not So Friendly Reminder

Just in case anyone forgot: this is my blog. I pay for it. I run it. I provide most of the content that draws readers here. Get it? If you want to misrepresent facts or arguments, make exaggerated claims on your own or others’ behalf, or sling insults like a whiny little child who’s not getting enough attention, feel free to do it on your own site. Do it on your employer’s, if they don’t mind you using their resources during working hours to sully their image. Just don’t expect to do it here. My house, my rules.

No, I’m not allowing comments on this post. Just this once, there’s no dialogue – only me telling the slow student in the class how it is. Those who harbor even the slightest doubt about whether I’m referring to them can rest assured that I’m not, and after this I hope we can go back to civil discourse without that noise in our ears.

No More Politics

For a long time, I’ve been aware that this site’s rather schizophrenic nature has been confusing for many people. The largest contingent of readers is primarily technical, and gets turned off by my political writings. The people who come for the political commentary are less numerous, but I’m sure their eyes glaze over when they see the jargon-drenched political posts. Family and friends who come here for news about me and my family probably click right through to the “family life” section if that’s not where their bookmark points already. Lately the percentage of the site devoted to technical content has increased, and so has my traffic. That’s not a coincidence, I’m sure. I do look through my logs occasionally, and the relationship is pretty clear.

In the interests of serving my “audience” better, I’ve been looking for a way to resolve this conundrum. I’ve thought about maintaining two completely separate blogs, but even when all of the categories are added up I barely post enough for one. Instead, I asked my friend who runs It Affects You if he would be interested in having me do my political writing there, and he has graciously accepted. As a result, those who are interested in what I have to say politically should probably go there. This site will become more technical and personal, though I’m sure I’ll still sneak in the occasional bit of more abstract philosophy. The random scientific or humorous or just generally off-the-wall stuff will also still be here, of course, along with book reviews and recipes and everything else.

I hope this change will help people enjoy both this site and my new political “home away from home” more. It’s always hard to be sure, though, so (as always) feedback is certainly appreciated. Please feel free to leave a comment or send me email to let me know what you think.

RAID-Z Redux

Jeff Bonwick, architect of Sun’s ZFS, has been kind enough to offer some clarification about RAID-Z in a comment to my last post on the subject. I’m not sure at this point whether we actually disagree or are talking past one another, but the sticking point is neatly summed up in the following statement by Jeff.

itâ??s the transactional semantics that make full-stripe writes safe, regardless of whether itâ??s RAID-Z or plain old RAID-5.

If the transactional semantics of ZFS make full-stripe writes safe even for RAID-5, then it’s clearly not RAID-Z (which isn’t in the picture) that’s solving the problem. It’s the transactional semantics, including the exclusive use of full-stripe writes, that do so. If RAID-Z is as safe as RAID-5 with ZFS’s transactional behavior, and as unsafe without (as I believe I explained in my last post), then it can hardly be considered a solution to RAID-5′s problems. Jeff’s more detailed explanation makes this equivalence even clearer.

RAID-Z addresses this by using variable stripe width. It treats all the blocks as a matrix, where the disks are columns so that entry (M, N) is the Mth sector of disk N. Space allocation is row-major, but I/O is column-major (so that data is in the clear). In (say) a 4+1 RAID-Z setup, this means that a single-sector write will only touch two disks â?? one data, one parity. A 3-sector write touches 4 disks â?? 3 data, 1 parity. A 100-sector write touches all 5 disks, with four disks getting 25 sectors of data each and one disk getting 25 sectors of parity. You might infer that RAID-Z uses more space for very small blocks, but quickly approaches the usual 25% parity overhead (in our 4+1 example) for large blocks. That is correct. Iâ??ll blog about this in considerably more detail next week.

The key point here is that you could apply a very similar technique if you were using RAID-5. However, you’d risk wasting even more space, and doing more I/O to write zeroes to the unused sectors within a stripe. The RAID-Z solution is clearly preferable from those perspectives, but not from that of data integrity. That brings me to another question about RAID-Z, which is the misleading name. RAID-Z might be a useful technique for a filesystem to use, perhaps even a significant innovation, but it’s not a RAID level. That’s an unwarranted attempt, in my opinion, to ride on RAID’s coat-tails because RAID was a truly significant advance in storage technology and is widely recognized as such. In part, I base that statement on something in Jeff’s original blog entry about RAID-Z.

You have to traverse the filesystem metadata to determine the RAID-Z geometry.

True RAID levels don’t require knowledge of higher-level “applications” (e.g. filesystems or volume managers) for reconstruction; that’s what we call a layering violation. All they require is knowledge of which disks are members of the RAID group. In some implementations of some RAID levels one further piece of information (the stripe width) is also needed, but that’s still a far cry from the arbitrarily complex metadata ZFS requires. RAID-Z is inseparable from ZFS and is therefore at ZFS’s semantic/operational level – i.e. not that at which RAID operates.

The fact that RAID-Z isn’t really a RAID level, or that it doesn’t (in and of itself) close the write hole, doesn’t mean it’s not cool. In fact I think it is cool. As I’ve said before, I’m not questioning the technology but a presentation that still seems as much based on marketing as on technical reality.

Drink Up!

What better way to celebrate Thanksgiving than by touting the health benefits of a popular feast component? Long known to have benefits at the other end of the digestive tract, it seems that cranberries are also good for your teeth. Enjoy.

Amy Report – November

This month we have a full multimedia experience – text, still pictures, and video. I’ll do the text first, then the pictures and video after the break. Amy’s walking has continued to improve, and she’s trying new variants all the time. First it was walking while carrying things, starting with small things then moving up to larger ones (e.g. large-format books) then multiple things. For a while she was carrying around a little basket full of puzzle pieces. She can walk fast, though I wouldn’t quite call it running, and she can walk backwards, and she can walk quite well even on uneven ground. She’s not talking yet, but her comprehension is increasing by leaps and bounds. One of her favorite games is to grab a book, plop down in someone’s lap, and go through it pointing at all the objects. Sometimes she leads, pointing and asking (with an emphatic “dah” sound) for identification. Other times she seems to enjoy being quizzed. “Can you find an umbrella? (point) Good! How about blueberries? (point) Yay!” And so on, often for quite a long time. We can go through several books this way, with her accurately identifying over a hundred objects. Specific words – especially “lips” and “umbrella” or anything that sounds like either – also elicit very specific and amusing behavior. I swear she was even trying to tell a non-verbal joke about umbrellas at dinner time yesterday. She’s also becoming quite vocal, as will be apparent in the video. Between the increased comprehension and vocalization, I’m sure she’ll be talking soon. Maybe she’ll start during our trip to Michigan, just as she did with walking.

Amy definitely has some habits that I find amusing. For while, she enjoyed taking stuff out of containers. Now it’s putting stuff in containers. We can spend half an hour or more going through various games and puzzles with her dumping the pieces out of a box or bag and then studiously putting them all back in. She also enjoys putting random junk (e.g. leaf bits that get tracked in from raking the yard) into the kitchen trash, or anything left on the bed or floor into the laundry hamper. Retrieving her pajamas from the hamper has become a nightly ritual for Cindy. More recently, Amy has discovered sitting on things other than the floor – on steps and stoops and on chairs just like an adult. Most recent of all is the “category” game, in which she goes through a book or a room pointing at every instance of a category that she can find. Bert from Sesame Street is a common example. Last night it was tomatoes. In the real world it’s usually chairs or tables. Once she pointed to a box downstairs, I said “box” and she immediately went up the stairs and all the way across the top floor to the master bathroom to point at a stack of diaper boxes. Another time she went from a hat in a book all the way down the stairs into the entranceway to point at some hats on top of a coat rack – showing a keen grasp not only of categories but of the correspondence between pictures and real things. She also seems to know the difference between planes (which she loves) and helicopters (which she finds uninteresting). There’s clearly a lot going on in there.

Oil Profiteers

This is now exactly a week old. I’ve been thinking of moving my political writing somewhere else so this site can focus more on techie stuff, and this was to be the first article at the new venue, but I haven’t heard back from the proprietor of that venue and I didn’t want this to wait forever.

To nobody’s surprise, Jeff Jacoby takes issue with the idea that the recent record oil-company profits are obscene. Equally unsurprising is the dishonest way in which he presents his case to the contrary. His columns are always a grab bag of cheap rhetorical tricks, devoid of substance, and this one is no exception. What follows is the fisking he so thoroughly deserves.

No More Mr. Nice Guy

OK, I’ve had it with the ZFS crew. Believe it or not, I’ve tried to be nice so far. I’ve taken pains to point out that I respect the excellent technical accomplishment that ZFS represents, and mostly wish that the marketing hype could be matched with good-faith technical exposition. That’s not happening. There are dozens of Sun employees making a concerted effort to flood the blogosphere with effusive praise for ZFS, mostly parroting the same empty last word in filesystems hype. To be quite blunt it’s starting to smell a lot like astroturf, and that’s something I really hate. I also really hate it when people show disrespect for their peers, as I believe Bryan Cantrill exemplifies.

there is no other conclusion left to be had: ZFS is the most important revolution in storage software in two decades — and may be the most important idea since the filesystem itself.

I’m sorry, but there’s no other word for that but bullshit. RAID was a great innovation, on whose coattails ZFS attempts to ride with “RAID-Z” even though it’s in a whole different conceptual space than the standard RAID levels. Volume managers have been around for years, and ZFS embeds one; likewise for journaled and atomic-update filesystems, reflected in ZFS’s intent log. ZFS’s pooling and “vertical integration” aren’t all that new either; GFS did many of the same things, earlier, for a whole cluster. (Does the GFS implementation match ZFS’s? Perhaps not, but they haven’t had the resources that Sun has devoted to ZFS either. The important thing is that they represented the same ideas.) All of these were real innovations, not something new provided for us in ZFS through the brilliance of Sun engineers alone. ZFS might be the best synthesis ever of these ideas plus some that really are new, but … most important revolution in two decades? Not even close.

The last thing I really hate is when people claim X solves Y, but the explanation of how turns out to be complete baloney (adjective form: balonious). I’ll get to that below the fold, but first a disclaimer. I work for a company producing storage-related functionality (continuous data protection) that I’m sure the ZFS crew would claim their baby makes obsolete. I admit that I’m not an entirely disinterested party, but I’m no less disinterested than the ZFS folks themselves. Besides, they’re wrong. No number of snapshots that have to be planned and performed ahead of time is the same as the ability to restore the state of a volume at the point that you only know in retrospect immediately preceded a fault or corruption event.

ZFS Released

Yes, that’s right, folks. As Wes Felter accurately puts it, “After years of hype, Sun released ZFS.” The first thing I noticed, perusing these pages, is that the people working on it can’t seem to get their stories straight. The front page says this…

the best part â?? no need for NVRAM in hardware. ZFS loves cheap disks.

…but Neil Perrin says this…

There’s also more work to do. For example, using nvram/solid state disks for the log would turbo-ise it.

Hmmm. Similarly, Bill Moore writes…

A product is only as good as its test suite

…and yet, the FAQ actually includes a section on What can I do if ZFS panics on every boot? Panics on every boot? Don’t you think that case could have been covered by the all-singing all-dancing test suite? That’s what most people would consider a show-stopper, but apparently not the intensely quality-oriented ZFS team. Enough of the sniping, though; let’s move on to more substantial questions.

With Us Or Against Us

Some of the rhetoric being used by those who favor retaining US control of the internet’s DNS (Domain Naming System) is pretty sickening. On Tuesday I heard a story about this in NPR, and the Official Line seems to be that giving other nations control over their own DNS entries is a measure being pushed by “countries like Iran and Saudi Arabia” who “don’t have the same attitudes about free speech” that we do. The implication is supposed to be that the change away from an ICANN monopoly is only supported by people who don’t care about repressive regimes and human rights. Well…WRONG. It is also supported by many countries who’ve shown a lot more respect for human rights than the good old US of A has lately – countries that don’t run a network of secret detention centers or send prisoners to countries they know will use torture. There are some semi-valid arguments to be made for letting the US retain control over top-level domains, but this “like Iran and Saudi Arabia” strawman is not among them. It is possible to oppose both having the US Commerce Department control a global resource and having “countries like Iran” deny DNS access to its citizens for political reasons.