About a year and a half ago, Sun’s ZFS was finally released to widespread and well deserved acclaim. Being a bit of a contrarian, not to mention a hater of dishonest marketing and astroturf, I posted a series of fairly critical articles about the less praiseworthy aspects of ZFS’s design and presentation. One of my criticisms had to do with the way that ZFS absorbs the functionality of a filesystem, volume manager, and RAID into one big amorphous blob. That led to quite a conversation on Jeff Bonwick’s blog, and some more snarkiness right here. Now I see that Andrew Morton, of Linux kernel fame, has called ZFS a rampant layering violation, and Sun’s marketing department asked Bonwick to respond. Leaving aside the irony of someone so deeply involved in the Linux kernel criticizing someone else’s application of software engineering principles, I’ll add a little more fuel to the fire.

Let’s skip right past the utterly charming but irrelevant mathematical example and look at how Bonwick presents traditional vs. ZFS storage layering.

You can think of any storage stack as a series of translations from one naming scheme to another — ultimately translating a filename to a disk LBA (logical block address). Typically it looks like this:

filesystem(upper): filename to object (inode)
filesystem(lower): object to volume LBA
volume manager: volume LBA to array LBA
RAID controller: array LBA to disk LBA

The overall ZFS translation stack looks like this:

ZPL: filename to object
DMU: object to DVA (data virtual address)
SPA: DVA to disk LBA

Yippee. So instead of object-to-vLBA-to-aLBA-to-dLBA we get object-to-DVA-to-dLBA. Is that a significant change? No, not really. The vLBA-to-aLBA translation is usually dirt simple anyway, with negligible cost in either code complexity or runtime. The total translation from vLBA to dLBA is not really much more complex than ZFS’s translation from DVA to dLBA (the DMU has its own internal complexity) in the common case despite being two layers instead of one. So why is it split into two layers? Because of the uncommon case. One of the major purposes of layering, or any other kind of abstraction, is to provide a well defined and conceptually lucid point at which one can digress from the usual flow if needed. That doesn’t mean the common cases should take the detour, of course, and good layering implementations don’t make that mistake. Does the layering in ZFS satisfy that purpose? I’ll answer in a moment, but first let’s look at Bonwick’s closing comment.

I certainly don’t feel violated. Do you?

Of course you don’t, because you’re the guy building the cathedral. Sun has always been severely afflicted by Not Invented Here syndrome, and this is no exception. The layering in ZFS is not better suited to the problem space; it’s better suited to Sun engineers. To see why, ask this question: what’s a DVA? It’s like any other kind of LBA, but subtly different enough to make a difference. Volume and disk LBAs are pretty well understood abstractions, used by many people working on many different implementations of volume managers and things that interface with them. A DVA only has specific meaning to someone toiling within the guts of ZFS. In addition to what I said earlier about reasons for layering, one of the main points behind any kind of modularity is to make it possible for different people with different specific expertise to work on different pieces in a well coordinated and productive fashion. Volume/disk LBAs have allowed many people at many companies to do all sorts of different and interesting things. ZFS’s DVAs only enable ZFS core developers to do what ZFS core developers want to do.

This brings me to another aspect of the DMU that hasn’t gotten much attention: its absorption not only of volume manager and RAID functionality but of I/O-scheduling functionality as well. Sun engineers present this as a great stride forward, but – like many of their other “innovations” – whatever benefits it provides are far outweighed by the loss of flexibility and opportunities for others besides those who made that decision to continue innovating in other directions. I/O scheduling has received a lot of attention in the Linux world lately, with many positive results. The thing that enabled all of this was someone doing the exact opposite of what Sun has done – making I/O scheduling more modular than it was before. That opened up a whole new field for experimentation. It almost seems that the folks working on ZFS have made a deliberate decision to be different not because they needed to be but because it creates a barrier to entry for anyone else who might otherwise try to take ZFS in an unanticipated direction. For example, one of my earlier criticisms of ZFS was that local filesystems aren’t really all that interesting and they missed an opportunity to make a cluster/distributed filesystem. The more coupling they introduce within ZFS, the more alien they make its basic paradigms, the more they tie it to every piece of Solaris or anything else anyone on the team ever worked on, the less likely it becomes that anyone else will take that step. I wouldn’t be at all surprised if the next major ZFS announcement is that they’ve taken it themselves, but their achievement will be somewhat tarnished by the fact that they actively shut others out of that effort. Their true commitment to the open exchange of ideas is reflected not only in their obfuscatory design but in their license and patent policies as well, which are all geared toward securing the benefits of open source while taking none of the risk that someone else might beat their own storage division to the punch when it comes to capitalizing on an idea. They’re a cathedral-wolf in bazaar-sheep clothing.