In the course of our recent conversations about ZFS and RAID-Z, one particular statement by Jeff Bonwick touched on a common misconception.

This question doesnâ??t even make sense unless you postulate a richer interface between file systems and storage than the simple block device protocol (which is the level at which RAID is customarily done).

A lot of people, especially those who have only dealt with disks at the filesystem level or above, seem to have the idea that the interface to disks (and disk arrays and other disk-like hardware) consists solely or almost solely of block reads and writes. That actually hasn’t been the case for a long time. SCSI has always had a plethora of other commands such as Inquiry, Test Unit Ready, Reserve, Release, and so on. All of SCSI’s descendants – including but not limited to iSCSI, Serial Attached SCSI, Fibre Channel, and SBP-2 – have inherited these extra commands. Vendors can do arbitrarily complex things with Mode Sense and Mode Select, Read Buffer and Write Buffer, vendor-specific opcodes, or reads and writes to pseudo-devices. Any of these facilities could be used to implement a completely general kind of RPC, so any interface that you could implement within a single system could also be implemented with an external storage device at one end (not that it would necessarily be a good idea to do so). In Jeff’s example, there are many ways that the filesystem could tell the storage device which section of a volume is using which RAID level or stripe size etc. … if it could pass the necessary information through the operating system’s own block-device interface. That’s where the “read and write and not much more” limitation really lives, and it’s not all that innovative to remove a restriction that the OS put there in the first place.

The advantage of putting extra functionality in the device instead of the host is that the functionality then becomes available to any connected host that knows the right commands, regardless of host hardware or operating system, and even simultaneously for many hosts at once with the device providing all necessary coordination instead of requiring external lock managers and such. The disadvantage, of course, is that if you rely on advanced device functionality too much you’ll be up a creek when you have to work with a device that doesn’t provide it (like a dumb SATA drive). That’s where our old friend modularity comes in. If you designed your software right you should be able to use a more advanced device interface where it exists, and emulate it where it doesn’t, without perturbation to any of your code other than a pluggable personality module for the device you’re actually using. GFS actually did exactly this for the locking primitives they require, and it was a smart choice. Designing toward the lowest common denominator is generally a very good way to stifle innovation, not promote it.