In the course of finding information about Maui yesterday, I came across the following quote EMC’s Chuck Hollis.

Presenting storage as blocks (e.g. LUNs) won’t scale. Presenting storage as files won’t scale. You’ll need an object-oriented approach with rich sematics [sic] — nothing else will work at this uber-massive scale.

Bull. Yes, it’s that simple – what Chuck says is not true. To see why, we first have to deal with some ambiguity. Who is presenting storage to whom? At the user level, presenting storage as files must scale because that’s the only interface many users will accept. That doesn’t mean you have to support full POSIX file semantics or anything, in fact you don’t have to support anything but procedural interfaces for whole-file read/write and listing/enumeration, but if you don’t provide such interfaces within something that looks basically like a filesystem namespace then your goose is cooked. You can call them something other than files, you can add extra interfaces, but none of that matters. To make a long intro short, Chuck’s statements are so absurd at the user-presentation level that it’s impossible to believe that’s what he meant. What he must be referring to instead is presenting blocks/files/objects for use by other components within the system. Let’s take another look in that context.

Files don’t scale, eh? Tell it to the people who make “files upon files” parallel filesystems like Lustre and PVFS. They deal with multi-petabyte installations, millions of files, higher I/O and metadata-operation rates than anybody, using local files on each I/O node to store pieces of user files. That approach scales just fine. What’s that, you say? It’s a different environment? No, not really. The big issues in cloud storage are geographic latency and reliability in the face of network failures. A chunk of data either gets from one side of the world quickly or slowly or not at all regardless of whether it’s being addressed/tracked as part of a LUN or a file or an object. Metadata operations actually become harder to distribute effectively as they increase in complexity, with blocks offering the simplest interface and objects the most complex. When it comes to crossing the divide from local access to remote, anything objects can do files can do better and blocks better still.

That brings us to the challenge of creating a block store that can span the globe. The usual objection is that there are just too many blocks, that tracking them individually requires too much space. This is kind of true . . . if you’re really naive about how you do it. To get a sense of the problem, consider a 1PB data store consisting of 4KB blocks. At a mere one bit of state per block, that’s still 30GB to track everything. No one server is likely to have that much memory. You can use bigger blocks and divide stuff among more servers (which you’ll need to do for I/O bandwidth reasons anyway), but that barely gets you back to the same point when you consider that in reality you’ll need much more than one bit of state per block. You can store state information on disk and cache it in memory (in fact the multi-level caching mechanisms I used for my block store could have done this for state information just as well as they did for data) and that would help a bit more. The really big bang for the buck, though, is to make the amount of state information proportional to the amount of data in use and not the total. After all, that’s essentially what file- and object-based approaches are doing, tracking only subsets of the total address space that are in use, and keeping lists of active LUN segments is no harder than keeping lists of open file/object handles. Combine all of these approaches, and a global data store based on blocks is quite manageable.

Lastly, we come to the “rich semantics” Chuck mentioned. Mapping files to files is surprisingly non-trivial when the “upper files” and “lower files” don’t support exactly the same features, but it’s not rocket science either. Turning objects into files is in the same difficulty range, but turning blocks into files is more complicated. I’m not going to go into great detail on how to solve that set of problems – maybe in another post some day – but I will make a few assorted points to do with where the complexity associated with those “rich semantics” should reside.

  • Separation between data access and metadata access is a good thing, which is why every viable parallel filesystem has gone that way. Both can be distributed, but in different task-appropriate ways. As HighRoad and subsequently PNFS have shown, once you have a metadata layer that can do data mapping, the clients can access data as either objects or blocks with approximately equal ease. Before anyone from EMC tries to contradict that, consider that your company name is on the PNFS-block proposal.
  • If the complexity lives on a bunch of data servers, those data servers need to communicate amongst themselves to satisfy many kinds of requests, and it’s really easy to get into all sorts of deadlock-prone loops or O(n^2) communication traps.
  • It’s generally a good idea to push as much complexity as possible to the most numerous components, and the most numerous components in a cloud-storage system are clients. Smart objects living on servers are no substitute for delegation methods that make it feasible to let clients do most of the work.
  • Considering all of the above points, turning a global coherent block store into a shared filesystem is almost exactly the same problem as turning a local coherent block store – i.e. a disk array – into one. It’s far from easy, but it’s something I and others have already shown can be done.

To wrap this up, let’s go back to Chuck’s claims. Presenting data as blocks can scale. Presenting data as files can scale. You don’t need an “object-oriented approach with rich semantics” at all. Remember the early designs for what eventually became Invista? Thankfully, neither does anyone else. They went “object-oriented approach with rich semantics” all the way and it was a fiasco that took years to untangle. One failure of implementation does not condemn an entire approach, though, or else global-scale data stores of any flavor would forever be a dead idea. The OOAWRS approach is not an entirely unreasonable one, but it’s by no means necessary. Other approaches than the one chosen by EMC can also work at this “uber-massive” scale.