During my day-job quest to define what “cloud filesystem” might most usefully mean, I keep coming back to one statement:: cloud users need somewhere to store their data. Yes, I know that seems obvious. It’s so obvious that one can scarcely imagine anybody thinking otherwise . . . and yet such thinking (or perhaps lack of thinking) seems all too common among people involved in cloud computing. Here’s a little pie chart of how a lot of cloud folks seem to allocate their time spent thinking about the technical issues.

As a storage guy, I find this not only annoying but stupid as well. It’s not just cloud storage that gets this treatment, by the way. It has long been the case that storage was the red-headed stepchild of computing. Kids come out of college knowing a lot about processors and compilers and AI and all sorts of other computation-oriented stuff. Nowadays they also know a lot about networking, though that’s often too specific to IP networking and doesn’t cover enough about other kinds of communication that occur within and between modern computer systems (e.g. internal or cluster interconnects). They might even learn something about security, but storage? That’s just the black box where you put stuff when you’re done. Never mind that a modern high-end storage system is likely to be more powerful and sophisticated in any dimension than anything connected to it . . . but I digress. The real point here is that cloud people have put off talking (much) about storage for a long time, but they finally seem ready to talk, so let’s talk. First, let’s talk about a distinction that’s already often made between different cloud offerings, and see how that distinction applies to cloud storage.

  • “Infrastructure as a Service” means providing familiar but low-level functionality, close to the nuts and bolts of how clouds are built. Emphasis here is on letting users build their own application “stacks” almost the same way they do outside of the cloud, but on demand and on somebody else’s hardware. Amazon’s EC2 is the obvious example, with Rackspace and GoGrid providing the best known alternatives. In storage, “familiar” means block or filesystem access; examples might include Amazon’s EBS or various vendors’ provision of SAN (especially iSCSI) or NAS facilities to cloud users.
  • “Platform as a Service” is more abstracted from hardware and operating systems, providing its own “stack” which defines (some might say dictates) more of the applications’ structure. Google’s AppEngine, Microsoft Azure, or the entire J2EE ecosystem are examples here. Familiarity is less of an issue here, so a plethora of options have appeared – traditional databases, schema-less and/or “NoSQL” databases, generic key/value stores, in-memory or persistent data grids (especially in the Java world), and so on.
  • I’d also put Amazon’s S3 in the “platform” category. Some might say it’s not quite the same as the other platform options I’ve mentioned, and they’re right, because of the second distinction I think matters. People in the storage world have been thinking about operational vs. archival storage for a long time, but the concepts and terms are finally entering the cloud conversation. In that context I would also add a third category, as follows.

  • Operational storage is what an application uses while it’s running, perhaps only while a single request is being processed. Emphasis is on low latency and high transaction rates. In traditional storage, this translates into small random I/O, and is often best served by fast disks or SSDs. In cloud storage this would also encompass in-memory caching systems, data grids, and key/value stores.
  • Archival storage is the opposite of operational storage in most regards. Emphasis is on data permanence, often with retention/deletion guarantees and/or rich metadata. Examples in traditional storage include virtual tape libraries or content-addressed storage systems using larger/slower disks or even non-disk media. In the cloud space this is where I’d put Amazon’s S3, EMC’s Atmos, and anything based on the Simple Cloud API.
  • Batch storage is my third category, and a bit of a hybrid. Performance is again a focus, but in this case it’s more about high bandwidth for large sequential I/O. Permanence matters more than for operational storage but less than for archival. Device speed matters less in this case than number of devices coupled with fast controllers and interconnects. In traditional storage, many parallel filesystems used in HPC or video (both processing and distribution) address this need. Some of them do so intentionally, never intending to support operational-storage access patterns in the first place, while others end up this way because they’re unintentionally lousy for those access patterns. In the cloud world, this is where I’d put GoogleFS and HDFS.

Cross infrastructure/platform with operational/archival/batch, and you get six categories. Cross with traditional/cloud and you get twelve. Where do I sit? Mostly at the intersection of the infrastructure level with the operational type, with a side order of cloud. As infrastructure a cloud filesystem has to use a familiar interface. As operational storage, it has to provide good performance especially for small/random I/O patterns. As a cloud component it has to be shared, distributed, dynamically scalable, and multi-tenant. It’s a bit of a gap right now. I think that it’s something users might want, but even those who are thinking about storage in the cloud tend to be heading in other directions. Probably the closest you can get right now is a clustered NAS such as those provided by Isilon, BlueArc, or Exanet. The money that has been poured into these companies and that they make in return validates the need/interest, but they all cost $$$ for proprietary hardware and software. I think there’s a place and a possibility for something more cost effective, which also more directly addresses cloud needs such as distribution and multi-tenancy.