In my cloud-storage slides for the cloud forum last week – available at Red Hat but you’ll probably have to register until I get clearance to post a copy here – one of the points I made is that you have many options for storage in the cloud but whatever options you choose should have certain “cloud appropriate” characteristics. Here, I’ll dive more into what I think that term means.

First, though, I have to talk a little bit about cloud-service deployment models. Just about any cloud service can be deployed privately by the users themselves, taking advantage of the elasticity and isolation already provided at the instance and network level. For example, I run GlusterFS this way on Amazon or Rackspace pretty regularly and it works fine. On the other hand, what if the provider offered GlusterFS as a permanent shared resource, like S3 but at a filesystem level or like CloudFiles but with a POSIX interface? The servers could be doing native instead of virtualized I/O, on specially provisioned and optimized hardware. This would much better capture those economies of scale and expertise that James mentions, and also take advantage of his “non-correlated peaks” to bring the cloud advantage of more efficient provisioning to storage as well as computation. That’s the deployment model I have in mind for this discussion.

The feature that many cloud storage systems tend to focus on is scale. This focus is actually kind of strange, because scale is not an essential aspect of cloud computing. Clouds don’t have to be large, even though larger clouds can benefit more from economies of scale and concentration of expertise. James Hamilton explains this very well, though I disagree with his main thesis. In fact, though, a high percentage of the machines used for cloud computing are part of large public clouds, so I’ll just leave that debate for another post. My main point here is that scale has to mean not just machine scale but human scale as well. Cloud users and applications come and go, but the services provided by the cloud itself have to stay running semi-permanently and endure many changes with no (or nearly no) downtime. If your cloud storage system can combine hundreds of machines into one giant pool, but requires massive amounts of human intervention every time you add a server or want to make a change, you haven’t won. A certain parallel filesystem (about which I already have another post queued up) had this problem in a major way. Everyone who deployed it, even in small configurations, had to devote significant resources to troubleshooting and tuning, and every change required major downtime. Fortunately, many cloud storage developers do seem to have learned from others’ mistakes in this area, and the systems they’ve created do pretty well in this regard.

Another feature that many cloud-storage folks seem at least somewhat aware of is security, but I think most don’t consider it enough. I’m tired of seeing cloud-storage vendors talk about their fantastic 256-bit encryption, because they’re usually only talking about encrypting data in flight and frankly I think they’re being more than a little bit deceptive about what that really means for the user. If you’re using a public cloud, you should be as distrustful of the provider as of some hypothetical “man in the middle” when it comes to your data, and encrypting data in flight isn’t sufficient. You also need to be concerned about encrypting data at rest, meaning that the provider shouldn’t have unencrypted data or keys for encrypted data. Your data should be encrypted gibberish from the moment it leaves your control until the moment it returns to your control, no matter how many times it’s transmitted or stored (including caches) in between, and only you should have the keys. Many who brag about their encryption strength don’t actually allow that.

Having all of your data encrypted protects against improper disclosure either to/through your provider or to others, but what about tampering or denial of service? For that, you need different forms of isolation. You can start with traditional multi-user security, but many cloud-storage systems don’t even have a concept of multiple users. FAIL. You also need to ensure that users not only can’t read or write each others’ files, but – especially when you’re giving them SLAs – that one user cannot crash or overload servers so that other users are affected. However, even cloud storage systems that meet the basic requirement of recognizing users as distinct from one another often fail in this regard; take a little bit of cloud cartography, a little bit of I/O performance observation, and it doesn’t take a genius to figure out that cloud storage users can indeed be very much affected by other users’ activity. In some cases this can amount to a denial of service. This is an area where a lot of work still needs to be done, from providing quality-of-service features in I/O systems and hypervisors up to giving users a way to specify service levels in their management consoles.

The last thing that I think is necessary to make a service “cloud appropriate” is monitoring. What good is an infinitely scalable system if you can’t figure out which part of the system needs to be scaled up to rub out a hot spot? Self-tuning systems that reconfigure themselves would be ideal, but systems that at least detect and report hot spots themselves would be almost as good (and many who are wary of ceding complete control to the software might prefer them). Cloud software should meet at least the same standards as enterprise software in terms of providing logs and statistics that enable a reasonably well trained human to see why the subsystem is slow. Again, unfortunately, this is an area where many cloud storage projects make little or no effort. Cloud storage providers need tools to figure out when they should add servers/disks/switches (three separate cases), how they should rearrange their directory structures to split up hot spots, etc. Too often, they have to use external tools like network logging to figure out problems that should be visible using facilities within the cloud software itself.

So there you have it: scale, security/isolation, and monitoring. Software that meets these requirements can be deployed as a public shared resource, while software that doesn’t can’t (or at least shouldn’t). Unfortunately, the software that does tends to be proprietary – often “wrapped in tin” and sold as hardware – while open-source software tends to meet only the scale requirement. If you want, you might consider that a hint about which problems I consider adequately addressed by others and which I’m spending my own time trying to address.