By now, most people interested in NoSQL and cloud storage and so on has probably seen the story of go-derper, which demonstrates two things.

  1. Memcached has no security of its own.
  2. Many people deploy memcached to be generally accessible.

Obviously, this is a recipe for disaster. Less obviously, the problem is hardly limited to memcached. Most NoSQL stores have no concept of security. They’ll let anyone connect and fetch or overwrite any object. One of the best known doesn’t even check that input is well formed, so “cat /dev/urandom | nc $host $port” from anywhere would crash it quickly. Among all of the other differences between SQL and NoSQL systems – ACID, joins, normalization and referential integrity, scalability and partition tolerance, etc. – the near-total abandonment of security in NoSQL is rarely mentioned. Lest it seem that I’m throwing stones from some other garden, I’d have to say many filesystems hardly fare any better. For example, I generally like GlusterFS but it provides only the most basic kind of protection against information leakage or tampering. As a POSIX filesystem it at least has a notion of authorization between users, but it does practically nothing to authenticate those users and authorization without authentication is meaningless. The system-level authorization to connect is trivially crackable, and once I’ve done that I can easily spoof any user ID I want – including root. I’ve had to make the point over and over again in presentations that cloud storage in general – regardless of type – is usually only suitable for deployment within a single user’s instances, protected by those instances’ firewalls and sharing a common UID space. For most such stores, if a cloud provider wants to offer it as a public, shared, permanent service separate from compute instances, a lot more work needs to be done.

What kind of work? Mostly it falls into two categories: encryption and authentication/authorization (collectively “auth”). For encryption, there’s a further distinction to be made between on-the-wire and at-rest encryption. A lot of cloud-storage vendors make all sorts of noise about their on-the-wire encryption, but they stay quiet or vague about at-rest encryption and that’s actually more important. The biggest threat to your data is insiders, not outsiders. The insiders aren’t even going on the wire, so all of that AES-256 encryption there doesn’t matter a bit. Insiders should also be assumed to have access to any keys you’ve given the provider, so the only way you can really be sure nobody’s messing with your data is if you never give them unencrypted data or keys for that data. Your data must remain encrypted from the moment it leaves your control until the moment it returns again, using keys that only you possess. I know how much of a pain that is, believe me. I’ve had to work through the details of how to reconcile this level of security with multi-level caching and byte addressability in CloudFS, but it’s the only way to be secure. Vendors’s descriptions of what they’re doing in this area tend to be vague, as I said, but Nasuni is the only one who visibly seems to be on the right track. It sure would be nice if people could get that functionality through open source, instead of paying both a software and a storage provider to get it. Cue appearance by Zooko to plug Tahoe-LAFS in 5, 4, 3, …

The other area where work needs to be done is handling user identities, which covers both auth and identity mapping. For starters, the storage system must internally enforce permissions between users, which of course means it must have a notion of there even being multiple users. For systems which can assume that a single connection belongs to a single user, you can then authenticate using SASL or similar and be well on your way to a full solution. For systems that can’t make such an assumption, which includes things like filesystems, that’s not sufficient. You need to identify and authenticate not just the system making a request, but the user as well. I’m not a security extremist, so I can accept the argument that if you can fully authenticate a system and communicate with them through a secure channel then you can trust them to identify users correctly. The alternative is something like GSSAPI, which requires less trust in the remote system but can be a pretty major pain to implement.

The last issue is identity mapping. Even if you can ensure that a remote system is providing the correct user IDs, those IDs are still only correct in their context. If you’re a cloud service provider, you really can’t assume that tenant A’s user X is the same as tenant B’s user X. Therefore, you need to map A:X and B:X to some global users P and Q. Because you might need to store these IDs and then return them later (e.g. on a stat() call if you’re a filesystem) you need to be able to do the reverse mapping back to A:X and B:X as well. Lastly, because cloud tenants can and will create new users willy-nilly, you can’t require pre-registration; you need to create new mappings on the fly, whenever you see a new ID. This ends up becoming pretty entangled with the authentication problem because authentication information needs to be looked up based on the global (not per-tenant) ID, so this can all be a big pain but – again – it’s the only way to be secure.

To sum up, the lesson of go-derper is not that memcached is uniquely bad. Lots of systems are equally bad, and making them less bad is going to be hard, but it needs to be done before the other promises made by those systems can be realized. For a great many people, systems that are so totally insecure are useless, no matter what other wonderful functionality they might provide.