In my professional life as well as here (is there really any difference?) one of the things I do a lot is “evangelize” the idea that applications need many different kinds of storage. You shouldn’t shoe-horn everything you have into an RDBMS. You shouldn’t shoe-horn everything you have into a filesystem. Ditto for key/value stores, column or document stores, graph DBs, etc. As I’m talking about different kinds of storage that make different CAP/performance/durability tradeoffs, somebody often mentions In Memory Data Grids (henceforth IMDGs). Occasionally references are made to tuple spaces, or to Stonebraker’s or Gray’s writings about data fitting in memory, but the message always seems to be the same: everything can run at RAM speed and so storage doesn’t need to be part of the operational equation. To that I say: bunk. I’ve set up a 6TB shared data store across a thousand nodes, bigger than many IMDG advocates have ever seen or will for at least a few more years, but 6TB is still nothing in storage terms. It was used purely as scratch space, as a way to move intermediate results between stages of a geophysical workflow. It was a form of IPC, not storage; the actual datasets were orders of magnitude larger, and lived on a whole different storage system.

But wait, the IMDG advocates say, we can spill to disk so capacity’s not a limitation. Once you have an IMDG that spills to disk, using memory as cache, you have effectively the same thing as a parallel filesystem only without the durability characteristics. Without a credible backup story, or ILM story, or anything else that has grown up around filesystems. How the heck is that a win? There are sites that generate 25TB of log info per day. Loading it into memory, even with spill-to-disk, is barely feasible and certainly not cost-effective. There are a very few applications that need random access to that much data; the people running those applications are the ones who keep hyper-expensive big SMP (like SGI’s UltraViolet) alive, and a high percentage of them work at a certain government agency. For the rest of us, the typical processing model for big data is sequential, not random. RAM is not so much a random-access cache as a buffer that constantly fills at one end and empties at the other. That’s why the big-data folks are so enchanted with Hadoop, which is really just a larger-scale version of what your video player does. VLC doesn’t load your entire video into memory. It probably can’t, unless it’s a very small video or very large memory, and you don’t need random access anyway. What it does instead is buffer into memory, with one thread keeping the buffer full while the other empties it for playback. The point is that memory is used for processing, not storage. The storage for that data, be it 4.7GB of video or 25TB of logs, is still likely to be disks and filesystems.

I’m not saying that IMDGs aren’t valuable. They can be a very valuable part of an application’s computation or communication model. When it comes to that same application’s storage model, though, IMDGs are irrelevant and shouldn’t be presented as alternatives to various kinds of storage. (Aside to the IMDG weenie who derided cloud databases and key/value stores as “hacks”: let’s talk about implementing persistence by monkey-patching the JVM you’re running in before we start talking about what’s a hack, OK?) Maybe when we make the next quantum leap in memory technology, so that individual machines can have 1TB each of non-volatile memory without breaking the bank, then IMDGs will be able to displace real storage in some cases. Or maybe not, since data needs will surely have grown too by then and there’s still no IMDG backup/ILM story worth telling. Maybe it’s better to continue treating memory as memory and storage as storage – two different things, each necessary and each involving its own unique challenges.