The other day I got into a discussion on Quora with Cloudera’s Jeff Hammerbacher, about how Impala uses storage. Now Jeff is someone I like, and respect immensely, but he said something in that discussion that I find deeply disturbing (emphasis mine).
MapReduce and Impala are just the first two of what are likely to be many frameworks for parallel processing that will run well over data stored in HDFS, HBase, and future data storage solutions built above HDFS.
I agree that there will always be multiple data stores within an organization. I suspect that as the storage and computational capabilities of Hadoop expand, the amount of data stored within Hadoop will come to dominate the amount of data stored outside Hadoop, and most analyses will be performed within the cluster.
That’s a pretty bold prediction, and a pretty HDFS-centric strategy for getting there. If HDFS is central to the Hadoop storage strategy, and most data is stored in Hadoop, then what he’s really saying is that most data will be stored in HDFS, and I really hope that’s not the plan because any Hadoop company that has that as a plan will lose. I’m a firm believer in polyglot storage. Some storage is optimized for reads and some for writes, some for large requests and some for small requests, some for semantically complex operations and some for simple operations, some for strong consistency and some for high performance. I don’t think the storage I work on is good for every purpose, and that’s even less the case for HDFS. It’s not even a complete general-purpose filesystem, let alone a good one. It was designed to support exactly one workload, and it shows. There’s no track record to show that the people interested in seeing it used more widely could pull off the fundamental change that would be required to make it a decent general-purpose filesystem. Thus, if the other parts of Hadoop are only tuned for HDFS, only tested with HDFS, stubbornly resistant to the idea of integrating with any other kind of storage, then users will be left with a painful choice.
- Put data into HDFS even though it’s a poor fit for their other needs, because it’s the only thing that works well with the rest of Hadoop.
- Put data into purpose-appropriate storage, with explicit import/export between that and HDFS.
Painful choices tend to spur development or adoption of more choices, and there are plenty of competitors ready to pick up that dropped ball. That leaves the HDFS die-hards with their own two choices.
- Hire a bunch of real filesystem developers to play whack-a-mole with the myriad requirements of real general-purpose storage.
- Abandon the “one HDFS to rule them all” data-lock-in strategy, and start playing nice with other kinds of storage.
Obviously, I think it’s better to let people use the storage they want, as and where it already exists, to serve their requirements instead of Hadoop developers’. That means a storage API which is stable, complete, well documented, and consistently used by the rest of Hadoop – i.e. practically the opposite of where Impala and libhdfs are today. It means actually testing with other kinds of storage, and actually collaborating with other people who are working on storage out in the real world. There’s no reason those future parallel processing frameworks should be limited to working well over HDFS, or why those future storage systems should be based on HDFS. There’s also little chance that they will be, in the presence of alternatives that are both more open and more attuned to users’ needs.
Perhaps some day most data will be connected to Hadoop, but it will never be within. The people who own that data won’t stand for it.