The other day I got into a discussion on Quora with Cloudera’s Jeff Hammerbacher, about how Impala uses storage. Now Jeff is someone I like, and respect immensely, but he said something in that discussion that I find deeply disturbing (emphasis mine).
MapReduce and Impala are just the first two of what are likely to be many frameworks for parallel processing that will run well over data stored in HDFS, HBase, and future data storage solutions built above HDFS.
I agree that there will always be multiple data stores within an organization. I suspect that as the storage and computational capabilities of Hadoop expand, the amount of data stored within Hadoop will come to dominate the amount of data stored outside Hadoop, and most analyses will be performed within the cluster.
That’s a pretty bold prediction, and a pretty HDFS-centric strategy for getting there. If HDFS is central to the Hadoop storage strategy, and most data is stored in Hadoop, then what he’s really saying is that most data will be stored in HDFS, and I really hope that’s not the plan because any Hadoop company that has that as a plan will lose. I’m a firm believer in polyglot storage. Some storage is optimized for reads and some for writes, some for large requests and some for small requests, some for semantically complex operations and some for simple operations, some for strong consistency and some for high performance. I don’t think the storage I work on is good for every purpose, and that’s even less the case for HDFS. It’s not even a complete general-purpose filesystem, let alone a good one. It was designed to support exactly one workload, and it shows. There’s no track record to show that the people interested in seeing it used more widely could pull off the fundamental change that would be required to make it a decent general-purpose filesystem. Thus, if the other parts of Hadoop are only tuned for HDFS, only tested with HDFS, stubbornly resistant to the idea of integrating with any other kind of storage, then users will be left with a painful choice.
- Put data into HDFS even though it’s a poor fit for their other needs, because it’s the only thing that works well with the rest of Hadoop.
- Put data into purpose-appropriate storage, with explicit import/export between that and HDFS.
Painful choices tend to spur development or adoption of more choices, and there are plenty of competitors ready to pick up that dropped ball. That leaves the HDFS die-hards with their own two choices.
- Hire a bunch of real filesystem developers to play whack-a-mole with the myriad requirements of real general-purpose storage.
- Abandon the “one HDFS to rule them all” data-lock-in strategy, and start playing nice with other kinds of storage.
Obviously, I think it’s better to let people use the storage they want, as and where it already exists, to serve their requirements instead of Hadoop developers’. That means a storage API which is stable, complete, well documented, and consistently used by the rest of Hadoop – i.e. practically the opposite of where Impala and libhdfs are today. It means actually testing with other kinds of storage, and actually collaborating with other people who are working on storage out in the real world. There’s no reason those future parallel processing frameworks should be limited to working well over HDFS, or why those future storage systems should be based on HDFS. There’s also little chance that they will be, in the presence of alternatives that are both more open and more attuned to users’ needs.
Perhaps some day most data will be connected to Hadoop, but it will never be within. The people who own that data won’t stand for it.
Same can be said for processor architectures. We can see how that has worked so far… GPUs may buck the trend, but only because they’ve shown such an extreme advantage in a few revenue-producing areas.
Great points. I’d extend the “Hire a bunch of real file-system developers” point further: the entire Hadoop system is too fragile, poorly documented and admin-hostile in comparison to a modern Linux distribution & filesystem. Things like daemons leaking memory until they crash, causing key services to fail in non-obvious, obscurely logged ways does not make me jump to entrust primary data to it.
Don’t get me wrong, Hadoop works and many people perform a huge amount of work with it but for it to become a true general-purpose system like that the basics need to be as easy as, say, setting up an Apache server on a modern Linux system – sane defaults, non-specialist-oriented documentation which won’t leave you googling for blog posts, etc.
Some of Impala’s optimizations are specific to HDFS. However that doesn’t mean the Hadoop ecosystem is restricted. That may be one vendor’s preference but the extrapolation to the entirety of Hadoop seems unwarranted. I think the problem lies with the question “What is Hadoop?” That’s not so clear because various parties are actively redefining the term. I think it is on its way to a generic like “UNIX”. There are several HDFS alternatives of varying maturity and credibility, and typical Hadoop client applications, such as MapReduce, Hive, Pig, and the like, use the FileSystem abstraction, which is agnostic about the underlying FS. If the FS can tell Hadoop where data resides local to a MapReduce task runner, all the better. (Ceph has something in this regard. I don’t know for the case of Gluster.) HBase relies on some HDFS semantics for improved recovery times and to avoid data loss, but this could be specialized to other filesystems too given available capabilities in those filesystems and developer effort.
I thought Hadoop, by definition, is HDFS and MapReduce?
I think the last two comments pretty neatly capture the problem here. Is Hadoop a project or a technology? Is it composed of code, people, infrastructure or legal agreements? Should it be defined minimally (e.g. just MapReduce) or maximally (dozens of subprojects providing and often duplicating dozens of kinds of functionality)? Is it just a marketing term hung on whatever certain people want to promote? I contend that HDFS, despite being an early and core part of Hadoop as a project, isn’t essential. Rip out MapReduce and it’s just not Hadoop any more. It doesn’t do what people set up Hadoop to do. (“What is its nature, Clarice? What does it do?”) Rip out HDFS, replace it with MapR/Ceph/GlusterFS/whatever, and it’s still Hadoop. It still does what people set up Hadoop to do, and might even do it better. HDFS has only become common because it rode MapReduce’s coat tails, some people at some companies want to continue the ride, but neither the other Hadoop components nor Hadoop users are well served by the “everything in HDFS” approach/attitude.
BTW, Andrew, GlusterFS does have such location-query capability. In fact, I’m pretty sure we’ve had it longer than Ceph has.
Hadoop is anything but just Map Reduce and to say otherwise seems a little off the mark. First and foremost, HDFS provides the optimal characteristics needed by Map Reduce for large scale processing and they were designed closely together. If anything, Map Reduce is going to some day go by the wayside and HDFS will likely continue to shine as an open de facto standard for many other algorithms and real-time engines, like YARN, Storm, and Impala. HDFS will continue to benefit from the Linux community as that community, too, improves its local file systems. Effectively, the entire world of open source developers directly or indirectly are improving HDFS.
I’m not 100% sure I understand if you’ve seen a 10 PB HDFS cluster humming along, but it’s a thing of beauty.
Actually I haven’t seen a 10PB HDFS cluster humming along. On the other hand, I’m not 100% sure you’ve ever seen a 10PB storage cluster that wasn’t HDFS, so maybe we should just set the ad hominems aside and discuss the matter at hand.
Your point about HDFS becoming a de facto standard is kind of assuming your conclusion. Yes, it might, but the whole question here is whether it should. Other petabyte-scale storage systems exist, have existed for longer than HDFS, serve more uses and users already, and benefit even more from Linux local-filesystem development because their authors actually talk to (or even are) Linux local-filesystem developers. In what way does HDFS *deserve* to be that de facto standard? Your claim about “optimal characteristics needed by Map Reduce” – even if true as stated – tells us little about whether it also has optimal characteristics needed by any other analysis infrastructure. In fact, since few things are optimal for more than one purpose, it suggests that HDFS will *not* be optimal for those others, and as MapReduce “goes by the wayside” even that solitary point of distinction will become irrelevant.
The other point that I think is critically important is that no system which requires copy in/out from where users would naturally put their data will ever be optimal. Big Data is typically generated by dozens or hundreds of programs. HDFS is optimal for almost none of them, and because it’s missing features expected in a true filesystem[1] some of them might not work with it at all. That leads to the painful choices I mentioned, systems which avoid painful choices are usually preferable, and federated storage does avoid those choices. “You must embrace our pseudo-standard” is not a winning strategy.
[1] Just today: https://twitter.com/nicktelford/status/266877649169809408
Hmmm, looks like someone at Greenplum gets it.