Archive for July, 2012

Why Does Cloudera *Really* Use HDFS?

Apparently, someone in Hadoop-land is getting worried about alternatives to HDFS, and has decided to address that fear via social media instead of code. Two days ago we had Daniel Abadi casting aspersions on Hadoop adapters. Today we have Charles Zedlewski explaining why Cloudera uses HDFS. He mentions a recent GigaOm article listing eight alternatives, and manages to come up with a couple more, but still manages to miss at least one. (Hint: what site are you reading right now?) He also makes a really weird comparison to the history of Linux, as though HDFS is the only option that’s open and hardware agnostic. What crap. Practically all of the alternatives share those qualities. He also makes some other misleading claims on HDFS’s behalf.

it has excellent … high availability (that’s right folks, drop the SPOF claims, you can download CDH4 here!)

That’s right, folks. After years of denying that the NameNode SPOF mattered, then more time wasted trying to push the problem off on someone else (e.g. shared SAN or NFS), HDFS finally has its very own high availability. Congratulations on reaching parity with the rest of the world. I’d hold off on “excellent” until more than a couple of real-world users have tried it, though.

[HDFS offers] Choice – Customers get to work with any leading hardware vendor and let the best possible price / performer win the decision, not whatever the vendor decided to bundle in.

Um, right. How many of those dozen-or-more alternatives can I not deploy on just as wide a range of hardware and operating systems as HDFS? A couple, maybe? Pure FUD.

Portability – It is possible for customers running Hadoop distributions based on HDFS to move between those different distributions without having to reformat the cluster or copy massive amounts of data.

This is also not an HDFS exclusive. Any of the alternatives that were developed outside the Hadoopiverse have this quality as well. If you have data in Cassandra or Ceph you can keep it in Cassandra or Ceph as you go Hadoop-distro shopping. The biggest data-portability wall here is HDFS’s, because it’s one of only two such systems (the other being MapR) that’s Hadoop-specific. It doesn’t even try to be a general-purpose filesystem or database. A tremendous amount of work has gone into several excellent tools to import data into HDFS, but that work wouldn’t even be necessary with some of the alternatives. That’s not just a waste of machine cycles; it’s also a waste of engineer cycles. If they hadn’t been stuck in the computer equivalent of shipping and receiving, the engineers who developed those tools might have created something even more awesome. I know some of them, and they’re certainly capable of it. Each application can write the data it generates using some set of interfaces. If HDFS isn’t one of those, or if HDFS through that interface is unbearably slow because the HDFS folks treat anything other than their own special snowflake as second class, then you’ll be the one copying massive amounts of data before you can analyze it . . . not just once, but every time.

That brings us to performance, which is also interesting because Charles barely mentions it (tangentially in his “choice” point). Isn’t Hadoop supposed to be all about going fast? Why else would they take short cuts like telling the caller that a write is complete when in fact it has only been buffered to local disk? I’m not going to say HDFS is not fast, but many of the alternatives are provably faster. That’s why they’re the ones pushing data faster than any Hadoop installation ever has, on the world’s biggest systems. I’m sure Hadoop is there too, and so HDFS probably is too, but it’s not the thing actual applications are using. Why not?

I suspect that the real reason Cloudera uses HDFS is not anything Charles mentions. I don’t know why they built it originally, since some of the alternatives already existed and a few more were at least well along as new projects. Maybe it’s because they wanted something written in the same language/environment as the rest of Hadoop, so instead of contributing patches to an existing filesystem for the one new feature they needed (data-locality queries) they went and built their own not-quite-filesystem. Maybe it’s because they assumed that whatever GoogleFS did was The Right Thing, so they cloned it and then kept thinking that “one more workaround” would make it all that they had originally hoped. Either way, whether it was NIH syndrome or NIAG (Not Invented At Google) syndrome, it developed its own inertia.

At this point the issue is more likely to be familiarity. It’s what their engineers know, and it’s what their customers know. Both groups know how to tune it, and tune other things to work with it. Moving to something else would not only take a lot of effort but take people out of their comfort zones as well, and it would also open them up to questions about why they expended all those resources on a sub-optimal solution before. Of course they’re going to stick with it, and even double down, because they have no choice. The rest of us do.

UPDATE: Eric Baldeschwieler at HortonWorks has posted their response to the GigaOm article. In my opinion he does a much better job identifying the qualities that HDFS and any would-be competitors must have and measuring the alternatives against those qualities. We might disagree on some points (e.g. sequential I/O patterns are common and generally well optimized so why the hell should I/O systems design for Hadoop instead of the other way around?) but kudos for a generally excellent response.


Do Hadoop Adapters Make Sense?

Daniel Abadi described his blog entry about Hadoop connectors as a “Stonebraker-style rant” and then delivered on the threat. Like everything Stonebraker has written in the last five years, it’s based on a fundamentally flawed premise, which is that HDFS stores unstructured data. This assumption is not clearly stated, but it’s pretty clear from context, e.g.

primary storage in Hadoop (HDFS) is a file system that is optimized for unstructured data

Among storage folks, “unstructured” means stuff that is written using any of the millions of applications of the last thirty years that write their data using the storage API that’s built into every operating system – i.e. files. HDFS does not store unstructured data. If it did, there would be no need to import true unstructured data into HDFS. HDFS is not quite a real file system, usable by a significant percentage of applications. It’s designed to support Hadoop itself and that’s pretty much all it does with any degree of competence. Can you extract and build a Linux kernel source tree in HDFS? Even if you could, how good do you think a system designed around management of 128MB chunks work for that? (BTW yes, I can think of ways to analyze this kind of data using Hadoop, so it’s not an entirely silly example. Use your imagination.)

Going back to Daniel’s argument, RDBMS-to-Hadoop connectors are indeed silly because they incur a migration cost without adding semantic value. Moving from one structured silo to another structured silo really is a waste of time. That is also exactly why filesystem-to-Hadoop connectors do make sense, because they flip that equation on its head – they do add semantic value, and they avoid a migration cost that would otherwise exist when importing data into HDFS. Things like GlusterFS’s UFO or MapR’s “direct access NFS” decrease total time to solution vs. the HDFS baseline. I’d put DataStax’s Brisk into the same category, even though the storage it provides is structured underneath, because it also works “inline” like the native file system alternatives instead of requiring an import phase like HDFS. The fact that you can also use CassandraFS to work with file-based applications is just icing on the cake.

So, are Hadoop adapters a good idea or a bad one? It depends on what kind you’re talking about. Daniel and I can agree that ETL-style RDBMS connector doesn’t make any sense at all, but I believe that an inline unstructured-data connector is a different story altogether.


Testing on Amazon’s New SSD Instances

The new hi1.4xlarge instances in EC2 are pretty exciting, not only because they’re equipped with SSDs but because they’re also equipped with 10GbE and placement groups allow you to create server clusters that are closely colocated with full bandwidth among them. I was about ready to do another round of GlusterFS testing to see the effects of some recent changes (specifically the multi-threaded SSL transport and Avati’s delayed post-op in AFR) so it seemed like a good time to try out the new instances as well.

After firing up my two server instances, the first thing I did was check my local I/O performance. Each volume seemed to top out at approximately 30K IOPS, same as I’d seen at Storm on Demand when I was testing my replication code there, but the Amazon instances have two of those so they should be able to do 60K IOPS per instance (the 100K everyone else keeps quoting is just a marketing number). I couldn’t immediately fire up a third instance in the same placement group because of resource limits so I fired up a plain old m1.xlarge for the client. I’ve applied for a resource-limit increase so I can do the test I wanted to do, but for now these results should at least be directly comparable to Storm on Demand. All of these tests were run on a four-brick two-way-replicated GlusterFS volume to take full advantage of the hardware in the servers. Please bear in mind that these are random synchronous writes over a (slow) network, so the numbers will seem very low compared to those you’d get if you were testing async I/O locally. This is all about a worst case; the best case just wasn’t interesting enough to report on.

Amazon High I/O Performance

If you compare to the Storm on Demand graph (link above) a few things immediately become apparent. One is that the highest valid number (the unsafe “fast-path” number doesn’t count) has gone up from about 3000 to about 4000. That’s nice, but also bear in mind that the Amazon instances cost $3.10 per hour and the Storm on Demand instances are only $0.41 per hour. Even if the IOPS numbers had doubled, that still doesn’t seem like such a great deal.

The second obvious result is that the same number for “plain old AFR” has gone up from ~1500 IOPS to well over 4000, quite handily overtaking my own hsrepl. I’m not entirely sure why hsrepl actually managed to get worse, but my working theory is that the new handling of “xdata” (where we put the version numbers necessary for correct operation) is considerably less efficient than the handling I’d implemented on my own before. I don’t have hard evidence of that, but the new code will definitely go around in a much longer code path issuing more reads for the same data, and the sudden drop-off for hsrepl in my own local testing corresponds exactly with that change. In the end we seem to be even further from that theoretical maximum, even though the absolute IOPS number has increased.

The other mystery for me is why the multi-threading also seems to make things worse. This isn’t actually doing SSL, even though the two features were inextricably tied together in the same patch, so there’s not a lot more total computational load. These machines have plenty of cores to spare, so it shouldn’t be a thread-thrashing issue either. I expected the multi-threaded numbers to get a bit better, and in all of my other tests that has been the case. Maybe when I get my resource limit increased I’ll see something different in the all-10GbE environment.

That’s pretty much all I have to say about the new instances or GlusterFS running on them. They’re certainly a welcome improvement for this worst-case kind of workload, but I’ve seen their ilk before so the only thing that’s really new to me is the high price tag.


Multi Ring Hashing

Ring-based consistent hashing is one of my favorite algorithms. It’s an elegant solution to a common set of problems, and many times I’ve seen people’s eyes light up when they realize that finding data among a set of servers doesn’t have to involve central directories or expensive lookups. Systems based on this capability are still emerging and evolving rapidly, because it’s so powerful. That said, once you get into actually implementing such a system, you start to realize that there are still some secondary problems that have to be resolved before you have a working system. One of these is the assignment of virtual node IDs or tokens within the hashing ring. Sam Overton at Acunu has recently written up an excellent overview of this issue, rightly pointing out the problems of both one token-per-server and infinite-token approaches. Of the three approaches he considers, random token assignment with more than one token per server really does seem like the best choice.

However, I think there’s another way that’s even better than that: multiple hash rings. Instead of each node having several tokens in one ring, it has one token in multiple rings. Part of the hash for a key is used to select a ring, then the remainder of the hash is used to find the key in that ring. This is mathematically equivalent to a certain way of allocating multiple tokens within a single ring, such that each of N contiguous ring segments contains a token for each node, as shown in the following diagram (colors represent nodes, three tokens per node).
ring equivalence
Mathematical equivalence isn’t the whole story, though. There are dozens of programming languages that are all Turing-complete and thus mathematically equivalent, and yet some are still better than others either for specific purposes or in general. In this case, one very important difference between the two approaches is that with multiple rings each ring closes itself while with the single ring each segment leads into the next. To see why this matters, consider how such hashing rings are often used to choose replicas as well as initial locations by simply moving around the ring (usually clockwise). It’s highly desirable in such cases for each of a node’s N tokens to have different predecessors and successors, so that when the node fails its load is distributed instead of being concentrated on one alternate. When the replication level R is more than two, it’s desirable for R-1 predecessors and successors to be different, which is even more difficult.

With that in mind, look at the single ring in the diagram above and consider what happens when R=3. A replica group starting at the purple 7:00 segment would be back on purple at 9:00. Similarly, a group starting on red at 11:00 ends up on red again at 1:00. Now try to fix that, without introducing a similar problem elsewhere. Just for extra fun, consider where you’d have to add the three tokens for a fifth node to preserve proper replication behavior. Token assignment can become quite a puzzle even at this size. It gets considerably more difficult as the node count increases, and the possibility of assigning tokens without dividing them into one-per-node sets doesn’t help (it almost seems to make things worse). By contrast, this problem never even exists in the multi-ring alternative, because the replica sets never overflow from one set of per-node tokens to another set. It’s still necessary to make sure that the rings are in different orders, but even that turns out to be dirt simple. Just assign tokens to each node based on a different “stride width” in each ring, skipping over already-assigned nodes. Here’s how it worked out for the example above.

  • Basic order is: blue, red, green, purple.
  • First ring, stride=1. Just follow the basic order.
  • Second ring, stride=2. First slot is blue, then progress two through the order to green. Another two brings us back to blue, which is already taken, so we skip to red. After that we revert to counting by two, which gets us purple.
  • Third ring, stride=3. First slot is blue again, then progress three to purple. Three again gets us to green, and finally red.

Ideally, the stride widths (after the first) should be prime with respect both to each other and to the total node count, and greater than the maximum replica count, but that’s not possible with small numbers of nodes. That’s why we end up with failover imbalances such as 2/3 of purple’s load failing over to blue, but there’s never a case where two replicas (even with R=4) end up on the same node in normal operation. In the end, the additional token-assignment flexibility of the single ring provides little practical benefit, while the multi-ring alternative makes life a lot simpler in the quite common case of using the ring(s) to place replicas as well as initial copies.


High Speed Replication Update

A while ago, I wrote about my High Speed Replication translator, comparing it to the existing method of replication. Bottom line: my version was about 50% faster for small synchronous random writes at the time. Well, times have changed. Here’s part of the commit message for my latest update to HSR.

This version works with the official xdata code. Unfortunately, that
code is grossly inefficient so hsrepl is now up to 12% slower than
before. At the same time, the applicability of the various
optimizations in AFR have made it up to 68% faster than before. Net
result is that AFR alone is now up to 8% faster than AFR+hsrepl on my
basic test case. Now that some of the techniques proven in hsrepl are
being imitated in AFR (e.g. asynchronous changelog decrements) the
balance is likely to tip even further in that direction. Its original
purpose fulfilled, this is likely to be the last version of hsrepl as a
separate translator.

Competition is good. If my code is superficially rejected, but my results spur others to make long-overdue improvements to the alternative, I’m not offended. I’m glad. Well done, AFR team.