Archive for January, 2013

Update on Ceph vs. GlusterFS

Since my last post has generated a bit of attention, I want to make sure the most important parts are not lost on anyone. First, let me reiterate: I love Ceph. I value Sage as a colleague and as an ally in the real fight. It would sadden me greatly if my comments had an adverse effect on that relationship. Partly I was writing out of frustration at being constantly compared to a vision. I can’t test a vision. The promise is there, but I can only test the reality. Partly I was also writing out of disappointment, because I know Ceph can do better. I have total faith in the quality of their architecture, and in the talent of their team. If there are glitches, they can be fixed. Maybe I’m trying to light a fire under them, but I don’t intend for anyone to get burned.

Second, how about that real fight? As I said, Ceph and GlusterFS are really on the same side here. The real fight is against proprietary storage, non-scalable storage, and functionally deficient storage. Users deserve better. As I’ve said in person many times, we have to win that battle before we squabble over spoils. The Bad Guys will laugh themselves silly if we tear each other apart. Ceph folks, if I have given offense I apologize. That was not my intent. I want us both to win, but to win we have to be honest with users about where we’re strong and where we still need to improve. So let’s measure, and improve, and kick some storage-industry ass together.

 

GlusterFS vs. Ceph

Everywhere I go, people ask me about Ceph. That’s hardly surprising, since we’re clearly rivals – which by definition means we’re not enemies. In fact I love Ceph and the people who work on it. The enemy is expensive proprietary Big Storage. The other enemy is things like HDFS that were built for one thing and are only good for one thing but get hyped relentlessly as alternatives to real storage. Ceph and GlusterFS, by contrast, have a lot in common. Both are open source, run on commodity hardware, do internal replication, scale via algorithmic file placement, and so on. Sure, GlusterFS uses ring-based consistent hashing while Ceph uses CRUSH, GlusterFS has one kind of server in the file I/O path while Ceph has two, but they’re different twists on the same idea rather than two different ideas – and I’ll gladly give Sage Weil credit for having done much to popularize that idea.

It should be no surprise, then, that I’m interested in how the two compare in the real world. I ran Ceph on my test machines a while ago, and the results were very disappointing, but I wasn’t interested in bashing Ceph for not being ready so I didn’t write anything then. Lately I’ve been hearing a lot more about how it’s “nearly awesome” so I decided to give it another try. At first I tried to get it running on the same machines as before, but the build process seems very RHEL-unfriendly. Actually I don’t see how duplicate include-file names and such are distro-specific, but the makefile/specfile mismatches and hard dependency on Oracle Java seem to be. I finally managed to get enough running to try the FUSE client, at least, only to find that it inappropriately ignores O_SYNC so those results were meaningless. Since the FUSE client was only slightly interesting and building the kernel client seemed like a lost cause, I abandoned that effort and turned to the cloud.

For these tests I used a pair of 8GB cloud servers that I’ve clocked at around 5000 synchronous 4KB IOPS (2400 buffered 64KB IOPS) before, plus a similar client. The very first thing I did was test local performance to verify that local performance was as I’d measured before. Oddly, one of the servers was right in that ballpark, but the other was consistently about 30% slower. That’s something to consider in the numbers that follow. In any case, I installed Ceph “Argonaut” and GlusterFS 3.2 because those were the ones that were already packaged. Both projects have improved since then; another thing to consider. Let’s look at the boring number first – buffered sequential 64KB IOPS.

Async 64KB graph

No clear winner here. The averages are quite similar, but of course you can see that the GlusterFS numbers are much more consistent. Let’s look at the graph that will surprise people – synchronous random 4KB IOPS.

Sync 4KB graph

Oh, my. This is a test that one would expect Ceph to dominate, what with that kernel client to reduce latency and all. I swear, I double- and triple-checked to make sure I hadn’t reversed the numbers. My best guess at this point is that the FUSE overhead unique to GlusterFS is overwhelmed by some other kind of overhead unique to Ceph. Maybe it’s the fact that Ceph has to contact two servers at the filesystem and block (RADOS) layers for some operations, while GlusterFS only has a single round trip. That’s just a guess, though. The important thing here is that a lot of people assume Ceph will outperform GlusterFS because of what’s written in a paper, but what’s written in the code tells a different story.

Just for fun, I ran one more set of tests to see if the assumptions about FUSE overhead at least held true for metadata operations – specifically directory listings. I created 10K files, did both warm and cold listings, and removed them. Here are the results in seconds.

Ceph GlusterFS
create 109.320 184.241
cold listing 0.889 9.844
warm listing 0.682 8.523
delete 93.748 77.334

Not too surprisingly, Ceph beat GlusterFS in most of these tests – more than 10x for directory listings. We really do need to get those readdirp patches in so that directory listings through FUSE aren’t quite so awful. Maybe we’ll need something else too; I have a couple of ideas in that area, but nothing I should be talking about yet. The real surprise was the last test, where GlusterFS beat Ceph on deletions. I noticed during the test that Ceph was totally hammering the servers – over 200% CPU utilization for the Ceph server processes, vs. less than a tenth of that for GlusterFS. Also, the numbers at 1K files weren’t nearly as bad. I’m guessing again, but it makes me wonder whether something in Ceph’s delete path has O(n²) behavior.

So, what can we conclude from all of this? Not much, really. These were really quick and dirty tests, so they don’t prove much. It’s more interesting what they fail to prove, i.e. that Ceph’s current code is capable of realizing any supposed advantage due to its architecture. Either those advantages aren’t real, or the current implementation isn’t mature enough to demonstrate them. It’s also worth noting that these results are pretty consistent with both Ceph’s own Argonaut vs. Bobtail performance preview and my own previous measurements of a block-storage system I’ve been told is based on Ceph. I’ve seen lots of claims and theories about how GlusterFS is going to be left in the dust, but as yet the evidence seems to point (weakly) the other way. Maybe we should wait until the race has begun before we start predicting the result.

 

Split and Secure Networks for GlusterFS

Users often ask if there’s a reasonable way to run GlusterFS so that clients connect to the servers over a public front-end network while the servers connect to each other over a separate private back-end network. This is actually a really good idea, because it helps to isolate things like self-heal and rebalance traffic on a separate network from the one the clients are using (though they still contend for resources on the servers so it’s no panacea). The basic problem is that both client-server and server-server connections are done using the same names, which in a normal configuration will resolve for everyone to the same addresses on the same network. The dirty way to deal with this is to use /etc/hosts or “split horizon” DNS so that the names actually resolve differently – to the front end when queried from the client and to the back end when queried from a server. You can also use host routes. These approaches basically work, but they have some flaws.

  • They require extra server configuration (i.e. outside of GlusterFS) that might be forgotten when servers are replaced or rebuilt. If that happens, you won’t get a clean or obvious failure. Instead you’re likely to get an overloaded front-end network.
  • They affect all traffic between the servers, even that not related to GlusterFS.
  • Inconsistent name resolution often leads to all sorts of other hard-to-debug problems.

I’m going to suggest something that doesn’t solve the first problem (in fact it’s even uglier in many ways) but might avoid the other two. Let’s say you have two servers (gfs1 and gfs2) and each have interfaces on two separate networks (front and back). If you also have a reasonably modern version of iptables you can do something like this on gfs1.

iptables -t nat -A OUTPUT -p tcp -d gfs2-front -j DNAT --to-destination gfs2-back
iptables -t nat -A POSTROUTING -p tcp -s gfs1-front -d gfs2-back -j SNAT --to-source gfs1-back

Then you do the inverse on gfs2 and voila – any connections initiated by one to the other get forced onto the back-end network even though they’re using the front-end names and addresses. You can actually generalize the POSTROUTING rule so that it uses netmasks instead of specific hosts, but you’re pretty much stuck adding a new OUTPUT rule for each peer. I told you it’s uglier than the other approaches. The real key thing that makes this different is that you can make it more specific to GlusterFS traffic, leaving other traffic alone, by adding more iptables rules for specific ports. For example:

iptables -t nat -I OUTPUT -p tcp --dport 655 -j ACCEPT

Because this is an insert instead of an append, this rule is hit first and exempts port 655 from the back-end forcing. This is actually pretty important in certain cases. For example, some readers might have recognized port 655 as the one used by tinc. I don’t actually have two physical networks, so I used tinc to create a sort of VPN with separate interfaces. If I hadn’t added the exception above, tinc’s traffic would get forced onto the VPN that tinc itself is providing. That causes an infinite loop and everything breaks, as I found many times before I came up with the “magic” iptables formula above.

There’s more to using tinc than just simulating an extra network, though. It’s also a secure network. Using this trick, you can have your server-to-server connections encrypted and authenticated, without needing to change anything in GlusterFS. I imagine that tinc’s compression feature could also be useful in some cases, though in general probably not.

None of this is any substitute for having advanced networking features in GlusterFS itself. Messing around with tinc and iptables is still awkward at best, there are significant performance implications, it doesn’t do anything for RDMA because it’s all IP-based, and it still doesn’t handle even more advanced configurations such as multiple front-end networks. Still, some people might like to know that it’s possible to do these things today, without having to wait for the appearance of these features in the code itself.