Archive for November, 2012

How To Build a Distributed Filesystem

This was originally a response on Quora, but it ended up attached to a comment that got downvoted (not by me) and that makes it hard to find so I figured I’d give it some more exposure here. I’ll add some more Gluster-specific commentary at the end, but first I just have to re-use the graphic from my original answer.


Here’s how I outlined some of the steps. They might look familiar.

  1. Implement a simple object server. You just have to be able to create/delete objects and read/write aligned blocks within them at this point.
  2. Decide how objects are going to be placed and found among multiple object servers. Single leader with election and failover? Consensus on each decision? Deterministic algorithm such as consistent hashing or CRUSH? Implement a simple lookup/read/write protocol that can deal with multiple object servers – statically configured for now.
  3. Make the object-server configuration dynamic, so new servers can join the group and new objects can be placed on them. It’s OK for now if a server leaving means its data is lost. At this point you’re mostly just implementing a membership protocol.
  4. Decide how replication is going to work. How are replicas chosen? Is replication sync or async? Do you make replications crash-proof by wrapping them in transactions? By writing to one server and then having it write to others? By using Dynamo-style R/W/N semantics? Implement a basic form of replication so that one server can fail and operations can occur on others. Don’t worry about repair for now.
  5. Worry about repair. If a server comes down and then back up, how does it reconcile state for its objects with state for other copies of those objects on other servers? How are conflicts detected and resolved? This could be a whole semester-long course all by itself, but just implement whatever works for you and make it modular so you can replace it later.
  6. Add rebalancing. As servers are added or removed, data needs to be shifted around to match the new topology. Implement that process.
  7. Add basic filesystem semantics like nested directories and byte-granularity reads/writes. If you chose the Ceph/RADOS (Frangipani/Petal) approach, you’ll need to add a whole metadata layer converting file operations into block operations. If you chose the GlusterFS (PVFS) approach, you’ll need to add that functionality within the object – now file – servers themselves. Either way, you’ll need to deal with things like concurrent reads/writes within a file and listing directories while entries within them are being created/deleted/renamed.
  8. Add basic security – UID/GID and user/group/other permission bits. Make the code check permissions. Might as well add other standard filesystem metadata such as mtime here as well.
  9. Add other operation types – hard and soft links, xattrs, rename, etc.
  10. Add caching and other performance enhancers. Do you want stateful caching with invalidation (or updates)? If so, how will you handle client failures? Or maybe you prefer leases, or time-to-live. Consider your needs for data and metadata (including directory entries) separately. Ditto for prefetching, buffering/batching, etc. Each will be different, so plan and implement accordingly.

At this point you’ll be at approximate parity with the other distributed filesystems out there. Most of these steps should take about half to two months for a suitably skilled individual working full time or nearly so, plus another couple of months for generic stuff like FUSE interfaces and connection management, so you’re looking at a bit over a year. Triple everything if you decide to do it all in the kernel instead, and triple again if you want to make it production quality.

OK, so why did GlusterFS in user space take even longer than that? The first reason is the time in which it was written. Many of the techniques that now seem obvious were very poorly understood (by anyone) at the time, which meant a lot of experimenting and often backtracking. Some components had to be replaced multiple times. Many of the things we depend on, such as FUSE, didn’t work all that well at the time either. Secondly, that same industry-wide unfamiliarity meant that recruiting and training developers and testers was much harder. That meant reduced hands-on productivity both from the new folks (at each iteration) and from the old folks who became mentors. Thirdly, and I think most significantly, the constraints of having been in production and having to deal with actual customer bugs played a part. It’s not easy to drain the swamp when you’re up to your a~~ in alligators, and it’s not easy to write new code when you’re at a customer site fixing the old stuff. If you look at other serious distributed filesystems of approximately the same vintage, you’ll see the same thing. (Don’t even get me started on the planetary waste of brainpower that NFS has been over the same period.) It would be feasible for a single person to develop a semi-decent distributed filesystem in a year or so now (I think about that all the time) but that most certainly wasn’t true in 2006. We’ll all be developing more rapidly now, adding features left and right. Keep that in mind as you evaluate each new announcement and try to guess who’s going to be ahead in another year or two.


Different Forms of Quorum

Several of us had a discussion on IRC about the almost-a-year-old GlusterFS quorum enforcement feature, and what quorum enforcement is likely to look like in the future. I tried to explain briefly some plans that had previously been discussed in email among some of the developers, but hadn’t circulated beyond there, and then I realized I should just write a blog post about that.

First, let’s review what quorum is for. The basic idea is simple: if different sets of nodes in a distributed data-storage system become isolated from one another, it becomes possible for multiple mutually-oblivious groups to make conflicting updates to the same object. The system therefore has two choices: allow this to occur and deal with the conflicts somehow, or prevent it. (Insert all of the usual verbiage about the CAP theorem/conjecture, eventual consistency, etc. here yourself. I’m bored with it.) One way to prevent such conflicts is to enforce some kind of a quorum rule, so that only one still-connected group of nodes can still apply updates. The rest are either limited to reading possibly-stale data, or can’t do anything at all. The key here is that There Can Be Only One such group. It must do something that no other group can do simultaneously. Here are some possible quorum rules:

  • Whichever group can set up disk fencing first has quorum. I know most people see fencing and quorum are separate, but experience has taught me that they’re not. That way lies madness.
  • Whichever group (if any) contains a majority of nodes has quorum. This is by far the most common rule.
  • Whichever group (if any) contains a weighted majority has quorum.
  • Whichever group (if any) has a majority of anchor nodes has quorum. MangoFS used this rule, not that anybody cares.

Yes, I realize that the anchor rules is a special case of the weighted-majority rule. The important thing, as I said, is that all of these rules satisfy the TCBOO condition. If your group has a majority, then no other group does . . . just like our recently concluded election. Even with simple majority quorum, though, there are still some questions that stretch the election analogy even further.

  • Who does the counting? Clients or servers?
  • What kind of objects do you count? Servers, bricks, replica sets?
  • What filter do you apply to those objects? Only those relevant to a particular volume, or all in the cluster?
  • What do you do when N=2, so that the only way to meet quorum is to have all relevant objects present?

Right now the only quorum we have: clients count all bricks within a replica set, and we do nothing special when N=2. That’s useful, better than no quorum leading to massive split-brain problems all the time, but we can do better. What will future quorum look like? It turns out that it’s going to be totally different.

  • Who counts? Servers.
  • What kind of objects are counted? Servers.
  • What filter do we apply? Right now, none; all servers throughout the cluster count toward quorum.
  • What do we do if N=2? Again, nothing.

The above is already represented in a patch. I think it’s awesome, but I’d still like to build on it in two ways. The first has to do with filters. I think it’s unacceptable that a volume which only exists on A+B can go read-only because C+D+E are down. Whole-cluster quorum is fine for management operations, but not for the data path. Therefore, once Pranith’s patch gets merged, I’ll probably submit another on top of it to implement an “only servers with an interest in this volume” filter.

The second issue is trickier. What should we do when N=2? In some cases, allowing a single failure to make a volume read-only (the current behavior) is fine. In others it’s not. One idea would be to “promote” from all bricks/servers in a replica set to all in a volume to all in the cluster. Unfortunately, that gets us nowhere in a two-server two-brick cluster, which is very common especially in the critical case of people trying GlusterFS for the first time and seeing how it responds to failures. The other idea is arbiter nodes, which hold no data but pump up the quorum number for the cluster (or for a volume if that’s how we’re counting). Thus, if we have a volume on two nodes in a cluster with more than two, a third node will be (automatically?) designated as having an interest in that volume so that the effective quorum is two out of three. Since tracking servers’ interests in volumes will already have to be part of the second patch I’ve mentioned, adding an arbiter is just a simple matter of manipulating that information so it should be a pretty straightforward third patch.

So that’s where we are with quorum today, and where we’re likely to be going. Please feel free to add any further questions or suggestions in the comments.


Trying Out MooseFS

In addition to watching the election coverage last night, I spent some time giving MooseFS another try. It’s a project I had high hopes for once, I even considered it as an alternative to GlusterFS as a basis for what was then CloudFS, but I was put off by several things – single metadata server, not modular, inactive/hostile community. When I tested it a couple of years ago it couldn’t even survive a simple test (write ten files concurrently and then read them all back concurrently) without hanging, so I pretty much forgot about it. When somebody recently said it was “better than GlusterFS” I decided it was time to put that to the test. Here are some results. First, let’s look at performance.
GlusterFS vs. MooseFS performance
OK, so it looks way faster than GlusterFS. That’s a bit misleading, though. For one thing, this is with replication turned on. GlusterFS replication is client to both servers; MooseFS is client to one server then that server to others (so the first server can use both halves of a full-duplex link). That means GlusterFS should scale better as the client:server ratio increases, but also that single-client performance will be half that for MooseFS. That’s what the “limit” lines in the chart above show, but those lines show something even more interesting. The MooseFS numbers are higher than is actually possible on the single GigE that client had. These numbers are supposed to include fsync time, and the GlusterFS numbers reflect that, but the MooseFS numbers keep going as data continues to be buffered in memory. That’s neither correct nor sustainable.

The way that MooseFS ignores fsync reminds me of another thing I noticed a while ago: it ignores O_SYNC too. I verified this by looking at the code and seeing where O_SYNC got stripped out, and now my tests show the same effect. Whereas the GlusterFS IOPS numbers with O_SYNC take the expected huge hit relative to those above, the MooseFS numbers actually improve somewhat. POSIX compliant, eh? Nope, not even trying. As a storage guy who cares about user data, I find that totally unacceptable and sufficient to disqualify MooseFS for serious use. The question is: how hard is it to fix? Honoring O_SYNC isn’t just a matter of passing it to the server, which would be easy. It’s also a matter of fixing the fsync behavior to make sure the O_SYNC request actually gets to the server, and – a little more arguably – to all servers that are supposed to hold the data. Those parts might be more difficult.

In any case, let’s take a more detailed look at how MooseFS compares to GlusterFS.

  • GOOD: performance. Despite everything I say above, it looks like MooseFS really does perform better than GlusterFS, at least for some important use cases. I’ll give them credit for that.
  • GOOD: per-object replication levels (“goals”). This is a feature I personally plan to add to GlusterFS some day, but it’s some day in the far future. :(
  • GOOD: snapshots. This one’s actually in the GlusterFS road map, but it’s not there today.
  • MIXED: self-heal and rebalance are more automatic, but also more opaque. I couldn’t find even the sketchiest documentation of this on their website. Apparently the only way to know what they’re doing, and if it’s the right thing, is to read the code. Besides being poor form for an open-source project (if dumping code from a private repository into a public one twice a year even qualifies), that also means it’s unlikely that the user will ever have significant control over when or how these things are done.
  • BAD: clunky configuration. Just a bunch of files in /etc/mfs with very little documentation. This isn’t just cosmetic; if you want to build a really big system, this approach just won’t scale. I don’t think GlusterFS’s approach to managing configuration data scales as well as it should either, but it’s still way better than this.
  • BAD: no online reconfiguration/upgrade. AFAICT, chunk servers can be added or removed on the fly, but anything else requires disruptive restarts. GlusterFS can already do most forms of reconfiguration online, and soon upgrades will be possible that way too.
  • BAD: no other protocols. No built-in NFS (requires re-export so the performance advantage would shift to GlusterFS), no UFO, no Hadoop integration, no qemu integration.
  • BAD: no geo-replication.
  • BAD: not readily extensible, as with GlusterFS translators.

Basically, if performance matters to you more than data integrity (e.g. the data exists elsewhere already), and/or if you really really need snapshots right now, then I don’t see MooseFS as an invalid choice. Go ahead, knock yourself out. You can even post your success stories here if you want. On the other hand, if you have any other needs whatsoever, I’d warn you to be careful. You might get yourself in deep trouble, and I’ve never seen anyone say the developers or other community were any help at all.


The Importance of Staying Sequential

One of the most important aspects of disk performance is the difference between seek latency and rotational latency. To put it simply, the time it takes to seek between tracks is at least an order of magnitude greater than the time it takes for the disk to spin – and that’s already more orders of magnitude greater than just about anything else that happens inside a computer. Thus, even though a disk provides random access, not all random accesses are equal; depending on where the disk head is after one operation, the time for the next can vary a great deal. Memory doesn’t have this behavior. Solid-state disks don’t have this behavior. Even networks don’t have this behavior; if it takes you X milliseconds to send one packet, it will probably take close to X milliseconds to send the next. Disks are special this way, and likely to become more rather than less special. Because of all this, operating systems and even some applications have evolved to generate sequential I/O (which doesn’t require disk seeks) instead of random (which does). What a lot of people tend to miss, even though it’s pretty obvious in retrospect, is that sequential I/O tends to become pathological I/O when there’s a journal on the same disk. That clicking you hear on a busy disk might well be the head seeking from the journal at one end of the disk to the actual data at the other – over and over and over and over again. How bad is it? Here’s a very simple graph.
sequential vs. random I/O
That’s from a couple of my test machines. The disks are pretty old (147GB is almost laughable nowadays) and it’s my usual worst-case test – synchronous, small, random writes. Wait, random? Weren’t we supposed to be comparing random to sequential? Well, yes. This is random I/O from the user perspective, but only for the red “single disk” line is it random when it hits disk. The filesystem actually does a pretty good job of de-randomizing the placement of those data blocks to avoid seeks, and the block I/O scheduler helps too. The journal I/O is inherently sequential, but what’s left is still the data/journal thrashing I mentioned. Thus, the green “separate journal” line shows what happens when we move the journal to a separate disk, so now we have two sequential streams on two separate devices instead of contending for one. Notice how the green line is more than twice as high as the red line. This isn’t just a matter of having twice as much hardware. It’s even more a matter of using that hardware more efficiently.

That brings us to SSDs. I don’t happen to have any SSDs handy, but it should be pretty clear how they might fit into this picture. Maybe you can afford to go all-SSD for your data and maybe you can’t. If you can, then you don’t need to worry about any of this stuff so why did you even read this far? If you can’t, then there’s still something you can do. Because they don’t have the “seek chasm” that spinning disks do, a single SSD can handle the logs for many spinning disks. The fact that SSDs are smaller shouldn’t matter either, because journals are smaller too. Thus, a bunch of disks plus a single SSD might give you better overall performance than more/faster disks without an SSD. Implications regarding the types and costs of block storage in public clouds are left as an exercise for the reader.