Archive for May, 2012

Back Door Async Replication

My standing search for “glusterfs” on Twitter got me into an interesting discussion with Dr. Shawn Tan about an interesting GlusterFS configuration for several workstations. At first my reaction was panic, because I could see potential for data loss in that configuration, but “you’re using it wrong” is rarely a productive response from a developer. Dr. Tan’s use case is actually quite valid. It just happens to be one that GlusterFS doesn’t deal with very well right now, so instead of spending my time feeling superior I spent it thinking about what we could do differently instead. The basic parameters here are:

  • Provide shared access to data across several workstations in a cluster.
  • Use replication to ensure that data can survive a workstation failure.
  • Maximize the percentage of reads that are local.
  • Maximize the percentage of writes that are local.

The first two points are pretty clearly good reasons to use GlusterFS or something like it. It’s the third and particularly fourth points where this configuration runs into trouble. Caching data for reads is fine, as long as you understand your consistency needs and don’t give up more consistency than your applications or users can really tolerate. Caching (actually buffering) writes is much more dangerous. If your goal is to ensure that data can survive a workstation failure, then writing it only on the workstation and leaving it there forever is simply a non-starter. However, writing it initially on the workstation and writing it asynchronously to another machine might shrink the window of vulnerability to an acceptable level. GlusterFS’s “geosync” won’t do many-to-one replication, which would be very space-inefficient anyway, so how else might we get something like the same effect?

One part of the solution would be something like the “NUFA” (Non Uniform File Access) feature that GlusterFS used to have, but which has been deprecated since “DHT” replaced “unify” before I was even seriously involved with the project. The idea behind NUFA was to create files locally whenever possible, and write to those local files instead of via the “normal” kind of distribution. That sounds ideal except that as soon as you add replication you’re writing remotely again, so NUFA’s not doing anything for you . . . or is it? It’s making sure one copy of the file is local. What if we could make that the only copy initially, so that we only do local writes in the main I/O path, but with some control over how long it takes before that data does get replicated somewhere else? That’s where my bypass translator comes in. It’s not production quality yet, but it basically fits the bill of setting up two replicas and then populating only one. Since it works with the regular AFR translator, you can then use regular self-heal to propagate the changes to another replica. With self-heal becoming more precise, thanks to lots of hard work by other members of the GlusterFS team, this will soon be a pretty efficient process. The only thing you lose is strict ordering. Thus, you might not want to do this for parallel applications that depend on such ordering but it’s probably fine for sharing on the time scale of a person working on one workstation before lunch and a different one after. Your I/O flow would be something like this:

  • User at workstation A creates a new file and writes data into it. The file is actually created on A and B, but the data is only written into A (with all of the pending flags set so AFR knows that B needs an update).
  • Every ten minutes, a job runs that “self-heals” partially written files from A to B, A to C, or in any other case where A shares a replica set with one of the other workstations.
  • The same user subsequently sits down at workstation C and reads the same file. It’s read from A and/or B, which is unfortunate, but that’s a case where we can deploy caching to remove some of the sting.

If a user gets up and moves immediately to another workstation, they might see stale data. If they then start modifying the same files from the new workstation, split-brain problems could result. Similarly, if a failure happens on A before a new file has been propagated to B, then that file’s actual contents will be unavailable until A (or at least the relevant disk from A) comes back. These might be absolute show-stoppers for most people. On the other hand, I could see using something like this myself e.g. between my work machine, my home machine, one or more laptops, and my general-purpose server at Rackspace. On the third hand, maybe I’ll just wait until I have a chance to work on my personal Patch Based Filesystem project which would handle this exact use case even better.

As a further optimization, the system could detect reads and/or existing-file writes at C for files living on A and B, then start pro-actively relocating related files so that one copy is at C. Ideally, this would mean adding a replica at C (turning two-way replication into three-way) and then deleting the replica at A or B (going back to two-way) so that data protection is maintained throughout. Unfortunately, current GlusterFS replication doesn’t offer this kind of flexibility to change replication levels on a per-file or temporary basis. We don’t have the detection of changes to trigger this migration, or the policy to identify related files, so this is all clearly in the future anyway. However, it’s a direction we could go, step by incremental step, providing at least a little bit of additional value to at least a few users each time.


GlusterFS License Change

As part of a continuing effort to broaden the potential uses of GlusterFS, the license for (most of) GlusterFS has been changed as follows (some details might vary between files).

  Copyright (c) 2008-2012 Red Hat, Inc. <>
  This file is part of GlusterFS.
  This file is licensed to you under your choice of the GNU Lesser
  General Public License, version 3 or any later version (LGPLv3 or
  later), or the GNU General Public License, version 2 (GPLv2), in all
  cases as published by the Free Software Foundation.

Disclaimer: I am not a lawyer, nor am I representing Red Hat. These are my personal interpretations and predictions as a developer on an open-source project, regardless of qualifications or affiliation.

In practical terms, this means you can choose GPLv2 if you want to link with code already under that license (which FSF itself deems incompatible with GPLv3), or LGPLv3 if you want to link with just about anything else. There are some server-only parts that are still under only GPLv3+ to prevent others from combining our open source with their own closed source to sell a competing product against us, but even that’s still very much open source and always will be. This combination of licenses is a bit weird, but it has been carefully thought out to be consistent with both our ideals and commercial reality. In particular, it should enable users to combine our code with proprietary code for their own private use (which makes me sad but is part of that commercial reality) while also enabling developers to create new translators without having to worry about license conflicts or losing control of their own code. Because enabling translator development is a goal that I’ve spent hundreds of hours pursuing, this is a very welcome change. Yay!


The Quest For Balance

One of the key features of GlusterFS, or any horizontally scalable system like it, is the ability to rebalance data as servers are added, removed, etc. How is that done? Come to think of it, what does “balance” even mean in such a system, and why is it so important to have it? Intuitively, balance has to do with the idea of each node doing a fair share of the total work. If you’re in balance, each node is doing very close to its fair share. If you’re out of balance, some nodes are doing far more while others are doing far less. For us, “work” means hosting files – more files on a server means more work for that server. In steady state, with a large number of files over a long period of time, consistent hashing does a remarkably good job of achieving this balance. However, just after a server has been added or removed, we’re way out of balance. The new server has no new files, and the DHT layouts won’t put any there. Thus, we need to adjust the layouts to include a hash range for the new server so that files will be placed there. In some cases that’s sufficient, and we can let allocation of new files take care of restoring balance, but in most cases we’ll need to move files there more proactively. In any case, the big question becomes how to assign the new server’s hash range. To do that, we need to consider two things.

  • A metric by which we determine the ideal size for each server’s hash range – and therefore whether it’s currently over- or under-loaded.
  • A method by which we determine the ideal position for the new server’s hash range.

For the rebalancing metric, we have several choices. The simplest is to say that each server has the same value, so each server gets an equally sized hash range. (This actually all happens per “brick” rather than per server, but I’m using the more familiar term to avoid having this be unnecessarily GlusterFS-specific.) Another option would be to assign ranges proportional to each server’s total storage capacity. A third option is to assign ranges proportional to each server’s free storage capacity. This has the effect of driving new file allocations toward (relatively) empty servers. For archival workloads where deletes are very uncommon, this will cause the system to converge “naturally” on a near-optimal balance without the need for proactive data migration. There is a slight downside to having the layouts not match the actual current distribution or files, making lookups ever so slightly slower, but I doubt the effect would even be noticeable. More importantly, for workloads where files are being deleted and recreated frequently, this approach could lead to the new servers becoming overloaded and require a data migration back to the old servers. In general, it would be unwise to use this metric without recalculating the layouts periodically.

For the rebalancing method, we have two slightly-conflicting goals. One is to achieve perfect balance in the long term. The other is to minimize data motion in the short term. As it turns out, most methods tend to favor one goal over the other. For example, the simplest method is just to assign the new server’s range at one end of the total range, and shift every other server’s range to make room. This gives “perfect” distribution according to our rebalance metric, but often involves horrendous amounts of unnecessary data motion. To see why this is the case, consider a system as it goes from four servers to five. For simplicity we’ll assume the server’s rebalance metrics are equal and that we have 100 possible hash values.

  • Server A starts with range 0-24, and ends with 0-19, so 5% of the total range is reassigned from A to B.
  • Server B starts with range 25-49, and ends with 20-39, so 10% of the total range is reassigned from B to C.
  • Server C starts with range 50-74, and ends with 40-59, so 15% of the total range is reassigned from C to D.
  • Server D starts with range 75-99, and ends with 60-79, so 20% of the total range is reassigned from D to E.
  • Server E starts with nothing, and ends with 80-99.

Notice how the percent-moved number keeps increasing at every step? That’s the direct result of adding at the end, leading to 50% data motion as we add 25% of the original capacity. Let’s contrast this with another “optimal” method that yields the exact same range sizes but tries to choose a minimally disruptive place to add the new range (which happens to be in the middle this time).

  • Server A starts with range 0-24, and ends with range 0-19, so 5% of the total range is reassigned from A to B.
  • Server B starts with range 25-49, and ends with range 20-39, so 10% of the total range is reassigned from B to E (not C).
  • Server C starts with range 50-74, and ends with range 60-79, so 10% of the total range is reassigned from C to E.
  • Server D starts with range 75-99, and ends with range 80-99, so 5% of the total range is reassigned from D to C.

This gives us exactly the same ultimate result as before, but this time only moving 30% of the data. That could be hundreds of terabytes in many GlusterFS installations, so it’s a pretty big deal. But wait, we can do better. If we wanted to minimize data movement even more, even at the expense of a slightly less optimal final distribution, we could try a couple of other approaches.

  • “Split one” picks the most overloaded current server and splits its hash range proportionally according to our metric for the two servers involved (one old plus one new). This has an almost fractal behavior of dividing the hash space first into halves, then into quarters, etc. The difference between the largest and smallest range can be as much as 2x, but data migration is very low (approximately 1/2^n).
  • “Split two” picks the most overloaded two current servers with adjacent ranges, and splits the combined range proportionally according to our metric for the three servers involved (two old plus one new). This gives more nearly optimal final placement, with hardly any increase in data movement.

There are many more possibilities, and more still if we consider the algorithms one might use for adding more than one server at a time, but this should give some idea of the possibilities. We already have three metrics and four methods, as follows.

  • Metrics: equal, total size, free size
  • Methods: append range, insert range, split one, split two

The good news is that we have nowhere to go but up. GlusterFS currently uses equal+append, which is worse than every other metric and every other method. I already have code to use total or free size as the metric, and insert range or split two as the method. In the not-so-distant future I hope to finish the work that’s necessary to make these available as options in GlusterFS itself (currently the code only does calculations and is not integrated with the actual fix-layout code). I think that most people would be best off using either free+insert or total+split2 most of the time depending on whether their workload matches the “archival” pattern mentioned above, then using total+insert less often (e.g. during planned downtime or predicted slack periods) to fix up the accumulated imperfections in the resulting layouts. However, the real point of both the code and this article is to let users make their own choices about how best to determine what “balance” means for them and how to restore it in a way that best suits their needs.