Archive for June, 2012

Healing Split Brain

As people who attended my recent Red Hat Summit talk are aware, one of the big issues with GlusterFS replication is “split brain” which occurs when conflicting updates are made to replicas of a file. Self-healing in the wrong direction risks data loss, so we won’t do self-heal if we detect split brain and we’ll be very conservative about identifying conditions as split brain which might actually be resolvable. Unfortunately, this can lead to situations in which we won’t do self-heal at all and the file remains inaccessible until the administrator manually resolves the split brain. Even more unfortunately, manual resolution means poring through logs and manually mapping from subvolume names to physical storage locations. Most unfortunately of all, the last step of manual resolution involves doing exactly what we generally tell people not to do and will soon forbid – modifying the back-end data directly on the servers.

Clearly, the best approach to split brain is to prevent it, for example by enabling the quorum enforcement feature that I implemented a while ago. However, the conditions that can cause split brain are not nearly as rare as we would like them to be, and a little help with manual resolution can go a long way. That’s where my new script comes in. At the very least, it will do some of the drudge work of parsing configurations, fetching extended attributes, etc. for the files you tell it to heal. If it still can’t heal a file, it will at least tell you why, in something approximating human language and without requiring you to search through every log file. That’s not all, though. It also uses algorithms that are a little different than the ones in the regular self-heal, so it can recognize and correct some more conditions:

  • In its “aggressive” mode, it will resolve some “wise fool” and “two fool” conditions that standard self-heal will give up on, if the pending-operation counters give us good reason to believe that some “accusations” should be withdrawn or reversed. (See my article on replication internals for explanations of these strange terms.) This can break some accusation loops that cause us to declare split brain.
  • Regardless of aggressive vs. normal mode, it will detect when file contents are identical and clear the pending-operation counts so that the file becomes accessible again. This is kind of a last-ditch attempt to get the data unblocked, after all of our other methods have failed.

Obviously, more aggressive self-heal means higher potential for data loss if we make the wrong decision. That’s why I wrote it to look only at files you specify, instead of doing a full scan. That’s why I went a little further than usual in writing tests for it. Think of it the same way as you would think of a wipe and restore from tape, when regular self-heal has definitely failed and regaining access to the file is critical even if the version you end up with is slightly out of date. It’s certainly not supported in any way by Red Hat, and my colleagues would be within their rights to disavow or even condemn it.

The script is designed to be run offline and on a server, though it can run online and on a client (so long as that client has the gluster CLI installed). You’ll need everything in the github directory I linked to above, and then you’d do something like this: myvol server1:/export/sdd path/to/broken/file another/broken/file

The second argument could be the path (e.g. from “gluster volume info” or the trusted.glusterfs.pathinfo synthetic extended attribute) for any brick where the affected files reside, and you can specify as many of those as you like. The script will then mount all of the bricks containing replicas, use those to fetch the pending-operation counts on all replicas, and try to figure out what kind of repair to do. If you’re having problems with split brain, it’s one more thing you can try before you go poking around on the back-end storage or give up entirely, but due to the inherent complexity of the problems it’s trying to solve I can’t guarantee that it will fix your particular split-brain problem. Good luck.


Never Trust Anyone Over 3.3

Probably everybody who cares already knows that GlusterFS 3.3 has been released. I find it amusing that quorum enforcement is listed as one of the marquee features, since it’s really such a trivial bit of code compared to the other features on the list, or to internal but significant changes such as the new GFID-based back end, but I guess it’s importance to users rather than effort that define the list. I think it’s also worth mentioning that the release includes a ton of minor fixes that came from running static analysis, and tweaks from the performance team, and other things that are individually small but add up to some pretty major improvements. Maybe some time soon I’ll run some benchmarks of 3.2 vs. 3.3 to show just how dramatic the differences can be. That’s not what I’m here to talk about, though. Now that 3.3 is out the door, a lot of deferred changes have started moving through the queue. Here are some examples of things that are well on their way to becoming part of 3.4, and that I might have mentioned here or in talking to people.

User-specified DHT layouts that won’t get stomped when you rebalance already merged
Selection of AFR “read child” via hash to avoid hot spots already merged
Server-side resolution of auxiliary GIDs, to support more than 16 already merged
Reliable selection of local AFR “read child” in active review
SSL (for I/O path) and socket multi-threading refreshed and in review

The other HekaFS features are separately getting a bit unstuck. Kaleb is charging ahead with essential infrastructure for the namespace and ID isolation. Edward is cleaning up the at-rest encryption code for submission. So, what am I going to be doing? I have a few more tweaks in mind around replication and distribution (including those I’m maintaining in my own GitHub trees), but mostly I expect to be working on the infamous …and a pony replication. In my not-humble-at-all opinion, that’s the one feature that will really put some distance between GlusterFS and the also-rans. It’s still the #1 idea for making people’s eyes light up, both in public presentations (come see me at Red Hat Summit!) and in private conversations with customers who have petabytes already . . . and that’s before you even consider its applicability to migrating data into or out of public clouds. After nearly three years of being a good little boy and working on stuff that honestly didn’t interest me nearly as much because the need was more immediate, it looks like I’ll finally be free – encouraged, even – to dive into the project I really came here to do. Expect more here as I work out various details over the coming weeks.