GlusterFS 3.5 Features

It's time to let some cats out of some bags. As my loyal readers (yeah right) have surely noticed, things have been quiet around here. Part of that has been the result of vacations and such, but also there's a lot of stuff I just haven't felt ready to write about. Now that I've finished writing my feature proposals for GlusterFS 3.5, I'm ready to write here as well. But first, a bit of philosophy.

On any non-trivial software project, there's likely to be a certain tension between "loyalists" who want to keep things largely the same and "rebels" who keep pushing for radical change. I was tempted to use "liberal" and "conservative" but Steve Yegge beat me to it with a different set of definitions. (Good reading, BTW, though he's a bit loquacious so make sure you have plenty of time.) The distinction I'm trying to make is as follows:

  • A loyalist believes that "one more tweak" or a succession of them will bring the project to greatness. There's no need, and too much risk, associated with radical change.

  • A rebel believes that you can't cross a chasm with small steps. You have to take bold steps, even if those steps are going to cause regret later.

Most people don't stay at one position on that spectrum, and certainly not at the ends. Often one's position is determined by circumstance. For example, despite my general tendency toward the rebel view, I've probably spent more of my career (especially at Revivio) fighting on the loyalist side. When it comes to GlusterFS, I'm not just a rebel but probably the rebel. There are others more fit and more inclined to take the loyalist position. I see it as my responsibility to take an under-represented rebel position by proposing new ideas and accumulating "technical credit" to remove or balance out the significant technical debt that has accumulated under the loyalist regime. I don't mind at all if my ideas stay "on the shelf" for years, as has happened with most of HekaFS. It's good to have extra ammo in the locker. I mind a little bit more when the loyalists reject ideas with pure FUD or passivity instead of sound technical argument, or when they can't seem to change their mind about an old idea without claiming it was their own, but that's a subject for a different post. The point here is that what I'm proposing is deliberately ambitious, because our competitors aren't exactly standing still. They're often acting more boldly than we are, and we'll never gain or maintain a lead over them (I won't get into which we're doing) with baby steps. With that in mind, here's what I've proposed.

  • Better SSL support. Surprise! This is actually a very loyalist feature. Basically the core code has been there for years, but there are several pieces that still need to be done - mainly on the usability front - before this can really be something we're proud of.

  • New Style Replication. For years, I've been advocating for fundamental change in this area. I even wrote an infamous "Why AFR Must Die" email explaining why I don't believe the current design or implementation will be sufficient going forward (using real user complaints and code exampels respectively). There finally seems to be some acceptance of the idea that we should use a log/journal instead of scattered xattrs or hard-link trickery to keep track of incomplete changes, but NSR goes even further than that. For one thing, it's server-based to take better advantage of how NICs and switches work. For another, it has an almost Dynamo-like consistency model which offers users better tradeoffs of consistency vs. availability and/or performance. There's a lot more, which I'll discuss in detail some time soon. For now I'll just say that performance will be better, recovery will be faster, and (best of all IMO) "split brain" will be easier to avoid.

  • Thousand-node scalability. The biggest limitation on our scalability right now is not in the I/O path, which scales just fine, but in our management path. The changes I'm proposing here - based on some of the latest advances that some of my distributed-system friends will surely recognize - represent important steps on the way to an exabyte-scale system. We're already at petabyte scale, TYVM, so I don't think that's an exaggeration.

  • Data tiering/classification. While NSR might be the one that I've spent the most time thinking about, this is the one that might have the most immediate impact and appeal for users. We get queries all the time about how we're going to deal with SSDs. Hybrid drives and "smart" HBAs aren't really very good solutions, because they only have local and limited knowledge. (This is the same issue as global vs. local deduplication BTW.) Being able to combine SSD-based and disk-based bricks in a single volume with smart placement and migration across all of them is a leap far beyond such hacks. Perhaps even more importantly, we also get asked a lot about various features that would be beneficial in some way - e.g. deduplication, bitrot detection, erasure coding - but carry a high performance cost. We need to tier between these in much the same way as between different hardware types. As it turns out, the exact same infrastructure also allows us to implement locality awareness, security-level awareness, and all sorts of other features with relatively little effort. Our modular structure and our existing data-distribution code already do 95% of what's needed, and just need a little nudge.

Those are just my own proposals. Others have made proposals too, so please go check them out. Personally, I've been eagerly awaiting Xavi's erasure-coding "disperse" translator for ages, and can't wait to see it become a full part of the project. While there's practically no chance that all of this will get into GlusterFS 3.5, a lot of it will and what's left will become a formidable arsenal of opportunities for years to come.

Comments for this blog entry