Of all the projects I've proposed or worked on for GlusterFS, New Style Replication (NSR) is one of the most ambitious. It has two major goals:
Improved handling of network partitions
Improved performance, both normally and during repair
Personally, I consider the improved partition handling to be the more important problem. NSR's predecessor AFR continues to be plagued by split-brain problems, with new reports on the mailing list almost weekly, despite many claims over the years that this tweak or that tweak will make those problems go away forever. Most other people seem to care more about performance. Fortunately, the two goals do not conflict. Often, as we shall see in a moment, the same techniques that are good for one are also good for the other.
NSR is deliberately designed to be like many other (especially database) replication systems that are known to work pretty well, which means it's very unlike AFR in two particular ways.
Replication is driven by a leader (server), which in our case is temporarily elected to the role, instead of directly from clients.
Change detection and replica repair are done using a log, instead of per-file markings augmented by an index of recently changed files.
To illustrate the first difference, here are some slides from my 2014 Red Hat Summit talk.
This is where we start to see how our two goals are compatible. Even though the main reason to use chain/leader based replication is to gain better control over what happens during a network partition, it also turns out to be good for performance. For one thing, it reduces coordination overhead to that needed for leader election and failure detection. That's tiny, compared to the coordination that has to happen for every write in the fan-out approach. Basically, we amortize that overhead over a gigantic number of writes, so the per-write cost is next to nothing.
Also, with fan-out replication, each client write has to be transmitted directly from the client to two (or more) replicas. If the client only has a single interface, as is typically the case, its effective outbound bandwidth is thus divided by two (or more). With chain replication, the client and the leader can each use their full outbound bandwidth simultaneously. That's good even if they're on the same network, better if they're on separate networks, and best of all if the back-end server network is faster than the front-end client network. To put it another way, for real networks configured the way real people do it, NSR's chain replication makes much better use of the available resources.
Of course, there is a downside. Or maybe there isn't. As you can see from the slide on the right, the chain method involves two network hops before a write can be acknowledged - client to leader, then leader to follower. This incurs some extra latency, but we'll see shortly that it might not actually be a problem. The only real case where chain replication "loses" is when many clients gang up on a single leader, and the leader's outbound bandwidth becomes more of a bottleneck than the clients' aggregate outbound bandwidth would have been. It can definitely happen. On the other hand, even in setups where clients and servers are similarly equipped, ganging up isn't as easy as you'd think. Some HPC workloads would be susceptible to this effect, but other than that it's far more likely - especially across a volume with many leaders for different replica sets - that only a few clients will be banging on a single leader at any given moment. Again, the way people really use these systems trumps the theoretical possibility of an opposite result.
So, where am I going with all of this? I've always thought of these two architectural directions - chain replication and logging - as part of one thing, but they're actually quite separable. I've also encountered massive "not invented here" resistance to NSR throughout its lifetime. Recently, I started to worry about how this resistance would affect my ability to see NSR through an adequate testing phase. Sure, I can write it, but if most of the resources I need for testing are continually diverted to AFR then I might never be able to test it properly. Then I hit on the idea of combining one of NSR's core ideas (and the associated code), but using it with AFR's change-recording and repair mechanisms instead. That way, we can get much more mileage on some parts while we finish the others. Thus, server-side AFR was born.
Long time GlusterFS users are probably thinking that server-side AFR is an old idea, and they're right. Back in the 2.x days, this was a very common way to deploy GlusterFS. Then along came 3.0, and the acquisition, and server-side AFR became deprecated. Well, it's back. During the short time that we had multiple people working on NSR, all of the infrastructure was developed to elect a leader, have clients use it, fail over when necessary, and manage the resulting I/O flow. That infrastructure is just as applicable to AFR. All we need to do is load AFR on the server side, talking to one local replica through a normal server-side set of translators and to the other as a sort of network proxy from the real client. How well does it work? Let's see.
Contrary to the usual habit among most of my colleagues, I like to run a lot of my tests in the cloud. For one thing, it's easier for me to find SSD-equipped instances that way, and I really don't want to have massive disk latencies in the way when I'm trying to measure the effect of a network data-flow change. More importantly, this makes my results more reproducible than if I used some internal setup specifically purchased and configured to run this code. So I hopped on Rackspace, and on Digital Ocean, and I hacked some volfiles, and I started running some tests. I used 32GB 20- or 24-core machines for each of two servers, and up to four clients each about a quarter that size as clients. Clients were each doing 32 threads' worth of I/O, using fio.
Networking turned out to be an interesting issue. On Rackspace, each instance type is defined to have a certain amount of network bandwidth, and that does seem to be enforced. Oddly, as far as I can tell, throttling is enforced at the flow (rather than host) level. Unfortunately, that sort of negates the very difference in network-interface use that I was trying to explore. Instead of having to split its bandwidth, a traditional AFR client just gets twice as much. I still saw some differences, but I think the Digital Ocean results were more informative because DO doesn't' throttle traffic the same way RS does. That's more like a real network would behave, so it increases applicability there as well.
I could go on and on about the hardware and software configuration, but instead let's just go straight to the juicy bits. Here's a graph.
Pretty dramatic, huh? While client-side fan-out (traditional) AFR topped out and even started to decline rather early, server-side chain-replicating AFR continued to climb. At a mere four active clients, server-side was ahead by more than 50%. It's also worth noting that the difference would be greater if I took the time to eliminate various locking and other stuff that's no longer necessary when writes go through a leader. That leader only needs to coordinate regular writes with those from self-heal (repair), and only when it knows self-heal is running. It never has to coordinate with writes from another simultaneous leader. That's actually a lot of overhead and code complexity we'll be able to avoid some day.
We already knew that chain replication was likely to win the bandwidth contest. We also expected it to lose the latency contest. Did it? Well, sort of.
Median latency was only 0-8% worse. 99th-percentile latency was 7-28% worse. So yes, latency was negatively affected. How much would that actually matter? Some people care about raw latency but more people care about IOPS, which brings us to the most interesting graph of the bunch.
Slightly worse, then pretty close to even, then 85% better. I probably should have added another client or two to see if the trend continued, but it was getting late and I was spending my own money. Also, I considered the point adequately made already. Chain replication already won the bandwidth contest. Given the latency characteristics of the two approaches, it would have been neither surprising nor fatal for it to lose the IOPS contest by a modest amount. Pulling ahead even on such a limited test seems sufficient to show that the overall tradeoff is still positive.
As it turns out, the two approaches are not mutually exclusive. It should be easy enough to switch between them as a volume-level option. (Even doing it at a replica-set level is possible, but there we have UI problem because there's currently no way to address a single replica set within a volume.) If a configuration or workload works better with the fan-out approach, users would still be able to do that. For most, switching to server-side chain replication is likely to yield a much needed boost in both robustness and performance.
The net result here is that users will be able to try a very different approach to replication well before the rest of GlusterFS 4.0 is ready. They should see some benefits, and we (the developers) should be able to learn from their experience. Maybe we'll even be able to free up some people from fixing AFR bugs, and get them to work on the remaining parts of NSR - or of 4.0 more generally - instead. Everybody wins.