Roll Back or Rock On?

For a while now, Kyle Kingsbury has been doing some excellent work evaluating the consistency and other properties of various distributed databases. His latest target is Redis. Mostly I agree with the points he makes, and that Redis Cluster is subject to inexcusable data loss, but there is one point on which my own position is closer to the opposition.

we have to be able to roll back operations which should not have happened in the first place. If those failed operations can make it into our consistent timeline in an unsafe way, perhaps corrupting our successful operations, we can lose data.

Those are strong words, but their strength is not matched by their precision. What does "unsafe" really mean here? Or "corrupting"? I'm the last person to take data corruption or loss lightly, but that's precisely why I think it's important to be crystal clear on what they mean. How is it "corruption" to perform a write that the user asked you to perform? The answer depends very much on what rules we're actually supposed to follow. Let's start with some of the most basic requirements for any distributed storage system.

  • Internal consistency: all nodes will eventually agree on whether each write happened or not. (Note: this is more CAP consistency than ACID consistency).

  • Durability: once a write has completed, it will be reflected in all subsequent reads despite transient loss of all nodes and/or permanent loss of some number (system-specific but always less than quorum).

We're not done yet, because we've only defined an internal kind of consistency. As many have pointed out, a distributed system includes its clients. A system that simply throws away all writes could satisfy our requirements, so let's add a more externally oriented consistency requirement.

  • External consistency: any write that has been acknowledged to the user as successfully completed must be complete according to the durability definition.

That's really about it. The last acknowledged write to a location will eventually become available everywhere, and remain available unless the failure threshold is exceeded (or a user deliberately overwrites it but that's a different matter). There are certainly many more requirements we could add, as we'll see, but these few are sufficient for a usable system.

One thing that's noticeably missing from our external-consistency rule is anything to do with unacknowledged writes. Unless we add more rules, the system is free to choose whether they should be completed or rolled back (so long as our other rules are followed). Here's a rule that would force the system to decide a certain way.

  • Any write that has not been acknowledged to the user must not be reflected on subsequent reads.

That should be pretty familiar to database folks as isolation (plus a bit of atomicity), and it's no surprise that database folks would assume it . . . but you know what they say about assumptions. Other kinds of systems, such as filesystems, do not have such a requirement. Instead of appeal to (conflicting) authority or tradition, let's try taking a look at what's actually right for users.

Unacknowledged writes fall into two categories: still in progress or definitively failed. For in-progress writes, isolation can be enforced by storing them "off to the side" in one way or another. This doesn't work for definitively failed writes, because "off to the side" is finite. Those writes have to be actually removed from the system - i.e. roll-back. The problem is that roll-back is subject to the same coordination problems as the original write and carries its own potential for data loss. In fact, for a write that overlaps with a previous one and succeeded at some nodes but not others, data loss absolutely will occur either way. The difference is only which data - old or new - will be lost.

So, back to what's right for users. Why is it better to lose the data that the user explicitly intended to overwrite than to lose the data that they explicitly intended to put in its place? Trick question: it's not. The careless user who didn't bother checking return values would obviously be better served by moving forward than by rolling back . . . but who cares about them? More importantly, even a diligent user who does check error codes should be aware that lack of acknowledgement does not mean lack of effect. By now "everybody knows" that if you send a network message and don't get a reply you can't assume it had no effect. The same "lost ack" problem exists for storage I/O as well, and has forever. In both worlds, the "must have had no effect" assumption is just as dumb as the careless programmer's "must have worked" assumption.

If we exclude the careless and truly diligent programmers, the only people left who would care about not having rollback would be those who know to check for errors but don't know or don't care enough to handle them properly. They must also be comfortable with the performance impact of roll-back support, most often from double writing. I'm not saying these people don't exist or their concerns aren't valid, but clearly roll-back is not the best or only system-design choice for everyone. Building a system that tries to keep as many writes as possible instead of throwing away as many as possible is an entirely valid option.

If Redis threw away 45% of acknowledged writes in Kyle's testing, that's a serious problem. That violates our consistency rule, or any reasonable alternative, and I have no problem saying that such a system is broken. When Kyle adds that Redis "adds insult to injury" by completing all of the unacknowledged writes instead, he's also correct - but it's only an insult, not a new injury. A new injury would be further loss of data, and whether those successful writes represent loss of data is very open to interpretation. If I accidentally knock some money out of your hands, then bend down and pick up only the pennies for you, it's not the pennies that are the problem. It's the money - or data - that got dropped on the floor and left there.

(NOTE: it has been pointed out to me that what Kyle tested was not Redis Cluster but a proposed WAIT enhancement to Redis replication. Or something like that. Fortunately, those distinctions aren't particularly relevant to the point I'm trying to make here, which is about the supposed necessity of roll-back support in this type of system. Nor does it change the fact that the system under test - whatever it was - failed miserably. Still, I used the wrong term so I've added this paragraph to correct it.)

Comments for this blog entry