In my Riak and Cassandra thread, one issue that came up was the use of vector clocks to resolve conflicts. Vector what? I’ve had this post about vector clocks queued up for a while (check the numbers) but what I had here wasn’t very good so I’ve decided to talk about conflict resolution more generally. The basic issue here is that, in a system where writes can happen independently in more than one place, multiple writes can happen which conflict or overlap, and at some point you have to decide which one survives. This is not just a problem in those newfangled NoSQL thingies either. Distributed filesystems and other systems (e.g. databases) that use asynchronous replication have been dealing with the same issue for decades – and worse, because of things like partial overlaps and data dependencies that don’t exist in the common NoSQL data models.

Vector clocks and closely related version vectors are among the common techniques used to resolve conflicts. The common idea here is that, for each datum, each node keeps track of the last version on every other node which is known to precede the current one. In other words, the version for a datum is not a single value but a vector, with a separate value for each (relevant) node. If I’m X, and I had already seen Y’s version 7 before I wrote my version 3, then I note that fact by including a 7 in Y’s entry within that vector and a 3 in mine. Then anyone who sees that 7 (or later) in my version would know that I had already seen and (presumably) accounted for Y’s update when I made mine, and could treat my update as superseding it. If they already saw Y’s 7 and they see a 6 from me, they’d be able to recognize that as a conflict. There’s a lot more about rules for updating or comparing vectors in either case, but that’s probably enough for now. No post here would be complete without a bullet list, though, so here are a couple of additional observations.

  • There are a lot of papers about vectors becoming too big, and ways to make them smaller. These are very real concerns when there are many updaters, as in a distributed filesystem. However, in many NoSQL stores the vectors only need to include an entry for each node where a replica might be stored. Since this number is likely to be very small, vector size isn’t an issue in these cases. Thanks to Alex Feinberg of Project Voldemort for reminding me of this recently.
  • Version vectors are usually per-object on a node, while vector clocks are usually global for the node, but this need not be the case. The version-vector update rules can be used for the entire set of objects on a node just as well.
  • There are subtle but important differences between the update rules for version vectors and vector clocks. In particular, message exchanges count as events for vector clocks and the vectors end up being more “advanced” on the receiver than on the sender. This extra ordering information can be useful, e.g. to detect ordering of operations across multiple objects, but can also create problems – e.g. with clocks passed back and forth never converging, or conflict-resolution events generating new vectors and thus potentially new conflicts.

The most important thing about both vector clocks and version vectors (henceforth “vectors” for both) is that they do not by themselves resolve conflicts. All they can do is detect conflicts, meaning updates whose order cannot be determined. The conflicting versions must all be saved until someone, at some time, looks at them and determines how to resolve the conflict – i.e. turn them into a single combined version. In most cases, this is the user. Both Voldemort and Riak, for example, follow Dynamo’s lead in this way. The Riak folks even claim that vector clocks are easy, but I don’t agree. Using vector clocks to detect conflicts isn’t hard, so they’re sort of right as far as that goes, but detection is not the same as resolution and resolution is where the harder problems live. Even Werner Vogels recently said that Dynamo is not very user friendly and vector clocks are a large part of the reason. Vectors aren’t that hard to deal with once you’re used to them, but there’s a bit of a “last straw” effect; they usually tend to show up when you’re already doing something fairly complex, and the last thing you need is one more $#@! bit of complexity. Also, not all conflicts are equally easy to resolve. Basho’s example (set union) turns out to be particularly easy due to certain mathematical properties, but even something as seemingly simple as incrementing an integer can run into more serious problems. How can increment be difficult? Well, let’s look at what might happen at node zero in a three-node system.

  1. Node zero has a value of 100 for the integer, and a version of {10,20,30}.
  2. An update comes in with a value of 101 and a version of {10,21,30}.
  3. Another update comes in, also with a value of 101 but with a version of {10,20,31}.

Hm. Node one’s update was apparently made while oblivious to node two’s, and vice versa. No problem, they each obviously incremented the variable once, so we just make it 102 with a version of {10,21,31}. Everyone’s happy, right? Not so fast. If all we kept was the first update (including its version) then we no longer have any information to tell us whether {10,21,30} was an update from 100 to 101 or from 99 to 101. We’re missing the same information for {10,20,31}. Thus, we don’t really know whether the correct new value is 100+1+1=102 or 99+2+2=103 or something else. To know that, we’d have to keep {10,20,30} so that we could see what each of the subsequent updates really did. (This is where the special nature of set union, which is not subject to such problems, becomes apparent.) What if the second update had used a version of {9,20,31}? That would indicate a conflict not with {10,21,30} but with {10,20,30} – meaning we’d better have its predecessor(s) handy as well. Basically we’re stuck holding onto all previous versions until we can be sure no new conflict with them can occur – i.e. we’ve heard from every possible updater with a version clearly equal to or later than {10,20,30} in this case. In a system with many updaters, or where long-lived partitions can occur, this can get pretty messy indeed. Not so easy any more, is it?

But wait, some might say, you can avoid all this by using vectors in a different way – to prevent update conflicts by issuing conditional writes which specify a version (vector) and only succeed if that version is still current. Sorry, but no, or at least not generally. In a partition-tolerant system, nodes on each side of a partition may simultaneously accept conflicting writes against the same initial version, and you’ll still have to resolve the conflict when you resolve the partition. For conditional writes to work, the condition must be evaluated at all update sites before the write can be allowed to succeed. Implications regarding SimpleDB internals are left as an exercise for the reader.

In conclusion, then, vectors are an incredibly powerful and useful tool to support conflict resolution . . . but conflict resolution remains a fundamentally hard problem. The complexity can be contained in many cases, but there will always be a few where it leaks out and gets all over you. This is why stronger forms of consistency, which truly don’t allow such conflicts to occur and therefore don’t require their resolution, have such enduring appeal. If you want a system that remains available even when partitions occur, though, you’ll need to weaken that consistency and therefore you’ll need to deal with conflicts. Understanding vector clocks and their relatives – and I mean really understanding them, not just treating them as some sort of magic pixie dust that will solve all of your problems – will likely be key to that effort. I hope I’ve provided at least a few clues to where both the pitfalls and the safe paths through that minefield are.