How Erasure Coding is Not Like Replication

Fri 13 February 2015

tags: storage

Many people think of erasure coding as equivalent to replication, but with better storage utilization. Want to store 100TB of data with two-failure survivability? With replication you'll need 300TB of physical storage; with 8+2 erasure coding you'll need only 125TB. Yeah, sure, there's a performance downside, but from a user's perspective that's the only other difference.

From a developer's perspective, there's another subtle but very important difference related to the atomicity of writes. In a replicated system, after a write every replica will contain valid data where the write occurred (assuming a reasonable implementation). Ideally, what's there will be the just-written data, but in failure scenarios that might not immediately be the case on some replica(s). Instead you might have stale data, but at least it was something the user wrote at some time. Even if you're down to one replica, at least you can retrieve data that was valid at some time. If the most recent operations can be replayed from somewhere else, you still have a proper base from which to do that.

Erasure-coded systems aren't necessarily like that. Systematic erasure codes are, but non-systematic codes aren't. If you're doing the aforementioned 8+2 with a non-systematic code and exactly half of your nodes manage to complete the write before lightning strikes the data center, what do you have? Garbage. You have insufficient state to reconstruct either the new or old versions of that data. While it's OK to say that the write wouldn't have been reported as successful in that case, it's not OK for it to have destroyed what was there previously in the course of what should have ended up as a no-op.

Because of this, an erasure-coded system must take extra steps that wouldn't be necessary in a replicated system. With replication, each replica can accept the write and then by itself ensure that old data is still available until the new data is fully written. With erasure coding, this atomicity guarantee must be maintained globally - no node may overwrite any old data until all nodes have the new data. You can meet this coordination need with 2PC, with Paxos or Raft, with MVCC, but you can't just punt. Whatever approach you choose, it adds a significant piece of complexity that replicated systems can omit (and usually do for the sake of performance).

I'm not saying erasure codes - or even non-systematic erasure codes - are bad. The space advantages are still there. The very "need k out of n fragments" property that makes this extra complexity necessary can be combined with a dash of cryptography to create storage with some very desirable security features. (AONT-RS is a very good starting point for understanding how.) I love erasure codes. This is just a tip for implementors of erasure-coded systems - or perhaps even more for would be implementors - so that they can plan and prepare appropriately.

Comments for this blog entry