Amazon has posted their own explanation of the recent EBS failure. Since I had offered some theories earlier, I think it’s worthwhile to close this out by comparing my theories with Amazon’s explanation. Specifically, I had suggested two things.

  • EBS got into a state where it didn’t know what had been replicated, and fell back to re-replicating everything.
  • There was inadequate flow control on the re-replication/re-mirroring traffic, causing further network overload.

It turns out that both theories were slightly correct but mostly incorrect. Here’s the most relevant part of Amazon’s account.

When this network connectivity issue occurred, a large number of EBS nodes in a single EBS cluster lost connection to their replicas. When the incorrect traffic shift was rolled back and network connectivity was restored, these nodes rapidly began searching the EBS cluster for available server space where they could re-mirror data. Once again, in a normally functioning cluster, this occurs in milliseconds. In this case, because the issue affected such a large number of volumes concurrently, the free capacity of the EBS cluster was quickly exhausted, leaving many of the nodes “stuck” in a loop, continuously searching the cluster for free space. This quickly led to a “re-mirroring storm,” where a large number of volumes were effectively “stuck” while the nodes searched the cluster for the storage space it needed for its new replica. At this point, about 13% of the volumes in the affected Availability Zone were in this “stuck” state.

the nodes failing to find new nodes did not back off aggressively enough when they could not find space, but instead, continued to search repeatedly

The first part refers to the sort of full re-mirroring that I had mentioned, although it was re-mirroring to a new replica instead of an old one. The last part is a classic congestion-collapse pattern: transient failure, followed by too-aggressive retries that turn the transient failure into a persistent one. I had thought this would apply to the data traffic, but according to Amazon it affected the “control plane” instead. This is also what caused it to affect multiple availability zones, since the control plane – unlike the data plane – spans availability zones within a region.

The most interesting parts, to me, are the mentions of actual bugs – one in EBS and one in RDS. Here are the descriptions.

There was also a race condition in the code on the EBS nodes that, with a very low probability, caused them to fail when they were concurrently closing a large number of requests for replication. In a normally operating EBS cluster, this issue would result in very few, if any, node crashes; however, during this re-mirroring storm, the volume of connection attempts was extremely high, so it began triggering this issue more frequently. Nodes began to fail as a result of the bug, resulting in more volumes left needing to re-mirror.

Of multi-AZ database instances in the US East Region, 2.5% did not automatically failover after experiencing “stuck” I/O. The primary cause was that the rapid succession of network interruption (which partitioned the primary from the secondary) and “stuck” I/O on the primary replica triggered a previously un-encountered bug. This bug left the primary replica in an isolated state where it was not safe for our monitoring agent to automatically fail over to the secondary replica without risking data loss, and manual intervention was required.

These bugs represent an important lesson for distributed-system designers: bugs strike without regard for location. Careful sharding and replication across machines and even sites won’t protect you against a bug that exists in every instance of the code. A while back, when I was attending the UCB retreats because of OceanStore, the Recovery Oriented Computing” folks were doing some very interesting work on correlated failures. I remember some great discussions about distributing a system not just across locations but across software types and versions as well. This lesson has stuck with me ever since. For example, in iwhd the extended replication-policy syntax was developed with a specific goal of allowing replication across different back-end types (e.g. S3, OpenStack) or operating systems as well as different locations. Maybe distributing across different software versions wouldn’t have helped in Amazon’s specific case if the bugs involved have been in there long enough, but it’s very easy to imagine a related scenario in which having different versions with different mirror-retry strategies in play (same theory behind multiple hashes in Stochastic Fair Blue BTW) might at least have avoided one factor contributing to the meltdown.