Amazon’s Own Post Mortem

Amazon has posted their own explanation of the recent EBS failure. Since I had offered some theories earlier, I think it’s worthwhile to close this out by comparing my theories with Amazon’s explanation. Specifically, I had suggested two things.

  • EBS got into a state where it didn’t know what had been replicated, and fell back to re-replicating everything.
  • There was inadequate flow control on the re-replication/re-mirroring traffic, causing further network overload.

It turns out that both theories were slightly correct but mostly incorrect. Here’s the most relevant part of Amazon’s account.

When this network connectivity issue occurred, a large number of EBS nodes in a single EBS cluster lost connection to their replicas. When the incorrect traffic shift was rolled back and network connectivity was restored, these nodes rapidly began searching the EBS cluster for available server space where they could re-mirror data. Once again, in a normally functioning cluster, this occurs in milliseconds. In this case, because the issue affected such a large number of volumes concurrently, the free capacity of the EBS cluster was quickly exhausted, leaving many of the nodes “stuck” in a loop, continuously searching the cluster for free space. This quickly led to a “re-mirroring storm,” where a large number of volumes were effectively “stuck” while the nodes searched the cluster for the storage space it needed for its new replica. At this point, about 13% of the volumes in the affected Availability Zone were in this “stuck” state.

the nodes failing to find new nodes did not back off aggressively enough when they could not find space, but instead, continued to search repeatedly

The first part refers to the sort of full re-mirroring that I had mentioned, although it was re-mirroring to a new replica instead of an old one. The last part is a classic congestion-collapse pattern: transient failure, followed by too-aggressive retries that turn the transient failure into a persistent one. I had thought this would apply to the data traffic, but according to Amazon it affected the “control plane” instead. This is also what caused it to affect multiple availability zones, since the control plane – unlike the data plane – spans availability zones within a region.

The most interesting parts, to me, are the mentions of actual bugs – one in EBS and one in RDS. Here are the descriptions.

There was also a race condition in the code on the EBS nodes that, with a very low probability, caused them to fail when they were concurrently closing a large number of requests for replication. In a normally operating EBS cluster, this issue would result in very few, if any, node crashes; however, during this re-mirroring storm, the volume of connection attempts was extremely high, so it began triggering this issue more frequently. Nodes began to fail as a result of the bug, resulting in more volumes left needing to re-mirror.

Of multi-AZ database instances in the US East Region, 2.5% did not automatically failover after experiencing “stuck” I/O. The primary cause was that the rapid succession of network interruption (which partitioned the primary from the secondary) and “stuck” I/O on the primary replica triggered a previously un-encountered bug. This bug left the primary replica in an isolated state where it was not safe for our monitoring agent to automatically fail over to the secondary replica without risking data loss, and manual intervention was required.

These bugs represent an important lesson for distributed-system designers: bugs strike without regard for location. Careful sharding and replication across machines and even sites won’t protect you against a bug that exists in every instance of the code. A while back, when I was attending the UCB retreats because of OceanStore, the Recovery Oriented Computing” folks were doing some very interesting work on correlated failures. I remember some great discussions about distributing a system not just across locations but across software types and versions as well. This lesson has stuck with me ever since. For example, in iwhd the extended replication-policy syntax was developed with a specific goal of allowing replication across different back-end types (e.g. S3, OpenStack) or operating systems as well as different locations. Maybe distributing across different software versions wouldn’t have helped in Amazon’s specific case if the bugs involved have been in there long enough, but it’s very easy to imagine a related scenario in which having different versions with different mirror-retry strategies in play (same theory behind multiple hashes in Stochastic Fair Blue BTW) might at least have avoided one factor contributing to the meltdown.

More Fallout from the AWS Outage

Since my last article on the subject, a couple of other folks have tried to use the EBS failure to pimp their own competing solutions. Joyent went first, with Network Storage in the Cloud: Delicious but Deadly. He makes some decent points, e.g. about “read-only” mounts not actually being read-only, until he goes off the rails about here.

This whole experience — and many others like it — left me questioning the value of network storage for cloud computing. Yes, having centralized storage allowed for certain things — one could “magically” migrate a load from one compute node to another, for example — but it seemed to me that these benefits were more than negated by the concentration of load and risk in a single unit (even one that is putatively highly available).

What’s that about “concentration of load and risk in a single unit”? It’s bullshit, to put it simply. Note the conflation of “network storage” in the first sentence with “centralized storage” in the second. As Bryan himself points out in the very next paragraph, the fallback to local storage has forced them to “reinvest in technologies” for replication, migration, and backup between nodes. That’s not reinvesting, that’s reinventing – of wheels that work just fine in systems beyond those Bryan knows. Real distributed storage doesn’t involve that concentration of load and risk, because it’s more than just a single server with failover. Those of you who follow me on Twitter probably noticed my tweet about people whose vision of “distributed” doesn’t extend beyond that slight modification to an essentially single-server world view. Systems like RBD/Sheepdog, or Dynamo and its derivatives if you go a little further afield, don’t have the problems that naive iSCSI or DRBD implementations do.

Next up is Heroku, with their incident report which turned into an editorial. They actually make a point I’ve been making for years.

2) BLOCK STORAGE IS NOT A CLOUD-FRIENDLY TECHNOLOGY. EC2, S3, and other AWS services have grown much more stable, reliable, and performant over the four years we’ve been using them. EBS, unfortunately, has not improved much, and in fact has possibly gotten worse. Amazon employs some of the best infrastructure engineers in the world: if they can’t make it work, then probably no one can. Block storage has physical locality that can’t easily be transferred.

OK, that last part isn’t quite right. Block storage has no more or less physical locality than file or database storage; it all depends on the implementation. However, block storage does have another property that makes it cloud-unfriendly: there’s no reasonable way to share it. Yes, cluster filesystems that allow such sharing do exist. I even worked on one a decade ago. There are a whole bunch of reasons why they’ve never worked out as well as anyone hoped, and a few reasons why they’re a particularly ill fit for the cloud. In the cloud you often want your data to be shared, but the only way to share block storage is to turn it into something else (e.g. files, database rows/columns, graph nodes) at which point you’re sharing that something else instead of sharing the block storage itself. Just about every technology you might use to do this can handle its own sharding/replication/etc. so you might as well cut out the middle man and run them on top of local block storage. That’s the only case where local block storage makes sense, because it explicitly does not need to be shared and is destined for presentation to users in some other form. Even in the boot-image case, which might seem to involve non-shared storage, there’s actually sharing involved if your volume is a snapshot/clone of a shared template. Would you rather wait for every block in a multi-GB image to be copied to local disk before your instance can start, or start up immediately and only copy blocks from a snapshot or shared template as needed? In all of these cases, the local block storage is somehow virtualized or converted ASAP instead of being passed straight through to users. The only reason for the pass-through approach is performance, but if you’re in the cloud you should be achieving application-level performance via horizontal scaling rather than hyper-optimization of each instance anyway so that’s a weak reason to rely on it except in a few very specialized cases such as virtual appliances which are themselves providing a service to the rest of the cloud.

Amazon’s Outage

Apparently the AWS data center in Virginia had some problems today, which caused a bunch of sites to become unavailable. It was rather amusing to see which of the sites I visit are actually in EC2. It was considerably less amusing to see all of the people afraid that cloud computing will make their skills obsolete, taking the opportunity to drum up FUD about AWS specifically and cloud computing in general. Look, people: it was one cloud provider on one day. It says nothing about cloud computing generally, and AWS still has a pretty decent availability record (performance is another matter). Failures occur in traditional data centers too, whether outsourced or run by in-house staff. Whether you’re in the cloud or not, you should always “own your availability” and plan for failure of any resource on which you depend. Sites like Netflix that did this in AWS, by setting up their systems in multiple availability zones, were able to ride out the problems just fine. The problem was not the cloud; it was people being lazy and expecting the cloud to do their jobs for them in ways that the people providing the cloud never promised. Anybody who has never been involved in running a data center with availability at least as good as Amazon’s, but who has nevertheless used this as an excuse to tell people they should get out of the cloud, is just an ignorant jerk.

The other interesting thing about the outage is Amazon’s explanation.

8:54 AM PDT We’d like to provide additional color on what were working on right now (please note that we always know more and understand issues better after we fully recover and dive deep into the post mortem). A networking event early this morning triggered a large amount of re-mirroring of EBS volumes in US-EAST-1. This re-mirroring created a shortage of capacity in one of the US-EAST-1 Availability Zones

I find this interesting because of what it implies about how EBS does this re-mirroring. How does a network event trigger an amount of re-mirroring (apparently still in progress as I write this) so far in excess of the traffic during the event? The only explanation that occurs to me, as someone who designs similar systems, is that the software somehow got into a state where it didn’t know what parts of each volume needed to be re-mirrored and just fell back to re-mirroring the whole thing. Repeat for thousands of volumes and you get exactly the kind of load they seem to be talking about. Ouch. I’ll bet somebody at Amazon is thinking really hard about why they didn’t have enough space to keep sufficient journals or dirty bitmaps or whatever it is that they use to re-sync properly, or why they aren’t using Merkle trees or some such to make even the fallback more efficient. They might also be wondering why the re-mirroring isn’t subject to flow control precisely so that it won’t impede ongoing access so severely.

Without being able to look “under the covers” I can’t say for sure what the problem is, but it certainly seems that something in that subsystem wasn’t responding to failure the way it should. Since many of the likely-seeming failure scenarios (“split brain” anyone?) involve a potential for data loss as well as service disruption, if I were a serious AWS customer I’d be planning how to verify the integrity of all my EBS volumes as soon as the network problems allow it.

Using with SSL

The other day, I needed to implement a very simple remote service for something at work. Everything I was doing seemed to map well onto simple HTTP requests, and I had played with a while ago, so it seemed like a good chance to refresh that bit of my memory. Not too much later, I was adding @route decorators to some of my existing functions, and voila! The previously local code had become remotely accessible, almost like magic. That was cool and allowed me to finish what I was doing, but at some point I’ll need to make this – and some other things like it – more secure. So it was that I sat down and tried to figure out how to make this code do SSL. At first I thought this must be extremely well-trodden ground, covered in just about every relevant manual and tutorial, but apparently not. Weird. In any case, here’s what I came up with for a server and a test program.

What I’m mostly interested in here is authenticating the client, but I do both sides because doing only one side seems a bit rude. “You have to show me ID before you can talk to me, but I don’t have to show anything to you.” It kind of annoys me how most people obsess over authenticating servers while allowing clients to remain anonymous, so I’m not going to do the opposite. The key in any case is to wrap WSGIServer.server_activate, which seems to be the last thing that gets called before accept(), so that it can call ssl.wrap_socket with all of the appropriate configuration data. Then, if you want to authenticate clients, you need to wrap WSGIRequestHandler.handle and actually check the incoming client certificate there. Finally, both of these get wrapped up together in an adapter class for to use. Clear as mud, huh?

That’s really all there is. It’s no stunning work of genius, that’s for sure, but maybe the next guy searching for this should-be-well-known recipe will be able to save some time.