Apparently the AWS data center in Virginia had some problems today, which caused a bunch of sites to become unavailable. It was rather amusing to see which of the sites I visit are actually in EC2. It was considerably less amusing to see all of the people afraid that cloud computing will make their skills obsolete, taking the opportunity to drum up FUD about AWS specifically and cloud computing in general. Look, people: it was one cloud provider on one day. It says nothing about cloud computing generally, and AWS still has a pretty decent availability record (performance is another matter). Failures occur in traditional data centers too, whether outsourced or run by in-house staff. Whether you’re in the cloud or not, you should always “own your availability” and plan for failure of any resource on which you depend. Sites like Netflix that did this in AWS, by setting up their systems in multiple availability zones, were able to ride out the problems just fine. The problem was not the cloud; it was people being lazy and expecting the cloud to do their jobs for them in ways that the people providing the cloud never promised. Anybody who has never been involved in running a data center with availability at least as good as Amazon’s, but who has nevertheless used this as an excuse to tell people they should get out of the cloud, is just an ignorant jerk.
The other interesting thing about the outage is Amazon’s explanation.
8:54 AM PDT We’d like to provide additional color on what were working on right now (please note that we always know more and understand issues better after we fully recover and dive deep into the post mortem). A networking event early this morning triggered a large amount of re-mirroring of EBS volumes in US-EAST-1. This re-mirroring created a shortage of capacity in one of the US-EAST-1 Availability Zones
I find this interesting because of what it implies about how EBS does this re-mirroring. How does a network event trigger an amount of re-mirroring (apparently still in progress as I write this) so far in excess of the traffic during the event? The only explanation that occurs to me, as someone who designs similar systems, is that the software somehow got into a state where it didn’t know what parts of each volume needed to be re-mirrored and just fell back to re-mirroring the whole thing. Repeat for thousands of volumes and you get exactly the kind of load they seem to be talking about. Ouch. I’ll bet somebody at Amazon is thinking really hard about why they didn’t have enough space to keep sufficient journals or dirty bitmaps or whatever it is that they use to re-sync properly, or why they aren’t using Merkle trees or some such to make even the fallback more efficient. They might also be wondering why the re-mirroring isn’t subject to flow control precisely so that it won’t impede ongoing access so severely.
Without being able to look “under the covers” I can’t say for sure what the problem is, but it certainly seems that something in that subsystem wasn’t responding to failure the way it should. Since many of the likely-seeming failure scenarios (“split brain” anyone?) involve a potential for data loss as well as service disruption, if I were a serious AWS customer I’d be planning how to verify the integrity of all my EBS volumes as soon as the network problems allow it.