Amazon has posted an analysis of the recent EBS outage. Here’s what I would consider to be the root cause
this inability to contact a data collection server triggered a latent memory leak bug in the reporting agent on the storage servers. Rather than gracefully deal with the failed connection, the reporting agent continued trying to contact the collection server in a way that slowly consumed system memory
After that, predictably, the affected storage servers all slowly ground to a halt. It’s a perfect illustration of an important principle in distributed-system design.
System-level failures are more more likely to be caused by bugs or misconfiguration than by hardware faults.
It is important to write code that guardds not only against external problems but against internal ones as well. How might that have played out in this case? For one thing, something in the system could have required positive acknowledgement of the DNS update (it’s not clear why they relied on DNS updates at all instead of assigning a failed server’s address to its replacement). An alert should have been thrown when such positive acknowledgement was not forthcoming, or when storage servers reached a threshold of failed connection attempts. Another possibility would be from the Recovery Oriented Computing project: periodically reboot apparently healthy subsystems to eliminate precisely the kind of accumulated degradation that something like a memory leak would cause. A related idea is Netflix’s Chaos Monkey: reboot components periodically to make sure the recovery paths get exercised. Any of these measures – I admit they’re only obvious in hindsight, and that they’re all other people’s ideas – might have prevented the failure.
There are other more operations-oriented lessons from the Amazon analysis, such as the manual throttling that exacerbated the original problem, but from a developer’s perspective that’s what I get froom it.