Is Eventual Consistency Useful?

Every once in a while, somebody comes up with the “new” idea that eventually consistent systems (or AP in CAP terminology) are useless. Of course, it’s not really new at all; the SQL RDBMS neanderthals have been making this claim-without-proof ever since NoSQL databases brought other models back into the spotlight. In the usual formulation, banks must have immediate consistency and would never rely on resolving conflicts after the fact . . . except that they do and have for centuries.

Most recently but least notably, this same line of non-reasoning has been regurgitated by Emin Gün Sirer in The NoSQL Partition Tolerance Myth and You Might Be A Data Radical. I’m not sure you can be a radical by repeating a decades-old meme, but in amongst the anti-NoSQL trolling there’s just enough of a nugget of truth for me to use as a launchpad for some related thoughts.

The first thought has to do with the idea of “partition oblivious” systems. EGS defines “partition tolerance” as “a system’s overall ability to live up to its specification in the presence of network partitions” but then assumes one strongly-consistent specification for the remainder. That’s a bit of assuming the conclusion there; if you assume strong consistency is an absolute requirement, then of course you reach the conclusion that weakly consistent systems are all failures. However, what he euphemistically refers to as “graceful degradation” (really refusing writes in the presence of a true partition) is anything but graceful to many people. In a comment on Alex Popescu’s thread about this, I used the example of sensor networks, but there are other examples as well. Sometimes consistency is preferable and sometimes availability is. That’s the whole essence of what Brewer was getting at all those years ago.

Truly partition-oblivious systems do exist, as a subset of what EGS refers to that way. I think it’s a reasonable description of any system that not only allows inconsistency but has a weak method of resolving conflicts. “Last writer wins” or “latest timestamp” both fall into this category. However, even those have been useful to many people over the years. From early distributed filesystems to very current file-synchronization services like Dropbox, “last writer wins” has proven quite adequate for many people’s needs. Beyond that there is a whole family of systems that are not so much oblivious to partitions as respond differently to them. Any system that uses vector clocks or version vectors, for example, is far from oblivious. The partition was very much recognized, and very conscious decisions were made to deal with it. In some systems – Coda, Lotus Notes, Couchbase – this even includes user-specifed conflict resolution that can accomodate practically any non-immediate consistency need. Most truly partition-oblivious systems – the ones that don’t even attempt conflict resolution but instead just return possibly inconsistent data from whichever copy is closest – never get beyond a single developer’s sandbox, so they’re a bit of a strawman.

Speaking of developers’ sandboxes, I think distributed version control is an excellent example of where eventual consistency does indeed provide great value to users. From RCS and SCCS through CVS and Subversion, version control was a very transactional, synchronous process – lock something by checking it out, work on it, release the lock by checking in. Like every developer I had to deal with transaction failures by manually breaking these locks many times. As teams scaled up in terms of both number of developers and distribution across timezones/schedules, this “can’t make changes unless you can ensure consistency” model broke down badly. Along came a whole generation of distributed systems – git, hg, bzr, and many others – to address the need. These systems are, at their core, eventually consistent databases. They allow developers to make changes independently, and have robust (though admittedly domain-specific) conflict resolution mechanisms. In fact, they solve the divergence problem so well that they treat partitions as a normal case rather than an exception. Clearly, EGS’s characterization of such behavior as “lobotomized” (technically incorrect even in a medical sense BTW since the operation he’s clearly referring to is actually a corpus callosotomy) is off base since a lot of people at least as smart as he is derive significant value from it.

That example probably only resonates with programmers, though. Let’s find some others. How about the process of scientific knowledge exchange via journals and conferences? Researchers generate new data and results independently, then “commit” them to a common store. There’s even a conflict-resolution procedure, domain-specific just like the DVCS example but nonetheless demonstrably useful. This is definitely better than requiring that all people working on the same problem or dataset remain in constant communication or “degrade gracefully” by stopping work. That has never worked, and could never work, to facilitate scientific progress. An even more prosaic example might be the way police share information about a fleeing suspect’s location, or military units share similar information about targets and threats. Would you rather have possibly inconsistent/outdated information, or no information at all? Once you start thinking about how the real world works, eventual consistency pops up everywhere. It’s not some inferior cousin of strong consistency, some easy way out chosen only by lazy developers. It’s the way many important things work, and must work if they’re to work at all. It’s really strong/immediate consistency that’s an anomaly, existing only in a world where problems can be constrained to fit simplistic solutions. The lazy developers just throw locks around things, over-serialize, over-synchronize, and throw their hands in the air when there’s a partition.

Is non-eventual consistency useful? That might well be the more interesting question.

Use Big Data For Good

There seems to be a growing awareness that there’s something odd about the recent election. “How did Obama win the presidential race but Republicans get control of the House?” seems to be a common question. People who have never said “gerrymandering” are saying it now. What even I hadn’t realized was this (emphasis mine).

Although the Republicans won 55 percent of the House seats, they received less than half of the votes for members of the House of Representatives.
 – Geoffrey R. Stone

What does this have to do with Big Data? This is not a technical problem. Mostly I think it’s a problem that needs to be addressed at the state level, for example by passing ballot measures requiring that district boundaries be set by an independent directly-elected commission. Maybe those members could even be elected via Approval Voting or Single Transferable Vote – systems which IMO should actually be used to elect the congresscritters themselves, but that’s not feasible without establishing voter familiarity in a different context.

Here’s the technical part. Most of the Big Data “success stories” seem to involve the rich (who can afford to buy/run big clusters) getting richer by exploiting consumers and invading their privacy. Very rarely do I hear about good uses, such as tracking drug interactions or disease spread. Where are the “data scientists” doing real science? Here’s an opportunity, while the election and its consequences are fresh in everybody’s minds, for those tools to do some good. How about if we use Big Data tools and Machine Learning techniques to crunch through demographic data and at least come up with congressional-district proposals that meet some rationally debatable definition of fairness? Obviously the results themselves can’t just be used as is, nor can the algorithms or data sets be enshrined into law, but maybe at least the operative definitions and the results they produce can provide decent starting points for a commission or the people themselves to consider. It seems like a lot better goal than targeting ads, anyway.

Another Amazon Post Mortem

Amazon has posted an analysis of the recent EBS outage. Here’s what I would consider to be the root cause

this inability to contact a data collection server triggered a latent memory leak bug in the reporting agent on the storage servers. Rather than gracefully deal with the failed connection, the reporting agent continued trying to contact the collection server in a way that slowly consumed system memory

After that, predictably, the affected storage servers all slowly ground to a halt. It’s a perfect illustration of an important principle in distributed-system design.

System-level failures are more more likely to be caused by bugs or misconfiguration than by hardware faults.

It is important to write code that guardds not only against external problems but against internal ones as well. How might that have played out in this case? For one thing, something in the system could have required positive acknowledgement of the DNS update (it’s not clear why they relied on DNS updates at all instead of assigning a failed server’s address to its replacement). An alert should have been thrown when such positive acknowledgement was not forthcoming, or when storage servers reached a threshold of failed connection attempts. Another possibility would be from the Recovery Oriented Computing project: periodically reboot apparently healthy subsystems to eliminate precisely the kind of accumulated degradation that something like a memory leak would cause. A related idea is Netflix’s Chaos Monkey: reboot components periodically to make sure the recovery paths get exercised. Any of these measures – I admit they’re only obvious in hindsight, and that they’re all other people’s ideas – might have prevented the failure.

There are other more operations-oriented lessons from the Amazon analysis, such as the manual throttling that exacerbated the original problem, but from a developer’s perspective that’s what I get froom it.

Comments on Parallels “Cloud Storage”

As someone who was once hired to work on a “cloud file system” I was quite intrigued by this tweet from Kir Kolyshkin.

@kpisman This is our very own distributed fs, somewhat similar to Gluster or CEPH but (of course) better. http://t.co/DlysXvve

Trying to ignore the fact that what the link describes is explicitly not a real filesystem,I immediately responded that the file/block discussion seemed misguided, and – more importantly – the code seemed to be MIA. The link is not to an implementation, in either source or binary form. It’s not even to an architecture or design. It’s just a discussion of high-level requirements, similar to what I did for HekaFS before I even considered writing the first line of code. Naturally, Kir challenged me to elaborate, so I will. Let’s start with what he has to say about scalability.

It’s interesting to note that a 64-node rack cluster with a fast Ethernet switch supporting fabricswitching technology can, using nothing more than 1Gb network cards and fairly run-of-the-mill SATA devices, deliver an aggregate storage bandwidth of around 50GB/s

I’ve actually seen a storage system deliver 50GB/s. I doubt that Kir has, because it’s not that common and if he had I’m pretty sure it would be mentioned somewhere in the document. Even if we assume dual Gb/s full-duplex NICs per node, that’s only 250MB/s/node or 16GB/s total. At 64 nodes per rack I don’t think you’re going to be cramming in more NICs, plus switches, so basically he’s just off by 3x. I work on the same kind of distributed “scale-out” storage he’s talking about, so I’m well aware of how claims like that should and do set off alarm bells for anybody who’s serious about this kind of thing. Let’s move on to the point I originally addressed.

each VE root contains a large number of small files,
and aggregating them in a file environment causes the file server to see a massively growing number of
objects. As a result, metadata operations will run into bottlenecks. To explain this problem further: if
each root has N objects and there are M roots, tracking the combined objects will require an N times M
scaling of effort.

How does this require “N times M” effort any more for M>1 servers than for M=1? The only explanation I can think of is that Kir is thinking of each client needing to have a full map of all objects, but that’s simply not the case. Clients can cache the locations of objects they care about and look up any locations not already in cache. With techniques such as consistent hashing, even those rare lookups won’t be terribly expensive. Servers only care about their own objects, so “N times M” isn’t true for any entity in the system. This is not entirely a solved problem, but both GlusterFS and Ceph (among many others) have been doing things this way for years so anybody claiming to have innovated in this space should exhibit awareness of the possibility. Let’s move on.

use of sparse objects typically is not of interest to hosting providers because they
already generally have more storage than they need.

O RLY? My customers – who I’d guess are probably more “enterprise-y” than Kir’s – certainly don’t seem to be wallowing in idle storage. On the contrary, they seem to be buying lots of new storage all the time and are very sensitive to its cost. That’s why one of the most frequently requested features for GlusterFS is “network RAID” or erasure coding instead of full-out replication, and deduplication/compression are close behind. They’re all geared toward wringing the most out of the storage people already have so that they don’t need to buy more. That hardly sounds like “more than they need” does it?

Because of these misunderstandings, I don’t think Parallels “cloud storage” is really comparable to GlusterFS, so I’m not sure why he mentioned it or why I’d care. It seems a lot more like RBD or Sheepdog, leaving open the question of why Parallels didn’t use one of those. Maybe they specifically wanted something that was closed source (or open source but you’re not supposed to know you’re paying them for something free). What’s really striking is what Kir never even mentions. For example, there’s no mention at all of security, privacy, or multi-tenancy. Surely, if this is supposed to be cloud storage, some mention should be made of accounts and authentication etc. There’s also no mention of management. If this is supposed to be all cloudy, shouldn’t there be something about how easy it is to add capacity or provision user storage from that pooled capacity? Without so much as an architectural overview it’s impossible to tell how well the result meets either the requirements Kir mentions or those he omits, and with such a start it’s hard to be optimistic.

Scaling Filesystems vs. Other Things

David Strauss tweeted an interesting comment about using filesystems (actually he said “block devices” but I think he really meant filesystems) for scale and high availability. I thought I was following him (I definitely am now) but in fact I saw the comment when it was retweeted by Jonathan Ellis. The conversation went on a while, but quickly reached a point where it became impossible to fit even a minimally useful response under 140 characters, so I volunteered to extract the conversation into blog form.

Before I start, I’d like to point out that I know both David and Jonathan. They’re both excellent engineers and excellent people. I also don’t know the context in which David originally made his statement. On the other hand, NoSQL/BigData folks pissing all over things they’re too lazy to understand has been a bit of a hot button for me lately (e.g. see Stop the Hate). So I’m perfectly willing to believe that David’s original statement was well intentioned, perhaps a bit hasty or taken out of context, but I also know that others with far less ability and integrity than he has are likely to take such comments even further out of context and use them in their ongoing “filesystems are irrelevant” marketing campaign. So here’s the conversation so far, rearranged to show the diverging threads of discussion and with some extra commentary from me.

DavidStrauss Block devices are the wrong place scale and do HA. It’s always expensive (NetApp), unreliable (SPOF), or administratively complex (Gluster).

Obdurodon Huh? GlusterFS is *less* administratively complex than e.g. Cassandra. *Far* less. Also, block dev != filesystem.

Obdurodon It might not be the right choice for any particular case, but for reasons other than administrative complexity.
What reasons, then? Wrong semantics, wrong performance profile, redundant wrt other layers of the system, etc. I think David and I probably agree that scale and HA should be implemented in the highest layer of any particular system, not duplicated across layers or pushed down into a lower layer to make it Somebody Else’s Problem (the mistake made by every project to make the HDFS NameNode highly available). However, not all systems have the same layers. If what you need is a filesystem, then the filesystem layer might very well be the right place to deal with these issues (at least as they pertain to data rather than computation). If what you need is a column-oriented database, that might be the right place. This is where I think the original very general statement fails, though it seems likely that David was making it in a context where layering two systems had been suggested.

DavidStrauss GlusterFS is good as it gets but can still get funny under split-brain given the file system approach: http://t.co/nRu1wNqI
I was rather amused by David quoting my own answer (to a question on the Gluster community site) back at me, but also a bit mystified by the apparent change of gears. Wasn’t this about administrative complexity a moment ago? Now it’s about consistency behavior?

Obdurodon I don’t think the new behavior (in my answer) is markedly weirder than alternatives, or related to being a filesystem.

DavidStrauss It’s related to it being a filesystem because the consistency model doesn’t include a natural, guaranteed split-brain resolution.

Obdurodon Those “guarantees” have been routinely violated by most other systems too. I’m not sure why you’d single out just one.
I’ll point out here that Cassandra’s handling of Hinted Handoff has only very recently reached the standard David seems to be advocating, and was pretty “funny” (to use his term) before that. The other Dynamo-derived projects have also done well in this regard, but other “filesystem alternatives” have behavior that’s too pathetic to be funny.

DavidStrauss I’m not singling out Gluster. I think elegant split-brain recovery eludes all distributed POSIX/block device systems.
Perhaps this is true of filesystems in practice, but it’s not inherent in the filesystem model. I think it has more to do with who’s working on filesystems, who’s working on databases, who’s working on distributed systems, and how people in all of those communities relate to one another. It just so happens that the convergence of database and distributed-systems work is a bit further along, but I personally intend to apply a lot of the same distributed-system techniques in a filesystem context and I see no special impediment to doing so.

DavidStrauss #Gluster has also come a long way in admin complexity, but high-latency (geo) replication still requires manual failover.

Obdurodon Yes, IMO geosync in its current form is tres lame. That’s why I still want to do *real* wide-area replication.

DavidStrauss Top-notch geo replication requires embracing split-brain as a normal operating mode and having guaranteed, predictable recovery.

Obdurodon Agreed wrt geo-replication, but that still doesn’t support your first general statement since not all systems need that.

DavidStrauss Agreed on need for geo-replication, but geo-repl. issues are just an amplified version of issues experienced in any cluster.
As I’ve pointed out before, I disagree. Even systems that do need this feature need not – and IMO should not – try to do both local/sync and remote/async replication within a single framework. They’re different beasts, most relevantly with respect to split brain being a normal operating mode. I’ve spent my share of time pointing out to Stonebraker and other NewSQL folks that partitions really do occur even within a single data center, but they’re far from being a normal case there and that does affect how one arranges the code to handle it.

Obdurodon I’m loving this conversation, but Twitter might not be the right forum. I’ll extract into a blog post.

DavidStrauss You mean complex, theoretical distributed systems issues aren’t best handled in 140 characters or less? :-)

I think that about covers it. As I said, I disagree with the original statement in its general form, but might find myself agreeing with it in a specific context. As I see it, aggregating local filesystems to provide a single storage pool with a filesystem interface and aggregating local filesystems to provide a single storage pool with another interface (such as a column-oriented database) aren’t even different enough to say that one is definitely preferable to the other. The same fundamental issues, and many of the same techniques, apply to both. Saying that filesystems are the wrong way to address scale is like saying that a magnetic #3 Phillips screwdriver is the wrong way to turn a screw. Sometimes it is exactly the right tool, and other times the “right” tool isn’t as different from the “wrong” tool as its makers would have you believe.

Distributed Databases in 1965

While we were in Ann Arbor last month, we stopped by the abolutely amazing Kaleidoscope used and rare bookstore. (I’d link, but can’t find a website.) I knew from our last visit that they have an excellent collection of old sci-fi magazines, so I decided to see if they had any from the month I was born – April 1965. Sure enough, they had a Galaxy from that month. I was surprised how many of the authors I recognized. Here are the stories mentioned on the cover:

  • “Wasted on the Young” by John Brunner
  • “War Against the Yukks” by Keith Laumer
  • “A Wobble in Wockii Futures” by Gordon R. Dickson
  • “Committee of the Whole” by Frank Herbert

That’s an all-star cast right there. However, the story that really made an impression on me was by someone I had never heard of – “The Decision Makers” by Joseph Green. It’s about an alien-contact specialist sent to decide whether a newly discovered species met relevant definitions of intelligence which would interfere with a planned terraforming operation. That’s pretty standard stuff for the SF of the time, but there’s a twist; the aliens, which are called seals, have a sort of collective intelligence which complicates the protagonist’s job. This leads to the passage that might be of interest to my usual technical audience.

Our group memory is an accumulated mass of knowledge which is impressed on the memory areas of young individuals at birth, at least three such young ones for each memory segment. We are a short-lived race, dying of natural causes after eight of your years. As each individual who carries a share of the memory feels death approaching he transfers his part to a newly born child, and thus the knowledge is transferred from generation to generation, forever.

Try to remember that this was written in 1965, long before the networked computer systems today were even imagined, and that the author wasn’t even writing about computers. He was trying to tell a completely different kind of story; the entire excerpt above could have been omitted entirely without affecting the plot. Nonetheless, he managed to describe a form of what we would now call sharding, with replication and even deliberate re-replication to preserve availability. The result should be instantly recognizable to anyone who has studied modern distributed databases such as Voldemort or Riak or Cassandra. A lot of people think of this stuff as cutting edge, but it’s also an incidental part of a barely-remembered story from 1965. Somehow I find that both humbling and hilarious.

Fighting FUD Again

Tom Trainer wrote what was supposed to be a thoughtful examination of what “cloud storage” should mean, but it came across as a rather nasty anti-Isilon hit piece. I tried to reply there, but apparently my comment won’t go through until I register with “UBM TechWeb” so they can sell me some crap, so I’m posting my response here. Besides being a defense of an unfairly maligned competitor – mine as well as Tom’s unnamed employer’s – it might help clarify some of the issues around what is or is not “real” cloud storage.

As the project lead for CloudFS, which addresses exactly the kinds of multi-tenancy and encryption you mention, I agree with many of your main points about what features are necessary for cloud storage. Where I disagree is with your (mis)characterization of Isilon to make those points.

* First, their architecture is far from monolithic. Yes, OneFS is proprietary, but that’s a *completely* different thing.

* Second, scaling to 144 servers is actually pretty good. When you look closely at what many vendors/projects claim, you find out that they’re actually talking about clients . . . and any idiot can put together thousands of clients. Conflating node counts with server counts was a dishonest trick when I caught iBrix doing it years ago, and it’s a dishonest trick now. Even the gigantic “Spider” system at ORNL only has 192 servers, and damn few installations need even half of that. It’s probably a support limit rather than an architectural limit. No storage vendor supports configurations bigger than they’ve tested, and testing even 144 servers can get pretty expensive – at least if you do it right. I’m pretty sure that Isilon would raise that limit if somebody asked them for a bigger system and let them use that configuration for testing.

Third, Isilon does have a “global” namespace as that term is usually used – i.e. at a logical level, to mean that the same name means the same thing across multiple servers, just like a “global variable” represents the same thing across multiple modules or processes. Do you expect global variables to be global in a physical sense too? In common usage, people use terms like “WAN” or “multi-DC” or “geo” to mean distribution across physical locations, and critiquing a vendor for common usage of a term makes your article seem like even more of a paid-for attack piece.

Disclaimer: I briefly evaluated and helped deploy some Isilon gear at my last job (SiCortex). I respect the product and I like the people, but I have no other association with either.

Amazon’s Own Post Mortem

Amazon has posted their own explanation of the recent EBS failure. Since I had offered some theories earlier, I think it’s worthwhile to close this out by comparing my theories with Amazon’s explanation. Specifically, I had suggested two things.

  • EBS got into a state where it didn’t know what had been replicated, and fell back to re-replicating everything.
  • There was inadequate flow control on the re-replication/re-mirroring traffic, causing further network overload.

It turns out that both theories were slightly correct but mostly incorrect. Here’s the most relevant part of Amazon’s account.

When this network connectivity issue occurred, a large number of EBS nodes in a single EBS cluster lost connection to their replicas. When the incorrect traffic shift was rolled back and network connectivity was restored, these nodes rapidly began searching the EBS cluster for available server space where they could re-mirror data. Once again, in a normally functioning cluster, this occurs in milliseconds. In this case, because the issue affected such a large number of volumes concurrently, the free capacity of the EBS cluster was quickly exhausted, leaving many of the nodes “stuck” in a loop, continuously searching the cluster for free space. This quickly led to a “re-mirroring storm,” where a large number of volumes were effectively “stuck” while the nodes searched the cluster for the storage space it needed for its new replica. At this point, about 13% of the volumes in the affected Availability Zone were in this “stuck” state.

the nodes failing to find new nodes did not back off aggressively enough when they could not find space, but instead, continued to search repeatedly

The first part refers to the sort of full re-mirroring that I had mentioned, although it was re-mirroring to a new replica instead of an old one. The last part is a classic congestion-collapse pattern: transient failure, followed by too-aggressive retries that turn the transient failure into a persistent one. I had thought this would apply to the data traffic, but according to Amazon it affected the “control plane” instead. This is also what caused it to affect multiple availability zones, since the control plane – unlike the data plane – spans availability zones within a region.

The most interesting parts, to me, are the mentions of actual bugs – one in EBS and one in RDS. Here are the descriptions.

There was also a race condition in the code on the EBS nodes that, with a very low probability, caused them to fail when they were concurrently closing a large number of requests for replication. In a normally operating EBS cluster, this issue would result in very few, if any, node crashes; however, during this re-mirroring storm, the volume of connection attempts was extremely high, so it began triggering this issue more frequently. Nodes began to fail as a result of the bug, resulting in more volumes left needing to re-mirror.

Of multi-AZ database instances in the US East Region, 2.5% did not automatically failover after experiencing “stuck” I/O. The primary cause was that the rapid succession of network interruption (which partitioned the primary from the secondary) and “stuck” I/O on the primary replica triggered a previously un-encountered bug. This bug left the primary replica in an isolated state where it was not safe for our monitoring agent to automatically fail over to the secondary replica without risking data loss, and manual intervention was required.

These bugs represent an important lesson for distributed-system designers: bugs strike without regard for location. Careful sharding and replication across machines and even sites won’t protect you against a bug that exists in every instance of the code. A while back, when I was attending the UCB retreats because of OceanStore, the Recovery Oriented Computing” folks were doing some very interesting work on correlated failures. I remember some great discussions about distributing a system not just across locations but across software types and versions as well. This lesson has stuck with me ever since. For example, in iwhd the extended replication-policy syntax was developed with a specific goal of allowing replication across different back-end types (e.g. S3, OpenStack) or operating systems as well as different locations. Maybe distributing across different software versions wouldn’t have helped in Amazon’s specific case if the bugs involved have been in there long enough, but it’s very easy to imagine a related scenario in which having different versions with different mirror-retry strategies in play (same theory behind multiple hashes in Stochastic Fair Blue BTW) might at least have avoided one factor contributing to the meltdown.

More Fallout from the AWS Outage

Since my last article on the subject, a couple of other folks have tried to use the EBS failure to pimp their own competing solutions. Joyent went first, with Network Storage in the Cloud: Delicious but Deadly. He makes some decent points, e.g. about “read-only” mounts not actually being read-only, until he goes off the rails about here.

This whole experience — and many others like it — left me questioning the value of network storage for cloud computing. Yes, having centralized storage allowed for certain things — one could “magically” migrate a load from one compute node to another, for example — but it seemed to me that these benefits were more than negated by the concentration of load and risk in a single unit (even one that is putatively highly available).

What’s that about “concentration of load and risk in a single unit”? It’s bullshit, to put it simply. Note the conflation of “network storage” in the first sentence with “centralized storage” in the second. As Bryan himself points out in the very next paragraph, the fallback to local storage has forced them to “reinvest in technologies” for replication, migration, and backup between nodes. That’s not reinvesting, that’s reinventing – of wheels that work just fine in systems beyond those Bryan knows. Real distributed storage doesn’t involve that concentration of load and risk, because it’s more than just a single server with failover. Those of you who follow me on Twitter probably noticed my tweet about people whose vision of “distributed” doesn’t extend beyond that slight modification to an essentially single-server world view. Systems like RBD/Sheepdog, or Dynamo and its derivatives if you go a little further afield, don’t have the problems that naive iSCSI or DRBD implementations do.

Next up is Heroku, with their incident report which turned into an editorial. They actually make a point I’ve been making for years.

2) BLOCK STORAGE IS NOT A CLOUD-FRIENDLY TECHNOLOGY. EC2, S3, and other AWS services have grown much more stable, reliable, and performant over the four years we’ve been using them. EBS, unfortunately, has not improved much, and in fact has possibly gotten worse. Amazon employs some of the best infrastructure engineers in the world: if they can’t make it work, then probably no one can. Block storage has physical locality that can’t easily be transferred.

OK, that last part isn’t quite right. Block storage has no more or less physical locality than file or database storage; it all depends on the implementation. However, block storage does have another property that makes it cloud-unfriendly: there’s no reasonable way to share it. Yes, cluster filesystems that allow such sharing do exist. I even worked on one a decade ago. There are a whole bunch of reasons why they’ve never worked out as well as anyone hoped, and a few reasons why they’re a particularly ill fit for the cloud. In the cloud you often want your data to be shared, but the only way to share block storage is to turn it into something else (e.g. files, database rows/columns, graph nodes) at which point you’re sharing that something else instead of sharing the block storage itself. Just about every technology you might use to do this can handle its own sharding/replication/etc. so you might as well cut out the middle man and run them on top of local block storage. That’s the only case where local block storage makes sense, because it explicitly does not need to be shared and is destined for presentation to users in some other form. Even in the boot-image case, which might seem to involve non-shared storage, there’s actually sharing involved if your volume is a snapshot/clone of a shared template. Would you rather wait for every block in a multi-GB image to be copied to local disk before your instance can start, or start up immediately and only copy blocks from a snapshot or shared template as needed? In all of these cases, the local block storage is somehow virtualized or converted ASAP instead of being passed straight through to users. The only reason for the pass-through approach is performance, but if you’re in the cloud you should be achieving application-level performance via horizontal scaling rather than hyper-optimization of each instance anyway so that’s a weak reason to rely on it except in a few very specialized cases such as virtual appliances which are themselves providing a service to the rest of the cloud.

Amazon’s Outage

Apparently the AWS data center in Virginia had some problems today, which caused a bunch of sites to become unavailable. It was rather amusing to see which of the sites I visit are actually in EC2. It was considerably less amusing to see all of the people afraid that cloud computing will make their skills obsolete, taking the opportunity to drum up FUD about AWS specifically and cloud computing in general. Look, people: it was one cloud provider on one day. It says nothing about cloud computing generally, and AWS still has a pretty decent availability record (performance is another matter). Failures occur in traditional data centers too, whether outsourced or run by in-house staff. Whether you’re in the cloud or not, you should always “own your availability” and plan for failure of any resource on which you depend. Sites like Netflix that did this in AWS, by setting up their systems in multiple availability zones, were able to ride out the problems just fine. The problem was not the cloud; it was people being lazy and expecting the cloud to do their jobs for them in ways that the people providing the cloud never promised. Anybody who has never been involved in running a data center with availability at least as good as Amazon’s, but who has nevertheless used this as an excuse to tell people they should get out of the cloud, is just an ignorant jerk.

The other interesting thing about the outage is Amazon’s explanation.

8:54 AM PDT We’d like to provide additional color on what were working on right now (please note that we always know more and understand issues better after we fully recover and dive deep into the post mortem). A networking event early this morning triggered a large amount of re-mirroring of EBS volumes in US-EAST-1. This re-mirroring created a shortage of capacity in one of the US-EAST-1 Availability Zones

I find this interesting because of what it implies about how EBS does this re-mirroring. How does a network event trigger an amount of re-mirroring (apparently still in progress as I write this) so far in excess of the traffic during the event? The only explanation that occurs to me, as someone who designs similar systems, is that the software somehow got into a state where it didn’t know what parts of each volume needed to be re-mirrored and just fell back to re-mirroring the whole thing. Repeat for thousands of volumes and you get exactly the kind of load they seem to be talking about. Ouch. I’ll bet somebody at Amazon is thinking really hard about why they didn’t have enough space to keep sufficient journals or dirty bitmaps or whatever it is that they use to re-sync properly, or why they aren’t using Merkle trees or some such to make even the fallback more efficient. They might also be wondering why the re-mirroring isn’t subject to flow control precisely so that it won’t impede ongoing access so severely.

Without being able to look “under the covers” I can’t say for sure what the problem is, but it certainly seems that something in that subsystem wasn’t responding to failure the way it should. Since many of the likely-seeming failure scenarios (“split brain” anyone?) involve a potential for data loss as well as service disruption, if I were a serious AWS customer I’d be planning how to verify the integrity of all my EBS volumes as soon as the network problems allow it.