Archive for May, 2013

Performance Variation in the Cloud

During my talks, I often try to make the point that horizontally scalable systems are often necessary not only to achieve high aggregate performance but also to overcome performance variation. This is true in general, but especially in a public cloud. In that environment, performance variation both between nodes and over time is much greater than in a typical bare-metal environment, and there’s very little you can do about it at the single-system level, so you pretty much have to deal with it at the distributed-system level. I’ve been using a graph that kind of illustrates that point, but it has a few deficiencies – it’s second-hand observation, it’s kind of old, and it’s for network performance whereas I usually care more about disk performance. To illustrate my point even more clearly, I took some measurements of my own recently and created some new graphs. Along the way, I found several other things that might be of interest to my readers.

The methodology here was deliberately simple. I’d get on a node, do whatever disk/volume setup was necessary, and then run a very simple iozone test over and over – eight threads, each doing random 4KB synchronous writes. I then repeated this exercise across three providers. It’s worth noting that each test is on a single machine at a single time. The variation across a broader sample is likely to be even greater, but these samples are already more than sufficient to make my point. Let’s look at the first graph, for a High I/O (hi1.4xlarge) instance.

Amazon I/O variation

Ouch. From peaks over 16K down to barely 6K, with barely any correlation between successive samples. That’s ugly. To be fair, Amazon’s result was the worst of the three, and that’s fairly typical. I also tested a Rackspace 30GB instance, and the measly little VR1G instance (yes, that’s 1GB) that runs this website at Host Virtual. The results were pretty amusing. To see how amusing, let’s look at the same figures in a different way.

IOPS distribution

This time, we’re looking at the number of samples that were “X or better” for any given performance level. This is a left/right mirror image of the more common “X or worse” kind of graph, which might seem a bit strange to some people. I did it this way deliberately so that “high to the right” is better, which I think is more intuitive. Too bad I don’t have comments so you can complain. :-P The way to interpret this graph is to keep in mind that the line always falls. The question is how far and how fast it falls. Let’s consider the three lines from lowest (overall to highest).

  • The Rackspace line is low, but it’s very flat. That’s good. 97% of the samples are in a range from just under 6000 to a bit more under 4000. That’s pretty easy to plan for, as we’ll discuss in a moment.
  • The Amazon line is awful. It has the highest peak on the left, but drops off continuously and sits below the HV line most of the time. As we’ve already noted, the range is also quite large. A flat line across a large range is exactly the opposite of a flat line across a small range; it’s very hard to plan around.
  • The Host Virtual line is the most interesting. 70% of the time it’s very nice and flat, from 13.5K down to 12K, but then it falls off dramatically. Is this a good or bad result? It requires a bit more complex mental model than a flat line, but once you’re used to the model it’s actually better for planning purposes.

Before I describe how to use this information for planning a deployment, let’s talk a bit about prices. That VR1G costs $20 a month. The Rackspace instance would cost $878 and the Amazon instance would cost $2562 (less with spot/reserved pricing). Pricing isn’t really my point here, but a 128x difference does give one pause. When the effect of variation on deployment size is considered, those numbers only get worse. Even when one considers the benefits of Amazon’s network (some day I’ll write about that because it’s so much better than everyone else’s that I think it’s the real reason to go there) and services and so on, any serious user would have to consider which workloads should be placed where. But I digress. On with the show.

Let’s look now at how to use this information to provision an entire system. Say that we want to get to 100K aggregate IOPS. How many instances it would take to get there assuming the absolute best case, and how many it would take to achieve a 99% probability based on these distributions?

Provider Best Case 99% Chance Ratio
Amazon 7 13 1.86
Rackspace 14 28 2.00
Host Virtual 8 11 1.38

Here we see something very interesting – the key point of this entire article, in my opinion. Even though Amazon is potentially capable of satisfying our 100K IOPS requirement with fewer instances than Host Virtual, once we take variation into account it requires more to get an actual guarantee. Instead of provisioning 38% more than the minimum, we need to need to provision 86% extra. As Jeff Dean points out in his excellent Tail At Scale article, variation in latency (or in our case throughput) is a critical factor in real-world systems; driving it down should be a goal for systems and systems-software implementors.

Before closing, I should explain a bit about how I arrived at these figures. Such figures can only be approximations of one sort or another, because the number of possibilities that must be considered to arrive at a precise answer is samples^nodes. Even at only 100 samples and 10 nodes, we’d be dealing with 10^20 possibilities. Monte Carlo would be one way to arrive at an estimate. Another way would be to divide the sorted samples into buckets, collapse the numbers within each bucket to a single number (e.g. average or minimum), then treat the results as a smaller number of samples. You can even use enumeration within a bucket as well as between buckets, and even do so recursively (which is in fact what I did). When there’s a nice “knee” in the curve, you can do something even simpler. Just eyeball a number above the knee and a number below, then work out the possibilities using those numbers and probability equal to the percentile at which the knee occurs. Whichever approach you use, you can do more work to get more accurate results but (except for Monte Carlo option) the numbers tend to converge very quickly so you’d probably be overthinking it.

OK, so what have we learned here? First, we’ve learned that I/O performance in the cloud is highly variable. Second, we’ve learned a couple of ways to visualize that variation and see the different patterns that it takes for each provider. Third, we’ve learned that consistency might actually matter more than raw performance if you’re trying to provision for a specific performance level. Fourth and last, we’ve learned a few ways to reason about that variation, and use repeated performance measurements to make a provisioning estimate that’s more accurate than if we just used an average or median. I hope this shows why average, median, or even 99th percentile is just not a good way to think about performance. You need to look at the whole curve to get the real story.


Object Mania

Apparently, at RICON East today, Seagate’s James Hughes said something like this.

Any distributed filesystem like GlusterFS or Ceph that tries to preserve the POSIX API will go the way of the dodo bird.

I don’t actually know the exact quote. The above is from a tweet by Basho’s Seth Thomas, and is admittedly a paraphrase. It led to a brief exchange on Twitter, but it’s a common enough meme so I think a fuller discussion is warranted.

The problem here is not the implication that there are other APIs better than POSIX. I’m quite likely to agree with that, and a discussion about ideal APIs could be quite fruitful. Rather, the problem is the implication that supporting POSIX is inherently bad. Here’s a news flash: POSIX is not the only API that either GlusterFS or Ceph support. Both also support object APIs at least as well as Riak (also a latecomer to that space) does. Here’s another news flash: the world is full of data associated with POSIX applications. Those applications can run just fine on top of a POSIX filesystem, but the cost of converting them and/or their data to use some other storage interface might be extremely high (especially if they’re proprietary). A storage system that can speak POSIX plus SomethingElse is inherently a lot more useful than a storage system that can speak SomethingElse alone, for any value of SomethingElse.

A storage system that only supported POSIX might be problematic, but neither system that James mentions is so limited and that’s what makes his statement misleading. The only way such a statement could be more than sour grapes from a vendor who can’t do POSIX would be if there’s something about supporting POSIX that inherently precludes supporting other interfaces as well, or incurs an unacceptable performance penalty when doing so. That’s not the case. Layering object semantics on top of files, as GlusterFS does, is pretty trivial and works well. Layering the other way, as Ceph does, is a little bit harder because of the need for a metadata-management layer, but also works. What really sucks is sticking a fundamentally different set of database semantics in the middle. I’ve done it a couple of times, and the impedance-mismatch issues are even worse than in the Ceph approach.

As I’ve said over and over again in my presentations, there is no one “best” data/operation/consistency model for storage. Polyglot storage is the way to go, and POSIX is an important glot. I’ve probably used S3 for longer than anyone else reading this, and I was setting up Swift the very day it was open-sourced. I totally understand the model’s appeal. POSIX itself might eventually go the way of the dodo, but not for a very long time. Meanwhile, people and systems that try to wish it away instead of dealing with it are likely to go the way of the unicorn – always an ideal, never very useful for getting real work done.