During my talks, I often try to make the point that horizontally scalable systems are often necessary not only to achieve high aggregate performance but also to overcome performance **variation**. This is true in general, but especially in a public cloud. In that environment, performance variation both between nodes and over time is much greater than in a typical bare-metal environment, and there’s very little you can do about it at the single-system level, so you pretty much have to deal with it at the distributed-system level. I’ve been using a graph that kind of illustrates that point, but it has a few deficiencies – it’s second-hand observation, it’s kind of old, and it’s for network performance whereas I usually care more about disk performance. To illustrate my point even more clearly, I took some measurements of my own recently and created some new graphs. Along the way, I found several other things that might be of interest to my readers.

The methodology here was deliberately simple. I’d get on a node, do whatever disk/volume setup was necessary, and then run a very simple iozone test over and over – eight threads, each doing random 4KB synchronous writes. I then repeated this exercise across three providers. It’s worth noting that each test is on a single machine at a single time. The variation across a broader sample is likely to be even greater, but these samples are already more than sufficient to make my point. Let’s look at the first graph, for a High I/O (hi1.4xlarge) instance.

Ouch. From peaks over 16K down to barely 6K, with barely any correlation between successive samples. That’s ugly. To be fair, Amazon’s result was the worst of the three, and that’s fairly typical. I also tested a Rackspace 30GB instance, and the measly little VR1G instance (yes, that’s 1GB) that runs this website at Host Virtual. Note that in the Amazon and Rackspace cases these were some of the beefiest instances available, the sort that should be on their own physical machines and most immune to the “bad neighbor” effect. This is as consistent as it gets, and the results were pretty amusing. To see how amusing, let’s look at the same figures in a different way.

This time, we’re looking at the number of samples that were “X or better” for any given performance level. This is a left/right mirror image of the more common “X or worse” kind of graph, which might seem a bit strange to some people. I did it this way deliberately so that “high to the right” is better, which I think is more intuitive. Too bad I don’t have comments so you can complain. :-P The way to interpret this graph is to keep in mind that **the line always falls**. The question is how far and how fast it falls. Let’s consider the three lines from lowest (overall to highest).

- The Rackspace line is low, but it’s very flat. That’s good. 97% of the samples are in a range from just under 6000 to a bit more under 4000. That’s pretty easy to plan for, as we’ll discuss in a moment.
- The Amazon line is awful. It has the highest peak on the left, but drops off continuously and sits below the HV line most of the time. As we’ve already noted, the range is also quite large. A flat line across a large range is exactly the opposite of a flat line across a small range; it’s very
**hard**to plan around. - The Host Virtual line is the most interesting. 70% of the time it’s very nice and flat, from 13.5K down to 12K, but then it falls off dramatically. Is this a good or bad result? It requires a bit more complex mental model than a flat line, but once you’re used to the model it’s actually better for planning purposes.

Before I describe how to use this information for planning a deployment, let’s talk a bit about prices. That VR1G costs $20 **a month**. The Rackspace instance would cost $878 and the Amazon instance would cost $2562 (less with spot/reserved pricing). Pricing isn’t really my point here, but a 128x difference does give one pause. When the effect of variation on deployment size is considered, those numbers only get worse. Even when one considers the benefits of Amazon’s network (some day I’ll write about that because it’s so much better than everyone else’s that I think it’s the real reason to go there) and services and so on, any serious user would have to consider which workloads should be placed where. But I digress. On with the show.

Let’s look now at how to use this information to provision an entire system. Say that we want to get to 100K aggregate IOPS. How many instances it would take to get there assuming the absolute best case, and how many it would take to achieve a 99% probability based on these distributions?

Provider |
Best Case |
99% Chance |
Ratio |

Amazon | 7 | 13 | 1.86 |

Rackspace | 14 | 28 | 2.00 |

Host Virtual | 8 | 11 | 1.38 |

Here we see something very interesting – the key point of this entire article, in my opinion. Even though Amazon is **potentially** capable of satisfying our 100K IOPS requirement with fewer instances than Host Virtual, once we take variation into account it requires more to get an actual **guarantee**. Instead of provisioning 38% more than the minimum, we need to need to provision 86% extra. As Jeff Dean points out in his excellent Tail At Scale article, variation in latency (or in our case throughput) is a critical factor in real-world systems; driving it down should be a goal for systems and systems-software implementors.

Before closing, I should explain a bit about how I arrived at these figures. Such figures can only be approximations of one sort or another, because the number of possibilities that must be considered to arrive at a precise answer is samples^nodes. Even at only 100 samples and 10 nodes, we’d be dealing with 10^20 possibilities. Monte Carlo would be one way to arrive at an estimate. Another way would be to divide the sorted samples into buckets, collapse the numbers within each bucket to a single number (e.g. average or minimum), then treat the results as a smaller number of samples. You can even use enumeration within a bucket as well as between buckets, and even do so recursively (which is in fact what I did). When there’s a nice “knee” in the curve, you can do something even simpler. Just eyeball a number above the knee and a number below, then work out the possibilities using those numbers and probability equal to the percentile at which the knee occurs. Whichever approach you use, you can do more work to get more accurate results but (except for Monte Carlo option) the numbers tend to converge very quickly so you’d probably be overthinking it.

OK, so what have we learned here? First, we’ve learned that I/O performance in the cloud is highly variable. Second, we’ve learned a couple of ways to visualize that variation and see the different patterns that it takes for each provider. Third, we’ve learned that consistency might actually matter more than raw performance if you’re trying to provision for a specific performance level. Fourth and last, we’ve learned a few ways to reason about that variation, and use repeated performance measurements to make a provisioning estimate that’s more accurate than if we just used an average or median. I hope this shows why average, median, or even 99th percentile is just not a good way to think about performance. You need to look at the **whole curve** to get the real story.

UPDATE: I couldn’t help myself, and ran the same test on Google Compute Engine’s biggest baddest instance type (n1-standard-8-d). Here’s the updated graph. Short version: to reach 100K IOPS in the best case, I’d need 88 of these. For a 99% guarantee as above, I’d need 110 for an overprovisioning ratio of 25%. That’s for the same per-instance cost as Rackspace BTW. Further conclusions are left as an exercise for the reader.