Storage Benchmarking Sins

I've written and talked many times about storage benchmarking. Mostly, I've focused on how to run tests and analyze results. This time, I'd like to focus on the parts that come before that - how you set up the system so that you have at least some chance of getting a fair or informative result later. To start, I'm going to separate the setup into layers.

  • The physical configuration of the test equipment.

  • Base-level software configuration.

  • Tuning and workload selection.

Physical Configuration

The first point about physical configuration is that there's almost never any excuse for testing two kinds of software on different physical configurations. Sure, if you're testing the hardware that makes some sense, but even then the only comparisons that make sense are the ones that exhibit equality at some level such as number of machines or system cost (including licenses). Testing on different hardware is the most egregious kind of dishonest benchmarking, but it's only the first of many.

The second point about physical configuration is that just testing on the same hardware doesn't necessarily make things fair. What if one system can transparently take advantage of RDMA or other kinds of network offload but the other can't? Is it really fair to compare on a configuration with those features, and not even mention the disparity? What if one system can use the brand-new and vendor-specific SSE9 instructions to accelerate certain operations, but the other can't? The answer's less clear, I'll admit, but a respectable benchmark report would at least note these differences instead of trying to bury them. A good rule of thumb is that it's hardware used that counts, not merely hardware present. If the two systems aren't actually using the same hardware, the benchmark's probably skewed.

The third and last point about hardware is it's still possible to skew benchmark results even if two systems are using the same hardware. How's that? Not all programs benefit equally from the same system performance profile. What if one system made a design decision that saves memory at the expense of using more CPU cycles, and the other system made a different design decision with the opposite effect? Is it fair to test on machines that are CPU-rich but memory starved, or vice versa? Of course not. A fair comparison would be on balanced hardware, though it's obviously difficult to determine what "balance" means. This is why it's so important for people who do benchmarks to disclose and even highlight potential confounding factors. Another common trick in storage is "short stroking" by using lots of disks and testing only across a small piece of those to reduce seek times. The flash equivalent might be to test one system on clean drives and the other after those same drives have become heavily fragmented. These differences can be harder to identify than the other two kinds, but they can have a similar effect on the validity of results.

Base Software Configuration

For the purposes of this section, "base" effectively means anything but the software under test - notably operating-system stuff. Storage configuration is particularly important. Is it fair to compare performance of one system using RAID-6 vs. another using JBOD? Probably not. (The RAID-6 might actually be faster if it's through a cached RAID controller, but that takes us back to differences in physical configuration so it's not what we're talking about right now.) Snapshots enabled vs. snapshots disabled is another dirty trick, since there's usually some overhead involved. Many years ago, when I worked on networking rather than storage, I even saw people turning compression on and off for similar reasons.

Other aspects of base configuration can be used to cheat as well. Tweaking virtual-memory settings can have a profound effect on performance, which will disproportionately hurt some systems. Timer frequency is another frequent target, as are block and process schedulers. In the Java world, I've seen benchmarks that do truly heinous things with GC parameters to give one system an advantage over another. As with physical configuration, base software configuration can be easily done so that it's equal but far from fair. The rule of thumb here is whether the systems have been set up in a way that an experienced system administrator might have done, either with or without having read each product's system tuning guides. If the configuration seems "exotic" or is undisclosed, somebody's probably trying to pull a fast one.

Tuning

Most of the controvery in benchmarking has to do with tuning of the actual software under test. When I and others have tested GlusterFS vs. Ceph, there have always been complaints that we didn't tune Ceph properly. Those complaints are not entirely without merit, even though I don't feel the results were actually unfair. The core issue is that there are two ways to approach tuning for a competitive benchmark.

  • Measure "out of the box" (OOTB) performance, with no tuning at all. If one system has bad defaults, too bad for them.

  • Measure "tuned to the max" performance, consulting experts on each side on how best to tweak every single parameter.

The problem is that the second approach is almost impossible to pull off in practice. Most competitive benchmarks are paid for by one side, and the other is going to be distinctly uninterested in contributing. Even in cases where the people doing the testing are independent, it's just very rare that competitors' interest and resource levels will align that closely. Therefore, I strongly favor the OOTB approach. Maybe it doesn't fully explore the capabilities of each system, but it's more likely to be fair and representative of what actual users would see.

However, even pure OOTB doesn't quite cut it. What if the systems come out of the box with different default replication levels? It's clearly not fair to compare replicated vs. non-replicated, or even two-way vs. three-way, so I'd say tuning there is a good thing. On the other hand, I'd go the other way for striping. While different replication levels effectively result in using different hardware (different usable capacity), the same is not true of striping which merely uses the same hardware a little differently. That falls into the "too bad for them" category of each project being responsible for putting its own best foot forward.

Another area where I think it's valid to depart from pure OOTB is durability. It's simply not valid or useful to compare a system which actually gets data on disk when it's supposed to vs. one that leaves it buffered in memory, as at least two of GlusterFS's competitors (MooseFS and HDFS) have been guilty of. You have to compare apples to apples, not apples to promises of apples maybe some day. Any deviations from pure OOTB should be looked at in terms of whether they correct confounding differences between systems or introduce/magnify those differences.

Conclusion

Benchmarking software is difficult. Benchmarking storage software is particular difficult. Very few people get it right. Many get it wrong just because they're not aware of how their decisions affect the results. Even with no intent to deceive, it's easy to run a benchmark and only find out after the fact that what seemed like an innocent configuration choice caused the result to be unrepresentative of any real-world scenario. On the other hand, there often is intent to deceive. With some shame, I have to admit that my colleagues in storage often play a particularly dirty game. Many of them, especially at the big companies, have been deliberately learning and applying all of these dirty tricks for decades, since EMC vs. Clariion (both sides) and NetApp vs. Auspex (ditto). None of them will be the first to stop, for obvious game-theoretic reasons.

I've tried to make this article as generic and neutral as I could, because I know that every accusation will be met with a counter-accusation. That's also part of how the dirty game is played. However, I do invite anyone who has read this far to apply what they've learned as they evaluate recent benchmarks, and reach their own conclusions about whether those benchmarks reveal anything more than the venality of those who ran them.

Comments for this blog entry