One of the problems with measuring and comparing performance of scalable systems is that any workload capable of producing meaningful results is going to be highly multi-threaded, and most developers don't know much about how to collect or interpret the results. After all, they hardly ever get any training in that area, and many of the tools don't exactly make it easy (as we'll see in a moment). Considering all the effort spent on complex ways to define the input workload - some tools have entire domain-specific languages for this - you'd think that some effort might have been spent on making the output more meaningful. You'd be wrong.
To see how easy it is to be misled, and how badly, let's consider a simple example. You have a storage system capable of sustaining 1000 IOPS. A single I/O thread can generate a load of 1000 IOPS. What happens when you run four of those?
Scenario 1: the storage system effectively delivers 250 IOPS per thread, continuously. Therefore they each report 250 IOPS, you add those up, and you get a correct sum of 1000 IOPS.
Scenario 2: the storage system effectively serializes the four threads. Thread A completes in one second, reporting 1000 IOPS. Thread B completes in two seconds - the first second sitting idle - and reports 500 IOPS. Threads C and D complete in three and four seconds respectively, reporting 333 and finally 250 IOPS. Add them all up and you get the wildly wrong sum of 2083 IOPS.
The mistake in the second scenario seems obvious when described this way, but I've seen smart people make it again and again and again over the years. One way to avoid it is not to trust reports from individual threads, but to measure the start and end times for the whole group. Unfortunately, you can miss a lot of useful information that way. Most importantly, a single slow worker can drag the entire average down and you won't even notice that the actual I/O rate for most of the threads and most of the time was actually far higher unless you're paying pretty close attention. Dean and Barroso call this the latency tail and it's significant in operations as well as measurement.
Another way to avoid the original over-counting problem is "stonewalling" - a term and technique popularized by iozone. This means stopping all threads when the first one finishes - i.e. "first to reach the stone wall" - and collecting the results even from threads that were stopped prematurely. This does avoid over-counting, but it can distort results in even worse ways than the previous method. It fundamentally means that your workers didn't do all of the I/O that you meant them to, and that they would have if they had all proceeded at the same pace. If you meant to do more I/O than will fit in cache, or on disks' inner tracks, too bad. If you wanted to see the effects of filesystem or memory fragmentation over a long run, too bad again. The slightest asymmetry in your workers' I/O rates will blow all of that away, and what storage system doesn't present any such asymmetry? None that I've ever seen. Worst of all, as Brian Behlendorf mentions, this approach doesn't even solve the single-slow-worker problem.
The use of stonewalling tends to hide the stragglers effect rather than explain or address it
In other words, iozone's stonewalling is worse than the problem it supposedly solves. Turn it off. If you want to see what's really happening to your I/O performance, the solution is neither of the above. Measuring just a start and end time, per worker or per run, is insufficient. To see how much work your system is doing per second, you have to look each second. Such periodic aggregation can not only give you accurate overall numbers and highlight stragglers, but it can also show you information like:
Performance percentiles (per thread or overall)
Global pauses, possibly indicating outside interference
Per-thread pauses e.g. due to contention/starvation
Mode switches as caches/tiers are exhausted or internal optimizations kick in
Cyclic behavior as timers fire or resources are acquired/exhausted
This is all really useful information. Do any existing tools provide it? None that I know of. I used to have such a tool at SiCortex, but it was part of their intellectual property and thus effectively died with them. Besides, it depended on MPI. Plain old sockets would be a better choice for general use. Reporting from workers to the controller process could be push or pull, truly periodic or all sent at the end (if you're more concerned about generating network traffic during the run than about clock synchronization). However it's implemented, the data from such a tool would be much more useful than the over-simplified crap that comes out of the current common programs. Maybe when I have some spare time - more about that in a future post - I'll even work on it myself.