Small Synchronous Writes

Sometimes people ask me why I always use small synchronous writes for my performance comparisons. Surely (they say), there are other kinds of operations that are more common or more important. Yes there are (I say), and don't call me Shirley. But seriously, folks, there are definitely other kinds of performance that matter. The problem is that they just don't tell you much about what makes two distributed filesystems different. I'll try to explain why.

Let's start with read-dominated workloads. It's well known that OS (and app) caches can absorb most of the reads in a system. This was the fundamental observation behind Seltzer et al's work on log-structured filesystems all those years ago. Reads often take care of themselves, so at the filesystem level focus on writes. The significance of caching is hardly less in distributed filesystems with greater latency. The primary exception to this rule is large sequential reads, but those tend to become bandwidth-bound very quickly and just about every distributed filesystem I've ever seen can saturate whatever network connections you have easily for such workloads. Boring. Between these two effects, it just turns out that read-dominated workloads aren't all that interesting.

Why not different kinds of writes? Mostly because large and/or asynchronous writes tend to follow the same patterns as large reads. Once you have the opportunity to batch and/or coalesce writes, effectively eliminating the effect that network latency might have on most of them, it becomes pretty easy to fill the pipe with huge packets. Boring again. It's important to measure how well the servers handle parallelism among many requests that are still kept separate, but that's a whole different thing. If both reads and large/async writes are uninteresting, what does that leave? Small sync writes, of course.

While I'm here, I might as well address a couple of other issues. One is the question about scale. Does a test of a single client and a single server (if replicating) really tell us anything useful for filesystems that are designed to have many servers? I think it does, for a certain class of such filesystems. In a system that uses algorithmic placement, such as GlusterFS or Ceph, an individual request really will hit only those servers and really will scale pretty linearly until you start hitting the network's scaling limits. It absolutely makes sense to test the network in the context of an actual deployment, but in the context of evaluating technologies the performance of a single server (or replica pair) does work as a proxy for the performance of N. That doesn't mean you should obsess over micro-optimizations or implementation concerns that don't have much measurable effect (e.g. kernel vs. FUSE clients), but it's really the data flow and algorithmic efficiency that matter most. This argument doesn't work nearly as well for more outdated architectures that use directory-based placement, such as HDFS or Lustre. In those cases, the need to go through the MDS or NameNode or whatever really does create a bottleneck that impacts system-wide scaling. That's something to consider when you're looking at such systems.

Lastly, what about metadata operations? File creation and directory listings are even worse than writes, aren't they? Yes, absolutely, they are. Testing only data operations is kind of a bad habit among filesystem folks, and I'm guilty too. I really should test and report on those things too, even though it probably means developing even more tools myself because the existing tools are even worse for that than they are for testing plain old reads and writes.

To make a long story . . . no longer, if not actually short, I've found that testing small synchronous writes is simply the best place to start. It's the first result to look at, but absolutely not the only one. If I were actually looking to deploy a system myself I'd try all sorts of workloads at the same scale as the deployment itself, or as close as I could get, and I'd show everyone a detailed report. On the other hand, when I'm doing the tests on my own time and at my own expense (in a public cloud) for a blog post or presentation, that's quite a different story.

Comments for this blog entry