One of the luxuries that I’ve gotten used to at some of the places I’ve worked is being surrounded by people who understand scalability. In my recent job hunt and other between-job activities, I’ve been reminded that a great many people in this industry lack such understanding. So, here’s the thing. If you draw a graph of a system’s component (e.g. server) count on the x axis, and aggregate performance on the y axis, then

Scalability is about the slope, not the height.

The line for a scalable system will have a positive slope throughout the range of interest. By contrast, the line for a non-scalable system will level off or even turn downward as contention effects outweigh parallelism. Note that a non-scalable system might still outperform a scalable system throughout that range of interest, if the scalable system has poor per-unit performance. Scalability is not the same as high performance; it enables higher performance, but does not guarantee it.

Similarly, building a scalable system might also increase single-request latency or decrease single-stream bandwidth due to higher levels of complexity or indirection. It would be wrong, though, to dismiss a scalable design based on such concerns. If low single-request latency or high single-stream bandwidth are hard requirements with specific numbers attached, then the more scalable system might not suit that particular purpose, but in the majority of cases it’s the aggregate requests or bytes per second that matter most so it’s a good tradeoff. The key to scalability is enabling the addition of more servers, more pipes, more disks, more widgets, not in making any one server or pipe etc. faster. Can you make better use of a network by using RDMA instead of messaging? Sure, that’s nice, it might even be all that’s needed to reach some people’s goals, but it’s a complete no-op where scalability is concerned. Ditto for “parallel” filesystems that only make data access parallel but do nothing to address the metadata bottleneck.

Scaling – more precisely what the true cognoscenti would recognize as horizontal scaling – across all parts of a system is an important key to performance in scientific and enterprise computing, in the grid and the cloud. That’s why all the biggest systems in the world, from Google and Amazon to Roadrunner and Jaguar, work that way. Anybody who doesn’t grasp that, in fact anyone who hasn’t internalized it until it’s instinct, is not qualified to be designing or writing about modern large computer systems. (Ditto, by the way, for anybody who thinks a hand-wave about their favorite protocol possibly allowing such scalability is the same as that protocol and implementation explicitly supporting it. Such people are frauds, to put it nicely.)

I’ll probably be writing more about scaling issues for a while, for reasons that will soon become apparent. Watch this space.