About eight years ago, I wrote a series of posts about server design, which I then combined into one post. That was also a time when debates were raging about multi-threaded vs. event-based programming models, about the benefits and drawbacks of TCP, etc. For a long time, my posts on those subjects constituted my main claim to fame in the tech-blogging community, until more recent posts on startup failures and CAP theorem and language wars started reaching an even broader audience, and that server-design article was the centerpiece of that set. Now some of those old debates have been revived, and Matt Welsh has written a SEDA retrospective, so maybe it’s a good time for me to follow suit to see what I and the rest of the community have learned since then.

Before I start talking about the Four Horsemen of Poor Performance, it’s worth establishing a bit of context. Processors have actually not gotten a lot faster in terms of raw clock speed since 2002 – Intel was introducing a 2.8GHz Pentium 4 then – but they’ve all gone multi-core with bigger caches and faster buses and such. Memory and disk sizes have gotten much bigger; speeds have increased less, but still significantly. Gigabit Ethernet was at the same stage back then that 10GbE is at today. Java has gone from being the cool new kid on the block to being the grumpy old man the new cool kids make fun of, with nary a moment spent in between. Virtualization and cloud have become commonplace. Technologies like map/reduce and NoSQL have offered new solutions to data problems, and created new needs as well. All of the tradeoffs have changed, and of course we’ve learned a bit as well. Has any of that changed how the Four Horsemen ride?

With regard to data copies, I think we’ve lost ground. More people now realize that data copies are bad, but with processors and memory being so much faster they seem less inclined to do anything about it. Many “modern” languages have absolutely atrocious support for the kind of efficient buffer-list methods I recommended. Immutable-data languages inevitably force programmers to copy data into new variables where once they would have updated in place. Sometimes they even force the new variable to be local in a new function called only for that purpose. You can argue all you like about the concurrency or robustness advantages of the immutable-data approach, you can argue that good programmers won’t “subvert” your favorite language that way, but the fact is that real-world programmers do engage in such subversion and it does carry a performance cost. Even if much of that cost is ameliorated by using reference counts instead of true copies, it’s still less efficient than modifiable buffer chains.

The context-switch issue is the one being debated most nowadays, but surprisingly little has actually changed. Pretty much all of what I said earlier – about the ratio of threads to processors, about single-threaded approaches being beneath contempt, about coroutines giving all of the headaches of concurrency with none of the advantages – remains true. I guess that’s not all that much of a surprise since I’d been working on multiprocessors for years when I wrote the article even though they weren’t fully mainstream yet. Sure, the scalability of native threads has improved a lot, most notably in Linux, but context switches still aren’t free. In a multiprocessor system, you also have the problem of resuming a thread on a different processor with a stone-cold cache. Some people who just heard of cohort scheduling think it provides a silver-bullet solution, but it really doesn’t. If you want to worry about cache warmth, you have to think about three kinds of cached data.

  • Instructions (yes, even with a unified I+D cache)
  • The actual data being operated on.
  • The secondary data representing request state, global/persistent data structures, etc.

Cohort scheduling mostly addresses the first of these, and somewhat the third. Inevitably, by preserving locality in these cases it tends to make things a bit worse for the second – and largest – category of cache contents. The paper was written in the same era as my own post. It made a certain amount of sense in the context of the type/size/speed of caches in use at the time, and the loads being placed on them. Does it make an equal amount of sense in today’s context? Maybe. Sometimes. There’s no silver-bullet solution here. If you really want to optimize for cache warmth, you’ll still have to think hard about what data is being cached, how it’s moving between cache levels, and what program structures will create access patterns that minimize that movement.

My conclusion, just as before: by all means use multiple native threads, and use multiple coroutines on top of those if it suits you, but use both judiciously and pragmatically. Above all, try to program in a way that maximizes your flexibility to adjust the balance between event-based and multi-threaded and coroutine-based approaches in your program, as the tradeoffs and the program itself continue to change.

On the last two issues – memory allocation and lock contention – there has also been little change. Memory allocation is becoming a bit of a hot issue as more people realize that their oh-so-convenient languages and frameworks are in some cases creating/deleting thousands of objects per request. That’s just obscene. The best response is to avoid the overhead altogether, going as far up the stack as you have to, but for those unwilling to do so there’s still ample opportunity to make those wasteful operations less costly. Similarly, there have been some advances in avoiding lock contention – most notably wider knowledge and acceptance of the actor model – but it remains a very thorny problem that I’d need a whole separate post to address.

Perhaps the biggest change I’d make to the original post, if I were writing it today, is to the grab-bag of items at the end. For example, many people are coming up with code to exploit the vast differences in latency and throughput between sequential and random disk I/O. Log-structured filesystems are relevant again, even if many insist on embedding ad-hoc, informally specified, bug-ridden, slow implementations (with apologies to Greenspun) in other systems before they’ve even read up on the original. There’s a fifth horseman there, which is the failure to account for the motion of data between levels of the storage hierarchy, or deal properly with the fundamental differences between those levels. Think about the following differences.

  • Cache: extremely fast, seemingly byte-addressable but really more alignment-sensitive than people think, very limited size.
  • Memory: slower than cache but faster than storage by an even greater degree, byte addressable
  • Direct-connected SSD (e.g. Fusion): very alignment sensitive, low/flat latency but some artifacts due to wear leveling and garbage collection
  • External SSD: a tiny bit slower than direct-connected, but otherwise similar
  • Single disk: slower, more variable latency due to seeks (few people can deal with seek issues well enough to notice rotational latency any more), internal caches which affect both durability and performance ratios
  • Disk array: faster than single disk but different ratios for read vs. write and sequential vs. random, cache is usually battery-backed and large enough that writes very rarely have to wait (but reads often do)

That’s a lot of differences, which can be hard to account for. Failing to understand these different characteristics and the consequences of translating between them is what’s behind misunderstandings about caches and cohort scheduling, behind those loathsome memory-only “data stores” and even behind a lot of SSD mania. In many cases this data motion is the primary determinant of overall program performance. It needs to be carefully managed to ensure that data will be where you want it, when you want it, with a minimum of time spent playing “musical chairs” with it.