Here’s the next section of my ongoing server-architecture article.

Whereas everyone thinks it’s obvious that data copies are bad, I’m often surprised by how many people totally ignore the effect of context switches on performance. In my experience, context switches are actually behind more total “meltdowns” at high load than data copies; the system starts spending more time going from one thread to another than it actually spends within any thread doing useful work. The amazing thing is that, at one level, it’s totally obvious what causes excessive context switching. The #1 cause of context switches is having more active threads than you have processors. As the ratio of active threads to processors increases, the number of context switches also increases – linearly if you’re lucky, but often exponentially. This very simple fact explains why multi-threaded designs that have one thread per connection scale very poorly. The only realistic alternative for a scalable system is to limit the number of active threads so it’s (usually) less than or equal to the number of processors. One popular variant of this approach is to use only one thread, ever; while such an approach does avoid context thrashing, and avoids the need for locking as well, it is also incapable of achieving more than one processor’s worth of total throughput and thus remains beneath contempt unless the program will be non-CPU-bound (usually network-I/O-bound) anyway.

The first thing that a “thread-frugal” program has to do is figure out how it’s going to make one thread handle multiple connections at once. This usually implies a front end that uses select/poll, asynchronous I/O, signals or completion ports, with an event-driven structure behind that. Many “religious wars” have been fought, and continue to be fought, over which of the various front-end APIs is best. Dan Kegel’s C10K paper is a good resource is this area. Personally, I think all flavors of select/poll and signals are ugly hacks, and therefore favor either AIO or completion ports, but it actually doesn’t matter that much. They all – except maybe select() – work reasonably well, and don’t really do much to address the matter of what happens past the very outermost layer of your program’s front end.

The simplest conceptual model of a multi-threaded event-driven server has a queue at its center; requests are read by one or more “listener” threads and put on queues, from which one or more “worker” threads will remove and process them. Conceptually, this is a good model, but all too often people actually implement their code this way. Why is this wrong? Because the #2 cause of context switches is transferring work from one thread to another. Some people even compound the error even further by requiring that the response to a request be sent by the original thread – guaranteeing not one but two context switches per request. It’s very important to use a “symmetric” approach in which a given thread can go from being a listener to a worker to a listener again without ever changing context. Whether this involves partitioning connections between threads or having all threads take turns being listener for the entire set of connections seems to matter a lot less.

Usually, it’s not possible to know how many threads will be active even one instant into the future. After all, requests can come in on any connection at any moment, or “background” threads dedicated to various maintenance tasks could pick that moment to wake up. If you don’t know how many threads are active, how can you limit how many are active? In my experience, one of the most effective approaches is also one of the simplest: use an old-fashioned counting semaphore which each thread must hold whenever it’s doing “real work”. If the thread limit has already been reached then each listen-mode thread might incur one extra context switch as it wakes up and then blocks on the semaphore, but once all listen-mode threads have blocked in this way they won’t continue contending for resources until one of the existing threads “retires” so the system effect is negligible. More importantly, this method handles maintenance threads very gracefully.

Savvy readers might already have started to wonder how much what I’ve described differs from Matt Welsh’s SEDA, which is already considered a leading-edge architecture. The answer is: not much. I know I’ve been using most of the approaches discussed here since well before I heard of SEDA, and I’m almost equally sure that Matt Welsh never heard of any of my work, but we clearly are concerned with many of the same problems and have reached many of the same conclusions. Think of it as convergent evolution. Matt, if you’re out there, drop me an email. I’d love to get your input on this article. The points I’d like to make regarding SEDA are as follows:

  1. SEDA has a focus on separating a server into multiple stages, beyond just listener and worker. While I don’t think this is necessary from a performance standpoint, it’s an excellent idea for other reason such as locking (which I’ll get to in a little bit) and maintainability. Once you’ve implemented a two-stage multi-dispatch event-based system with concurrency control – what I’ve just described – generalizing it to an N-stage system is easy and can be highly beneficial.
  2. SEDA’s one significant flaw, in my opinion, is that it allocates a separate thread pool to each stage with only “background” reallocation of threads between stages in response to load. The #1 and #2 causes of context switches as noted above still apply in this case.
  3. In general I don’t like systems that overuse queues. I’ve said it before and I’ll say it again: putting something on a queue and then immediately taking it right back off is silly. While I can appreciate the convenience and clarity of the queues-everywhere model, the silliness is something I’d rather avoid.
  4. In the context of an academic research project, implementing SEDA in Java might make sense. In the real world, though, I think the choice can be characterized as unfortunate.