Last night, after dealing with the latest piece of gcc nastiness that I have to deal with in my stack ripper, I got to wondering whether the whole effort was really worthwhile. Compiling the ripped code is a bit of a pain, and then it’s very hard to recognize or follow in a debugger. Why not just use plain old threads? It’s a very reasonable question. I’d have to say that in many cases people should use plain old threads. However, there are reasons why stack-ripping might be more appropriate at least in some situations.

  • One thread per connection is simply too limiting. Any serious generic server infrastructure must be prepared to track context per request, not per connection, and handle very large numbers of concurrent requests.
  • In a 1:1 threading model depending on OS threads, each thread has to have its own kernel stack and other associated data structures, which can become a serious issue if you have a very large number of threads corresponding to requests. There is also a non-trivial performance penalty, no matter how hard the Linux folks pat themselves on the back for their supposed brilliance.
  • A lightweight N:1 user-level threading model, such as one might implement using the ucontext or signal-stack methods (reference), doesn’t have the same performance problems, and does allow the user to set the per-thread stack size, so that might seem like an ideal approach. However, the stack size in any threading model needs to include not only the context for internal functions but also for any library functions it might call. Do you know how much stack space printf – which you might well be tempted to use for debugging output – consumes, for example? Many times you’d be surprised. The stack overhead for library functions is problematic both because it’s unknown and because it’s potentially too large to be acceptable for very large numbers of threads. Ripped code associates the internal context with the request, but borrows the stack of the executing thread for library calls.
  • N:1 thread models are also incapable of using multiple processors effectively, whereas it’s trivial for ripped code to do so.
  • In some environments such as kernels or embedded systems, the sorts of stack manipulation necessary to implement any kind of threads might not be possible or safe. It has always been one of my primary goals to have a method that can be used even in those environments, which is definitely true for the stack ripper but highly questionable for threading.

As I said in my main server-design article, “mildly parallel” applications aren’t really my primary concern and models which only work on one processor are lame. For those types of applications it hardly matters what your infrastructure looks like. For a really serious server, though, the stack-ripping approach still seems to have an advantage. It can use multiple processors, its resource usage is very conservative so it scales well to very high concurrency levels, it preserves the essential benefit of expressing complex logic as sequential code instead of having to break it up into a plethora of event handlers with explicit state, and it’s still applicable to kernel or embedded code. The only real drawback is that the generated code is hard to read or step through in a debugger. The first shouldn’t be an issue because only a compiler should really need to read it anyway. With regard to the second, I offer an observation. If you can afford to sit in a debugger and step through code, you’re obviously not dealing with a timing problem or worried about performance. If that’s the case, you can run the original unripped version of your code in a threading system; the block() pseudo-function that the ripper uses to recognize blocking points is very closely analagous to a yield() call in a user-level threading library. Once you’re done debugging your code that way, you can go back to building the semantically-identical ripped version for performance.