Yesterday, Joel Spolsky wrote about how Ruby is slow. One of his key theories is summed up by the following quote.

Without knowing much about the implementation of Ruby, I would guess that the biggest issue is around late binding and especially duck typing, which prevents type inference or strong typing, which means that function calls will always be slow because you can never get something compiled down to the point where a function call is just a single CALL instruction (on x86)… you always have to be exploring the object, possibly even scanning a hash table to find the function you want to call. Calling methods on objects is really, really, really, really common, especially in pure OO languages, and that bit of the code needs to get down to at least a CALL indirect on the CPU for Ruby to offer the same performance as languages where the type of the object can be determined at compile time and therefore the instruction where you’re jumping to can be gotten either at compile time (like in C) or with a single indirection through a single vtable (like in C++).

“Duck typing” is a bit of a sore spot for a platypus, for which LayEgg() works but FlapWings() doesn’t, but Joel’s usually wrong about stuff (like thinking that good developers care more about Aeron chairs than a good salary as though they don’t know their market value and can’t buy their own darn chairs if they’re paid appropriately) so I didn’t think much of it. Then he posted a link to an article about how to make method dispatch faster and that piqued my interest. So, naturally enough, I decided to write a little test program. Basically it tries a bunch of different kinds of method dispatch, doing many method calls and using the cycle counter for timings. The dummy method being called doesn’t really do anything, but does have a standard prologue and epilogue. Here are some results.

  • Direct call (unconditional always-taken jump). This was just to get a baseline, and yielded about 10 cycles per iteration.
  • Dispatch table containing just the target address (indirect always-taken jump). This caused the time to go up to 14 cycles per iteration.
  • Dispatch table containing a JMP instruction to the target (two always-taken jumps). Back down to 12 cycles.
  • Avi Bryant’s method (unconditional always-taken jump, compare, conditional jump). Back up to 14 cycles. Yes, this was worse than the previous well-known and flexible method.
  • Conditional before the jump (compare, conditional jump). Either 12 or 14 cycles, depending on whether it was coded so that the positive or negative branch was actually taken each time.

Observation one is that Avi’s trick doesn’t actually seem beneficial, and that’s before you consider that it relies on compiling the target function in a special way. Compiling the caller differently is at least as good, and the second dispatch-table method is even better. Observation two is that none of this really matters that much. Two cycles? Give me a break. I’ve done plenty of code that really did have to count cycles, and that kind of thing does have its place, but this isn’t it. For any higher-level language, many other internal aspects of how the compiler and run-time libraries work will be much more important than the method-dispatch differences observed so far.

Observation three is the killer. I don’t think Avi’s technique really works for true duck typing. I started to realize this as I was writing the test program, and I started wondering what condition the target routine should test. Avi says it’s just a CMP, but comparing what to what? In a truly duck-typed language the target function is going to have a pointer to an object, and then it has to figure out whether it’s the right method for that object. Even in a statically typed OO language that’s going to involve a bit of pointer chasing to find the right vtbl and so on. In a duck-typed language, where the actual object might have absolutely no inheritance relationship with the class on which the guessed method is defined, it’s trickier. At the very least you’ll have to find the hash table, find the index that would be used if this were the correct method, and do two compares – one for the method-name pointers and one for the method-body pointers. This is the price you pay for the flexibility of duck typing, but it still doesn’t validate Joel’s theory. Even with two compares, method dispatch is not likely to be the problem here.