There are some interesting aspects of working with a 972-node system on a regular basis. (I hardly ever think of the SC5832 as a 5832-processor system; for the stuff I do it’s more often the number of nodes rather than the number of processors that matters.) For one thing, all of that algorithmic-complexity stuff from college has come back with a vengeance. When n=972, the difference between O(n) and O(n2) can be the difference between four seconds and one hour. Even subtler differences can matter. For example, when I was working on high-availability clusters, eight nodes was about as big as things got. At n=8, the difference between O(log(n)) and O(sqrt(n)) is negligible, and with rounding they’re identical. At n=972, sqrt(n) is approximately three times log(n). When the difference appears in boot-time coordination or data-structure sizing, you really tend to notice.

Besides the obvious difference in timing, algorithmic differences can often determine whether something works at all. The performance vs. load curve for many, if not most, programs tends to peak at some point and then drop (not level off) after that due to various forms of thrashing. Worse, many programs’s behavior will degrade very quickly once they get behind by a certain amount, as code that was only intended to handle transient load spikes gets exercised for a high continuous load. We’ve found several commonly used utilities and libraries that either crash or slow down so much that they might as well have crashed in our environment. Even well-written code usually needs at least a little bit of tweaking to handle the numbers of connections and messages that we can throw at it almost accidentally.

Another fun difference at this scale is that every single boot of the system exercises your code a thousand times, and in fact might stress some parts (e.g. communications) far more. This means that bugs that happen only one time in a thousand will happen pretty much every time on some node in the system. That means you can reproduce it in under ten minutes, whereas a single machine booting every five minutes would take (on average) nearly four days and in normal usage might take years. This has its downside, especially if one node failing holds up the boot for the rest, but it’s kind of nice in terms of smoking out all those deadlocks and race conditions and stuff inherited from programmers who didn’t know how to avoid them.

Fun fact: my boss has estimated that every time we boot an SC5832 we’re probably booting more Linux-MIPS machines than will be booted that day in the rest of the world. Frightening thought, huh?