Last week, the geek community discovered the “memory wall” which occurs when adding more processing power to a chip (currently in the form of more processor cores) fails to improve performance because memory bandwidth becomes the limiting factor. This phenomenon has been known in some circles for ages, of course. Matt Reilly has been talking about this for years, and I think he has some graphs that show how memory bandwidth per core has actually gotten worse lately as the number of cores per chip has increased (but I’m too lazy to find them right now). In any case, now Ars Technica and Slashdot readers know about the memory wall too, thanks to a recent Sandia report that shows performance actually declining after ~8 cores per chip.

What it all comes down to, really, is that if you want systems to get faster you have to add more than one kind of capacity. Adding more CPU cycles doesn’t help if you don’t add more memory, and more bandwidth to that memory. Many would argue that you also need to add communications and I/O capacity as well, but I’ll leave that alone for a moment. One approach that the young ‘uns at Ars and /. pounced on was having chips share each others’ memory controllers in a NUMA system. The problem is, we’ve already been down that path. I was there, hardly the first, and I had lots of company. NUMA is great, but I don’t think it’s going to solve this particular kind of problem. For one thing, the problem is in the ratio between cores and memory controllers, and NUMA doesn’t change that. A system that’s already hitting the memory wall with four-core chips isn’t going to get any better if you add more four-core chips, and might even get worse due to thrashing.

The other problem is that NUMA systems never scaled that well beyond 32-64 nodes or so even when nodes were borrowing from each others’ caches, and I haven’t heard of any fundamental breakthroughs that would change that. What people are talking about nowadays is even harder, though – nodes borrowing each others memory controllers. I’m no chip designer, but my impression is consuming resources on both sides of the transaction for that long (this is a long path in CPU terms), and all the logic to coordinate between them through all possible sequences of events, is likely to make life for those chip designers rather difficult.

Last time around, most people who actually worked on this stuff eventually reached the conclusion that – for performance reasons, for reliability reasons, and for cost reasons – explicit communication between processors was preferable to the implicit kind that occurs by putting all processors into one cache/memory address space. Think Beowulf. Think MPI. Think shmem and UPC and GASnet and all those. It means that programmers have to re-think some of the ways they write code, but this model is already de rigueur on all of the truly big systems solving the world’s hardest computational problems. If we really want to crack the memory wall, we shouldn’t repeat the mistake of trying to build hideously expensive memory systems to support a programming model that will fail anyway. That killed plenty of companies already. Instead, we should try to build components and systems to support the more mature programming model(s) that professional parallel programmers are already using.