Parade of FAIL

Jeff Darcy November 4, 2009 12:47

One of the items that popped up in my morning scan of the news was a list of top failures in computing. The list is a bit of a FAIL itself, so I’ll continue the train of thought here. The thing about failure is that it can be instructive – often more instructive than success, which probably says something interesting about human psychology but I don’t know what. In fact, many of the things I’ll mention here aren’t really failures in the sense that they were misguided or doomed from the start. Some of them were seminal ideas and great successes in their time, but times changed and now it’s time to move on. Here are some of my suggestions, but first I have to make some exemptions because otherwise the list will just be too big.

  • Networking has generated a particularly impressive string of dead bodies – DECnet, NetWare, OSI, X.25, ATM, FDDI, token ring, on and on and on. How many routing protocols have risen and then fallen again? How many TCP speed tweaks and congestion-control methods? How many autoconfiguration and NAT-traversal methods? How many firewall technologies? That will just have to be a separate list.
  • I’m not even going to start on web/dot-com failures. There are whole blogs, much busier than this one, devoted to that.

On with the real list, after the fold.

  • Microsoft Bob. ‘Nuff said, but any list of computing failures has to include it. I’ll throw in Windows ME as a bonus.
  • Pen computing. That’s a stalwart in these lists too.
  • Proprietary UNIX. Xenix and A/UX in the commodity world. Original Solaris and AIX, Solaris, and IRIX on workstations. Ultrix and Digital UNIX, Dynix, UMAX, DG/UX on servers. Too many others too name. All dead, mostly killed by Linux. For all of its immense size and tremendous vigor, I’m still not sure the Linux community can match the sum of everything that used to go on in this area, and the remaining alternatives (mostly *BSD) barely keep real competition alive. I don’t for a moment long for a return to things being more proprietary, but I do miss some of the diversity that came from having truly separate communities.
  • Sun. A lot of the technology will live on, and in some cases deservedly so, but the company itself ranks among the biggest failures ever. If you think that’s harsh, be glad I edited out the stuff that was in this space before. I might still post it if someone comes here from Sun and gets all whiny, though.
  • RISC. “WTF?!?” I hear everyone say. I know, I know. RISC was a paradigm changing innovation. Computer Architecture: A Quantitative Approach is one of the best technical books ever written. Hennessy and Patterson are among my idols. How can I say RISC was a failure? Well, look at the market. Only IBM and ARM even compete with the uber-ugly x86 CISC architecture outside of the embedded space. RISC went from being an architecture to being an implementation technique a while ago; we all know by now that what really gets scheduled and executed inside an x86 is RISC instructions. As an architecture, RISC lives on in the embeded space, and sort of inside GPUs, but it has failed in the general market.
  • Big SMP. There used to be tons of big shared-memory machines. Now there are almost none, because the complexity (and hence cost) of the memory system to support them just became unsustainable. Now all the world’s massively parallel, with explicit communication or at least an explicit non-transparent memory hierarchy. Tilera puts more cores on a single chip than Sequent and Encore could ever put inside a refrigerator-sized machine, and those processors are much faster too, but once you go off-chip it’s all clusters. Now ScaleMP and 3leaf claim to be able to build big shared-memory systems again because the interconnect bandwidth is there, but (a) that’s not true except at small scale – it’s an n^2 problem and 32 or 64 machines is useful but not big any more – and (b) the interconnect wasn’t the only problem anyway. The other problem is programmers, who will always abuse the facility to reach out and touch any piece of memory any time by actually doing so – constantly, inappropriately, in ways that result in contention and thrashing and terrible system/application performance. Better programming models and better compiler/framework support (from the PGAS languages to Cilk and such) have helped, but once you have those you can do everything you need to on a massively parallel system. There’s more SMP in a GPU than in these cobbled-together big-SMP systems, and they manage it by having an explicit memory hierachy. Don’t NUMA me; NUMA is about non-uniform access times, but here the access methods are non-uniform as well. Explicit knowledge of what went where, whether that knowledge resides in the programmer’s head or the compiler/runtime’s, is the way to go.
  • RAID. Another WTF, I’m sure. RAID originally meant Redundant Array of Inexpensive Disks, and was touted as an alternative to Single Large Expensive Disk. Again, it was a fantastic idea for its time. When RAID was no longer so inexpensive the marketrons rewrote history so that RAID now stands for Redundant Array of Independent Disks, but some of us remember. Of course, nowadays storage from the likes of EMC or Hitachi is more SLED than RAID – it’s large, it’s expensive, and even if it’s not single disks it often seems that way from a deployment or business perpective. Many of the originally proposed RAID levels are long gone. RAID-5 and RAID-6 live on, but are increasingly recognized as inadequate ways of protecting data given the current ratio between disk capacities and speeds. I’ve already written about “RAID-Z” so don’t start. People who really care about data protection and understand the current tradeoffs have turned to plain old mirroring (which works because disks are so cheap), distributed “shared nothing” filesystem or database architectures (which protect against server as well as disk failure without incurring large costs in extra hardware), Reed-Solomon coding, and other approaches.

If we look at these failures, and the great successes that can now enjoy an honorable retirement as well, a couple of patterns start to emerge. One is of smaller pieces, joined together at a higher conceptual level. Often that means more software and less hardware, as with Big SMP vs. clustering or RAID vs. distributed storage. It could also mean something like eventual consistency and client-based reconciliation gaining favor over strong consistency (even that provided by software) in “Not Only SQL” data stores at an even higher conceptual level. There’s a corresponding trend in software development becoming harder, as every real programmer has to grapple directly with concurrency issues and memory/communication hierarchies. The “program as cookie recipe” metaphor no longer applies except for beginners; “program as busy restaurant” (with five cooks and ten servers) is more like it. Even corporate and business structures are affected, with monolithic procedure-bound entities (including formal standards bodies) being out-maneuvered by looser associations and communities of smaller companies and individuals.

Small pieces, loosely joined – only it’s not just the web. It’s evident even in the pieces that make up the web, and in things far removed from the web. The future is fine-grained and distributed, not monolithic or hierarchical.

6 Responses to “Parade of FAIL”

  1. John Cookon 04 Nov 2009 at 6:48 pm

    Not only are RAID disks not “inexpensive, they’re not independent either, not in the sense of independent probabilities of failure.

  2. Wes Felteron 04 Nov 2009 at 9:01 pm

    Pen computing is back, except now the computer fits inside the pen and the tablet is made of paper. A few people at the office are using these to digitize notes. It’s probably still a gimmick, though.

  3. lacoson 05 Nov 2009 at 9:57 am

    I’ve read that RISC and CISC are continuously alternating, or more exactly, higher and lower levels of abstractions, both in software and hardware, fluctuate.

    If you have a nice abstraction, like CISC, and its specification matches exactly what a user wants to do, then it’s more efficient than hand-called RISC, because the abstract->small-pieces translation happens at a lower level, thus less round-trips.

    If the abstraction leaks or doesn’t match exactly what a user tries to do, then users will start to abuse the high-level API.

    I think the example I’ve read about was GPU’s (mis)used for non-rendering, generic computational tasks, and conversely, high-performance general purpose CPU’s used for graphical rendering, for greater freedom of expression. CPU’s were too slow, thus a few primitives were crystallized and pushed down into hardware (GPU’s). Now, people started wanting to tap that vector architecture for other purposes (coveting an access to the lower levels of the GPU), while simultaneously finding the primitives too restrictive even for graphics, with powerful CPU’s being available.

    I know about these topics only superficially as you can probably tell, but the conversation I seem to remember went something like this.

    Second, the market is fashion driven and just plain unreasonable. If I get your point, you say that big SMP is disappearing rightfully, for the reasons (1) its complexity makes it too expensive in the market, and (2) programmers fail at programming it (the abstraction leaks and one should handle node distances explicitly, for example). This might be the case, but why didn’t the market reward SiCortex then? I may be a bit off here, but I believe SiCortex mainly supported MPI (which appears to be the correct approach with the current swing of “RISC” in parallel programming), and it would have been cheaper as well in the long run, both for being very energy-efficient and more useful to program (more explicit, but with less obscure bugs). Still, the market didn’t keep it alive.

    Sorry, this is wildly incoherent, and I apologize for that. I can’t express it better for now, I just feel that the market is a fickle mistress and no phenomenon can be justified by it.

  4. Jeff Darcyon 05 Nov 2009 at 10:32 am

    Good points about RISC/CISC and CPU/GPU going back and forth.

    why didn’t the market reward SiCortex then?

    The market seemed quite willing to; it’s the investors who failed. No matter how good a startup’s technology is, they face an uphill battle in the marketplace. Time after time, we’d hear prospective customers say they loved the product, they loved the people they’d worked with through the evaluation period, but they were afraid to spend that much for a system from a small company that might not survive. Their prophecy of failure turned out to be self-fulfilling, just as a prophecy of success would have been. If those same customers had evaluated products purely on the merits, many of them would have bought our systems and we would have been fine. Building up that trust requires demonstrating that the original technical success is repeatable, but building that second generation – which we were doing – takes more time and money that early-adopter customers put in – hence the need for later-stage investment. More here, and it was a similar story at Revivio and many other places. The lesson from SiCortex is not a technical one, but that startups should worry about their investors’ viability as much as the other way around.

    Getting back to the topic, it’s important to note that all of the very biggest systems in the world – e.g. Roadrunner, Jaguar, Intrepid – are based on essentially the same kind of non-shared-memory structure as SiCortex. The Blue Gene and PowerXCell8i systems from IBM are particularly interesting in this context because they use explicit memory hierarchies even within nodes. Shared memory is great up to its scaling limits, but its scaling limits in the real world are determined by programmer behavior. That behavior is typically to abuse the sharing so that the amount of coherency traffic rises as (approximately) the square of the processor count, rapidly overwhelming even the interconnect resources that are available on-chip. So far the only viable solution has been to constrain use of the interconnect by adopting models like MPI. Hadoop manages to do something like that across larger systems that share storage rather than memory, where similar problems also occur. Perhaps something like Cilk or a distributed Grand Central Dispatch can do that for larger systems that share memory. It’s still an active area of research, and I wish those researchers well.

  5. lacoson 05 Nov 2009 at 7:53 pm

    Great answer, thanks! The part on the self-fulfilling prophecy was particularly creepy.

  6. lacoson 23 Nov 2009 at 6:50 am

    “Shared memory is great up to its scaling limits, but its scaling limits in the real world are determined by programmer behavior. That behavior is typically to abuse the sharing so that the amount of coherency traffic rises as (approximately) the square of the processor count, rapidly overwhelming even the interconnect resources that are available on-chip. So far the only viable solution has been to constrain use of the interconnect by adopting models like MPI.”

    Not to contradict or anything — I’m not even sure if this could be better used as a PRO or CON argument for your point — I’ll just add this as something possibly (hopefully!) relevant:

    http://lacos.hu/lbzip2-scaling/scaling.html

Comments RSS

Leave a Reply