Yesterday I saw an article at KernelTrap about adaptive readahead for Linux. This is an area of considerable interest to me, for a couple of reasons. The first is that when I joined Conley, before it got absorbed by EMC, my mandate was to design a next-generation storage system. The design I came up with involved three areas of innovation.

  • Cooperative caching, which I went on to develop further as my last project at EMC and still hope to get back to some day.
  • “Micro-RAID” which slices and dices disks so it can use different attributes (including location) for different parts and then recombine them into user-visible volumes. Lots of virtualization companies, including Conley founder Ric Calvillo’s own Incipient, are doing this nowadays but I always thought it was the least interesting of the three parts.
  • Intelligent readahead, which did get revived and worked on by other people at EMC with only slight involvement from me.

The kind of intelligent readahead I was looking at was supposed to be entirely transparent and general than Wu Fengguang’s version, detecting arbitrary patterns rather than just variations on sequential access and doing so entirely without user intervention or hints. My feeling was that application access patterns often do not reflect “spatial” concerns so much as program logic. An application might access X/Y/Z in sequence not because of anything to do with the on-disk locations of X/Y/Z but because those happen to be the nodes in some internal search tree or because of the order in which it initializes internal components. In fact, with applications being designed and implemented by many people, or using third-party libraries, there might be nobody who really knows what an application’s access pattern will be. This idea came back to me in a powerful way quite by accident, while I was working on HighRoad. We were working with a potential partner who had just such a patchwork application. We became intimately familiar with its access patterns from looking at I/O traces, and became curious. When the developer of one component was contacted, he seemed quite unaware of (and even a bit mystified by) the I/O patterns that we had observed, even though his own code was directly involved in creating them.

Unfortunately, none of this really led to anything like Wu Fengguang’s 30% performance gain. Between trying to be so general and lacking access to some critical information (such as which thread actually did each I/O) the prototype was never really that good at detecting – let alone exploiting – the patterns it was supposed to. WFG’s approach seems to have yielded much more significant practical advantages, for which I applaud him (or her; I’m not good with Chinese names).

One other interesting point did come up in the email thread about this.

Included with the patches were some benchmarks showing an impressive performance boost with the PostgreSQL database, leading Andrew Morton [interview] to comment, “these are nice-looking numbers, but one wonders. If optimising readahead makes this much difference to postgresql performance then postgresql should be doing the readahead itself, rather than relying upon the kernel’s ability to guess what the application will be doing in the future. Because surely the database can do a better job of that than the kernel.” It was noted that PostgreSQL developers want to keep their code portable, much more difficult when implementing custom readahead logic for each OS.

The portability and “stick to one’s knitting” arguments are significant, and probably sufficient, but I think even they matter less than one that’s made further down by David Lang.

do you really want to have every program doing it’s own readahead?

Good question. Having multiple parts of a system each trying to be smart like this, when being smart means consuming resources while remaining oblivious to the others, often hurts overall performance and robustness as well. Adding the proper feedback to a “hint” interface like fadvise would make it quite unwieldy and even less portable. Resource management is the kernel’s job. Letting it decide what (if anything) to do with the hints it receives from multiple sources, balancing the resource needs they imply, is definitely the right way to go on something like this.