As promised, here’s some video.
Yesterday I saw an article at KernelTrap about adaptive readahead for Linux. This is an area of considerable interest to me, for a couple of reasons. The first is that when I joined Conley, before it got absorbed by EMC, my mandate was to design a next-generation storage system. The design I came up with involved three areas of innovation.
- Cooperative caching, which I went on to develop further as my last project at EMC and still hope to get back to some day.
- “Micro-RAID” which slices and dices disks so it can use different attributes (including location) for different parts and then recombine them into user-visible volumes. Lots of virtualization companies, including Conley founder Ric Calvillo’s own Incipient, are doing this nowadays but I always thought it was the least interesting of the three parts.
- Intelligent readahead, which did get revived and worked on by other people at EMC with only slight involvement from me.
The kind of intelligent readahead I was looking at was supposed to be entirely transparent and general than Wu Fengguang’s version, detecting arbitrary patterns rather than just variations on sequential access and doing so entirely without user intervention or hints. My feeling was that application access patterns often do not reflect “spatial” concerns so much as program logic. An application might access X/Y/Z in sequence not because of anything to do with the on-disk locations of X/Y/Z but because those happen to be the nodes in some internal search tree or because of the order in which it initializes internal components. In fact, with applications being designed and implemented by many people, or using third-party libraries, there might be nobody who really knows what an application’s access pattern will be. This idea came back to me in a powerful way quite by accident, while I was working on HighRoad. We were working with a potential partner who had just such a patchwork application. We became intimately familiar with its access patterns from looking at I/O traces, and became curious. When the developer of one component was contacted, he seemed quite unaware of (and even a bit mystified by) the I/O patterns that we had observed, even though his own code was directly involved in creating them.
Unfortunately, none of this really led to anything like Wu Fengguang’s 30% performance gain. Between trying to be so general and lacking access to some critical information (such as which thread actually did each I/O) the prototype was never really that good at detecting – let alone exploiting – the patterns it was supposed to. WFG’s approach seems to have yielded much more significant practical advantages, for which I applaud him (or her; I’m not good with Chinese names).
One other interesting point did come up in the email thread about this.
Included with the patches were some benchmarks showing an impressive performance boost with the PostgreSQL database, leading Andrew Morton [interview] to comment, “these are nice-looking numbers, but one wonders. If optimising readahead makes this much difference to postgresql performance then postgresql should be doing the readahead itself, rather than relying upon the kernel’s ability to guess what the application will be doing in the future. Because surely the database can do a better job of that than the kernel.” It was noted that PostgreSQL developers want to keep their code portable, much more difficult when implementing custom readahead logic for each OS.
The portability and “stick to one’s knitting” arguments are significant, and probably sufficient, but I think even they matter less than one that’s made further down by David Lang.
do you really want to have every program doing it’s own readahead?
Good question. Having multiple parts of a system each trying to be smart like this, when being smart means consuming resources while remaining oblivious to the others, often hurts overall performance and robustness as well. Adding the proper feedback to a “hint” interface like fadvise would make it quite unwieldy and even less portable. Resource management is the kernel’s job. Letting it decide what (if anything) to do with the hints it receives from multiple sources, balancing the resource needs they imply, is definitely the right way to go on something like this.
Today is Cindy’s and my tenth wedding anniversary. Wow. Ten years ago I was 31 and she was … younger than that. ;) We had fairly recently moved to Newton, which seems like a distant memory from here in Lexington. I had recently started a new job at Dolphin, to be followed by Mango, Conley/EMC, and Mariko/Revivio. Cindy had recently started at FASTech, to be followed by Lotus/IBM and motherhood. I was still playing volleyball. We had both been reintroduced to the joys of hiking and camping by our mutual friend Scott, but had not yet summited more than a handful of the New Hampshire 4000-footers of which Cindy now has 40+ out of 48 and I have a few less. Amy was only part of an incompletely imagined future – how incompletely we had yet to appreciate – and did not even have a name. Ditto for her cousins Eli and Oliver, and many others new arrivals among our friends. Others have grown up, moved on, passed on, married, divorced, become sick, recovered, received degrees or other honors, changed cities and careers, and made almost every other kind of change imaginable. I don’t feel I’ve changed that much, though it wouldn’t surprise me if someone disagrees, and I don’t think Cindy has either. She’s still the woman I love, still funny and warm and sane and smart and all those other things that led me to propose in the first place.
For those few who haven’t heard the story, it was on top of Mount Chocorua, which is one of the more scenic peaks in the White Mountains. It was Labor Day. Cindy had told me in a previous conversation (half-way down Cadillac Mountain in Acadia National Park) that if I ever proposed on a mountain it had darn well better be at the top, but I was counting on her having forgotten that. I took advantage of the fact that we often go at different speeds on steep rocky scrambles, such as at the top of Chocorua, to get to the top well ahead of her. I then changed into an extra shirt that I had brought, combed my hair, etc. When Cindy joined me we sat for a while. She was probably just enjoying the view and catching her breath; I was going slightly nuts. You’re probably expecting something dramatic here, but that’s really not my style. I didn’t unfurl a “Marry Me” flag or have a ring flown in by helicopter or anything like that. I just asked. She said yes, after what was probably only a moment but seemed like quite long enough to me thankyouverymuch. Then we had someone take a picture, which we should find when I get home, hung around a while longer, and headed down. Needless to say, we had a lot to talk about while we walked. The rest, as they say, is history.
So, that story told, what better way to celebrate than with pictures of one of that marriage’s happiest outcomes? Obviously I mean Amy, so here she is.
Does this sound familiar?
The web is no longer a category thatâ€™s useful to lump together with all other sorts of businesses. Prudent advice for getting ready to produce real widgets is likely to be exactly the opposite of whatâ€™s sensible for starting a new web service. The cost structure is entirely different, the agility is entirely different, and the priorities should be totally different too.
Sounds a lot like the “old business models don’t apply” self-delusion that led to dot-BOMB, doesn’t it? Here’s a hint for all those who spend too much time in echo chambers like 37signals congratulating each other for being so much like themselves: it’s still business. Sooner or later your money will still have to come not from your own pocket and not from investors but from actual customers who have money but are wary of all this “net changes everything” arrogance. That’s especially true the second time around, whether it’s the same snake-oil salesmen coming around again or a new batch. They’re still going to want actual products and services and commitments in return for paying you. The cost structure is a little different, but not much, and your “agility” is merely a function of being unencumbered by actual engagement with the aforementioned paying customers. It will fade as you move out of your parents’ basement into the real world.
The “everything old is new again” aspect of this really hit me when I saw in the comments that DropSend was being used as an example of how far web-based service delivery has come. It seems to be the brainchild of one Ryan Carson – another cheerleader for the Fubar 2.0 business model. The problem is, it mirrors both the functionality and the “feel” of a company called click2send that I worked with back in 2000. They also wanted to solve the barely-real “firewalls eat large email attachments” problem, but there were a couple of differences:
- Click2send seemed to have some idea about making a pitch to businesses that might actually have both money and the problem they claim to solve, whereas DropSend seems to rely mostly on targeting individual users with the old “sign up for a free trial” bait-and-switch.
- Click2send had transparent email integration, while DropSend doesn’t look like much more than a web interface to an old-fashioned FTP drop.
Yeah, some progress, huh? This is like having the same bad dream two nights in a row. Wake me when this one’s over.
Yes, I’m still alive, and I know May has been the slowest month here in quite some time. The simple fact is that I haven’t had much to say. Part of the reason is that I’ve been incredibly busy at work. Those of you who’ve worked in the software industry are undoubtedly familiar with the idea of “crunch time” and this has been beyond what I normally think of as crunch time. Working nights and weekends once in a while is almost normal, but foregoing sleep and family time and any other kind of respite is something else. In addition to spending more time at work, it has been more intense time than usual. It’s like the difference between taking a couple of short breaks during an hour-long workout (something else I haven’t done in a while) and just pounding straight through at maximum intensity for the entire time . . . then doing the same thing all day, and then the next day, and so on. I don’t recall ever being quite so acutely aware of how short a day is, or seriously worrying about whether I was safe to drive to/from work because of exhaustion.
Obviously I’ve been able to take a breather long enough to write this, but now duty calls again. That next bit of code isn’t going to write itself. There is light at the end of the tunnel and I’m sure I’ll be back soon, but I haven’t even had the energy to queue up anything to write about when the crisis is over so it might be a while before the site gets rolling again. Please bear with me.
The machine formerly known first as “precious” and then as “vilya” has undergone quite a few changes. A while back, I upgrade the hard disk from a 40GB Seagate Barracuda to a 120GB Samsung Spinpoint. Then, on my birthday, I transplanted it to a new Antec NSK2400 desktop-style case, so now it looks like this (it’s the larger one in the middle of the picture). This past weekend, I replaced the CPU, motherboard, and memory. The CPU is an Athlon64 3000+, which is the baby of the “Venice” family. The motherboard is an MSI K8NGM2-FID, which has just about everything. The onboard video (GeForce 6150) is better than what I had and it also has onboard Firewire, so that let me eliminate two PCI cards. It has PCI Express and SATA, which I’m not even using, but those plus the newer CPU socket and the two still-empty memory slots give me upgrade options in several different directions. At this point the only thing that hasn’t changed is the CD-ROM, but Windows still seems to recognize it as the same machine.
Installing the hardware was easy; getting the software to work was a pain. Windows wouldn’t boot, or even crash properly, so I now have a fresh install. On Linux I had to give up on the open-source Ethernet driver and use the vendor’s, which is going to be a pain every time I update the kernel. I can’t even be bothered upgrading to the accelerated video drivers since everything works fine using the “vesa” XFree86/Xorg driver. Since audio and video operations were the main reason for the upgrade, I timed some operations before and after. The result is that the system is about 50% faster than it used to be, which is actually a bit less than I had expected but still OK. Also, I can overclock by about 20% with some simple BIOS settings (CPU temperature is still only about 30C) so it’s roughly an Athlon64 3500+ except that it’s $80 (40%) cheaper. That makes it the fastest system I have regular access to, displacing the systems at work which previously would have held that title. Also, with Cool’n'Quiet turned on and smart fan control ndash; this is apparently one of the only motherboards that has managed to get that working properly with CnQ – the system’s actually quieter than it was before.
Faster, quieter, and not all that expensive. I’d call that a success.
Sigh. I guess he’ll just never get it. Linus is off railing against microkernels again, even though it seems he has never worked on what he’s criticizing and isn’t aware of current reality. Mach (specifically as it came from CMU and not in later forms) is a convenient whipping boy for every OS programmer who’s afraid of change, but it’s not at all representative of what microkernels today can do and have been doing for ages. Here’s the part where Linus goes most completely off into the weeds.
The fundamental result of access space separation is that
you can’t share data structures. That means that you can’t
share locking, it means that you must copy any shared data,
and that in turn means that you have a much harder time
handling coherency. All your algorithms basically end up
being distributed algorithms.
And anybody who tells you that distributed algorithms
are “simpler” is just so full of sh*t that it’s not even
Where to begin? Let’s start with his claim about copying data. What needs to be copied is information, not data, and the information needed to satisfy a single request is often much less than the whole set of data related to that request. For example, if you want to read a file remotely, you don’t need to pass your entire file data structure. You can pass a handle which maps easily to a corresponding structure at the other end (plus offset and length and so on). The amount of actual data copying that needs to be done in a microkernel or distributed system is much less than Linus seems to think.
Secondly, sharing data structures and locking is often something you shouldn’t be doing anyway. Besides the obvious implication for correctness when you have more code sharing the same data structures, there can actually be performance implications. More code locking the same data structures means more contention for those locks. Linus quite correctly points out that the performance issues in microkernels is less about implementation details than about the algorithms that they impose on you, which is really just an example of the old “optimize algorithms, not code” from your first computer-science class and applies to monolithic kernels as well, but somehow he doesn’t apply that lesson to locking. One of the classic ways to deal with excess lock contention is to reduce sharing by bringing some data “out from under” the lock, which is exactly the same thing that microkernels force you to do.
Lastly, as regards distributed algorithms, even Linus has observed (elsewhere) that the world is moving more and more toward networks and clusters instead of separate standalone systems. Nobody’s claiming that distributed algorithms are simpler, but they’re unavoidable. They’re part of what every programmer has to know to be effective in the modern programming world, and an operating system which is already designed as separate components is very much easier to make distributed. If Linus wants to worry about memory copying, he should think about how Linux limits process migration to a coarse-grained and heavyweight checkpoint model.
Anybody who has ever done distributed programming should
know by now that when one node goes down, often the rest
comes down too. It’s not always true (but neither is it
always true that a crash in a kernel driver would bring
the whole system down for a monolithic kernel), but it’s
true enough if there is any kind of mutual dependencies,
and coherency issues.
Now Linus is getting bogged down in exactly those implementation issues he said we should avoid. Anybody who has ever become good at distributed programming should know by now that with competent design and implementation the frequency with which one node takes down others can be very low. It becomes very much the exception, not the rule; with monolithic kernels the phenomenon of one component bringing everything down remains the rule, not the exception. Just because the fault-propagation rates in question are neither 0% nor 100% does not mean they’re equal, so Linus’s attempt to paint the problem the same way for distributed systems and monolithic kernels is specious.
Projects such as L4Linux or QNX have already shown that one can gain (at least some of) the robustness and/or distribution benefits of microkernels with minimal impact on performance. That’s manifest reality. For Linus to continue repeating claims that conflict with reality says far more about him and his “leadership” than about reality.
On Saturday the three of us went to Walden Pond, to soak up some of the ambience before the whole place turns into a total zoo. She loves the “ba-kak” (baby backpack) and I don’t mind the exercise, so that’s how we started out. A little over half-way around, I decided to let her down so she could stretch her legs and I could stretch my back. Cindy and I were pretty surprised that she actually walked the rest of the way until the very end when it didn’t seem safe near the water – maybe about a mile. That’s a long walk for a less-than-two-year-old! Click below for pictures.