Fungus Among Us

big fungus

When we got home this evening, we spotted this fungus on one of the oak trees i our front yard. It’s hard to tell from the picture, but it’s about two feet wide! There were three others, one almost as big, as well. It’s probably a laetiporus sulphureus, a.k.a. Chicken of the Woods. Some people think it’s rather good eating, though it apparently has a very strong “fungusy” taste and some people have adverse reactions to it. There also seem to be some hints that it might indicate a rot problem, but most sites say nothing about that and the tree is clearly in excellent health so I’m not inclined to worry. I’m also not inclined to eat it, and not just because I’m too lazy to get out the ladder (it’s about 25′ off the ground).

I know it’s not a great picture, BTW, but I’m pleased that I was able to get one even this good. That’s at 8x zoom (2x optical, 4x digital – and digital zoom generally sucks) without any support other than my hand. I used XnView to adjust the color and contrast, and the “enhance focus” filter (another thing that generally sucks) actually worked too. All in all it worked out very well, I think.

Server Design III: Memory Allocation

Here’s the latest section of my ongoing server-design article. I also changed some of the previous section (context switches) from how it appears in the log.

Allocating and freeing memory is one of the most common operations in many applications. Accordingly, many clever tricks have been developed to make general-purpose memory allocators more efficient. However, no amount of cleverness can make up for the fact that the very generality of such allocators inevitably makes them far less efficient than the alternatives in many cases. I therefore have three suggestions for how to avoid the system memory allocator altogether.

Suggestion #1 is simple preallocation. We all know that static allocation is bad when it imposes artificial limits on program functionality, but there are many other forms of preallocation that can be quite beneficial. Usually the reason comes down to the fact that one trip through the system memory allocator is better than several, even when some memory is “wasted” in the process. Thus, if it’s possible to assert that no more than N items could ever be in use at once, preallocation at program startup might be a valid choice. Even when that’s not the case, preallocating everything that a request handler might need right at the beginning might be preferable to allocating each piece as it’s needed; aside from the possibility of allocating multiple items contiguously in one trip through the system allocator, this often greatly simplifies error-recovery code. If memory is very tight then preallocation might not be an option, but in all but the most extreme circumstances it generally turns out to be a net win.

Suggestion #2 is to use lookaside lists for objects that are allocated and freed frequently. The basic idea is to put recently-freed objects onto a list instead of actually freeing them, in the hope that if they’re needed again soon they need merely be taken off the list instead of being allocated from system memory. As an additional benefit, transitions to/from a lookaside list can often be implemented to skip complex object initialization/finalization.

It’s generally undesirable to have lookaside lists grow without bound, never actually freeing anything even when your program is idle. Therefore, it’s usually necessary to have some sort of periodic “sweeper” task to free inactive objects, but it would also be undesirable if the sweeper introduced undue locking complexity or contention. A good compromise is therefore a system in which a lookaside list actually consists of separately locked “old” and “new” lists. Allocation is done preferentially from the new list, then from the old list, and from the system only as a last resort; objects are always freed onto the new list. The sweeper thread operates as follows:

  1. Lock both lists.
  2. Save the head for the old list.
  3. Make the (previously) new list into the old list by assigning list heads.
  4. Unlock.
  5. Free everything on the saved old list at leisure.

Objects in this sort of system are only actually freed when they have not been needed for at least one full sweeper interval, but always less than two. Most importantly, the sweeper does most of its work without holding any locks to contend with regular threads. In theory, the same approach can be generalized to more than two stages, but I have yet to find that useful.

One concern with using lookaside lists is that the list pointers might increase object size. In my experience, most of the objects that I’d use lookaside lists for already contain list pointers anyway, so it’s kind of a moot point. Even if the pointers were only needed for the lookaside lists, though, the savings in terms of avoided trips through the system memory allocator (and object initialization) would more than make up for the extra memory.

Suggestion #3 actually has to do with locking, which we haven’t discussed yet, but I’ll toss it in anyway. Lock contention is often the biggest cost in allocating memory, even when lookaside lists are in use. One solution is to maintain multiple private lookaside lists, such that there’s absolutely no possibility of contention for any one list. For example, you could have a separate lookaside list for each thread. One list per processor can be even better, due to cache-warmth considerations, but only works if threads cannot be preempted. The private lookaside lists can even be combined with a shared list if necessary, to create a system with extremely low allocation overhead.

Not Much of a Photographer

Here are the two pictures I took on this weekend’s hiking trip. The first picture is from under the fire tower on Carrigain; the second is from South Hancock. Obviously the views were less excellent than usual, but the company more than made up for it.

Carrigain South Hancock

Look Ma, I’m Syndicated

If you don’t know what RSS is, you can ignore this entry. If you do know, you might be interested to know that I’ve added an RSS version of this log, at (RSS 0.92, or close enough that at least a couple of aggregators seem to get it right).

Content-Based Addressing and Filesystems

The question was actually posed to me via email, and I’m being evil by posting it without permission, but it’s actually a question I’ve been asked quite a few times so I’m pretty optimistic that the author will forgive me.

Have you ever thought of the implications of a file system design that utilizes a content based file handle (a crypto hash of the raw file data)?

Heh. Only all the time. ;-)

Seriously, though, content-based addressing has some interesting advantages and drawbacks. On the one hand, it makes a large class of consistency problems just go away, and that’s a Major Good Thing. The obvious thing to do would be to arrange all data in a filesystem hierarchically, much like a phase tree. To write a block you’d first insert the data block, then a new version of its inode to point to the new block, then a new version of the directory to point to the inode, all the way up to the root. It would be impossible to get an inconsistent state in such a system. Cool, huh?

The problem is that it’s also very difficult to make sure that you’re using the filesystem’s current state. While a certain amount of “slack” is generally allowable, especially in a publication-oriented system, users’ and applications’ tolerances for staleness are limited. The only way to be sure that what I’m reading is current as of a particular point in time is to work all the way down from the root to the block I want, but that has horrendous performance implications. There’s also a logistical problem with all those old versions of inodes and directories remaining in the system essentially forever because nobody can ever really know whether it’s safe to drop them. Lastly, while content-based addressing in some ways makes it easy to continue writing despite disconnection and network partitions, there’s still a very serious problem of how to reconcile the two halves of a split brain properly when everything’s connected again.

Some of these problems strike me as both important and unsolvable in a purely content-based-addressing scheme. At some point it seems that you need some part of the system to be based on a scheme that provides stronger consistency, and that’s where I spend a lot of my “think time” on the subject. I’m pretty sure that there are other people with a perfectly good handle on the content-based part of such a hybrid system, but it’s still pretty unclear exactly what the characteristics of the other part would be.

Random Thoughts about Education

original context

A student who is learning to write turns in their work. They recieve their paper back with corrections and a grade. They are usually still confused. Instead, I think that the teacher should rewrite the entire paper showing the correct way it should be done. This doesn’t have to be done for every paper, but if it were done for a few, the students could relate and understand better since it becomes more personal.

Do you have any idea how much teacher effort this would require, even for a few papers? Have you ever actually had to edit someone else’s writing? A teacher’s life is hard enough already. Lesson plans and grading take a lot more effort than most people think, and when teachers are expected to spend time acting as surrogates for absent or uninvolved parents as well it’s amazing that any teacher ever even has time for a movie. I think the goal of providing better feedback could be far better served by teachers doing two things more aggressively:

  • Keeping track of the errors that many students seem to be making, and specifically addressing those errors for the benefit of the whole class after the papers have been turned in.
  • Encouraging students who still have questions to visit during the teacher’s office hours for further explanation (yes, I believe teachers at all levels should have regular free-form office hours).

Another thing that I would do if I were a teacher – and I’m sure everyone’s glad I’m not – would be to enlist more help. For example, while many parents are truly uninvolved in their children’s education, many others would love to help more but aren’t very good at it. Quarterly parent-teacher meetings don’t cut it. What I’d do is send frequent notes to parents identifying particular areas where the student seems to need more help, and perhaps pointing them to resources that they could use.

In a similar vein, why not enlist the help of the brighter students? Why not offer the more advanced students credit toward their own grades – or perhaps the chance to skip an assignment – for helping others? This doesn’t just help the struggling student. It helps the advanced students learn teaching skills, it keeps the class progressing together, and it has great social benefits (e.g. as an antidote to stratification) as well. I’ve never understood why most schools aren’t more directly involved in helping students helping each other. Every child psychologist knows that children learn as much from their peers as from authority figures; why not make use of that, instead of resisting it?

What Happened to Advogato?

I haven’t been able to reach it since early yesterday at least – not from home, not from work, not through a proxy elsewhere. I haven’t seen anything on any of the other geek-news sites or weblogs I follow either. What happened?

Site-news Quickies

The old site is completely gone. I’ve cancelled my account, I can’t ping the machine, it’s just…gone. The New Era is truly upon us.

I rearranged my topic index a bit. The “tech” section was getting really out of hand, so I’ve split it into several smaller sections, and combined some of the non-tech sections.

Server Design II: Context Switches

Here’s the next section of my ongoing server-architecture article.

Whereas everyone thinks it’s obvious that data copies are bad, I’m often surprised by how many people totally ignore the effect of context switches on performance. In my experience, context switches are actually behind more total “meltdowns” at high load than data copies; the system starts spending more time going from one thread to another than it actually spends within any thread doing useful work. The amazing thing is that, at one level, it’s totally obvious what causes excessive context switching. The #1 cause of context switches is having more active threads than you have processors. As the ratio of active threads to processors increases, the number of context switches also increases – linearly if you’re lucky, but often exponentially. This very simple fact explains why multi-threaded designs that have one thread per connection scale very poorly. The only realistic alternative for a scalable system is to limit the number of active threads so it’s (usually) less than or equal to the number of processors. One popular variant of this approach is to use only one thread, ever; while such an approach does avoid context thrashing, and avoids the need for locking as well, it is also incapable of achieving more than one processor’s worth of total throughput and thus remains beneath contempt unless the program will be non-CPU-bound (usually network-I/O-bound) anyway.

The first thing that a “thread-frugal” program has to do is figure out how it’s going to make one thread handle multiple connections at once. This usually implies a front end that uses select/poll, asynchronous I/O, signals or completion ports, with an event-driven structure behind that. Many “religious wars” have been fought, and continue to be fought, over which of the various front-end APIs is best. Dan Kegel’s C10K paper is a good resource is this area. Personally, I think all flavors of select/poll and signals are ugly hacks, and therefore favor either AIO or completion ports, but it actually doesn’t matter that much. They all – except maybe select() – work reasonably well, and don’t really do much to address the matter of what happens past the very outermost layer of your program’s front end.

The simplest conceptual model of a multi-threaded event-driven server has a queue at its center; requests are read by one or more “listener” threads and put on queues, from which one or more “worker” threads will remove and process them. Conceptually, this is a good model, but all too often people actually implement their code this way. Why is this wrong? Because the #2 cause of context switches is transferring work from one thread to another. Some people even compound the error even further by requiring that the response to a request be sent by the original thread – guaranteeing not one but two context switches per request. It’s very important to use a “symmetric” approach in which a given thread can go from being a listener to a worker to a listener again without ever changing context. Whether this involves partitioning connections between threads or having all threads take turns being listener for the entire set of connections seems to matter a lot less.

Usually, it’s not possible to know how many threads will be active even one instant into the future. After all, requests can come in on any connection at any moment, or “background” threads dedicated to various maintenance tasks could pick that moment to wake up. If you don’t know how many threads are active, how can you limit how many are active? In my experience, one of the most effective approaches is also one of the simplest: use an old-fashioned counting semaphore which each thread must hold whenever it’s doing “real work”. If the thread limit has already been reached then each listen-mode thread might incur one extra context switch as it wakes up and then blocks on the semaphore, but once all listen-mode threads have blocked in this way they won’t continue contending for resources until one of the existing threads “retires” so the system effect is negligible. More importantly, this method handles maintenance threads very gracefully.

Savvy readers might already have started to wonder how much what I’ve described differs from Matt Welsh’s SEDA, which is already considered a leading-edge architecture. The answer is: not much. I know I’ve been using most of the approaches discussed here since well before I heard of SEDA, and I’m almost equally sure that Matt Welsh never heard of any of my work, but we clearly are concerned with many of the same problems and have reached many of the same conclusions. Think of it as convergent evolution. Matt, if you’re out there, drop me an email. I’d love to get your input on this article. The points I’d like to make regarding SEDA are as follows:

  1. SEDA has a focus on separating a server into multiple stages, beyond just listener and worker. While I don’t think this is necessary from a performance standpoint, it’s an excellent idea for other reason such as locking (which I’ll get to in a little bit) and maintainability. Once you’ve implemented a two-stage multi-dispatch event-based system with concurrency control – what I’ve just described – generalizing it to an N-stage system is easy and can be highly beneficial.
  2. SEDA’s one significant flaw, in my opinion, is that it allocates a separate thread pool to each stage with only “background” reallocation of threads between stages in response to load. The #1 and #2 causes of context switches as noted above still apply in this case.
  3. In general I don’t like systems that overuse queues. I’ve said it before and I’ll say it again: putting something on a queue and then immediately taking it right back off is silly. While I can appreciate the convenience and clarity of the queues-everywhere model, the silliness is something I’d rather avoid.
  4. In the context of an academic research project, implementing SEDA in Java might make sense. In the real world, though, I think the choice can be characterized as unfortunate.

Show Me the Money

Yesterday in the Boston Globe’s editorial section there was a cartoon intended to show the disparity in spending between men’s and women’s health problems. I don’t remember the exact details, but it showed a woman seeking – and not getting – some sort of help, in front of signs pointing to the prostate and testicular cancer units. I found myself slightly offended, because I’ve heard this canard about unequal spending and have never quite believed it. I try to have an open mind, though, so I tried to find some statistics.

I limited my searches to various forms of cancer this time because it’s easier to see the numbers side by side. Maybe I’ll do something similar for non-cancer diseases when I have more time. I went to the US National Cancer Institute’s SEER system for information on incidence and mortality for four types of cancer – two affecting women and two affecting only men. I initially intended to look up actual dollar amounts for research into each type, but it turns out that those numbers are really hard to find. Instead, I decided to use number of research papers as a (probably inadequate) substitute, and got the appropriate numbers from the US National Library of Medicine’s PubMed system. Here are the results (incidence and mortality are per 100K).

Incidence Mortality # of Papers
Breast Cancer 73.4 16.4 120805
Cervical Cancer 5.1 1.7 33534
Prostate Cancer 71.5 12.6 37675
Testicular Cancer 2.5 0.3 14504

It does seem odd to me that (for example) prostate cancer occurs at 97% of the rate for breast cancer, and has 77% of the mortality, but only generates 31% as much research. I don’t want do overstate any conclusions, but at least in this case I don’t see much evidence that men are hogging all the research spending. I’d really like to know where the evidence – if any – for such a belief comes from. So far, it seems reasonably likely that the real picture for gender-specific health care is not like we’ve been led to believe, and that my annoyance at the cartoon’s portrayal might have been justified.