Stackable Exit Hooks for Bash

I’m just going to leave this here and then quietly back away before the flames start.

# Stackable atexit functionality for bash.
# Bash's "trap ... EXIT" is somewhat similar to libc's "atexit" with the
# limitation that such functions don't stack.  If you use this construct twice,
# the cleanup code in the second invocation *replaces* that in the first, so
# the first actually doesn't happen.  Oops.  This snippet shows a way to get
# stackable behavior by editing the current trap function to incorporate a new
# one, either at the beginning or the end.  That's a really cheesy thing to do,
# but it works.
function atexit_func {
	# Bash doesn't have anything like Python's 'pass' so we do nothing
	# this way instead.
	echo -n
trap "atexit_func" EXIT
# Call this function to have your cleanup called *before* others.
function atexit_prepend {
	tmpfile=$(mktemp atexit.XXXXXX)
	typeset -f atexit_func > $tmpfile
	echo -en "2a\n$1\n.\nw\nq\n" | ed - $tmpfile
	. $tmpfile
	rm $tmpfile
# Call this function to have your cleanup called *after* others.
function atexit_append {
	tmpfile=$(mktemp atexit.XXXXXX)
	typeset -f atexit_func > $tmpfile
	echo -en "\$i\n$1\n.\nw\nq\n" | ed - $tmpfile
	. $tmpfile
	rm $tmpfile
function first_atexit {
	echo "first atexit function"
atexit_append first_atexit
function second_atexit {
	echo "second atexit function"
atexit_append second_atexit
function third_atexit {
	echo "third atexit function"
atexit_prepend third_atexit
# Should see third/first/second here.

CloudFS on the Horizon

I try to keep blogging about $dayjob to an absolute minimum, but this is kind of a big deal. Today, I pushed the first bits of CloudFS out to a public git repository. A week or so ago I registered to hold non-code content, but it’s not really up yet so for now you can clone that repository and look at the therein, or at the Fedora 15 feature page. Here are a few nuggets to get you started.

  • CloudFS is basically GlusterFS enhanced to allow deployment by a provider as a permanent shared service, rather than as a private thing that users can run within their own compute instances.
  • The enhancements necessary fall into several broad categories: authentication, encryption, isolation (each tenant gets their own namespace and UID/GID space), quota/billing, and some necessary enhancements to existing GlusterFS functionality.
  • This is a very pragmatic and unambitious release, explicitly not including the improved-DHT and multi-site-replication functionality that I think will make CloudFS really cool. Think of it as a warm-up to the main attraction.
  • The code is nowhere near complete. The three translators I’ve written are complete enough to do useful things – and more importantly to be worth reviewing – but all need to be improved in various ways and there are other bits (mostly around configuration and management) that don’t even exist yet. To put it another way, I think the code represents that point on a journey where you’ve climbed the highest hill and can see the destination, but there are many miles yet to be walked.

Once I get set up the rest of the way, I’ll probably start posting more info there. Stay tuned.

Fixing Linux Bloat

Apparently, Linus himself has come out and said that Linux is getting bloated and huge.

“We’re getting bloated and huge. Yes, it’s a problem,” said Torvalds.

Asked what the community is doing to solve this, he balked. “Uh, I’d love to say we have a plan,” Torvalds replied to applause and chuckles from the audience. “I mean, sometimes it’s a bit sad that we are definitely not the streamlined, small, hyper-efficient kernel that I envisioned 15 years ago…The kernel is huge and bloated, and our icache footprint is scary. I mean, there is no question about that. And whenever we add a new feature, it only gets worse.”

I can’t help but wonder how much of the reason for this bloat is a general aversion among many Linux kernel hackers to stable kernel interfaces in favor of getting code into the mainline Linux tree. Greg Kroah-Hartman has written eloquently on the subject, but I stand by my own dissent from almost two years ago. In addition to the objections I raised before, I believe that the “only one download” attitude is also part of the reason the kernel is bloated. Everyone pays not only to download and configure/build code for platforms and devices they’ll never see, but also to run core-kernel code that’s only there to support environments that they’ll also never see. For example, a lot of core changes have been made to support various kinds of virtualization. Virtualization is a very valuable feature on which I myself often rely, but is it really fair to make everyone carry the baggage for a feature they don’t use? Might that be an example of an anti-patch philosophy contributing to the bloat Linus mentioned?

The problem is that, if you can’t do something as a completely separate module (and BTW it’s pretty amazing what you can do that way), then you have two choices: maintain your own patches forever, or get them into the mainline kernel where they’ll affect users in every environment all over the world. Both approaches are unpleasant. Maintaining your own patches across other people’s random kernel-interface changes is a pain. Dealing with all the LKML politics to get your patches accepted can be a pain too. What if there were a middle ground? What if the community were more patch-friendly, so that functionality requiring patches weren’t treated quite so shabbily. That would mean more stable kernel interfaces – not infinitely stable, but not allowing unilateral change every time one of those senior folks learns a new trick. It would also mean better ways of distributing patches alongside, rather than in, the kernel, such as having a well-known area on for real-time and virtualization and NUMA and similar major-feature patches. I recognize the problem of maintaining every possible combination, but shouldn’t users at least have a choice between well-tested “plain” and “everything” kernels? Wouldn’t that help address the bloat, with a minimum of pain for all involved? Shouldn’t we at least discuss alternatives to the model that led to things being so bloated that even Linus has commented on it?

This is Going to be Ugly

This bug allows regular users to put whatever code they want at location zero, and then trick the kernel into executing it. Lovely.

“Since it leads to the kernel executing code at NULL, the vulnerability is as trivial as it can get to exploit,” security researcher Julien Tinnes writes here. “An attacker can just put code in the first page that will get executed with kernel privileges.”

Tinnes and fellow researcher Tavis Ormandy released proof-of-concept code that they said took just a few minutes to adapt from a previous exploit they had. They said all 2.4 and 2.6 version since May 2001 are affected.

Rightly or no, this will blow a big hole in the “Linux is so secure” smugness I often see. Personally, I’ve always thought it was a bad idea to leave user space mapped when you’re in the kernel – for reasons much like this. The model of having completely separate user and kernel/supervisor maps, and special instructions (or instruction variants) to access user space from kernel mode, is less convenient but far more secure.

Open Source Support

Stephanie Zvan wrote an article about how open-source support must be better than closed-source because being motivated by pride is so much better than being motivated by money. Greg Laden and some of her other friends applauded predictably. I’m not going to mince words: I thought it was a singularly uninsightful article. I’m not even sure it was meant to be insightful so much as to be provocative or even outright insulting. Both are common motivations on the web, but attribution of motive is a fallacy and I have a few choice words later on for those who indulge in it so I won’t discuss motivation any further than to say that attributing a constructive motive where none is in evidence would be just as fallacious. Here’s a partial list of reasons why I think the article was bad.

  • Motivation is not strongly correlated with open vs. closed source. There are plenty of open-source folks who are motivated primarily by hopes of cashing in their open-source cred for cold hard cash some day, and couldn’t care less about whether their users are well served in the process. Many people working on closed-source projects, on the other hand, do care very much about their users and have passed up more lucrative opportunities so that they could do something they believe in.
  • One particularly important example of this is the people who start companies. People who start computer companies (and to a lesser extent those who choose to work at startups) have willingly consigned themselves to long hours for less immediate reward than they could get working for somewhere else or as consultants, risking their retirements and their marriages and much else in the process. That takes a lot of commitment and passion, which is reflected in concern for every single user. Some open-source workers can match that, but they’re far outnumbered by those who wrote something for themselves in their spare time and whose last involvement with it was to post it on Sourceforge or Freshmeat. Zvan’s whole theory falls apart when the “good guys” have the “bad motivation” and vice versa.
  • Open-source programmers’ pride, which Zvan presents as an unalloyed good, can lead to bad behavior as well. The Linux Kernel Mailing List is a notoriously nasty place, the Gentoo project is only one of the most prominent to be ripped apart by developers’ ego wars, bitter disputes over KDE vs. GNOME have raged everywhere, etc. For every example that fits Zvan’s model of kind and diligent open-source programmers being nice to users, there are at least two examples of immature and antisocial open-source programmers demonstrating utter contempt for users. The perception of users as an alien species at best, The Enemy at worst, is common among programmers regardless of their software-distribution model.
  • Discussing what motivates people is a tricky business, especially when discussing people clearly different than yourself. I’m sure the answer would be that the principles involved are universal and scientifically validated, but “it’s science” isn’t just magical pixie dust you can sprinkle on your biases and brainfarts to make them more credible. Real science involves applying appropriate analytic and explanatory models to actual data, accounting for exceptions or confounding factors, not just forcing made-up data into the mold of one pet theory. Programmers are people, not lab rats, engaged in a complex task that often involves conflicting motivations. Simplistic behavioral models do not suffice for them any more than for editors or travel coordinators.
  • If we want real science, and the question is whether open-source or closed-source products have better support, the way to find an answer is not to cook up some pet theory and try to fit data to it. The real scientific way is to look at user-perceived outcomes of support encounters in both realms. Stephanie doesn’t even consider it, and the one guy who offered such data in Greg’s thread was completely ignored except by me.

In the end, I’ll just say what I said in Greg’s thread. Open source is just a way of distributing software, with its own unique advantages and disadvantages. Ease of participation and likelihood of a project continuing beyond it’s originator’s withdrawal are among open source’s inherent advantages. Kind or level of motivation are not. Open source is not a panacea, and shouldn’t be a religion. Like any religion, those who relentlessly proselytize but don’t even practice what they preach – for example by actually writing some open-source software – are boors and hypocrites.

Warming up to Linus

Linus (one of those people who no longer needs a last name) often says things that seem hasty and/or extreme. I’ve criticized him for it in the past, when he said things about specs or debuggers or typedefs that I considered wrong. Sometimes, though, a bit of hasty and extreme commentary is just what’s called for.

if you write your data _first_, you’re never going to see corruption
at all.

This is why I absolutely _detest_ the idiotic ext3 writeback behavior. It
literally does everything the wrong way around – writing data later than
the metadata that points to it. Whoever came up with that solution was a
moron. No ifs, buts, or maybes about it.

I share his amazement that anybody would think writing metadata before data was a good idea. Contrary to one Very Senior Linux Developer’s assertion, this has not been a problem with FFS/UFS since soft updates were invented (in fact it’s kind of why they were invented). It’s something most filesystem developers have known about, and been careful about, for over a decade. Unless they’re ext* developers, I guess. I find myself similarly amazed at another mis-statement by the same VSLD.

these days, nearly all Linux boxes are single user machines

No. Wait, let me think about that some more. Still no. It might be the case that the majority of Linux systems only have one user logged in, but that’s not relevant for the great number of Linux installations on servers or embedded appliances. The relevant number is not how many users are logged in but how many are requesting service, and that’s far more than enough to make “nearly all” untrue. What’s worse is that this untrue statement was used to brush off a security concern, as though multi-user systems aren’t even worth worrying about.

Keep in mind that this is not some late arrival working on Linux out of the goodness of their heart. This is someone who has been involved with Linux since very early days, who for years has been paid to work on and represent Linux full time. If I had shown such an extreme lack of judgment in my design or coding, then compounded that by making such egregiously false and reckless statements about both my own employer’s and others’ products, I would expect an even stronger reaction than Linus’s.

The Great Fsync Debate

One of the hottest topics in the Linux world lately has been the issue of atomically updating a file on a filesystem that uses delayed allocation, and whether fsync() is an acceptable solution. This is an issue now because, even though many filesystems have used delayed allocation for a while, ext4 is the first to make it into common enough use to spark the debate. One of the best discussions I’ve seen so far is from Alexander Larsson’s (thanks to Wes Felter for the link). It also refers to a proposal from Ted Ts’o regarding the issue which is worth reading.

One of the things that might not be obvious about Ted’s proposal is that it’s constructed to maintain a separation between files and the directory entries that (might) point to them. The desirability of such separation is a bit of a religious issue which I’m not going to get into; the point here is that, while Ted doesn’t explicitly mention it, this explains many things about his proposal that might otherwise seem strange or unnecessary. It’s actually a good proposal as far as the file/directory separation issue goes, but I think it runs smack into another issue: like the fsync() approach, it tries to fix an ordering issue by forcing synchronous updates. In the same LWN discussion Ted even cites Anton Ertl’s explanation of what’s wrong with synchronous metadata updates, but I would say that synchronous data updates – such as the fsync-like behavior implied by the comment attached to flinkat() in Ted’s proposal – are bad for almost exactly the same reasons. The problem here is that the common open/write/rename idiom represents a clearly intended ordering of both file and directory operations, and that ordering can be preserved for the file operations (the writes) but the directory operation is allowed to “jump the queue” because it’s not a file operation. (Note, BTW, that the open is both a file and a directory operation, with clear ordering semantics wrt the writes. So much for that mythical separation between file and directory operations.) My suggestion is that if you have an ordering problem then you should provide a way to preserve ordering. Forcing certain operations to be done synchronously is not necessary and hurts performance/scalability, which is exactly why people are avoiding or complaining about fsync() in the first place.

Unfrtunately, the issue of ordering vs. synchrony highlights a pretty fundamental problem that pervades POSIX: the assumption that synchronous operations are the norm, and asynchronous operations are handled in a second-class kind of way if at all. If not for that, then all metadata calls including rename() could be done asynchronously. Once you’re doing operations asynchronously, it’s a small step to add predicates that must be satisfied before they execute. A solution for some hypothetical system not hobbled by some of the sillier Linux/UNIX/POSIX dogma might therefore look like this:

token1 = fbarrier(fd);
Inserts a marker into fd‘s I/O stream, and returns a token corresponding to that marker. The token does not become valid until the marker leaves the I/O stream.
token2 = rename_async(old_name,new_name,token1);
Queues a rename operation from old_name to new_name, to execute when token1 becomes valid, and returns token2 representing the status of the rename itself (e.g. queued, completed, failed). Note that token1 could represent any kind of pending event, not just a token from fbarrier.
status = query_token(token2);
Find out whether the rename actually completed (optional). There could also be wait_for_token(), epoll() support for generic async tokens, etc. providing a fully generic infrastructure for asynchronous programming.

Someone’s bound to point out that such an approach does not lock the parent directory (as Ted’s proposal does). That means it’s still vulnerable to certain kinds of races involving the parent; separate solutions for that problem should be obvious and are left as an exercise for the reader. The particular problem I’m trying to focus on is of preserving a commonly expected kind of ordering between writes and renames, without forcing any of the operations to be synchronous.

Hey, Red Hat!

Somebody at Red Hat is linking to my site, but the two referrers I see are both internal so I can’t see why. One of the links is a search page; the other seems to be a post to a mailing list. Could someone at Red Hat please look at the post and tell me what it’s about? Thanks.

Aspire One: my own kernel

As I mentioned in my article on wireless for the Aspire One, one of the things I did was build my own kernel so that some day I could strip out some of the superfluous crud. Well, yesterday I actually started trying to boot that kernel. It was more of a struggle than I thought it would be, but I eventually succeeded. Well, kind of. Before I get into more detail, let me just make something very clear: don’t try this at home. I’m going to talk about doing some things that could seriously screw up your system – not just the Aspire One but whatever other machine you run some of the steps on. Don’t blame me if you turn one of your machines into a very pretty brick.

At this point, I have concluded that the version of GRUB (the boot loader) that’s installed on the A1 is non-standard in some way. No matter what I did in grub.conf (or menu.lst just in case it was lying to me about which config file it was using) I couldn’t get it to show me an actual boot menu. Come to think of it, I couldn’t get it to show me the splash screen referenced in the as-shipped grub.conf either. It would always show the original who-knows-from-where splash screen, and always boot the original kernel. The only thing that seems to have changed since I first received the box is that now I have to hit the space bar a couple of times at the blank screen where the GRUB menu should be. That was good for a few tense moments, but it also tells me that GRUB isn’t being bypassed entirely.

Until I understand exactly what’s special about the Aspire One version of GRUB, I’m loth to replace it or to modify that first entry. Instead, I chose to set up an external boot loader on my oldest and smallest USB key, using my Dell laptio to set it up and test it because I don’t trust any of the GRUB utilities on the A1. I saved everything I used to have on it, reformatted it as plain ext2 (had been ext3), copied the my Dell’s /boot directory onto it, ran grub-install etc. The only non-obvious thing was that after grub-install I still had to go into the grub shell and run setup there. That seems redundant, but here are the commands I used.

grub> root (hd1,0)
grub> setup (hd1)

No, not (hd1,0) as you’d think. This allowed me to use the USB key to boot the Dell, so then I tried on the A1. The boot entry I used looked like this.

title My A1 Kernel
rootnoverify (hd1,0)
kernel /boot/bzimg_jd ro root=LABEL=linpus vga=0x311 splash=silent loglevel=1 console=tty1 quiet nolapic timer
initrd /boot/initrd-splash.img
map (hd0) (hd1)

The most important part is that (hd1,0) when booting from the USB key means the internal SSD, whereas when booting from the internal SSD – or the Dell’s hard disk while I was setting this up – it meant the USB key. Similarly, the “map” command makes it so that post-boot references to /dev/sda refer to the SSD again. This also makes it possible to remove the USB key after booting, despite having used it to boot the system originally.

The other bit of strangeness had to do with loading kernel modules built for one kernel while booted with another. On my first boot, things mostly worked but some things failed because the version strings weren’t quite the same. To get around this, I used “make menuconfig” to change the local version in my kernel from “lw” to “jd” then rebuilt and installed my own matching modules – in parallel with the shipped ones instead of replacing them. My next reboot went a lot better, except for ndiswrapper which had been installed separately. That was just a matter of unpacking it and running “make install” just as I had before.

So now I have my own kernel that’s functionally identical to the original one. I’m out of time now, but some day soon I’ll actually start trimming the fat from my version.

UPDATE 2008-09-22: While I was able to boot my own fully-functional kernel this way a couple of times, on other occasions not everything came up OK (“HAL failed to initialize”) and on still others the system exhibited random flakiness later (failed to make a WiFi connection or recognize a USB device). Apparently, building from the Acer sources with the Acer config file still doesn’t yield something functionally equivalent to the Acer kernel. It might have been an honest mistake/omission on their part, or it might be like when Sun used to accept my fixes and incorporate them into SunOS but leave them out of the supposedly-equivalent NFS source they shipped back to us. Either way, it irks me.

Aspire One: sshfs and encfs

These are two things I consider essential on my mobile machines, one for accessing my files at work and the other for accessing the ones I keep encrypted on my USB drive. Neither is on the Aspire One by default, and getting them there was a bit tricky. For one thing, the RPMs that are available have lots of false dependencies, and installing them would drag in a ton of stuff including the wrong kernel. Seemed like a good way to break my system, so I aborted that effort. Then I tried to build sshfs from source. To do that, I had to build FUSE from source too, because the FUSE kernel module that’s on the system is OK but the version of the libraries that’s on there is missing stuff that sshfs needs. So I compile/install FUSE, install a few other *-devel packages (the standard way this time), build sshfs, and try to run it. I get this:

sshfs: relocation error: sshfs: symbol fuse_opt_insert_arg, version FUSE_2.6 not defined in file with link time reference

WTF? I start digging, learn a few things about versioned symbols, check out how they’re done in FUSE, but everything looks OK. A few experiments only make things worse. Finally I realize that sshfs is finding the vendor-provided version of libfuse in /lib before the one I built in /usr/local/lib, and that version’s broken. I had to fix that in /etc/ (yuk) but then it worked. After that, encfs was a breeze; I had to install boost and openssl and rlog but those were all very straightforward. And no, I didn’t need dmadm or a new kernel or any of the other junk that the package manager wanted to pull in.

Broken wireless, out of date software and repositories, drastically uninformative error messages – where does it all end? Truly, Linux is only free if your time has no value. It’s been worth it to me, to have a highly portable computer that has the software I need, but I hate to think of how someone less experienced would react when they hit these hurdles.