As you all know by now, I’ve changed the way I manage the content for this site. I now write posts in WordPress, then turn the results – after all of the database access, theme application, etc. – into static pages which are then served to you. One of my main reasons was increased security, which has been a problem with WordPress for years but which has generated a lot of interest recently because of the latest botnet. Therefore, I’ll describe what I’ve done so that maybe others can try something similar and maybe even improve on my recipe.

In order to understand what’s going on here, you have to know a bit about how a WordPress site is structured. It might surprise you to know that each post can be accessed no fewer than six different ways. Each post by itself is available either by its name or by its number. Post are combined on the main page, per-month lists, and per-category lists. Lastly, the category lists are also reachable either by name or by number. In fact, if a post is in multiple categories it will appear in more than six places. It’s important to preserve all of this structure, or else links will break. This is why I didn’t use a WordPress plugin to generate the static content, by the way. Even in the most cursory testing, every one I tried failed to maintain this structure properly. Most of the work I had to do was related to getting that structure just right, but first you have to generate all of the content so I’ll start there.

The basic idea for fetching the content is to crawl your own site with wget. Start with a normally working WordPress installation. Make sure you’ve set your WordPress options to use a name-based (rather than number-based) URL structure, and turned comments off on all posts. Then issue something like the following command.

wget -r -l inf -p -nc -D atyp.us http://pl.atyp.us/wordpress

This might take a while. For me it’s about twenty minutes, but this is an unusually old blog. Also, you don’t need to do this all the time. Usually, you should be able to regenerate the post itself plus its global/month/category timelines, but not touch other months or categories. At this point you’ll have a very simple static version of your site, good enough as an archive or something, but you’ll need to fix it up a bit before you can really use it to replace the original.

The first fix has to do with accessing the same page by name or by number. One of my goals was to avoid rewriting the actual contents of the generated pages. I don’t mind copying, adding links, adding web-server rewrite rules, and so on, but rewriting content only fixes things for me. Any links from outside would still be broken. My solution here has two parts. The first is a script, which finds the name-to-number mapping inside each article and uses that information to create a directory full of symbolic links. Here it is, but be aware that I plan to improve it for reasons I’ll get to in a moment.

#!/bin/bash
 
function make_one {
	# Canned Platypus posts
	p_expr='single single-post postid-\([0-9]*\)'
	p_id=$(sed -n "/.*$p_expr.*/s//\\1/p" < $1)
	if [ -n "$p_id" ]; then
		ln -s "$1" "post-$p_id.html"
		return
	fi
	# HekaFS posts
	p_expr=' name=.comment_post_ID. value=.\([0-9]*\). '
	p_id=$(sed -n "/.*$p_expr.*/s//\\1/p" < $1)
	if [ -n "$p_id" ]; then
		ln -s "$1" "post-$p_id.html"
		return
	fi
	c_expr='archive category category-[^ ]* category-\([0-9]*\)'
	c_id=$(sed -n "/.*$c_expr.*/s//\\1/p" < $1)
	if [ -n "$c_id" ]; then
		ln -s "$1" "cat-$c_id.html"
		return
	fi
}
 
find $1 -name index.html | while read f; do
	make_one $f
done

Notice how I had to handle the two blogs differently? It turns out that this information is theme-specific, and some themes might not include it at all. What I really should do is get this information from the database (correlate post_title with ID in wp_posts), but it works for now. The second part is a web-server rewrite rule, to redirect CGI requests for an article or category by number to the appropriate link. Here’s what I’m using for Hiawatha right now.

UrlToolkit {
    ToolkitID = cp-wordpress
    RequestURI isfile Return
    # Handle numeric post/category links.
    Match /wordpress/\?p=(.*) Rewrite /wordpress/links/post-$1.html Return
    Match /wordpress/\?cat=(.*) Rewrite /wordpress/links/cat-$1.html Return
    Call static-wordpress
}

What ends up happening here is that Hiawatha rewrites the CGI URL so that it points to the link I just created, which in turn points to the actual article. The "static-wordpress" URL toolkit handles another dynamic-link issue, this time related to JavaScript and CSS files.

UrlToolkit {
    ToolkitID = static-wordpress
    # Support multiple versions of CSS and JS files, with the right extensions.
    Match (.*)\.css\?(.*) Rewrite $1_$2.css Return
    Match (.*)\.js\?(.*) Rewrite $1_$2.js Return
    # Anything else gets the arguments lopped off.
    Match (.*)\?(.*) Rewrite $1 Return
}

I had to do this because it turned out that Firefox would complain about CSS/JS files not having the right type, because Hiawatha would get the wrong type unless the file ended in .js or .css respectively. For example, widgets.css?ver=20121003 wouldn't work. This rule rewrites it to widgets_ver=20121003.css which does work. To go with that, I also have a second renaming script.

#!/bin/bash
 
workdir=$(mktemp -d)
trap "rm -rf $workdir" EXIT
 
find $1 -name '*\?*' | grep -Ev '&"' > $workdir/all
 
# Use edit-in-place instead of sed to avoid quoting nastiness.
 
# Handle CSS files.
grep '\.css' $workdir/all > $workdir/css
ed - $workdir/css < < EOF
g/\([^?]*\)\.css?\([^?]*\)/s//mv '\1.css?\2' '\1_\2.css'/
w
q
EOF
 
# Handle JavaScript files.
grep '\.js' $workdir/all > $workdir/js
ed - $workdir/js < < EOF
g/\([^?]*\)\.js?\([^?]*\)/s//mv '\1.js?\2' '\1_\2.js'/
w
q
EOF
 
# Handle everything else.
grep -Ev '\.css|\.js' $workdir/all > $workdir/gen
ed - $workdir/gen < < EOF
#g/\([^?]*\)?\([^?]*\)/s//mv '\1?\2' '\1_\2.html'/
g/\([^?]*\)?\([^?]*\)/s//rm '\1?\2'/
w
q
EOF
 
. $workdir/js
. $workdir/css
. $workdir/gen

Note that the script also deletes other (non-CSS non-JS) files with question marks, since wget will leave some of those lying around and (at least in my case) they're invariably useless. Similarly, the static-wordpress rewrite rule just deletes the question mark and anything after it.

At this point you should have a properly fixed-up blog structure, which you can push to your real server and serve as static files (assuming you have the right configuration). What's missing? Well, comments for one. I still vaguely plan to add an external comment service like Disqus or Livefyre, but to be honest I'm not in that much of a hurry because - while I do appreciate them - comments have never been a major part of the site. The other thing missing is search, and I'm still pondering what to do about that. Other than that, as you must be able to see if you can read this, the process described above seems to work pretty well. My web server is barely using any CPU or memory to serve up two sites, and my "attack surface" has been drastically reduced by not running MySQL or PHP at all.

P.S. Hiawatha rocks. It's as easy to set up as nginx, it has at least as good a reputation for performance, and resource usage has been very low. I'd guess I can serve about 60x as much traffic as before, even without flooding protection - and that's the best thing about Hiawatha. I can set a per-user connection limit (including not just truly simultaneous connections but any occuring within N seconds of each other) and ban a client temporarily if that limit is exceeded. Even better, I can temporarily ban any client that makes more than M requests in N seconds. I've already seen this drive off several malware attempts and overly aggressive bots, while well behaved bots and normal users are unaffected. This probably increases my load tolerance by up to another order of magnitude. This might not be the fastest site of any kind, but for a site that has (almost) all of the power of WordPress behind it I'd say it's doing pretty well.