Using WordPress to Generate Static Pages

As you all know by now, I’ve changed the way I manage the content for this site. I now write posts in WordPress, then turn the results – after all of the database access, theme application, etc. – into static pages which are then served to you. One of my main reasons was increased security, which has been a problem with WordPress for years but which has generated a lot of interest recently because of the latest botnet. Therefore, I’ll describe what I’ve done so that maybe others can try something similar and maybe even improve on my recipe.

In order to understand what’s going on here, you have to know a bit about how a WordPress site is structured. It might surprise you to know that each post can be accessed no fewer than six different ways. Each post by itself is available either by its name or by its number. Post are combined on the main page, per-month lists, and per-category lists. Lastly, the category lists are also reachable either by name or by number. In fact, if a post is in multiple categories it will appear in more than six places. It’s important to preserve all of this structure, or else links will break. This is why I didn’t use a WordPress plugin to generate the static content, by the way. Even in the most cursory testing, every one I tried failed to maintain this structure properly. Most of the work I had to do was related to getting that structure just right, but first you have to generate all of the content so I’ll start there.

The basic idea for fetching the content is to crawl your own site with wget. Start with a normally working WordPress installation. Make sure you’ve set your WordPress options to use a name-based (rather than number-based) URL structure, and turned comments off on all posts. Then issue something like the following command.

wget -r -l inf -p -nc -D atyp.us http://pl.atyp.us/wordpress

This might take a while. For me it’s about twenty minutes, but this is an unusually old blog. Also, you don’t need to do this all the time. Usually, you should be able to regenerate the post itself plus its global/month/category timelines, but not touch other months or categories. At this point you’ll have a very simple static version of your site, good enough as an archive or something, but you’ll need to fix it up a bit before you can really use it to replace the original.

The first fix has to do with accessing the same page by name or by number. One of my goals was to avoid rewriting the actual contents of the generated pages. I don’t mind copying, adding links, adding web-server rewrite rules, and so on, but rewriting content only fixes things for me. Any links from outside would still be broken. My solution here has two parts. The first is a script, which finds the name-to-number mapping inside each article and uses that information to create a directory full of symbolic links. Here it is, but be aware that I plan to improve it for reasons I’ll get to in a moment.

#!/bin/bash
 
function make_one {
	# Canned Platypus posts
	p_expr='single single-post postid-\([0-9]*\)'
	p_id=$(sed -n "/.*$p_expr.*/s//\\1/p" < $1)
	if [ -n "$p_id" ]; then
		ln -s "$1" "post-$p_id.html"
		return
	fi
	# HekaFS posts
	p_expr=' name=.comment_post_ID. value=.\([0-9]*\). '
	p_id=$(sed -n "/.*$p_expr.*/s//\\1/p" < $1)
	if [ -n "$p_id" ]; then
		ln -s "$1" "post-$p_id.html"
		return
	fi
	c_expr='archive category category-[^ ]* category-\([0-9]*\)'
	c_id=$(sed -n "/.*$c_expr.*/s//\\1/p" < $1)
	if [ -n "$c_id" ]; then
		ln -s "$1" "cat-$c_id.html"
		return
	fi
}
 
find $1 -name index.html | while read f; do
	make_one $f
done

Notice how I had to handle the two blogs differently? It turns out that this information is theme-specific, and some themes might not include it at all. What I really should do is get this information from the database (correlate post_title with ID in wp_posts), but it works for now. The second part is a web-server rewrite rule, to redirect CGI requests for an article or category by number to the appropriate link. Here’s what I’m using for Hiawatha right now.

UrlToolkit {
    ToolkitID = cp-wordpress
    RequestURI isfile Return
    # Handle numeric post/category links.
    Match /wordpress/\?p=(.*) Rewrite /wordpress/links/post-$1.html Return
    Match /wordpress/\?cat=(.*) Rewrite /wordpress/links/cat-$1.html Return
    Call static-wordpress
}

What ends up happening here is that Hiawatha rewrites the CGI URL so that it points to the link I just created, which in turn points to the actual article. The "static-wordpress" URL toolkit handles another dynamic-link issue, this time related to JavaScript and CSS files.

UrlToolkit {
    ToolkitID = static-wordpress
    # Support multiple versions of CSS and JS files, with the right extensions.
    Match (.*)\.css\?(.*) Rewrite $1_$2.css Return
    Match (.*)\.js\?(.*) Rewrite $1_$2.js Return
    # Anything else gets the arguments lopped off.
    Match (.*)\?(.*) Rewrite $1 Return
}

I had to do this because it turned out that Firefox would complain about CSS/JS files not having the right type, because Hiawatha would get the wrong type unless the file ended in .js or .css respectively. For example, widgets.css?ver=20121003 wouldn't work. This rule rewrites it to widgets_ver=20121003.css which does work. To go with that, I also have a second renaming script.

#!/bin/bash
 
workdir=$(mktemp -d)
trap "rm -rf $workdir" EXIT
 
find $1 -name '*\?*' | grep -Ev '&"' > $workdir/all
 
# Use edit-in-place instead of sed to avoid quoting nastiness.
 
# Handle CSS files.
grep '\.css' $workdir/all > $workdir/css
ed - $workdir/css < < EOF
g/\([^?]*\)\.css?\([^?]*\)/s//mv '\1.css?\2' '\1_\2.css'/
w
q
EOF
 
# Handle JavaScript files.
grep '\.js' $workdir/all > $workdir/js
ed - $workdir/js < < EOF
g/\([^?]*\)\.js?\([^?]*\)/s//mv '\1.js?\2' '\1_\2.js'/
w
q
EOF
 
# Handle everything else.
grep -Ev '\.css|\.js' $workdir/all > $workdir/gen
ed - $workdir/gen < < EOF
#g/\([^?]*\)?\([^?]*\)/s//mv '\1?\2' '\1_\2.html'/
g/\([^?]*\)?\([^?]*\)/s//rm '\1?\2'/
w
q
EOF
 
. $workdir/js
. $workdir/css
. $workdir/gen

Note that the script also deletes other (non-CSS non-JS) files with question marks, since wget will leave some of those lying around and (at least in my case) they're invariably useless. Similarly, the static-wordpress rewrite rule just deletes the question mark and anything after it.

At this point you should have a properly fixed-up blog structure, which you can push to your real server and serve as static files (assuming you have the right configuration). What's missing? Well, comments for one. I still vaguely plan to add an external comment service like Disqus or Livefyre, but to be honest I'm not in that much of a hurry because - while I do appreciate them - comments have never been a major part of the site. The other thing missing is search, and I'm still pondering what to do about that. Other than that, as you must be able to see if you can read this, the process described above seems to work pretty well. My web server is barely using any CPU or memory to serve up two sites, and my "attack surface" has been drastically reduced by not running MySQL or PHP at all.

P.S. Hiawatha rocks. It's as easy to set up as nginx, it has at least as good a reputation for performance, and resource usage has been very low. I'd guess I can serve about 60x as much traffic as before, even without flooding protection - and that's the best thing about Hiawatha. I can set a per-user connection limit (including not just truly simultaneous connections but any occuring within N seconds of each other) and ban a client temporarily if that limit is exceeded. Even better, I can temporarily ban any client that makes more than M requests in N seconds. I've already seen this drive off several malware attempts and overly aggressive bots, while well behaved bots and normal users are unaffected. This probably increases my load tolerance by up to another order of magnitude. This might not be the fastest site of any kind, but for a site that has (almost) all of the power of WordPress behind it I'd say it's doing pretty well.

Static Site is LIVE

If you’re seeing this, it’s because you’re on the new site, seeing static files served by Hiawatha instead of dynamic files served by nginx. If you notice anything else that’s different, let me know.

Static Site Update

As I mentioned too long ago, I’ve been planning to migrate this site to a different method of operation, for both performance and security reasons. Specifically, my approach allows me to add posts, change themes, etc. with all the power of WordPress and its community at hand, but then serve up the results as static pages. I have most of that working on my test site, with only two things not working: by-category listings (by-date listings work) and comments. I can actually do without comments for a while until I find an external solution that I like, but I feel like I do need to fix the by-category listings before I switch over. For the technically minded, here’s a rough outline of how I’m doing this.

  1. I have two Hiawatha configs – one for dynamic pages and one for static. These are currently on the same machine, but the plan is to run them on separate machines when I’m done.
  2. For editing etc. I just use the dynamic config and everything works just as it has for years.
  3. When I’m done editing, “wget -r -l inf -p -nc -D atyp.us” gets me a static version of the site.
  4. I also have a script to rename some files and deal with a few other site-specific issues.
  5. When I’m all done, I switch over to my static Hiawatha config, which has a couple of URL-rewrite rules to work around the CGI-oriented URLs that WordPress produces.
  6. The live site is running no PHP or MySQL, just Hiawatha serving up static files.

The key point is that this all looks exactly the same as the current site running on a standard setup, even though it’s all very different behind the scenes. When I’m done, I’ll more fully document everything and put up a how-to for other WordPress users to follow.

EDIT: Since someone else is sure to ask, I will. Why not just switch to a built-for-static system like Octopress? Here’s why.

I’d still have to convert the existing content. As long as I automate that process, it doesn’t matter much whether I do it once or many times. Even rebuilding the entire static site from scratch, which I’ve done a lot more while debugging than I’d ever do in normal operation, doesn’t take long enough to bother me. Selective rebuilds would be easy, and even faster.

  • I like the WordPress tools that I use to create, categorize, and present content. When it comes to plugins and themes, even the most popular/sophisticated static systems seem downright primitive by comparison, so I’d be back to doing more stuff by hand.
  • I’m very conservative about breaking links, and none of the static systems are fully compatible with the URL structure that I’ve been using for years.

My only gripes with WordPress are security and performance. Sure, I could make a more drastic change, and the pages would be a bit simpler/smaller if I did (even the simplest WordPress themes generate some horrendous HTML), but I’d need a better reason than that.

EDIT 2: Now that the static site is live, I no longer even run PHP/MySQL on the real server. This edit, and the next post, were added by running a copy of the site on my desktop at home and then copying only the output to the cloud.

Server Design in Serbo-Croatian

Ten and a half years ago, I wrote an article on server design. Considering that I probably worked harder on that than on anything I’ve posted since, I’m pleased that it has continued to be one of the most popular articles on the site despite its increasing age. It’s even more gratifying to know that some people are including it in their academic course materials – maybe half a dozen instances that I know of. Now, thanks to Anja Skrba, it has been translated into Serbo-Croatian. I’m totally unqualified to comment on the accuracy of the translation, but it’s a commendable contribution to the community nonetheless. Feedback on either the content or the translation would surely be appreciated by both of us. Thank you, Anja!

How (Not) To Collaborate

Collaboration is one of the most essential human skills, not just in work but in life generally, and yet it’s poorly taught (if at all) and a lot of people are bad at it. Programmers are especially bad at it, for a whole variety of reasons, and this last week has been like a crash course in just how bad. Collaboration means exchanging ideas. Here’s how I have seen people fail to participate in such exchanges recently.

  • Passive ignoring. No response at all.
  • Active ignoring. Nod, smile, put it on a list to die. This is what a lot of people do when they’ve been told they need to work on their collaboration skills, and want to create an appearance of collaboration without actually working at it.
  • Rejection. All variants of “no” and “what a terrible idea” and “my idea’s better” fall into this category.
  • “It’s my idea now.” The obvious version is just presenting the idea unchanged, as one’s own. The sneakier alternative is to tweak it a little, or re-implement it, so it’s not obvious it’s the same, but still present derivative work without credit to the original.
  • “It’s your problem now.” This is probably the most insidious of all. It presents an appearance of accession, but in fact no exchange of ideas has occurred. Just as importantly, the person doing this has presumed unilateral authority to decide whose problem it is, creating an unequal relationship.

The key to real collaboration is not only to accept a single idea itself, but to facilitate further exchange. Here are some ways to make that work.

  • Accept the context. Respect the priority that the other person gives to the idea along with the idea itself. , Assume some responsibility for facilitating it. Don’t force people to remind, re-submit or nag before you’ll really consider what they’re suggesting. Both active and passive ignoring are wrong because they violate this principle.
  • Don’t attach strings. Don’t make people jump through unnecessary hoops, or demand that they assume responsibility for more than the subject of their idea, just to have their idea considered. Obviously, “your problem now” and its cousin “you touch it you own it” violate this rule. I’ve left more jobs because of this tendency, which leaves people shackled to responsibilities they never asked for, than for any other reason. I don’t think I’m the only one.
  • Be a teacher, not a judge. Every opportunity for rejection is also an opportunity for teaching. If there’s something truly wrong with an idea, you should be able to explain the problem in such a way that everyone benefits. You owe it to your team or your community or even your friends and family to develop this skill.
  • Give credit. It will come back to you. People rarely give freely to notorious thieves and hoarders.

Note that I’m not making any appeals to morality here. I’m not saying it’s right to make collaboration easier. I’m saying it’s practical. When you make collaboration with you easy and pleasant, people want to do it more. That frees you to work on the problems that most interest you, and share credit for a successful project instead of getting no credit at all for a failed or stagnant one. When people try to do you a favor, try to accept graciously.