Using WordPress to Generate Static Pages

As you all know by now, I’ve changed the way I manage the content for this site. I now write posts in WordPress, then turn the results – after all of the database access, theme application, etc. – into static pages which are then served to you. One of my main reasons was increased security, which has been a problem with WordPress for years but which has generated a lot of interest recently because of the latest botnet. Therefore, I’ll describe what I’ve done so that maybe others can try something similar and maybe even improve on my recipe.

In order to understand what’s going on here, you have to know a bit about how a WordPress site is structured. It might surprise you to know that each post can be accessed no fewer than six different ways. Each post by itself is available either by its name or by its number. Post are combined on the main page, per-month lists, and per-category lists. Lastly, the category lists are also reachable either by name or by number. In fact, if a post is in multiple categories it will appear in more than six places. It’s important to preserve all of this structure, or else links will break. This is why I didn’t use a WordPress plugin to generate the static content, by the way. Even in the most cursory testing, every one I tried failed to maintain this structure properly. Most of the work I had to do was related to getting that structure just right, but first you have to generate all of the content so I’ll start there.

The basic idea for fetching the content is to crawl your own site with wget. Start with a normally working WordPress installation. Make sure you’ve set your WordPress options to use a name-based (rather than number-based) URL structure, and turned comments off on all posts. Then issue something like the following command.

wget -r -l inf -p -nc -D atyp.us http://pl.atyp.us/wordpress

This might take a while. For me it’s about twenty minutes, but this is an unusually old blog. Also, you don’t need to do this all the time. Usually, you should be able to regenerate the post itself plus its global/month/category timelines, but not touch other months or categories. At this point you’ll have a very simple static version of your site, good enough as an archive or something, but you’ll need to fix it up a bit before you can really use it to replace the original.

The first fix has to do with accessing the same page by name or by number. One of my goals was to avoid rewriting the actual contents of the generated pages. I don’t mind copying, adding links, adding web-server rewrite rules, and so on, but rewriting content only fixes things for me. Any links from outside would still be broken. My solution here has two parts. The first is a script, which finds the name-to-number mapping inside each article and uses that information to create a directory full of symbolic links. Here it is, but be aware that I plan to improve it for reasons I’ll get to in a moment.

#!/bin/bash
 
function make_one {
	# Canned Platypus posts
	p_expr='single single-post postid-\([0-9]*\)'
	p_id=$(sed -n "/.*$p_expr.*/s//\\1/p" < $1)
	if [ -n "$p_id" ]; then
		ln -s "$1" "post-$p_id.html"
		return
	fi
	# HekaFS posts
	p_expr=' name=.comment_post_ID. value=.\([0-9]*\). '
	p_id=$(sed -n "/.*$p_expr.*/s//\\1/p" < $1)
	if [ -n "$p_id" ]; then
		ln -s "$1" "post-$p_id.html"
		return
	fi
	c_expr='archive category category-[^ ]* category-\([0-9]*\)'
	c_id=$(sed -n "/.*$c_expr.*/s//\\1/p" < $1)
	if [ -n "$c_id" ]; then
		ln -s "$1" "cat-$c_id.html"
		return
	fi
}
 
find $1 -name index.html | while read f; do
	make_one $f
done

Notice how I had to handle the two blogs differently? It turns out that this information is theme-specific, and some themes might not include it at all. What I really should do is get this information from the database (correlate post_title with ID in wp_posts), but it works for now. The second part is a web-server rewrite rule, to redirect CGI requests for an article or category by number to the appropriate link. Here’s what I’m using for Hiawatha right now.

UrlToolkit {
    ToolkitID = cp-wordpress
    RequestURI isfile Return
    # Handle numeric post/category links.
    Match /wordpress/\?p=(.*) Rewrite /wordpress/links/post-$1.html Return
    Match /wordpress/\?cat=(.*) Rewrite /wordpress/links/cat-$1.html Return
    Call static-wordpress
}

What ends up happening here is that Hiawatha rewrites the CGI URL so that it points to the link I just created, which in turn points to the actual article. The "static-wordpress" URL toolkit handles another dynamic-link issue, this time related to JavaScript and CSS files.

UrlToolkit {
    ToolkitID = static-wordpress
    # Support multiple versions of CSS and JS files, with the right extensions.
    Match (.*)\.css\?(.*) Rewrite $1_$2.css Return
    Match (.*)\.js\?(.*) Rewrite $1_$2.js Return
    # Anything else gets the arguments lopped off.
    Match (.*)\?(.*) Rewrite $1 Return
}

I had to do this because it turned out that Firefox would complain about CSS/JS files not having the right type, because Hiawatha would get the wrong type unless the file ended in .js or .css respectively. For example, widgets.css?ver=20121003 wouldn't work. This rule rewrites it to widgets_ver=20121003.css which does work. To go with that, I also have a second renaming script.

#!/bin/bash
 
workdir=$(mktemp -d)
trap "rm -rf $workdir" EXIT
 
find $1 -name '*\?*' | grep -Ev '&"' > $workdir/all
 
# Use edit-in-place instead of sed to avoid quoting nastiness.
 
# Handle CSS files.
grep '\.css' $workdir/all > $workdir/css
ed - $workdir/css < < EOF
g/\([^?]*\)\.css?\([^?]*\)/s//mv '\1.css?\2' '\1_\2.css'/
w
q
EOF
 
# Handle JavaScript files.
grep '\.js' $workdir/all > $workdir/js
ed - $workdir/js < < EOF
g/\([^?]*\)\.js?\([^?]*\)/s//mv '\1.js?\2' '\1_\2.js'/
w
q
EOF
 
# Handle everything else.
grep -Ev '\.css|\.js' $workdir/all > $workdir/gen
ed - $workdir/gen < < EOF
#g/\([^?]*\)?\([^?]*\)/s//mv '\1?\2' '\1_\2.html'/
g/\([^?]*\)?\([^?]*\)/s//rm '\1?\2'/
w
q
EOF
 
. $workdir/js
. $workdir/css
. $workdir/gen

Note that the script also deletes other (non-CSS non-JS) files with question marks, since wget will leave some of those lying around and (at least in my case) they're invariably useless. Similarly, the static-wordpress rewrite rule just deletes the question mark and anything after it.

At this point you should have a properly fixed-up blog structure, which you can push to your real server and serve as static files (assuming you have the right configuration). What's missing? Well, comments for one. I still vaguely plan to add an external comment service like Disqus or Livefyre, but to be honest I'm not in that much of a hurry because - while I do appreciate them - comments have never been a major part of the site. The other thing missing is search, and I'm still pondering what to do about that. Other than that, as you must be able to see if you can read this, the process described above seems to work pretty well. My web server is barely using any CPU or memory to serve up two sites, and my "attack surface" has been drastically reduced by not running MySQL or PHP at all.

P.S. Hiawatha rocks. It's as easy to set up as nginx, it has at least as good a reputation for performance, and resource usage has been very low. I'd guess I can serve about 60x as much traffic as before, even without flooding protection - and that's the best thing about Hiawatha. I can set a per-user connection limit (including not just truly simultaneous connections but any occuring within N seconds of each other) and ban a client temporarily if that limit is exceeded. Even better, I can temporarily ban any client that makes more than M requests in N seconds. I've already seen this drive off several malware attempts and overly aggressive bots, while well behaved bots and normal users are unaffected. This probably increases my load tolerance by up to another order of magnitude. This might not be the fastest site of any kind, but for a site that has (almost) all of the power of WordPress behind it I'd say it's doing pretty well.

Static Site is LIVE

If you’re seeing this, it’s because you’re on the new site, seeing static files served by Hiawatha instead of dynamic files served by nginx. If you notice anything else that’s different, let me know.

Static Site Update

As I mentioned too long ago, I’ve been planning to migrate this site to a different method of operation, for both performance and security reasons. Specifically, my approach allows me to add posts, change themes, etc. with all the power of WordPress and its community at hand, but then serve up the results as static pages. I have most of that working on my test site, with only two things not working: by-category listings (by-date listings work) and comments. I can actually do without comments for a while until I find an external solution that I like, but I feel like I do need to fix the by-category listings before I switch over. For the technically minded, here’s a rough outline of how I’m doing this.

  1. I have two Hiawatha configs – one for dynamic pages and one for static. These are currently on the same machine, but the plan is to run them on separate machines when I’m done.
  2. For editing etc. I just use the dynamic config and everything works just as it has for years.
  3. When I’m done editing, “wget -r -l inf -p -nc -D atyp.us” gets me a static version of the site.
  4. I also have a script to rename some files and deal with a few other site-specific issues.
  5. When I’m all done, I switch over to my static Hiawatha config, which has a couple of URL-rewrite rules to work around the CGI-oriented URLs that WordPress produces.
  6. The live site is running no PHP or MySQL, just Hiawatha serving up static files.

The key point is that this all looks exactly the same as the current site running on a standard setup, even though it’s all very different behind the scenes. When I’m done, I’ll more fully document everything and put up a how-to for other WordPress users to follow.

EDIT: Since someone else is sure to ask, I will. Why not just switch to a built-for-static system like Octopress? Here’s why.

I’d still have to convert the existing content. As long as I automate that process, it doesn’t matter much whether I do it once or many times. Even rebuilding the entire static site from scratch, which I’ve done a lot more while debugging than I’d ever do in normal operation, doesn’t take long enough to bother me. Selective rebuilds would be easy, and even faster.

  • I like the WordPress tools that I use to create, categorize, and present content. When it comes to plugins and themes, even the most popular/sophisticated static systems seem downright primitive by comparison, so I’d be back to doing more stuff by hand.
  • I’m very conservative about breaking links, and none of the static systems are fully compatible with the URL structure that I’ve been using for years.

My only gripes with WordPress are security and performance. Sure, I could make a more drastic change, and the pages would be a bit simpler/smaller if I did (even the simplest WordPress themes generate some horrendous HTML), but I’d need a better reason than that.

EDIT 2: Now that the static site is live, I no longer even run PHP/MySQL on the real server. This edit, and the next post, were added by running a copy of the site on my desktop at home and then copying only the output to the cloud.

Changes Coming Soon

I’m going to be making a few changes around the site soon, so I figured I’d give people a bit of warning in case something goes wrong.

  • I’ll be moving. It’s not that I have any problem whatsoever with Rackspace. I must emphasize that they’ve been purely awesome while I’ve been here. The issue is purely one of location. My goal is to reduce my total work-to-home latency, and they’re not in the right place for that. By moving, I can make that latency half of what it is currently, and a third of what it was back when I did things the “recommended” way. I’ll write more some day about the configuration I’ve been using for the last few weeks. For now, it should suffice to say that everything I’ve done here I can do exactly the same way at the new place.
  • I’ll be changing how both blogs (pl.atyp.us and hekafs.org) get served. Between the two of them, this site overall has gone from barely detectable to a small blip. I have no pretensions of this being a truly big or important site, but even a blip needs to consider performance. I have a plan to continue using WordPress as a content-generation system, but I’ll actually be the only one using it directly. What everyone else will see is the result of a script that slurps all of the articles and category/month lists out of WordPress and converts them into (slightly optimized) static files that nginx can serve up by itself with maximum HTTP caching goodness. There will be no need for MySQL or PHP except when I’m adding new posts. There won’t be any need for varnish either, which I consider a good thing since it just screwed me last night by croaking for no reason whatsoever and leaving both blogs dead in the water.
  • As part of the static-page strategy, comments will have to change. Instead of using WordPress comments, I’ll switch to using an external system – probably Livefyre. Old comments will still be visible as part of the static pages, but new comments – including those on old posts – will go through a new system.
  • I’ll probably change the theme too, this time to something as minimal as I can find. No widgets. No sidebars. Just a modest header, a small menu bar, and the articles themselves.

If everything goes well, these changes will have only minimal effect on readers. Comments will look a little different, and load times will generally be faster, but it will still be the same guy writing about the same things in the same style. Stay tuned.

To The Cloud . . . And Beyond!

You might not have noticed, but I just moved. As part of my ongoing project to consolidate my various web “properties” I upgraded and updated my Rackspace cloud server (which I’ve been using for two years), and put nginx + php_fpm + mysql on it to serve my websites. It probably wasn’t the best idea to do the move on the same day I posted something as inflammatory as my last post – there was some virtual-memory tuning I’d forgotten, and I did get bitten by the uber-stupid “OOM Killer” under the Hacker News load – but it all seems to be working out otherwise. One of the nice things is that I can resize my server any time I expect a similar spike, then shrink it again when the spike’s over. If I were really motivated I’d do it all automatically, but I don’t have that kind of spare time.

So, as usual, please let me know if you see any glitches. One of the things the traffic spike did for me was show that normal stuff is working, but some stuff around the edges might still need tweaking. I know FTP access and image links to womb.atyp.us (in old posts) aren’t working. Anything else?

Eleventh Anniversary

Yesterday was this blog’s eleventh birthday. Actually I’m not quite sure when I created it, but that’s the date of the earliest “log entry” I could find; this was before the term “blog” was in common use, so I weren’t calling them posts yet. Just for fun, here’s a quick overview of how the blog has evolved over time.

Month Host Software Notable/Typical Post
August 2000 NameZero Static HTML Programmers’ Lifestyles
December 2000 The Mythical Linux Month
October 2001 JTLnet Home grown Multithreading vs. Event-Based Programming
February 2002 Home grown with comments (“PlatSpot”) Freenet Thought Experiment
August 2002 Server Design
February 2003 Burton The Whole Sad Story
April 2003 Ruin Dissed
March 2004 Total Choice Stack the Ripper
May 2004 pMachine Silly Assertions
June 2004 We Interrupt This Weblog…
October 2004 WordPress Waiting for GODOT_RESP
March 2006 Site5 Amazon S3
June 2006 eMax Net Neutrality
September 2006 Verizon’s DNS Sucks
January 2008 InMotion Linux Pointer Types
September 2008 GlowHost Aspire One Wireless
August 2009 Amazon vs. Rackspace vs. Flexiscale
November 2009 Availability and Partition Tolerance
July 2010 It’s Faster Because It’s C

Thanks to all who have inspired posts, commented on posts, sent email, replied elsewhere, or generally helped to make this a useful place to collect my thoughts.

Stones in the Pond

I’ve been on vacation for the last few days, and while I was (mostly) gone a few interesting things seem to have happened here on the blog. The first is that, after a totally unremarkable first week, my article It’s Faster Because It’s C suddenly had a huge surge in popularity. In a single day it has become my most popular post ever, more than 2x its nearest competitor, and it seems to have spawned a couple of interesting threads on Hacker News and Reddit as well. I’m rather amused that the “see, you can use Java for high-performance code” and the “see, you can’t…” camps seem about evenly matched. Some people seem to have missed the point in even more epic fashion, such as by posting totally useless results from trivial “tests” where process startup dominates the result and the C version predictably fares better, but overall the conversations have been interesting and enlightening. One particularly significant point several have made is that a program doesn’t have to be CPU-bound to benefit from being written in C, and that many memory-bound programs have that characteristic as well. I don’t think it changes my main point, because memory-bound programs were only one category where I claimed a switch to C wouldn’t be likely to help. Also, programs that store or cache enough data to be memory-bound will continue to store and cache lots of data in any language. They might hit the memory wall a bit later, but not enough to change the fundamental dynamics of balancing implementation vs. design or human cycles vs. machine cycles. Still, it’s a good point and if I were to write a second version of the article I’d probably change things a bit to reflect this observation.

(Side point about premature optimization: even though this article has been getting more traffic than most bloggers will ever see, my plain-vanilla WordPress installation on budget-oriented GlowHost seems to have handled it just fine. Clearly, any time spent hyper-optimizing the site would have been wasted.)

As gratifying as that traffic burst was, though, I was even more pleased to see that Dan Weinreb also posted his article about the CAP Theorem. This one was much less of a surprise, not only because he cites my own article on the same topic but also because we’d had a pretty lengthy email exchange about it. In fact, one part of that conversation – the observation that the C in ACID and the C in CAP are not the same – had already been repeated a few times and taken on a bit of a life of its own. I highly recommend that people go read Dan’s post, and encourage him to write more. The implications of CAP for system designers are subtle, impossible to grasp from reading only second-hand explanations – most emphatically including mine! – and every contribution to our collective understanding of it is valuable.

That brings us to what ties these two articles together – besides the obvious opportunity for me to brag about all the traffic and linkage I’m getting. (Hey, I admit that I’m proud of that.) The underlying theme is dialog. Had I kept my thoughts on these subjects to myself or discussed them only with my immediate circle of friends/colleagues, or had Dan done so, or had any of the re-posters and commenters anywhere, we all would have missed an opportunity to learn together. It’s the open-source approach to learning – noisy and messy and sometimes seriously counter-productive, to be sure, but ultimately leading to something better than the “old way” of limited communication in smaller circles. Everyone get out there and write about what interests you. You never know what the result might be, and that’s the best part.

(Dedication: to my mother, who did much to teach me about writing and even more about the importance of seeing oneself as a writer.)

Spring Cleaning

You might notice that things look a little different around here. I stayed up a bit later than I meant to last night, tweaking a new theme to give the place a slightly fresher look. If I broke anything too badly you wouldn’t be able to read this, but for minor damage just leave a comment. Thanks!

Site Trends

Just for fun, I decided to spend part of my lunch break generating a graph of my top twenty posts. The first graph I did was total hits vs. date posted. I know this site has become more popular lately, and I wasn’t too surprised to see that the increase in popularity outweighs the effect of older posts having had longer to rack up the numbers. What I thought might be more interesting was a graph of hits per day instead of total hits. Since #20 was a too-recent anomaly, I only graphed the top nineteen. Here’s the result.

hits per day vs. date posted

Not bad. Of course, for this graph the general effect of age is the opposite of what it would be for total hits. I’d like to graph “hits in first month” but that would require a lot more log processing than I can do on my lunch break. The thing I find most interesting is what this tells me about my evolving readership. Eight out of the ten most recent posts to make the list are technical, as are six out of the top ten by total hits and seven out of the top ten by hits per day. While I’ve gone through periods of less technical blogging, and some of the results do make the top twenty, the technical stuff is clearly what people come here for. Most of my family and some of my friends have learned to look for other stuff (including family pictures) on Facebook, and I’ve deliberately cut back on the political stuff in general (my recent blip about the election notwithstanding). As I predicted back in my “Unemployed!” post – #15 total and #8 per day – this blog has become and will probably continue to be more technical than it was during the mid-oughties.

Anti-Social Networking

One of the things I’ve really come to dislike about many bloggers is their endless self-promotion. Many people seem to follow up even their most trivial blog post by linking to it on Twitter, on Facebook, on LinkedIn, on several tech-news aggregators (dzone is particularly afflicted by this), and on just as many mailing lists. I don’t see the point. I believe we’ve all become sufficiently well connected that good content will tend to find its audience without such shenanigans. I occasionally write something that gets linked elsewhere, causing a spike in my readership – sometimes well after an article was actually posted. I like that, I take pride in it, but my traffic numbers today are only interesting relative to my own traffic numbers yesterday. Increasing traffic means I’m getting better at writing things that my audience seems to like, and that makes me happy. I feel no need to compare my numbers to anyone else’s, though. If I started pimping my articles everywhere I could, my traffic would start to reflect the effort I put into self-promotion, instead of the effort I put into thinking and writing, and all comparisons to my own historical numbers would be invalid. That seems like a loss to me.

If you like something I write here, and think some other audience would benefit from seeing it, by all means post a link wherever you want. I know some of my readers have already done that many times, and I thank them for it – especially you, Wes, wherever you are. I think such genuine “votes of confidence” from others are worth far more – to me and to readers – than me linking to myself could ever be, which is a large part of why I decline to play that game. I’m opting out of that particular rat race, and any other race that can only be won by the biggest rat. I like my niche.