Using WordPress to Generate Static Pages

As you all know by now, I’ve changed the way I manage the content for this site. I now write posts in WordPress, then turn the results – after all of the database access, theme application, etc. – into static pages which are then served to you. One of my main reasons was increased security, which has been a problem with WordPress for years but which has generated a lot of interest recently because of the latest botnet. Therefore, I’ll describe what I’ve done so that maybe others can try something similar and maybe even improve on my recipe.

In order to understand what’s going on here, you have to know a bit about how a WordPress site is structured. It might surprise you to know that each post can be accessed no fewer than six different ways. Each post by itself is available either by its name or by its number. Post are combined on the main page, per-month lists, and per-category lists. Lastly, the category lists are also reachable either by name or by number. In fact, if a post is in multiple categories it will appear in more than six places. It’s important to preserve all of this structure, or else links will break. This is why I didn’t use a WordPress plugin to generate the static content, by the way. Even in the most cursory testing, every one I tried failed to maintain this structure properly. Most of the work I had to do was related to getting that structure just right, but first you have to generate all of the content so I’ll start there.

The basic idea for fetching the content is to crawl your own site with wget. Start with a normally working WordPress installation. Make sure you’ve set your WordPress options to use a name-based (rather than number-based) URL structure, and turned comments off on all posts. Then issue something like the following command.

wget -r -l inf -p -nc -D atyp.us http://pl.atyp.us/wordpress

This might take a while. For me it’s about twenty minutes, but this is an unusually old blog. Also, you don’t need to do this all the time. Usually, you should be able to regenerate the post itself plus its global/month/category timelines, but not touch other months or categories. At this point you’ll have a very simple static version of your site, good enough as an archive or something, but you’ll need to fix it up a bit before you can really use it to replace the original.

The first fix has to do with accessing the same page by name or by number. One of my goals was to avoid rewriting the actual contents of the generated pages. I don’t mind copying, adding links, adding web-server rewrite rules, and so on, but rewriting content only fixes things for me. Any links from outside would still be broken. My solution here has two parts. The first is a script, which finds the name-to-number mapping inside each article and uses that information to create a directory full of symbolic links. Here it is, but be aware that I plan to improve it for reasons I’ll get to in a moment.

#!/bin/bash
 
function make_one {
	# Canned Platypus posts
	p_expr='single single-post postid-\([0-9]*\)'
	p_id=$(sed -n "/.*$p_expr.*/s//\\1/p" < $1)
	if [ -n "$p_id" ]; then
		ln -s "$1" "post-$p_id.html"
		return
	fi
	# HekaFS posts
	p_expr=' name=.comment_post_ID. value=.\([0-9]*\). '
	p_id=$(sed -n "/.*$p_expr.*/s//\\1/p" < $1)
	if [ -n "$p_id" ]; then
		ln -s "$1" "post-$p_id.html"
		return
	fi
	c_expr='archive category category-[^ ]* category-\([0-9]*\)'
	c_id=$(sed -n "/.*$c_expr.*/s//\\1/p" < $1)
	if [ -n "$c_id" ]; then
		ln -s "$1" "cat-$c_id.html"
		return
	fi
}
 
find $1 -name index.html | while read f; do
	make_one $f
done

Notice how I had to handle the two blogs differently? It turns out that this information is theme-specific, and some themes might not include it at all. What I really should do is get this information from the database (correlate post_title with ID in wp_posts), but it works for now. The second part is a web-server rewrite rule, to redirect CGI requests for an article or category by number to the appropriate link. Here’s what I’m using for Hiawatha right now.

UrlToolkit {
    ToolkitID = cp-wordpress
    RequestURI isfile Return
    # Handle numeric post/category links.
    Match /wordpress/\?p=(.*) Rewrite /wordpress/links/post-$1.html Return
    Match /wordpress/\?cat=(.*) Rewrite /wordpress/links/cat-$1.html Return
    Call static-wordpress
}

What ends up happening here is that Hiawatha rewrites the CGI URL so that it points to the link I just created, which in turn points to the actual article. The "static-wordpress" URL toolkit handles another dynamic-link issue, this time related to JavaScript and CSS files.

UrlToolkit {
    ToolkitID = static-wordpress
    # Support multiple versions of CSS and JS files, with the right extensions.
    Match (.*)\.css\?(.*) Rewrite $1_$2.css Return
    Match (.*)\.js\?(.*) Rewrite $1_$2.js Return
    # Anything else gets the arguments lopped off.
    Match (.*)\?(.*) Rewrite $1 Return
}

I had to do this because it turned out that Firefox would complain about CSS/JS files not having the right type, because Hiawatha would get the wrong type unless the file ended in .js or .css respectively. For example, widgets.css?ver=20121003 wouldn't work. This rule rewrites it to widgets_ver=20121003.css which does work. To go with that, I also have a second renaming script.

#!/bin/bash
 
workdir=$(mktemp -d)
trap "rm -rf $workdir" EXIT
 
find $1 -name '*\?*' | grep -Ev '&"' > $workdir/all
 
# Use edit-in-place instead of sed to avoid quoting nastiness.
 
# Handle CSS files.
grep '\.css' $workdir/all > $workdir/css
ed - $workdir/css < < EOF
g/\([^?]*\)\.css?\([^?]*\)/s//mv '\1.css?\2' '\1_\2.css'/
w
q
EOF
 
# Handle JavaScript files.
grep '\.js' $workdir/all > $workdir/js
ed - $workdir/js < < EOF
g/\([^?]*\)\.js?\([^?]*\)/s//mv '\1.js?\2' '\1_\2.js'/
w
q
EOF
 
# Handle everything else.
grep -Ev '\.css|\.js' $workdir/all > $workdir/gen
ed - $workdir/gen < < EOF
#g/\([^?]*\)?\([^?]*\)/s//mv '\1?\2' '\1_\2.html'/
g/\([^?]*\)?\([^?]*\)/s//rm '\1?\2'/
w
q
EOF
 
. $workdir/js
. $workdir/css
. $workdir/gen

Note that the script also deletes other (non-CSS non-JS) files with question marks, since wget will leave some of those lying around and (at least in my case) they're invariably useless. Similarly, the static-wordpress rewrite rule just deletes the question mark and anything after it.

At this point you should have a properly fixed-up blog structure, which you can push to your real server and serve as static files (assuming you have the right configuration). What's missing? Well, comments for one. I still vaguely plan to add an external comment service like Disqus or Livefyre, but to be honest I'm not in that much of a hurry because - while I do appreciate them - comments have never been a major part of the site. The other thing missing is search, and I'm still pondering what to do about that. Other than that, as you must be able to see if you can read this, the process described above seems to work pretty well. My web server is barely using any CPU or memory to serve up two sites, and my "attack surface" has been drastically reduced by not running MySQL or PHP at all.

P.S. Hiawatha rocks. It's as easy to set up as nginx, it has at least as good a reputation for performance, and resource usage has been very low. I'd guess I can serve about 60x as much traffic as before, even without flooding protection - and that's the best thing about Hiawatha. I can set a per-user connection limit (including not just truly simultaneous connections but any occuring within N seconds of each other) and ban a client temporarily if that limit is exceeded. Even better, I can temporarily ban any client that makes more than M requests in N seconds. I've already seen this drive off several malware attempts and overly aggressive bots, while well behaved bots and normal users are unaffected. This probably increases my load tolerance by up to another order of magnitude. This might not be the fastest site of any kind, but for a site that has (almost) all of the power of WordPress behind it I'd say it's doing pretty well.

My Brother Rocks

In just about every technical community but one, I probably have a higher profile nowadays than my brother Kevin. I say that not out of younger-sibling competitiveness, but almost for the exact opposite reason – to point out that he’s a pretty technical guy too, and largely responsible for my being one. Here are a couple of points of evidence.

  • His interest in computers predates mine. When a friend of the family loaned us what had to be one of the first TRS-80 computers in New Zealand, it was Kevin who really jumped all over the opportunity.
  • He made a lot more effort regarding computers. One of the very first things he did when he came to the US (a year after our mother and I did) was save up and buy an Apple II. Given the price tag and our economic circumstances at the time, that was a pretty major expenditure. He dove right into 6502 programming, still years before I took programming seriously.
  • He was involved in open-source long before I was . . . except it wasn’t called that back then. Kevin was on the NetHack 3 development team, which was a pretty complex global enterprise. If you were to look at the way the developers coordinated, you’d recognize a lot of the patterns in common use today. This was back in 1989, as I was just starting my own programming career.

Since then, I’ve gone on to infamy and misfortune. Kevin is now a DNS guru, which is why I said “every community but one” earlier. As it happens, this knowledge came in handy just recently. I’m trying to consolidate my web “properties” which are currently spread all over the place. I want to use one provider for DNS, one for email, and one for everything else. GlowHost is very soon going to web-only, and not even that as soon as I get un-stuck enough to set up my own nginx/PHP/etc. configuration on a cloud server I already use for a bunch of other things. As I was trying to move email from GlowHost to FastMail I ran into a glitch. I transferred DNS and email for one of my less-used domains just fine. When I tried to move atyp.us – yes, this domain right here – the DNS part seemed to be OK but I was having trouble with email. I was able to get email on FastMail, but I could see from the headers that it was still going through GlowHost first. I looked at the NS and MX records from a bunch of different places, and everything seemed fine, but even after several days I was still seeing this screwy behavior. Time to call in the DNS expert to see what I was missing.

Pause: can anyone else guess?

The problem turned out to be that mail transfer agents are dumber than I thought, and my silly insistence on using pl.atyp.us instead of atyp.us was confusing the poor babies. Even though I had the MX records for atyp.us and *.atyp.us in place, they’d still fail to find an MX record for pl.atyp.us specifically. Then, they wouldn’t even go “up the tree” and get the MX for atyp.us as I thought they would (and as the SOA for pl.atyp.us makes pretty clear). Instead – and this is the part where Kevin was able to point me in the right direction – they’d fall back to looking for an A record which was still pointing to GlowHost because that’s still where the website is. Bingo. I added the “pl” MX records, and I can already see email flowing in without going through GlowHost.

So thank you, Older Brother. No, not for the MX thing. For every thing.

Amazon’s Own Post Mortem

Amazon has posted their own explanation of the recent EBS failure. Since I had offered some theories earlier, I think it’s worthwhile to close this out by comparing my theories with Amazon’s explanation. Specifically, I had suggested two things.

  • EBS got into a state where it didn’t know what had been replicated, and fell back to re-replicating everything.
  • There was inadequate flow control on the re-replication/re-mirroring traffic, causing further network overload.

It turns out that both theories were slightly correct but mostly incorrect. Here’s the most relevant part of Amazon’s account.

When this network connectivity issue occurred, a large number of EBS nodes in a single EBS cluster lost connection to their replicas. When the incorrect traffic shift was rolled back and network connectivity was restored, these nodes rapidly began searching the EBS cluster for available server space where they could re-mirror data. Once again, in a normally functioning cluster, this occurs in milliseconds. In this case, because the issue affected such a large number of volumes concurrently, the free capacity of the EBS cluster was quickly exhausted, leaving many of the nodes “stuck” in a loop, continuously searching the cluster for free space. This quickly led to a “re-mirroring storm,” where a large number of volumes were effectively “stuck” while the nodes searched the cluster for the storage space it needed for its new replica. At this point, about 13% of the volumes in the affected Availability Zone were in this “stuck” state.

the nodes failing to find new nodes did not back off aggressively enough when they could not find space, but instead, continued to search repeatedly

The first part refers to the sort of full re-mirroring that I had mentioned, although it was re-mirroring to a new replica instead of an old one. The last part is a classic congestion-collapse pattern: transient failure, followed by too-aggressive retries that turn the transient failure into a persistent one. I had thought this would apply to the data traffic, but according to Amazon it affected the “control plane” instead. This is also what caused it to affect multiple availability zones, since the control plane – unlike the data plane – spans availability zones within a region.

The most interesting parts, to me, are the mentions of actual bugs – one in EBS and one in RDS. Here are the descriptions.

There was also a race condition in the code on the EBS nodes that, with a very low probability, caused them to fail when they were concurrently closing a large number of requests for replication. In a normally operating EBS cluster, this issue would result in very few, if any, node crashes; however, during this re-mirroring storm, the volume of connection attempts was extremely high, so it began triggering this issue more frequently. Nodes began to fail as a result of the bug, resulting in more volumes left needing to re-mirror.

Of multi-AZ database instances in the US East Region, 2.5% did not automatically failover after experiencing “stuck” I/O. The primary cause was that the rapid succession of network interruption (which partitioned the primary from the secondary) and “stuck” I/O on the primary replica triggered a previously un-encountered bug. This bug left the primary replica in an isolated state where it was not safe for our monitoring agent to automatically fail over to the secondary replica without risking data loss, and manual intervention was required.

These bugs represent an important lesson for distributed-system designers: bugs strike without regard for location. Careful sharding and replication across machines and even sites won’t protect you against a bug that exists in every instance of the code. A while back, when I was attending the UCB retreats because of OceanStore, the Recovery Oriented Computing” folks were doing some very interesting work on correlated failures. I remember some great discussions about distributing a system not just across locations but across software types and versions as well. This lesson has stuck with me ever since. For example, in iwhd the extended replication-policy syntax was developed with a specific goal of allowing replication across different back-end types (e.g. S3, OpenStack) or operating systems as well as different locations. Maybe distributing across different software versions wouldn’t have helped in Amazon’s specific case if the bugs involved have been in there long enough, but it’s very easy to imagine a related scenario in which having different versions with different mirror-retry strategies in play (same theory behind multiple hashes in Stochastic Fair Blue BTW) might at least have avoided one factor contributing to the meltdown.

“Anonymous” is Bad For Anonymity

I’ll probably get in trouble for writing this, but somebody has to. Feeling full of themselves after the Wikileaks affair, Anonymous has started going after other worthy targets. The problem is, they’re doing it in a way that almost guarantees a bad outcome. For example, look at their letter to Westboro Baptist Church.

We, the collective super-consciousness known as ANONYMOUS

Might as well stop there. This introduction, plus the hyperbole and contorted sentence structure throughout, makes me think of nothing so much as James T. Kirk’s painfully melodramatic speeches in old Star Trek episodes. This is not the image you want to project when you’re fighting for a cause. For an even worse example, consider the antics at last week’s RSA conference.

loooooooooooooooooool
owned by anonymous. niiiice.

Again, stop there. Already we have text/internet shorthand, no caps, general swagger, etc. It looks like a child drunk on power, not an adult making a serious policy point. “In it 4 the lulz” indeed . . . and that’s the problem. I don’t object to Anonymous’s choice of targets here. Westboro Baptist definitely deserves some karmic payback, and evidence suggest the same of HBGary Federal. I don’t even object to their tactics, though some might. The problem is the kind of attention this will get them, and how that attention might turn into policy changes that adversely affect all of us. Anonymous clearly wields great power. Power can be used by heroes, and it can be used by bullies. The difference often lies in two things.

  • Identifying yourself. There is no way to tell who’s really Anonymous and who’s just some totally unrelated internet cretin using the name and cause as an excuse for random acts of vandalism. This is kind of ironic, since the real members of Anonymous are clearly experts in technologies such as secure anonymous publishing that would allow them to take or deny credit for any particular act without having to reveal their identities. Anonymous is really pseudonymous, not anonymous, and should take care to preserve the distinction.
  • Defining yourself. Real freedom fighters have identifiable goals and methods. We might not approve of either, but without any explanations (beyond generic “freedom of information” blather that could mean anything) or apparent limits nobody will see the nobility of the cause. Why is Anonymous more prominently taking on Westboro and HBGary, or even Visa and Mastercard, instead of Qaddafi? To extend that thought a little, how are their methods really distinct from Qaddafi’s? Without Anonymous taking a clear stand, “in it 4 the lulz” effectively becomes their credo.

I’m not suggesting that Anonymous should behave differently to satisfy my or anyone else’s comfort level. I’m suggesting they should do so for the sake of the very goals they (vaguely) claim to value. When people see a group with far more power than self-control, which fails to distinguish itself from any other band of bullies, then the Powers That Be will start to see bands of bullies on the internet as a Real Problem. Those who are already looking for any excuse to require ID before connecting to the internet, or to give security agencies more power to invade our privacy in the name of tracking down the Bad People, will be all over that. Policy makers aren’t listening to us, the people. They’re listening to the people with money – like HBGary Federal or worse – who stand to make even more money in such a world. They’re also listening to people like the RIAA/MPAA who would also dearly love a more controlled internet. A very likely outcome of all this is much less privacy and potential for anonymity on the internet. Thanks a lot, Anonymous.

November Tweets

For those (few) who follow me but not there, here are my top ten from November.

  • OH at museum: Now that we’ve appreciated all the diversity, can we please move on? (November 07)
  • If I have publicly and violently clashed with the founders, please pardon my raucous laughter when you try to recruit me. (November 09)
  • Using Inconsolata font for code editing. Quite nice. (November 11)
  • My take on the “your argument is invalid” meme, inspired by driftx on #cassandra. http://imgur.com/WtiFL (November 12)
  • Tea Party yard work: borrow a neighbor’s leaf blower, then blow all your leaves onto his yard. (November 14)
  • http://www.cs.virginia.edu/~weimer/ shows a *negative* correlation between some popular coding-style preferences and actual readability. (November 15)
  • If you work in distributed systems but haven’t read Saito and Shapiro then fix that. (November 16)
  • How many applications have you used today? How many are you personally willing to rewrite to use an “alternative data store”? (November 29)
  • I am the king of . . . just a minute. Where was I? Oh yeah, the king of . . . just a sec . . . multitasking. (November 29)
  • If hand-waving built muscle, I’d know some very buff architects. (November 30)

October Tweets and Links

A little early, but it’s been a good month.

  • There are only two problems with distributed counters. Or maybe three.
  • Problems with .ly domains? Less money for Libyan government, some URL shorteners die. And the downside is . . . ?
  • Self-driving cars, offshore wind farms, embryonic stem cell treatments – all on one glance at the news. What times we live in.
  • Assembly Instructions from Hell
  • Sauron to bid for Tea Party leadership. It would be an improvement.
  • Caching separate from the DB (any type) is not the enterprise version of a DB which is inherently distributed. It’s the buggy version.
  • The last frame is so funny because so many really do think such “answers” are useful. Not Invented Here
  • Matchstick Tirith
  • Simply awesome microphotography.
  • Lord of the Rings + Rudolph the Red Nosed Reindeer
  • Sears for zombies. (“Afterlife. Well spent.”)
  • Anybody who refers to an “internet cable” shouldn’t do distributed programming.
    [Just for you, PeteZ: the person was referring to a physical object, not just internet service from a cable company]
  • App devs shouldn’t create infrastructure . . . not because their time is better spent, but because they suck at it.

August Tweets

Some of my (and others’) favorites from August.

  • Based on a mispronunciation by my daughter, thinking of renaming my blog to CyberVinegar.
  • The programmer equivalent of age/sex/location is OS/language/editor.
  • Laws should affect human behavior, not router behavior.
  • Using virtualization on the desktop means that crappy desktop software can force you to reboot an entire cluster.
  • Never say “I’m game” at a hunt club.
  • “Wisdom of crowds” is a misnomer. Crowds aggregate *knowledge*; wisdom is usually excluded or distorted.

Don’t want to miss my next pithy comment? Here I am.

Spread of a Meme

I find it fascinating how links get distributed over time. Here’s an example involving the amazing pencil-tip sculptures by Dalton Ghetti,and the times I’ve been presented with the same link in Google Reader.

I predict that it will show up on at least one more website I follow. In maybe a week or so, I’ll see it in print media for the first time. Another week or two after that, I’ll see it in the Boston Globe. That’s the usual pattern, anyway.

Twitter Quality Ratio

I just thought of a new metric: followers per tweet. I’m at about 0.43, which is pretty middle of the road. I see some people who are flirting with the 0.1 mark. At the other end of the scale I see some some who are at 3.0 or better. Not too surprisingly, the first group are disproportionately likely to end up on my “whale-jumpers” list which I check less often, and a similarly disproportionate number of my favorite tweeple seem to be in the second group. I could therefore expect to improve the “quality” of my own personal Twitter stream by checking this ratio for people I’m thinking of following . . . and you could do the same for your own personal stream as well.

BTW, if you want to help me pump up my own ratio, I’m @Obdurodon. ;)

Stones in the Pond

I’ve been on vacation for the last few days, and while I was (mostly) gone a few interesting things seem to have happened here on the blog. The first is that, after a totally unremarkable first week, my article It’s Faster Because It’s C suddenly had a huge surge in popularity. In a single day it has become my most popular post ever, more than 2x its nearest competitor, and it seems to have spawned a couple of interesting threads on Hacker News and Reddit as well. I’m rather amused that the “see, you can use Java for high-performance code” and the “see, you can’t…” camps seem about evenly matched. Some people seem to have missed the point in even more epic fashion, such as by posting totally useless results from trivial “tests” where process startup dominates the result and the C version predictably fares better, but overall the conversations have been interesting and enlightening. One particularly significant point several have made is that a program doesn’t have to be CPU-bound to benefit from being written in C, and that many memory-bound programs have that characteristic as well. I don’t think it changes my main point, because memory-bound programs were only one category where I claimed a switch to C wouldn’t be likely to help. Also, programs that store or cache enough data to be memory-bound will continue to store and cache lots of data in any language. They might hit the memory wall a bit later, but not enough to change the fundamental dynamics of balancing implementation vs. design or human cycles vs. machine cycles. Still, it’s a good point and if I were to write a second version of the article I’d probably change things a bit to reflect this observation.

(Side point about premature optimization: even though this article has been getting more traffic than most bloggers will ever see, my plain-vanilla WordPress installation on budget-oriented GlowHost seems to have handled it just fine. Clearly, any time spent hyper-optimizing the site would have been wasted.)

As gratifying as that traffic burst was, though, I was even more pleased to see that Dan Weinreb also posted his article about the CAP Theorem. This one was much less of a surprise, not only because he cites my own article on the same topic but also because we’d had a pretty lengthy email exchange about it. In fact, one part of that conversation – the observation that the C in ACID and the C in CAP are not the same – had already been repeated a few times and taken on a bit of a life of its own. I highly recommend that people go read Dan’s post, and encourage him to write more. The implications of CAP for system designers are subtle, impossible to grasp from reading only second-hand explanations – most emphatically including mine! – and every contribution to our collective understanding of it is valuable.

That brings us to what ties these two articles together – besides the obvious opportunity for me to brag about all the traffic and linkage I’m getting. (Hey, I admit that I’m proud of that.) The underlying theme is dialog. Had I kept my thoughts on these subjects to myself or discussed them only with my immediate circle of friends/colleagues, or had Dan done so, or had any of the re-posters and commenters anywhere, we all would have missed an opportunity to learn together. It’s the open-source approach to learning – noisy and messy and sometimes seriously counter-productive, to be sure, but ultimately leading to something better than the “old way” of limited communication in smaller circles. Everyone get out there and write about what interests you. You never know what the result might be, and that’s the best part.

(Dedication: to my mother, who did much to teach me about writing and even more about the importance of seeing oneself as a writer.)