I’ve been preparing to write the direct-to-S3 site backup script that I talked about before. It really shouldn’t take long – probably only an hour or two, maybe a bit more because I can’t do it uninterrupted – but when approaching any new technology I like to do a reasonably thorough survey of the tools and techniques and issues. This comes from having been burned too many times by seeing others jump into something too fast, unaware that there were better approaches than the ones they were using, but by the time I came along it was too late because too much code was already written. Here are some observations for others who might choose to go down this path.

The limitation to a single naming-hierarchy level (bucket/name instead of directory/directory/…/name) is slightly annoying. More annoying still is the lack of a rename, so if you want to change the name of a large file/object you must delete it and re-upload with the new name. If you’re just maintaining the slashes in paths of files you upload from multiple directories elsewhere, changing a file’s location must be handled the same way. The only way around this is to give files permanent unique names and maintain some kind of directory yourself. This works OK for a backup type of application except that it violates Jeremy Zawodny’s “can I easily find my backed-up files even if I’ve lost the program that put them there?” criterion, and it doesn’t work at all with Amazon’s virtual web-hosting facility. I can understand the lack of a rename in the REST API, but the lack of such an obviously important feature even in the SOAP API makes me wonder if they made the mistake of basing an object’s location on its name so that a rename implies relocating the object’s data. No matter what the reason, this is one of the biggest warts on S3.

Speaking of the virtual-hosting facility, it’s nice but you do lose some functionality compared to what you’d have on a regular website. For example, the logging facilities are rudimentary and hard to access (more about this in a moment). You also lose any kind of functionality that you might be used to from .htaccess under Apache. I’d like to host my Australia and platypus pictures through Amazon, but I use my .htaccess to refuse links from web forums and such because I don’t want to pay for extra bandwidth just so some teenager can post my wombat-poop picture to a thread or (worse) use one of my platypus pictures as an avatar. Just one such usage could easily generate thousands of hits and up to a gigabyte of consumed bandwidth, so losing the Referer filtering could actually cost me money.

What was that about “rudimentary and hard to access”? Part of that is the state of the tools available to monitor, maintain, and generally tinker with S3. I used NS3 Manager for a while (it’s how I’ve uploaded most of my pictures and videos so far) but it’s not very stable and it leaves annoying little “placeholder” files everywhere for no apparent reason. Jets3t Cockpit seems a lot better, but still doesn’t support all of the S3 functionality such as logging. Anyone working with S3 needs to get used to doing a lot of things the hard way, which fortunately is not all that hard but can be a bit tedious nonetheless.

Be wary of web-based tools such as Cockpit Online, Openfount S3 Manager, or AWS Zone. I have no particular reason to distrust or cast aspersions on any of their authors, who are after all doing the community a service by providing these tools, but bear in mind that to use them you must provide your “secret” S3 key to a third party you don’t know. Caution is always called for in such cases.

If you’re using PHP, there really doesn’t seem to be much difference between neurofuzzy’s library and the semi-official Amazon library. One difference might be crucial, however, and that’s support for streaming uploads instead of having to load an entire file into memory to transfer it. Many web hosts would not appreciate loading a gigabyte file into memory from a script, and in fact it often won’t even work. Fortunately, Mission Data has published a patch to the Amazon library to do streaming transfers. This, combined with Christopher Shepherd’s S3 backup script, will probably form the basis for my own site-backup tool. The libraries available for Ruby actually seem a bit better, and using them might be a good way to learn Ruby, but Ruby support is simply less ubiquitous than PHP support and this would make my script a bit less generally useful. Maybe, if I have enough spare time, I’ll do both PHP and Ruby versions just for fun.