After reading Jeremy Zawodny’s analysis of backup-system costs, I decided to give Amazon’s Scalable Storage Service another look. Shortly after that, he posted a list of resources that was also very helpful. I immediately tried Carbonite and JungleDisk. (Yes, I know Carbonite might or might not be S3 based. Speculation abounds, but there doesn’t seem to be any hard evidence either way. In any case, it plays in the same space.) I found both pretty intolerably slow, each using less than one tenth of the bandwidth I know I have either at work or at home. Carbonite is also a for-pay service (with a free trial) so it’s a lot less interesting. JungleDisk, on the other hand, is free for now, but the author clearly means to profit from it some day. It’s also a little careless about things like leaving your Amazon “secret” key lying around in plaintext on your PC, and I don’t appreciate that kind of thing. I gave S3Drive a look, intrigued by the fact that it’s implemented as a true filesystem which works for all programs and not just as an Explorer-limited “web folder” or namespace extension, or (most limiting of all) as a separate standalone program. Unfortunately, even the author admits that it’s slow and uses a lot of memory, and doesn’t recommend storing more than 5MB. Sorry, but that’s not even enough for testing. I’ll pass for now; maybe I’ll come back and check it out again in a few more months. Currently I’m giving S3 Backup a try. It does suffer from being a standalone program, but it does seem to transfer data much faster and it’s more respectful of users’ privacy. Also, the pace of development seems high so there’s promise of it getting better.

The idea that really interests me, though, is of using S3 to back up this website. Yeah, the one you’re looking at right now. You see, I have decent bandwidth at home and at work, but it’s still nowhere near what either my web host or Amazon has. Why suck all that data through the thin pipe to do a backup when there’s a fat pipe between where the data reside now and a secure backup location? What if I need to change hosts yet again? Why bounce everything through my home connection instead of through S3? I think it would be far better to use the thin pipe only for control, to set up a “third party transfer” directly between the website and S3. That product space seems a lot more thinly populated, so instead of looking for programs I’ve been looking for libraries so I can write my own simple backup program. The language of interest here is PHP, for two reasons:

  • I sort of know PHP, and I don’t know Ruby (yet). I’m far more interested right now in getting things done and possibly learning something about S3 than in learning yet another language, thankyouverymuch.
  • A lot of other people who might find anything I produce useful are likely to have web hosts who support PHP but not Ruby.

Amazon has an example of using S3 from PHP, but it’s pretty basic. As far as I’ve been able to tell so far, the canonical PHP interface to S3 is neurofuzzy’s library so that’s probably where I’ll start. If nothing else, it should be a helpful guide to what the underlying S3 API looks like in real life and not just on paper.

One last thought, included here just for the sake of having a place to put it. If it turns out that S3 is too heavily optimized toward storing large objects (i.e. that performance is limited by number of object operations rather than number of bytes) then it seems like a one-to-one mapping of user files to S3 objects might not be a very good idea. The question, then, is how to aggregate user files into larger S3 objects without having to rewrite an entire large object whenever one small object within it changes. One approach I’ve been toying with is to use something like a log-structured or atomic-update filesystem, with S3 objects representing (an infinite supply of) disk slices instead of files. As you write, you actually write into a new slice. When it’s full, or at other “strategic” times, it gets linked into the overall filesystem hierarchy to supplant earlier versions. The ratio between user actions and S3 actions can therefore be extremely high without sacrificing data integrity. I don’t know yet whether such an approach is really a good idea, but maybe it’s something other filesystem geeks would like to chew on.