What Is It?

OK, who wants to guess what this is?

I know it’s a poor picture – it’s hard to keep my hand steady enough while I’m holding something up to the light like that – but that shouldn’t stop anyone from hazarding a guess.

Whole Files vs. Blocks

Brandon Wiley writes:

Replicate only whole

files. This will dramatically increase the probability of the successful

retrieval of a file.

I’ve gone over this before, and I’m amazed that otherwise-intelligent people are still clinging to that misconception. One thing that’s wrong with it is that it assumes only whole files are useful. That might be true for an MP3, but it most certainly is not for a database file. Breaking a file up and storing the pieces in different locations drastically increases the probability of retrieving at least part of it even after a failure, and in the majority of cases that’s a good thing.

The other problem with Brandon’s statement is that it’s just not true even for retrieving the whole file, if replication is taken into account – as it should be. The mathematical magic behind erasure codes and their ilk is based on the relationships between blocks; they really don’t work when there’s only one block (the whole file) or the block size is extremely large or variable. While your one file is being replicated in its entirety to make sure that two node failures are necessary for data loss to occur, someone else might be using erasure codes and the same amount of storage to make sure that half a dozen failures would be necessary in their system. Alternatively, they could reduce the storage overhead by 30% and have exactly the same level of data protection that you do.

Lastly, breaking a file up into pieces increases the effectiveness of “swarming”; instead of downloading from two nodes at once you could be downloading from a hundred. There’s a reason that so many people design systems to distribute files in blocks rather than whole files, and it’s not because they’re the ones who haven’t thought things through.

Update November 11, 9:35am: Zooko has posted a response to this, but totally ignored my point about replication and assumed the opposite of my point about the value of partial files.

Pencil in October 15-18 2003

Colossal Colon Tour. Really. It’s a colon-cancer education thing, and it actually does look kinda fun. I can’t wait; I haven’t seen an expletive deleted this big since I left EMC.

Clean Power (sort of)

Maybe other people know about them already, but I haven’t seen them mentioned here yet: Splashpower claims to’ve developed a device that can charge up any number of electronic gadgets just by placing them on a plain-looking pad, instead of having a zillion separate AC adapters and cords all over the place. It works via inductive coupling; the per-device “SplashModule” is claimed to be “sub-millimeter thin” and “customized to the shape, size and power requirements of the device”. Needing SplashModules at all makes this less than perfect, but IMO it’d still be worth it if I could replace all of those power warts.

This was submitted as a story on Slashdot. They rejected it, and then posted a story on the colors of LEDs in optical mice instead. Need I say more?

Meta-intelligence

This idea has been bubbling around in my head since about lunch-time on Saturday, so now is probably as good a time as any to write it down. As I sat there munching on a double-decker taco, I started thinking about…well, thinking about thinking, and why some people seem to be better at it than others. In one sense thinking is the process of storing, retrieving, and rearranging facts, and can be likened to the operation of a warehouse. (Yes, I know there are aspects of the thought process that don’t seem to fit that metaphor, but it’ll do for current purposes.) This warehouse has a bunch of forklifts driving around picking up boxes and putting them down somewhere else. If you want the warehouse to handle more traffic, you have a few options:

  • Buy more forklifts (and operators, of course).
  • Buy faster forklifts.
  • Manage box placement and forklift scheduling better.

Note that making the warehouse bigger doesn’t really help. You still need to move the boxes, and making the warehouse bigger might only mean you have to move them further; this is about moving the boxes around, not how many you have just sitting there. The first two improvements above have a pretty straightforward relevance to differences in intelligence: some people just plain think faster, and some multitask better, and some do both. The third improvement, though, is a little less obvious. It’s sort of like being smart about being smart; you know you only have so many forklifts and they only move so fast, so how do you utilize them most effectively? Within the warehouse metaphor, there are several strategies:

  • If a box is often used in a particular place, keep it near that place (caching).
  • If two boxes are often used together, keep them together (affinity).
  • If a particular area keeps getting two crowded, either move some of its contents elsewhere (partitioning) or make the lanes in that area wider so more forklifts can get through (provisioning).
  • Arrange pickups and dropoffs to minimize total distance the forklifts need to go (scheduling).

The terms in parentheses are those that computer/network folks apply to analogous tools or techniques that they use to build systems. What’s interesting here, though, is not the commonality between warehouses and computer systems; it’s the commonality between both and human thought processes. Just as some warehouses and some computer systems operate more efficiently – using the same number and speed of components – some people also manage their thinking more efficiently. For example:

  • Caching: deliberately forget or put out of your mind or irrelevant facts/ideas, so they don’t crowd out stuff that’s more important.
  • Affinity: use “mnemonic hooks” to connect ideas that often go together
  • Scheduling: keep an eye out for shortcuts, don’t go around in circles, learn to recognize dead ends, etc.

Another comparison that comes to mind here is chess. Merely thinking fast helps in chess, but multitasking hardly helps at all and the real mark of a good player is not so much in how fast they think but in how well they organize and prioritize lines of play to examine. Management of time so that the best lines get the first and deepest exploration is what makes the difference between a regular player and a master.

What is perhaps most important is that these mental techniques can be learned and practiced. Some people know this stuff intuitively, but there’s also an established body of knowledge about “teaching people how to think”. That’s what our colleges and universities supposedly do, though they rarely address it as directly as I’ve tried to do here. Usually it’s more of a skill you learn on your own as you write papers and take tests in other subjects, and either survive if you learn it well or leave if you don’t. Maybe it should be taught more directly in the first semester or two, and not just in the test-prep courses that rich kids take in high school.

Election Day

It should surprise nobody that I went to vote today. I might be the only person who voted for four different parties – Democrat, Republican, Green and even Libertarian. You can’t get much more non-partisan than that. ;-)

The coolest thing actually happened before I even entered the polling place, and had nothing to do with voting. A red-tailed hawk flew right over me, no more than ten feet away, in the parking lot, and then perched in a tree about thirty feet away. We spent a good five minutes watching each other before s/he flew off – in one case to catch a rodent, in the other to vote for some.

Bad Days

There was a flipped SUV on my way to work today, and it looks like it took out a power/phone/cable line as well. There were already other people stopped to help, so there wasn’t anything for me to do, but I hope everyone’s OK.

Caching

I’m not a computer architect. Yeah, my business card says “software architect” but that’s either a meaningless phrase or something completely different. Nonetheless, Hennessy and Patterson’s Computer Architecture: A Quantitative Approach (I read it before Goldberg became the third co-author) has had a greater effect on my professional life than any other book. The reason all comes down to one formula – one which seems so obvious once it’s stated, and yet I find myself quoting it constantly because a lot of people Just Don’t Get It. Basically it goes like this:

Value = (Prob * Benefit) – ((1 – Prob) * Cost)

Value = value of an optimization
Prob = probability (zero to one) that the optimization will apply in any specific case
Benefit = benefit when the optimization does apply
Cost = cost when the optimization does not apply

As an example, consider a processor cache, and the following parameters:

  • An uncached memory access takes 5 cycles
  • A cached access takes a single cycle (Benefit = 4)
  • Cache lookups cost a single cycle even for uncached accesses (Cost = 1)
  • 90% of accesses are served from the cache (Prob = 0.9)

The result is ((0.9*4)-(0.1*1)) or an average savings of 3.59 cycles per memory access. That’s pretty darn good; memory accesses with the cache will be on average more than three times as fast as they would be without. Clearly, a cache with these characteristics would be a Good Thing To Have…and BTW that would generally be considered a lousy cache hit rate.

What brought this to mind recently was a discussion on IRC, where I found myself defending Freenet‘s caching of data. Yes, that was me defending Freenet; take a breath or two and let the shock subside. ;-) Let’s take a look at a couple of the Freenet numbers that would go into the formula above:

  • Latency for a request served from nearby: 2-20ms
  • Latency for a request that goes to the other side of the world: 80-200ms
  • Average benefit: 129ms
  • Penalty from routing through other nodes that don’t have the data you want in cache: what the heck, let’s say 100ms

Applying the formula backwards, we can see that Freenet’s caching is a net win any time the cache hit rate is greater than about 45% even if we’re very pessimistic about the penalty. Here are a few secondary observations:

  • The penalty is almost entirely an implementation artifact, and doesn’t reflect at all on Freenet’s basic design. There’s no reason the penalty couldn’t be reduced to a couple of milliseconds per node with a decent implementation, and then my case would be even more compelling.
  • I’m actually leaving out a lot of complexity because in Freenet there are lots of caches at different “distances” from the requester, but I don’t think that changes the basic arguments or conclusions.

So, the real question becomes: can we get a 45% cache hit rate? That depends a lot on who’s accessing what data, and who’s providing how much cache space, and other factors, but I don’t think it’s at all unreasonable to expect 45% overall. A lot of content has sufficient “universal appeal” that it will quickly get cached almost everywhere, and those copies will satisfy a pretty high percentage of user requests. A lot of people are willing to “donate” fairly large amounts of space for caching. Obviously, if you deliberately set out to request obscure/unpopular content you’re going to see only the penalty and not the benefit, but that’s measuring the uncommon case and is pretty worthless as a guide to overall system design. With a slightly better implementation, Freenet with caching will likely provide average access times a whole order of magnitude better than a similar system without caching. This is especially true because caches not only reduce access time but they also spread the load around, and access times tend to increase exponentially on an overloaded node. Proponents of similar projects such as GNUnet should carefully consider effects such as these, and other knowledge about real large-scale systems, before they conclude based on a smaller-scale system that caching is not beneficial.