Background
The basic idea is based on the observation (or hope) that the inaccuracy in the official ratings is basically random, as opposed to being systematic or predictable. In fact I know the inaccuracy on Yahoo is not totally random (we'll get to ICC later). Most of the long-time players there eventually adopt opponent-selection strategies that (whether or not that is their conscious intent) take advantage of the rating system's flaws to boost their ratings by 50-200 points relative to everyone else. Also, the prevalence of cheaters eventually affects everyone's rating. The cheaters themselves get higher ratings at the expense of the people they play, and anyone who plays either gets "infected" by the inaccuracy, and so on in an ever-expanding ripple effect.
All in all, though, the cheaters' effect on ratings becomes less predictable as you move further away from the cheaters themselves, eventually making things even more random than they were before, and the old-timer effect is small enough that the overall picture is still one of random rather than systematic inaccuracy. The proof is in the pudding, of course, and I'll discuss that some more when I get to talking about results.
How It Works
How do we use a larger sample to get a more accurate rating for a single individual, when the number of games played by that individual doesn't change? The answer lies in recursion. Let's say we want a rating for X. X's official rating is, in effect, based on how they did in their last 20 or so games, which is to say that N=20. What we do is look at who X played in those games, and how X did against them, so if X played 20 people and each of their ratings is based on N=20, then our N=400. If we want an even more accurate rating, we go one level deeper and calculate X's opponents' ratings the same way we calculated X's in the previous case, and then we combine the results to get a total N=8000. Actually, we do assign weights to ratings at each level to account properly for people who've played fewer than 20 games, so the actual value of N for a "level 2" rating (two levels of recursion, level 0 is the official rating) is usually more like 6000.
The way this cancels out the "noise" to get an accurate rating is fairly simple. Let's say that X played A, B, C, etc. all the way through T. Maybe A's rating is 50 points higher than it should be. Fine. B's rating might also be 50 points high, or it might be 50 points low, or 200 points low. Because the inaccuracy is basically random, the chance that all of A through T will be overrated is very low (about one in a million). It's much more likely that half will be overrated and half will be underrated, so the overall average of X's opponents will be highly accurate and therefore so will X's derived rating based on results against that average.
Results
How well does it work? Well, it turns out that the script's "level 2" ratings have significantly higher predictive value than Yahoo's own ratings - even though the former is based on the latter. Too bad I can't bet on Yahoo chess games. ;-) I haven't exactly done a rigorous statistical analysis, but I have spot-checked random players' records, and predictions based on level 2 ratings are about 15-20% more accurate than those based on official ratings. In other words, in a sample of 20 games, Yahoo ratings will typically predict about 12-14 of the games correctly while level-2 ratings will predict about 16-17.
On ICC - yes, this works for ICC as well, in fact the ICC version is considerably more sophisticated at this point - the story is different. It turns out that ICC ratings are pretty accurate to begin with; I rarely see more than a 50-point variation between their ratings and mine. In terms of predictive value, it's a wash. My ratings seem to do a little better, but it's less than 5% and for a small random sample that's not very significant.
Change Log