The Elo system was originally developed for ranking chess players, but has since been used to rank performance in many other contexts as well. According to the Wikipedia article, this even includes the Bowl Championship Series in college football, which was a surprise to me. The article also mentions several common criticisms of Elo, some of which are addressed by existing alternatives, and more recently there has been a competition to improve the state of the art still further (warning: site is being Slashdotted as I write this).

I should probably enter, since I’ve done a lot of thinking about this exact problem over the years, but I’m oversubscribed already. Before this site even became a blog (which BTW was a little over ten years ago) I implemented a Yahoo/ICC rating script based on the principle of examining not just a person’s opponents but also their opponents and so on to whatever level one wants. It often found non-trivial anomalies in the ratings of players who were getting better or worse, or more often those who had lately changed their preferred time limits or criteria for selecting opponents. I used to run it to cut through the ratings manipulation then (probably still) rampant on both sites, to find opponents who were truly at my own level for a satisfying game. Another idea I’ve developed since then is to represent a person’s rating as not just a strength but a strength plus an internally calculated “style” measured along one or more axes. This has nothing to do with actual style, but can account for the often persistent and reproducible anomalies when one player just seems to beat another player more often than their ongoing records against other opponents would ever indicate. For example, as an endgame-oriented Caro-Slav type of player, I can genuinely expect to win more than I lose against easily-frustrated tacticians even if they’re rated 100-200 points above me. There are other players rated below me against whom I’ll fare poorly, but I’m not about to tell you those secrets. Another idea that I haven’t actually explored as much is of using a “coupled oscillator” approach similar to that used in time-synchronization protocols, to account for the fact that an opponent’s rating was probably changing even as you played them, but that the resulting inaccuracy can be detected and corrected in retrospect. In other words, subsequent analysis could show that a player you thought was at 1500 when you played them was probably more around 1600 but their rating hadn’t caught up yet, and then you could account for that when using the result to modify your own rating. The process could even be repeated until the results stabilized, but probably most of the bang for the buck would come in the first couple of iterations.

All of these approaches can be combined, of course, and I’m pretty sure that with sufficient tweaking of code I could come up with something that would outperform Elo on the competition’s sample. They’re both a bit out of line with the competition’s goal of using/maintaining roughly the same amount of data as Elo, though. On the other hand, they seem to be allowing Glicko, so there’s clearly room for a couple more values per player. Perhaps some simplified version of the last idea would work. In any case, it’ll be interesting to see what comes out of this.