I’m not entirely sure what got me going, but I’ve been thinking a lot lately about the challenge of distinguishing text written by one person from text written by someone else. It seems to be a much-studied research problem, with a lot of the research oriented toward detecting plagiarism, but there seems to be a surprising lack of available tools. I think a lot of people think of this as a competition, with reputations and quite possibly money on the line, and nobody wants to reveal their “secret sauce”.

In the classic “scratch an itch” tradition, I decided to whip up a text-fingerprinting program of my own. The approach roughly falls into the category of “stylometrics” but it’s pretty naive so far. I’m not going for the whole natural-language-processing thing. So far the program has absolutely no knowledge of words beyond that they are sequences of alphabetic characters. All it does is look at things like the number of letters per word and words per sentence, occurrence rates of certain kinds of punctuation, etc. to generate a “fingerprint” (currently only seven numbers). As brain-dead as the program is, though, the fingerprints it creates already seem pretty accurate as far as distinguishing my own text from other people’s – based on a sample of only a few kilobytes. I have lots of ideas about some other “low-hanging fruit” that I can pick, including some very rudimentary aspects of word usage and things like emphasis or quoting styles, and I think I can make it even better. Maybe it won’t be good for a doctoral thesis, but it might actually be useful instead.

No, I’m not releasing code yet. :-P I plan to, but I want to implement some of these other ideas, and make the interface a little nicer, clean up and comment the code. If anyone’s interested, ping me in a couple of weeks.