As part of my backup/synchronization project, I put some thought into what hash algorithm to use, and in fact whether to rely so entirely on hashing at all. Yes, I have read Val Henson’s Analysis of Compare-by-hash and disagree somewhat with her conclusions. I think it’s entirely reasonable to compare the probability of a hash collision against the probability of a errors arising from other causes. It doesn’t matter whether media degradation is silent or deterministic or not; I’m more likely to lose two or even three copies of my data, in separate locations, to media failure than to a hash collision. As long as that’s the case, expending extra effort reducing a risk that’s already over a dozen orders of magnitude lower is basically a waste of time. Henson’s hand-wavy “keep more state” alternative doesn’t just make code a tiny bit more complex; it can make the code much more complex, introduce new logistical problems such as user registration or history pruning, or degrade performance to unacceptable levels scanning through change lists. These can all basically defeat the purpose of a program that’s intended to be low-overhead in terms of both setup complexity and run time.

This brings us to my choice of hash algorithms. I used MD5. Yes, I heard that gasp of astonishment and disapproval from all the SHA-1 purists, but let me explain. MD5 is much quicker to compute, and I don’t care that it’s (theoretically) less secure. I need a hash that’s collision-resistant but not necessarily one that’s cryptographically secure, and I don’t subscribe to the belief that the two are equivalent. To see why, let’s look at two criteria usually given for a secure and/or collision-resistant hash:

  • Reversibility refers to the ease or difficulty with which the input (or an input, since the pigeon-hole principle tells us there must be many) corresponding to a given output.
  • Collision resistance refers to the ease or difficulty with which two inputs yielding the same output can be calculated.

Astute readers will note that both definitions sort of assume that someone’s trying to reverse the algorithm or produce a collision. That’s a non-issue in my case; all I’m interested in is the accidental occurrence of duplicates. I could embed the entire contents of a file up to fifteen bytes within a sixteen-byte hash, making an easily recognizable subset of output values trivially reversible, and it wouldn’t matter. It might even be useful. As for collision resistance, let’s take the Gödel approach. Assume that I’m hashing data produced by some other program. There’s some very small subset of programs that will produce hash collisions with significantly greater than random probability (which is very small). There’s also some very small subset of programs that actually do something useful. The intersection of these two very tiny subsets is so small (relative to the set of all possible programs) that the probability of finding one rivals the probability of a random hash collision in its smallness. Sure, someone might be able to deliberately devise a program that can generate collisions, but that program will almost certainly not be useful for anything else. When the issue is resistance to accidental error and not deliberate attack, it’s important to look at collision resistance within the space of actually useful data producers, not programs that serve no other purpose. In that context I’d say that MD5′s collision resistance is just fine, and the difference in calculation speed does matter. If I were designing a protocol for the express purpose of running over a network, where the possibility of deliberate attack must always be considered, I definitely would use a more secure kind of hash despite the cost. For something that will mostly run within a single machine, though, that would be silly.