Data Gravity

In the last few days, I had an interesting exchange on Twitter about the concept of data gravity. For convenience, I'll include the relevant parts here.

  • Mat Ellis: Interesting piece by @mjasay link … @randybias is right on the money, data gravity is already a big deal on the cloud

  • me: Data gravity will continue to be a big deal, no matter how fast the network. Can't beat the speed of light.

  • Randy Bias: Data gravity and speed of light are entirely unrelated.

  • me: No matter how much bandwidth you have, latency-bound sync and coordination limit total data velocity.

I think this is an important point, and Randy is hardly the first to get it wrong, but the explanation is a little longer than Twitter's 140-character limit. If you have data that you want to access from multiple places, you have two choices.

  • Keep a copy in one location, access it remotely from elsewhere. Besides being extremely latency-bound, this does nothing for availability.

  • Keep multiple copies, and keep them in sync. The sync process/protocol still tends to be quite latency-bound, and as the number of replicas increases you get increasingly poor storage utilization. Even Google doesn't have an infinite budget for disks.

Either way, no matter how much bandwidth you have, latency - bound by speed of light - is an issue. This is exactly the point I made in my Dude, Where's My Data talk at LISA'12: making that initial copy is easy, but keeping it up to date is hard. Sooner or later you're back to this.


That's data gravity, despite high bandwidth. Computing is full of "if you just do/have X" pipe dreams, of which "throw hardware at it" is just a subcategory. People who've actually tried X have usually found that there are tons of secondary issues that have to be solved, and even then X isn't the panacea it was imagined to be. This is such a case. Having tons of bandwidth is nice, it does allow Google to do things that others can't, but it simply doesn't make data gravity disappear.

Comments for this blog entry