Tiers Without Tears

A lot of people have asked when GlusterFS is going to have support for tiering or Hierarchical Storage Management, particularly to stage data between SSDs and spinning disks. This is a pretty hot topic for these kinds of systems, and many - e.g. Ceph, HDFS, Swift - have announced upcoming support for some form or other. However, tiering is just one part of a larger story. What do the following all have in common?

  • Migrating data between SSDs and spinning disks.
  • Migrating data between replicated storage and deduplicated, compressed, erasure-coded storage.
  • Placing certain types of data in a certain rack to increase locality relative to the machines that will be using it.
  • Segregating data in a multi-tenant environment, including between tenants at different service levels requiring different back-end configurations.

While these might seem like different things, they're all mostly the same except for one part that decides where to place a file/object. It doesn't really matter whether the criteria include file activity, type, owner, or physical location of servers. The mechanics of actually placing it there, finding it later, operating on it, or moving it somewhere else are all pretty much the same. We already have all those parts in GlusterFS, in the form of the DHT (consistent hashing) translator. We've even added tweaks to it before, such as the ill-named NUFA. Therefore, it makes perfect sense to use that as the basis for our own tiering strategy, but I call it "data classification" because the same enhancements will allow it to do far more than tiering alone.

The key idea behind data classification is reflected in its earlier name - DHT over DHT. Our "translator" abstraction allows us to have multiple instances of the same code active at once, differing only in their parameters and relationship to one another. It's just one of many ways that GlusterFS is more modular than its closest competitors, even though those are implemented in more object-oriented languages. To see how this kind of setup works, let's start with an example without it, capable of implementing only the simplest form of tiering.

image

In this example, we have four bricks each consisting of a smaller SSD component (red) and a larger spinning-disk component (blue). This can easily be done using something like dm-cache, Bcache, FlashCache, or various hardware solutions. Those hybrid bricks are then combined, first into replica pairs and finally into a volume using the DHT (a.k.a. "distribute") translator. This approach actually works pretty well and is easy to implement, but it's less than ideal. If your working set is concentrated on anything less than the entire set of bricks, then you could fill up their SSD parts and either become network-bound or have accesses spill over to the spinning-disk components even though potentially usable resources on other bricks remain idle. This approach doesn't deal well with adding more resources in anything but a totally symmetric fashion across all bricks, and in particular precludes concentrating those SSDs on a separate set of beefier servers with extra-fast networking. Lastly, it doesn't support tiering across different encoding methods or replication levels, let alone the other non-tiering functions mentioned above. Now, consider this different kind of setup.

image

Here, the left half is our fast working-storage tier and the right half is our archival tier optimized for storage efficiency and survivability instead of performance. Note that this is a logical/functional view, not a physical one. A1 and A2 might still be on the same server, but now their logical relationship has changed and so they could also be moved separately.

Our performance tier looks much like the whole system did before, with bricks arranged into replica sets and then DHT (as it is today). However, we've split off the spinning disks into a whole separate pool, and put a new "tiering" translator (a modified version of DHT) on top. Here's the cool part: that "replicate 3" layer might actually be erasure coding instead of normal replication. That would suck for performance, but since this is only used for our slow tier that's OK. 90% of the accesses to the fast tier + 90% of data in the storage-efficient tier = goodness. We could also toss in deduplication, compression, or bit-rot detection on that side only for extra fun. Note that we couldn't do this in the other model, because you can't put non-distributed tiering on top of distributed erasure coding. Most other tiering proposals I've seen do the tiering at too low a level, and are far more useful as result.

Finally, let's consider those other functions that aren't tiering. In the second diagram above, it would be trivial to replace the "distribute" component above with one that's making decisions based on rack location instead of random hashing. Similarly, it would be trivial to replace the top-level "tier" component with one that makes decisions based on tenant identity or service level instead of file activity. It's almost as easy to add even more layers, doing all of these things at once in a fully compatible way. No matter what, migrating data based on new policies or conditions can still use the same machinery we've worked so hard to debug and optimize for DHT rebalancing.

Over the last few years I've come up with a lot of ways to improve GlusterFS, or distributed filesystems in general, but this is one of my favorites. It can add so much functionality in return for so little deep-down "heavy lifting" and that's pretty exciting.

Comments for this blog entry