Archive for October, 2012

Fun With Renaming Files

One of the “fun” things about consistent hashing in an actual filesystem, as opposed to an object or key/value store, is that you have to support rename operations. Supporting all of the required POSIX semantics around rename can introduce many kinds of complexity, which I’ll probably write about some time, but right now I’ll just focus on one kind of complexity that affects even simple rename operations. Consider the following quite-common sequence to update a file fubar.txt:

  1. Create .fubar.txt.tmp0123
  2. Write everything into the new file
  3. Rename .fubar.txt.tmp0123 over fubar.txt

The idea here is to rely on the atomic nature of rename to make the entire update atomic[1]. The problem this poses for GlusterFS has to do with the way we use consistent hashing to locate files. A file’s hash is based on its name, so when the name changes the hash might change too. Therefore, after the rename it might be on the “wrong” server according to the hash, but we really don’t want to incur the expense of moving it. What we do instead is create a linkfile on the new correct server, pointing to where the file really is. The problem is that there’s an extra cost involved in creating the link, and then another cost involved in following it. This cost can be surprisingly high for some workloads, and it turns out that two of our own components – UFO and geosync – are affected by it.

We already had a fix in place for rsync (which is used by geosync). However, that doesn’t help anything else. Therefore, over the weekend I created a patch to handle the more general case. What the patch does is allow the user to specify a regular expression that separates the final form of the name from the transient prefix/suffix associated with the temp-file idiom above. We then apply that expression internally to put the temp file in the right place to start with, so the subsequent rename all occurs within a single server and doesn’t require a linkfile.

It’s worth noting that we wouldn’t have this problem if we located files by hashing immutable GFIDs instead of mutable names, and that for directory traversals etc. we do have the GFID in hand. However, not all lookups occur that way. For the remainder, we’d have to fall back to querying every server, which would have a catastrophic effect on performance. We could go to a scheme where we always query the “right” server for a file’s parent directory if we don’t already have a GFID, and make sure that server has a linkfile pointing to the file’s real location, but then we’re kind of right back where we started. Actually we’d be even worse off, because the extra linkfile would be needed even for files that were never renamed. We’re better off with name-based hashing plus linkfiles plus rename-target prediction, as inelegant as that combination might be.

[1] The rename might actually not be reflected on storage unless/until you call fsync on the parent directory, but that’s another future post.

Efficiency Still Matters

One of the most common knocks against GlusterFS is that it eats too many CPU cycles. For example:

Really not liking GlusterFS now. Performance is quite poor and CPU usage way too high for what it does.

I’ve talked about performance issues and expectations many times. With regard to CPU usage, I can’t resist saying that most people who say “too high for what it does” don’t actually understand what it does, but Frank still has a good point. I’ll be the first to admit that the GlusterFS code is pretty inefficient. Over on my other blog I obliquely mention one example, which I discuss as a correctness issue bit which also affects performance. Every time you write a variable you actually don’t need, you don’t just use up a few CPU cycles. You might also generate more cache-coherency traffic or push something else out of cache/TLB. The effects one time are minor, but for code that executes millions of times a second these things add up. If an idiom that drives more memory access is repeated across thousands of functions, then it has a measurable effect on overall cache efficiency. You owe it to yourself, your colleagues, and your users to avoid sins like the following (all pet peeves in the GlusterFS code).

  • Overuse of macros and inlines that hide their true cost from profilers etc. I particularly despise macro abuse, because macros are also not type-safe and are annoying to debug through.
  • Too much locking and unlocking, often hidden (see above) and/or forced by bad calling conventions. Lock cycles are expensive.
  • Too many memory-allocator trips per request for stack frames, translator-local structures, dictionaries, etc. Also, custom allocators that don’t implement thread affinity.
  • Linear searches in dictionaries and fd/inode context lists.
  • Repeating arguments through every stage of a deep translator chain, instead of using a request-block idiom that keeps arguments in one place.
  • Calling functions/macros on every iteration of a loop instead of just once (compilers can catch these some of the time but not always).
  • Functions that are only called from one place, but defined in a separate module to maximize linkage difficulty and cache effects.

As I said, these things add up. For some of these I can point to one specific place where an improvement could be made, but most of these issues don’t live in a particular piece of code. They’re aspects/idioms that appear everywhere, in every piece of code, not causing a distinct pause but just slowing down the “clock rate” at which every line executes. In addition to all of the other things I’m doing with specific code, one of my long-term goals on the project is to start changing habits away from these inefficient calling conventions and coding style to something a bit “tighter” as befits system code. Let’s see how that goes.