Efficiency, Performance, and Locality

While it might have been overshadowed by events on my other blog, my previous post on Solid State Silliness did lead to some interesting conversations. I’ve been meaning to clarify some of the reasoning behind my position that one should use SSDs for some data instead of all data, and that reasoning applies to much more than just the SSD debate, so here goes.

The first thing I’d like to get out of the way is the recent statement by everyone’s favorite SSD salesman that “performant systems are efficient systems”. What crap. There are a great many things that people do to get more performance (specifically in terms of latency) at the expense of wasting resources. Start with every busy-wait loop in the world. Another good example is speculative execution. There, the waste is certain – you know you’re not going to execute both sides of a branch – but it’s often done anyway because it lowers latency. It’s not efficient in terms of silicon area, it’s not efficient in terms of power, it’s not efficient in terms of dollars, but it’s done anyway. (This is also, BTW, why a system full of relatively weak low-power CPUs really can do some work more efficiently than one based on Xeon power hogs, no matter how many cores you put on each hog.) Other examples of increased performance without increased efficiency include most kinds of pre-fetching, caching, or replication. Used well, these techniques actually can improve efficiency as requests need to “penetrate” fewer layers of the system to get data, but used poorly they can be pure waste.

If you’re thinking about performance in terms of throughput rather than latency, then the equation of performance with efficiency isn’t quite so laughable, but it’s still rather simplistic. Every application has a certain ideal balance of CPU/memory/network/storage performance. It might well be the case that thinner “less performant” systems with those performance ratios are more efficient – per watt, per dollar, whatever – than their fatter “more performant” cousins. Then the question becomes how well the application scales up to the higher node counts, and that’s extremely application-specific. Many applications don’t scale all that well, so the “more performant” systems really would be more efficient. (I guess we can conclude that those pushing the “performance = efficiency” meme are used to dealing with systems that scale poorly. Hmm.) On the other hand, some applications really do scale pretty well to the required node-count ranges, and then the “less performant” systems would be more efficient. It’s a subject for analysis, not dogmatic adherence to one answer.

The more important point I want to make isn’t about efficiency. It’s about locality instead. As I mentioned above, prefetch and caching/replication can be great or they can be disastrous. Locality is what makes the difference, because these techniques are all based on exploiting locality of reference. If you have good locality, fetching the same data many times in rapid succession, then these techniques can seem like magic. If you have poor locality, then all that effort will be like the effort you make to save leftovers in the refrigerator to save cooking time . . . only to throw those leftovers away before they’re used. One way to look at this is to visualize data references on a plot, using time on the X axis and location on the Y axis, using Z axis or color or dot size to represent density of accesses . . . like this.

time/location plot

It’s easy to see patterns this way. Vertical lines represent accesses to a lot of data in a short amount of time, often in a sequential scan. If the total amount of data is greater than your cache size, your cache probably isn’t helping you much (and might be hurting you) because data accessed once is likely to get evicted before it’s accessed again. This is why many systems bypass caches for recognizably sequential access patterns. Horizontal lines represent constant requests to small amounts of data. This is a case where caches are great. It’s what they’re designed for. In a multi-user and/or multi-dataset environment, you probably won’t see many thick edge-to-edge lines either way. You’ll practically never see the completely flat field that would result from completely random access either. What you’ll see the most of are partial or faint lines, or (if your locations are grouped/sorted the right way) rectangles and blobs representing concentrated access to certain data at certain times.

Exploiting these blobs is the real fun part of managing data-access performance. Like many things, they tend to follow a power-law distribution – 50% of the accesses are to 10% of the data, 25% of the accesses are to the next 10%, and so on. This means that you very rapidly reach the point of diminishing returns, and adding more fast storage – be it more memory or more flash – is no longer worth it. When you consider time, this effect becomes even more pronounced. Locality over short intervals is likely to be significantly greater than that over long intervals. If you’re doing e-commerce, certain products are likely to be more popular at certain times and you’re almost certain to have sessions open for a small subset of your customers at any time. If you can predict such a transient spike, you can migrate the data you know you’ll need to your highest-performance storage before the spike even begins. Failing that, you might still be able to detect the spike early enough to do some good. What’s important is that the spike is finite in scope. Only a fool, given such information, would treat their hottest data exactly the same as their coldest. Only a bigger fool would fail to gather that information in the first place.

Since this all started with Artur Bergman’s all-SSD systems, let’s look at how these ideas might play out at a place like Wikia. Wikia runs a whole lot of special-interest wikis. Their top-level categories are entertainment, gaming, and lifestyle, though I’m sure they host wikis on other kinds of subjects as well. One interesting property of these wikis is that each one is separate, which seems ideal for all kinds of partitioning and differential treatment of data. At the very grossest level, it seems like it should be trivial to keep some of the hottest wikis’ data on SSDs and relegate others to spinning disks. Then there’s the temporal-locality thing. The access pattern for a TV-show wiki must be extremely predictable, at least while the show’s running. Even someone as media-ignorant as me can guess that there will be a spike starting when an episode airs (or perhaps even a bit before), tailing off pretty quickly after the next day or two. Why on Earth would someone recommend the same storage for content related to a highly rated and currently running show as for a show that was canceled due to low ratings a year ago? I don’t know.

Let’s take this a bit further. Using Artur’s example of 80TB and a power-law locality pattern, let’s see what happens. What if we have a single 48GB machine, with say 40GB available for caching? Using the “50% of accesses to 10% of the data” pattern, that means 3.125% of accesses are even out of memory. No matter what the latency difference between flash and spinning disks might be, it’s only going to affect that 3.125% of accesses so it’s not going to affect your average latency that much. Even if you look at 99th-percentile latency, it’s fairly easy to see that adding SSD up to only a few times memory size will reduce the level of spinning-disk accesses to noise. Factor in temporal locality and domain-specific knowledge about locality, and the all-SSD case gets even weaker. Add more nodes – therefore more memory – and it gets weaker. Sure, you can assume a flatter access distribution, but in light of all these other considerations you’d have to take that to a pretty unrealistic level before the all-SSD prescription starts to look like anything but quackery.

Now, maybe Artur will come along to tell me about how my analysis is all wrong, how Wikia really is such a unique special flower that principles applicable to a hundred other systems I’ve seen don’t apply there. The fact is, though, that those other hundred systems are not well served by using SSDs profligately. They’ll be wasting their owners’ money. Far more often, if you want to maximize IOPS per dollar, you’d be better off using a real analysis of your system’s locality characteristics to invest in all levels of your memory/storage hierarchy appropriately.

Solid State Silliness

Apparently Artur Bergman did a very popular talk about SSDs recently. It’s all over my Twitter feed, and led to a pretty interesting discussion at High Scalability. I’m going to expand a little on what I said there.

I was posting to comp.arch.storage when Artur was still a little wokling, so I’ve had ample opportunity to see how a new technology gets from “exotic” to mainstream. Along the way there will always be some people who promote it as a panacea and some who condemn it as useless. Neither position requires much thought, and progress always comes from those who actually think about how to use the Hot New Thing to complement other approaches instead of expecting one to supplant the other completely. So it is with SSDs, which are a great addition to the data-storage arsenal but cannot reasonably be used as a direct substitute either for RAM at one end of the spectrum or for spinning disks at the other. Instead of putting all data on SSDs, we should be thinking about how to put the right data on them. As it turns out, there are several levels at which this can be done.

  • For many years, operating systems have implemented all sorts of ways to do prefetching to get data into RAM when it’s likely to be accessed soon, and bypass mechanisms to keep data out of RAM when it’s not (e.g. for sequential I/O). Processor designers have been doing similar things going from RAM to cache, and HSM folks have been doing similar things going from tape to disk. These basic approaches are also applicable when the fast tier is flash and the slow tier is spinning rust.
  • At the next level up, filesystems can evolve to take better advantage of flash. For example, consider a filesystem designed to keep not just journals but actual metadata on flash, with the actual data on disks. In addition to the performance benefits, this would allow the two resources to be scaled independently of one another. Databases and other software at a similar level can make similar improvements.
  • Above that level, applications themselves can make useful distinctions between warm and cool data, keeping the former on flash and relegating the latter to disk It even seems that the kind of data being served up by Wikia is particularly well suited to this, if only they decided to think and write code instead of throwing investor money at their I/O problems.

Basically what it all comes down to is that you might not need all those IOPS for all of your data. Don’t give me that “if you don’t use your data” false-dichotomy sound bite either. Access frequency falls into many buckets, not just two, and a simplistic used/not-used distinction is fit only for a one-bit brain. If you need a lot of machines for their CPU/memory/network performance anyway, and thus don’t need half a million IOPS per machine, then spending more money to get them is just a wasteful ego trip. By putting just a little thought into using flash and disk to complement one another, just about anyone should be able to meet their IOPS goals for lower cost and use the money saved on real operational improvements.

Fun With Dynamic Loading

My current project is based on GlusterFS, which relies heavily on dlopen/dlsym to load its ‘translators” and other modules. Mostly this works great, and this modularity is the main reason I’m using GlusterFS in the first place, but yesterday I had to debug an interesting glitch that might be of interest to other people using similar plugin-based architectures.

The immediate problem had to do with GlusterFS’s quota feature. It turns out that the functionality is split across two translators, with most of it in the quota translator but some parts in the marker translator instead. I’m not sure why this is the case and suspect it shouldn’t be, but that pales in comparison to the fact that quota is run on clients. Huh? That makes the system trivially prone to quota cheating, which is simply going to be unacceptable in many situations – especially the “cloud billing” situation that’s promoted as a primary use of this feature. One of the nice things about the translator model is that you can move translators up or down in the “stack” or even move them across the server/client divide with relative ease, so I decided to try running quota on the server. I was a bit surprised when it blew up immediately, before the first client even finished mounting, but this kind of surprise is exactly why we do testing so I started to debug the problem.

The crash was a segfault in quota_lookup_cbk, which is called on the way out of the first lookup on the volume root. It looked like we were trying to free the “local” structure associated with this call, as we should be, but one of the component structures contained a bogus pointer – 0×85, which has never been a valid pointer on any operating system I’ve used and isn’t a common ASCII character either. Weird. Since GlusterFS is in user space, I have the rare (to me) luxury of using a debugger to step through the first part of the lookup code, but that only showed that everything seemed to be initialized properly. Then I started reading the quota code to see where the structure involved might get set to anything but zero. There didn’t seem to be any. Breakpoints on the functions that would have been involved in such a thing were never hit. I went back and stepped through the initialization again to see if there was anything I’d missed, and that’s when I realized that dynamic loading was involved.

What I noticed, as I stepped through the code, was that at one moment I was in quota_lookup at quota.c:688, about to call quota_local_new. When I stepped into that function, though, I found myself at marker-quota-helper.c:322 – part of a whole different translator. It didn’t take long from there to see that marker has its very own function named quota_local_new, so the problem clearly related to the duplicate symbol names. These duplicate symbols wouldn’t occur unless the two translators were both loaded in the same process, so now I knew why nobody at Gluster had seen the problem, but how exactly do the duplicate symbols cause it and what could I do to fix it? After more investigation, I saw that the dlopen(3) call that GlusterFS uses to load translators specifies the RTLD_GLOBAL flag. What does this mean? Here’s the man page:

The symbols defined by this library will be made available for
symbol resolution of subsequently loaded libraries.

Oh. A quick check verified that quota_local_new became valid when marker was loaded first, and remained at the same value even when quota was loaded subsequently, so quota_lookup was using the wrong version of quota_local_new. The marker version of this function does the wrong kind of initialization on the wrong kind of structure as far as quota is concerned, but this is all way after the compiler does all of its type checking so even if that checking were stronger it wouldn’t catch this. We get back a pointer to the wrong kind of structure, initialized the wrong way, and the only surprise is that we don’t blow up before we try to free it in quota_lookup_cbk.

So much for diagnosis. How about a fix? Most translator functions and dispatch tables are explicitly looked up using dlsym, and loading multiple translators wouldn’t work at all if RTLD_GLOBAL caused the wrong symbol to be returned in that case. I can’t think of any cases where code in one translator intentionally depends on a symbol exported from another instead of using the dispatch tables and such provided by the translator framework, so maybe using RTLD_LOCAL instead of RTLD_GLOBAL would help. Rather surprisingly, it doesn’t; quota_local_new still retains its marker value even after quota is loaded. That seems like a bug, but I can’t be bothered debugging dlopen. Another flag-based solution that initially seemed promising was RTLD_DEEPBIND. Here’s the man page again.

RTLD_DEEPBIND (since glibc 2.3.4)
Place the lookup scope of the symbols in this library ahead of
the global scope. This means that a self-contained library
will use its own symbols in preference to global symbols with
the same name contained in libraries that have already been
loaded. This flag is not specified in POSIX.1-2001.

Whether it works or not seems a bit irrelevant, though, since it’s non-portable and introduces a few problems of its own. Even the guy who added it seems to think it’s a bad idea. In the end, I adopted what many probably thought was the obvious solution: rename the conflicting symbols. I actually found four of them using “nm” and renamed the versions in marker, so now everything works.

The moral of the story, and the reason I’m writing about this instead of filing it away as just another among hundreds of other debugging stories that nobody else will ever care about, is that plugins and dynamic loading can be trickier than you think. This one would have been easy to miss if I had just stepped over quota_local_new instead of stepping into it, or if I hadn’t happened to notice the line numbers, or if I hadn’t already tangled with linkers and loaders enough to know that the way symbols get resolved can lead to some pretty “spooky” results. Maybe somebody reading this, or searching for terms like dlopen or RTLD_GLOBAL, will find this and be saved some tedious debugging.

Fighting FUD Again

Tom Trainer wrote what was supposed to be a thoughtful examination of what “cloud storage” should mean, but it came across as a rather nasty anti-Isilon hit piece. I tried to reply there, but apparently my comment won’t go through until I register with “UBM TechWeb” so they can sell me some crap, so I’m posting my response here. Besides being a defense of an unfairly maligned competitor – mine as well as Tom’s unnamed employer’s – it might help clarify some of the issues around what is or is not “real” cloud storage.

As the project lead for CloudFS, which addresses exactly the kinds of multi-tenancy and encryption you mention, I agree with many of your main points about what features are necessary for cloud storage. Where I disagree is with your (mis)characterization of Isilon to make those points.

* First, their architecture is far from monolithic. Yes, OneFS is proprietary, but that’s a *completely* different thing.

* Second, scaling to 144 servers is actually pretty good. When you look closely at what many vendors/projects claim, you find out that they’re actually talking about clients . . . and any idiot can put together thousands of clients. Conflating node counts with server counts was a dishonest trick when I caught iBrix doing it years ago, and it’s a dishonest trick now. Even the gigantic “Spider” system at ORNL only has 192 servers, and damn few installations need even half of that. It’s probably a support limit rather than an architectural limit. No storage vendor supports configurations bigger than they’ve tested, and testing even 144 servers can get pretty expensive – at least if you do it right. I’m pretty sure that Isilon would raise that limit if somebody asked them for a bigger system and let them use that configuration for testing.

Third, Isilon does have a “global” namespace as that term is usually used – i.e. at a logical level, to mean that the same name means the same thing across multiple servers, just like a “global variable” represents the same thing across multiple modules or processes. Do you expect global variables to be global in a physical sense too? In common usage, people use terms like “WAN” or “multi-DC” or “geo” to mean distribution across physical locations, and critiquing a vendor for common usage of a term makes your article seem like even more of a paid-for attack piece.

Disclaimer: I briefly evaluated and helped deploy some Isilon gear at my last job (SiCortex). I respect the product and I like the people, but I have no other association with either.