Canned Platypus

Saving the world one byte at a time since 2000

Archive for the ‘tech’ Category

I have the best readers. One sent me email expressing a hope that I’d write about Martin Fowler’s LMAX Architecture. I’d be glad to. In fact I had already thought of doing so, but the fact that at least one reader has already expressed an interest makes it even more fun. The architecture seems to incorporate three basic ideas.

  • Event sourcing, or the idea of using a sequentially written log as the “system of record” with the written-in-place copy as a cache – an almost direct inversion of roles compared to standard journaling.
  • The “disruptor” data/control structure.
  • Fitting everything in memory.

I don’t really have all that much to say about fitting everything in memory. I’m a storage guy, which almost by definition means I don’t get interested until there’s more data than will fit in memory. Application programmers should IMO strive to use storage only as a system of record, not as an extension of memory or networking (“sending” data from one machine to another through a disk is a pet peeve). If they want to cache storage contents in memory that’s great, and if they can do that intelligently enough to keep their entire working set in memory that’s better still, but if their locality of reference doesn’t allow that then LMAX’s prescription just won’t work for them and that’s that. The main thing that’s interesting about the “fit in memory” part is that it’s a strict prerequisite for the disruptor part. LMAX’s “one writer many readers” rule makes sense because of how cache invalidation and so forth work, but disks don’t work that way so the disruptor’s advantage over queues is lost.

With regard to the disruptor structure, I’ll also keep my comments fairly brief. It seems pretty cool, not that dissimilar to structures I’ve used and seen used elsewhere; some of the interfaces to the SiCortex chip’s built-in interconnect hardware come to mind quickly. I think it’s a mistake to contrast it with Actors or SEDA, though. I see them as complementary, with Actors and SEDA as high-level paradigms and the disruptor as an implementation alternative to the queues they often use internally. The idea of running these other models on top of disruptors doesn’t seem strange at all, and the familiar nature of disruptors doesn’t even make the combination seem all that innovative to me. It’s rather disappointing to see useful ideas dismissed because of a false premise that they’re alternatives to another instead of being complementary.

The really interesting part for me, as a storage guy, is the event-sourcing part. Again, this has some pretty strong antecedents. This time it recalls Seltzer et al’s work on log-structured filesystems, which is even based on may of the same observations e.g. about locality of reference and relative costs of random vs. sequential access. That work’s twenty years old, by the way. Because event sourcing is so similar to log-structured file systems, it runs into some of the same problems. Chief among these is the potentially high cost of reads that aren’t absorbed by the cache, and the complexity involved with pruning no-longer-relevant logs. Having to scan logs to find the most recent copy of a block/record/whatever can be extremely expensive, and building indices carries its own expense in terms of both resources and complexity. It’s not a big issue if your system has very high locality of reference, which time-oriented systems such as LMAX or several others types of systems tend to, but it can be a serious problem in the general case. Similarly, the cleanup problem doesn’t matter if you can simply drop records from the tail or at least stage them to somewhere else, but it’s a big issue for files that need to stay online – with online rather than near-line performance characteristics – indefinitely.

In conclusion, then, the approach Fowler describes seems like a good one if your data characteristics are similar to LMAX’s, but probably not otherwise. Is it an innovative approach? Maybe in some ways. Two out of the three main features seem strongly reminiscent of technology that already existed, and combinations of existing ideas are so common in this business that this particular combination doesn’t seem all that special. On the other hand, there might be more innovation in the low-level details than one would even expect to find in Fowler’s high-level overview. It’s interesting, well worth reading, but I don’t think people who have dealt with high data volumes already will find much inspiration there.

While we were in Ann Arbor last month, we stopped by the abolutely amazing Kaleidoscope used and rare bookstore. (I’d link, but can’t find a website.) I knew from our last visit that they have an excellent collection of old sci-fi magazines, so I decided to see if they had any from the month I was born – April 1965. Sure enough, they had a Galaxy from that month. I was surprised how many of the authors I recognized. Here are the stories mentioned on the cover:

  • “Wasted on the Young” by John Brunner
  • “War Against the Yukks” by Keith Laumer
  • “A Wobble in Wockii Futures” by Gordon R. Dickson
  • “Committee of the Whole” by Frank Herbert

That’s an all-star cast right there. However, the story that really made an impression on me was by someone I had never heard of – “The Decision Makers” by Joseph Green. It’s about an alien-contact specialist sent to decide whether a newly discovered species met relevant definitions of intelligence which would interfere with a planned terraforming operation. That’s pretty standard stuff for the SF of the time, but there’s a twist; the aliens, which are called seals, have a sort of collective intelligence which complicates the protagonist’s job. This leads to the passage that might be of interest to my usual technical audience.

Our group memory is an accumulated mass of knowledge which is impressed on the memory areas of young individuals at birth, at least three such young ones for each memory segment. We are a short-lived race, dying of natural causes after eight of your years. As each individual who carries a share of the memory feels death approaching he transfers his part to a newly born child, and thus the knowledge is transferred from generation to generation, forever.

Try to remember that this was written in 1965, long before the networked computer systems today were even imagined, and that the author wasn’t even writing about computers. He was trying to tell a completely different kind of story; the entire excerpt above could have been omitted entirely without affecting the plot. Nonetheless, he managed to describe a form of what we would now call sharding, with replication and even deliberate re-replication to preserve availability. The result should be instantly recognizable to anyone who has studied modern distributed databases such as Voldemort or Riak or Cassandra. A lot of people think of this stuff as cutting edge, but it’s also an incidental part of a barely-remembered story from 1965. Somehow I find that both humbling and hilarious.

A few days ago, I got my Eee Pad Transformer. I guess you could say I didn’t get the Transformer part, because I only got the tablet part without the accompanying keyboard. So far, I’m loving it. It’s a very convenient size, the screen is gorgeous, performance and battery life seem excellent. I’ve had to spend some time getting used to Android, but so far I like that too. There’s also something much more satisfying about interacting by touch than through a keyboard and/or mouse, but maybe that’s just novelty. Here are some of the apps I’ve been using.

  • Built-in mail client. Not bad, so far it seems much better than the one that’s on my (non-Snow Leopard) Mac.
  • Seesmic for Twitter. I tried a few others, but none offered the combination of features I want. Twimbow for Android would rock.
  • Mobo Player (plus codecs) for playing unconverted video. I’m never going back to watching video on my iPad Touch while I exercise. Having a screen this large, but flat enough to put on the elliptical’s console, is fantastic. Playback has been smooth as butter, too.
  • ConnectBot (ssh client), plus Hacker’s Keyboard so I can have control keys.
  • Dungeon Defender. Yeah, I got this for entertainment as well as work.
  • Google Reader, Calendar, etc. through the browser.

All of this software was easily installed from the app market, and cost me zero. The ASUS on-screen keyboard seems a little nicer than the stock Android one, and it seems to suffice for short messages, but I don’t think I’d want to sit through an extended session of trying to use bash and vi that way. If I wanted to take this as my sole computer on a trip, I’d definitely get the keyboard – not only for faster typing but for extended battery life as well. I’d probably want to get VPN and IRC clients as well, though to be honest I’ve gotten tired of VPN glitches so I’ve switched to doing everything through ssh tunnels and SOCKS proxies when I’m working from home (like now) so maybe I’d skip the VPN.

Overall, I’d say my initial impression has been very positive. Maybe I’ll check back in a month or two and tell people how it fares over the longer term.

Charles Hooper wrote an interesting article about Letting Tech People be Socially Inept, in response to a recent incident at ThisWebHost where a technical director got mad at a customer and deleted data. There’s so much wrong here that it’s hard to know where to begin. “Jules” at This* was totally in the wrong. Charles is also wrong when he says that every position is customer-facing. Some people can do really great work only interacting with one person – usually their boss – and referring to that one person as a “customer” seems rather facile to me. Most of all, though, the puntwits on Hacker News are wrong too.

Probably the most common theme in the HN responses is that high technical skill and eschewing social niceties are closely and necessarily related because they both involve pruning away extraneous detail to get to the heart of something complex. For example, the very first comment there refers to “seemingly-meaningless boilerplate and social grease that we call people skills” and sets the tone for many that follow. The problem I have with this is two-fold. First, a lot of the bad behavior I see from my colleagues has nothing to do with social niceties. Go look at How To Ruin a Project and you’ll see that many of the items listed there don’t have anything to do with social skills. You don’t need a fine understanding of social cues to realize it’s wrong to miss the point, focus on the trivial, or wage “guerilla warfare” against a decision you don’t like. Those things might well have social effects, but they’re wrong even for purely practical reasons as well.

My other objection to the “boilerplate and grease” meme is that social cues actually do serve a practical purpose. They help to identify how strongly someone holds a belief, and how much importance they attach to the subject. Without that information, it’s easy to waste enormous amounts of time fighting wars that didn’t really need to be fought, and sometimes that leaves little time or energy – or poisons the atmosphere – for the debates that really do need to occur. One of the real problems I see with my fellow techies is a general lack of perspective, proportion, or priority. People who fail to give or read cues regarding these three important factors are demonstrating a deficiency that can affect even the most purely technical decision making. Being socially inept is not just cosmetic; it has a real and tangible effect on overall competence.

That brings me to the second most common theme in the HN responses: Asperger’s Syndrome. I’ve known quite a few real Aspies. I’ve known even more people who self-diagnose or self-identify that way as an excuse for being lazy about social interaction, so I won’t claim that I’m that way myself, but I will say I’m close enough (as is visibly the case for most of the other males in my family) to understand the challenges they face. I understand the “gravity” that always pulls one’s thoughts inward, and I know the pain of having to pull one’s attention away to deal with the “noise” that other people can generate. I will gladly do everything I can to accommodate people for whom these burdens truly are greater. I’ll teach them coping strategies I’ve learned from others, act as interpreter, run interference for them, whatever. However, there’s a difference here. It’s one thing to have a hard time understanding social cues. It’s another thing to understand those cues perfectly well and use that knowledge to troll more effectively. It’s really not that hard to tell the difference between an Aspie and an asshole; those who throw spitballs from behind a shield meant for others are not only jerks but cowards as well.

My conclusion is that Charles was wrong about everyone being customer facing, but right about the more fundamental reality that we techies in general need to stop being such jerks. We need to stop enabling the jerks by applauding when they act in deliberately offensive ways on HN, on Twitter, in conference presentations, etc. We need to stop pretending that the combative style prevalent on HN or LKML is the best way to facilitate progress; there’s no empirical evidence that “culling the herd” or “honing one’s weapons” or other such bloody metaphors really apply. We need to stop encouraging young techies to expend their energy emulating those styles instead of developing real people skills. Social skills really do serve a useful purpose, and anyone can improve them. That doesn’t make you less technical; it makes you more adult.

I think a lot of medium-to-senior programmers think it would be cool to run their open-source project, particularly if they have chafed under the leadership of someone else on another project. I’m not going to say it’s not fun, but after having done this for a while I think a word of warning is in order. You’re going to spend a lot of your time, for long stretches practically all of it, doing things besides the design/code/test cycle that individual contributors get to focus on. Here’s an assuredly partial list.

  • Be your own IT staff. Set up a source code repository, mailing list, bug tracker, website, etc. Sure, there are distros and forges that will be glad to help with some of this, and it’s definitely better than having to set up every bit of software and keep up with every security issue on your own leased machines yourself, but even then you get to deal with someone else’s infrastructure team and release schedule.
  • Be your own release engineer. The second biggest time-suck in your life is going to be managing branches and shepherding patches. As much as you might hate rigid coding standards and checkin policies, you’re almost certainly going to be the one defining and enforcing those for any project with more than (in my experience) two developers because otherwise things very quickly get out of control. On top of that, you get to deal with packaging as well and packaging issues can often take more time to resolve than almost any technical issue.
  • Be your own HR department. Whether you’re actually hiring or just attracting developers for your project, you’re going to have to spend some time recruiting others’ participation. It’s actually harder for pure open source, because you can neither offer money nor do a proper interview. Some people who work in easy/popular specialties act as though interviews are passe anyway, but that’s total BS. There are still fields where most of the advanced work is still being done behind closed doors. If you relied on GitHub profiles to hire people for work on distributed replication, SSD tiering, or especially de-duplication, you’d practically guarantee that you’re getting the wannabes instead of the true experts (not that the “GitHub or git out” crowd is qualified to tell the difference). You’re going to end up interviewing and evaluating people either before they go through the door or after. Since it’s really hard to get rid of an incompetent or toxic person after, you owe it to yourself and the other members of your team to do the filtering before.
  • Be your own marketing department. If you work at an actual company, whether startup or established, you might have full-time marketing folks to attract business interest, but they’ll be all but totally useless (and disinterested) when it comes to attracting technical interest so this is going to be the biggest expenditure of your time. Blog, tweet, attend conferences, respond to inquiries in email and IRC. Lather, rinse, repeat. If you’re lucky, you’ll get to spend time presenting to customers/partners as well, which means even more time away from the code but is obviously worth it. “Evangelism” is necessary partly as part of your recruiting strategy – see above – but also so that when you do talk to people about deploying your code you don’t find yourself taking a torpedo in the side from some techie who’s either unfamiliar with your project or already inclined to use something else.
  • Write your own documentation. Developers hate writing documentation. They even hate reviewing documentation. However, if you’re running your own project you probably won’t have any trained and dedicated doc writers until very late in the game if ever, so you get to provide not only technical documentation but user documentation as well. No, a wiki doesn’t cut it, and neither does a FAQ on your website. If you’re really serious about letting your users figure things out for themselves instead of bugging you on mailing lists or IRC (which is even more time away from coding) then you’ll need to write not just technical documentation but end-user documentation as well. Writing man pages in nroff might seem like chipping flakes off a piece of flint, but you’ll still have to do it.
  • Deal with legal issues. If you’re lucky you can avoid patents, but – as I’ve found out – you can’t avoid trademarks. You might also deal with contributor agreements and such where your project relates to others, even if you eschew them on your own.

So now you’re spending 30% of your time on recruiting and evangelism, 25% of your time playing release engineer, 20% of your time doing all these other things. What do you do with the other 25% of your time that you actually get to spend on code? Probably more than half of that time will be spent on “peripheral” pieces of the code – selecting libraries and debugging their problems, writing config/argument parsers, running static code analysis (including memory-leak analysis) tools, or generally filling in wherever there are gaps. Maybe you’ll get to spend 10% of your time doing the things that you started the project to do. Maybe not.

I don’t mean to seem like a total downer. Even spending 10% of your time on the parts you really enjoy can be worth it if that allows you to prove a point or make the world a better place. It’s better than working on someone else’s dream (or merely lining their pockets) during the day, and getting to spend only tiny scraps of your “spare” time pursuing your own dream. I’m just trying to sound a note of caution here. Be aware that when you turn your tinkering into an actual project you lose a lot of control over both its direction and your own involvement. Personally I don’t subscribe to the open-source mantra that you should start inviting others to participate as soon as you have an idea. That’s great for the kibitzers, but it’s not so great for the players who might actually be better off keeping an idea to themselves for a while before giving up the chance to work on it the way they want to.

(Yes, I’m playing devil’s advocate a bit here. There’s enough “big happy family” rah-rah out there already. Average that with the deliberately negative view I present here, and you might end up somewhere near reality.)

I read about this a few days ago on the Green Data Center Blog, and tried to comment there, but apparently Dave Ohara either isn’t checking his moderation queue or doesn’t like skeptical comments about companies whose press releases he’s re-publishing. I’ll try to reconstruct the gist of that comment here.

Since working at SiCortex, I’ve kept a bit of an eye on other companies trying to produce high-density high-efficiency systems, like SeaMicro and Calxeda. Frankly, I find SeaMicro’s “throughput computing” spin very unconvincing. A little over a year ago I took a look at their architecture and reached the conclusion that all those processors were going to starve – not enough memory to let data rest, not enough network throughput to keep it moving. Now they have somewhat more memory and a lot more CPU cycles, but as far as I can tell the exact same network and storage capabilities, so the starvation problem is going to be even more severe. Even worse, the job posting Dave forwards for them (if he doesn’t get a fee for that he should) seems to indicate that they’s still pursuing an Ethernet-centric interconnect strategy. That will keep them well behind where SiCortex, Cray, IBM and others were years ago in terms of internal bandwidth for similar systems.

On the other hand, even if Calxeda’s more radical departure from “me too” computing seems more likely to yield something useful, it would be unfair to contrast their still-hoped-for systems to SeaMicro’s actually-shipping ones. Come on, Calxeda, ship something so we can actually make that comparison.

While it might have been overshadowed by events on my other blog, my previous post on Solid State Silliness did lead to some interesting conversations. I’ve been meaning to clarify some of the reasoning behind my position that one should use SSDs for some data instead of all data, and that reasoning applies to much more than just the SSD debate, so here goes.

The first thing I’d like to get out of the way is the recent statement by everyone’s favorite SSD salesman that “performant systems are efficient systems”. What crap. There are a great many things that people do to get more performance (specifically in terms of latency) at the expense of wasting resources. Start with every busy-wait loop in the world. Another good example is speculative execution. There, the waste is certain – you know you’re not going to execute both sides of a branch – but it’s often done anyway because it lowers latency. It’s not efficient in terms of silicon area, it’s not efficient in terms of power, it’s not efficient in terms of dollars, but it’s done anyway. (This is also, BTW, why a system full of relatively weak low-power CPUs really can do some work more efficiently than one based on Xeon power hogs, no matter how many cores you put on each hog.) Other examples of increased performance without increased efficiency include most kinds of pre-fetching, caching, or replication. Used well, these techniques actually can improve efficiency as requests need to “penetrate” fewer layers of the system to get data, but used poorly they can be pure waste.

If you’re thinking about performance in terms of throughput rather than latency, then the equation of performance with efficiency isn’t quite so laughable, but it’s still rather simplistic. Every application has a certain ideal balance of CPU/memory/network/storage performance. It might well be the case that thinner “less performant” systems with those performance ratios are more efficient – per watt, per dollar, whatever – than their fatter “more performant” cousins. Then the question becomes how well the application scales up to the higher node counts, and that’s extremely application-specific. Many applications don’t scale all that well, so the “more performant” systems really would be more efficient. (I guess we can conclude that those pushing the “performance = efficiency” meme are used to dealing with systems that scale poorly. Hmm.) On the other hand, some applications really do scale pretty well to the required node-count ranges, and then the “less performant” systems would be more efficient. It’s a subject for analysis, not dogmatic adherence to one answer.

The more important point I want to make isn’t about efficiency. It’s about locality instead. As I mentioned above, prefetch and caching/replication can be great or they can be disastrous. Locality is what makes the difference, because these techniques are all based on exploiting locality of reference. If you have good locality, fetching the same data many times in rapid succession, then these techniques can seem like magic. If you have poor locality, then all that effort will be like the effort you make to save leftovers in the refrigerator to save cooking time . . . only to throw those leftovers away before they’re used. One way to look at this is to visualize data references on a plot, using time on the X axis and location on the Y axis, using Z axis or color or dot size to represent density of accesses . . . like this.


time/location plot

It’s easy to see patterns this way. Vertical lines represent accesses to a lot of data in a short amount of time, often in a sequential scan. If the total amount of data is greater than your cache size, your cache probably isn’t helping you much (and might be hurting you) because data accessed once is likely to get evicted before it’s accessed again. This is why many systems bypass caches for recognizably sequential access patterns. Horizontal lines represent constant requests to small amounts of data. This is a case where caches are great. It’s what they’re designed for. In a multi-user and/or multi-dataset environment, you probably won’t see many thick edge-to-edge lines either way. You’ll practically never see the completely flat field that would result from completely random access either. What you’ll see the most of are partial or faint lines, or (if your locations are grouped/sorted the right way) rectangles and blobs representing concentrated access to certain data at certain times.

Exploiting these blobs is the real fun part of managing data-access performance. Like many things, they tend to follow a power-law distribution – 50% of the accesses are to 10% of the data, 25% of the accesses are to the next 10%, and so on. This means that you very rapidly reach the point of diminishing returns, and adding more fast storage – be it more memory or more flash – is no longer worth it. When you consider time, this effect becomes even more pronounced. Locality over short intervals is likely to be significantly greater than that over long intervals. If you’re doing e-commerce, certain products are likely to be more popular at certain times and you’re almost certain to have sessions open for a small subset of your customers at any time. If you can predict such a transient spike, you can migrate the data you know you’ll need to your highest-performance storage before the spike even begins. Failing that, you might still be able to detect the spike early enough to do some good. What’s important is that the spike is finite in scope. Only a fool, given such information, would treat their hottest data exactly the same as their coldest. Only a bigger fool would fail to gather that information in the first place.

Since this all started with Artur Bergman’s all-SSD systems, let’s look at how these ideas might play out at a place like Wikia. Wikia runs a whole lot of special-interest wikis. Their top-level categories are entertainment, gaming, and lifestyle, though I’m sure they host wikis on other kinds of subjects as well. One interesting property of these wikis is that each one is separate, which seems ideal for all kinds of partitioning and differential treatment of data. At the very grossest level, it seems like it should be trivial to keep some of the hottest wikis’ data on SSDs and relegate others to spinning disks. Then there’s the temporal-locality thing. The access pattern for a TV-show wiki must be extremely predictable, at least while the show’s running. Even someone as media-ignorant as me can guess that there will be a spike starting when an episode airs (or perhaps even a bit before), tailing off pretty quickly after the next day or two. Why on Earth would someone recommend the same storage for content related to a highly rated and currently running show as for a show that was canceled due to low ratings a year ago? I don’t know.

Let’s take this a bit further. Using Artur’s example of 80TB and a power-law locality pattern, let’s see what happens. What if we have a single 48GB machine, with say 40GB available for caching? Using the “50% of accesses to 10% of the data” pattern, that means 3.125% of accesses are even out of memory. No matter what the latency difference between flash and spinning disks might be, it’s only going to affect that 3.125% of accesses so it’s not going to affect your average latency that much. Even if you look at 99th-percentile latency, it’s fairly easy to see that adding SSD up to only a few times memory size will reduce the level of spinning-disk accesses to noise. Factor in temporal locality and domain-specific knowledge about locality, and the all-SSD case gets even weaker. Add more nodes – therefore more memory – and it gets weaker. Sure, you can assume a flatter access distribution, but in light of all these other considerations you’d have to take that to a pretty unrealistic level before the all-SSD prescription starts to look like anything but quackery.

Now, maybe Artur will come along to tell me about how my analysis is all wrong, how Wikia really is such a unique special flower that principles applicable to a hundred other systems I’ve seen don’t apply there. The fact is, though, that those other hundred systems are not well served by using SSDs profligately. They’ll be wasting their owners’ money. Far more often, if you want to maximize IOPS per dollar, you’d be better off using a real analysis of your system’s locality characteristics to invest in all levels of your memory/storage hierarchy appropriately.

Apparently Artur Bergman did a very popular talk about SSDs recently. It’s all over my Twitter feed, and led to a pretty interesting discussion at High Scalability. I’m going to expand a little on what I said there.

I was posting to comp.arch.storage when Artur was still a little wokling, so I’ve had ample opportunity to see how a new technology gets from “exotic” to mainstream. Along the way there will always be some people who promote it as a panacea and some who condemn it as useless. Neither position requires much thought, and progress always comes from those who actually think about how to use the Hot New Thing to complement other approaches instead of expecting one to supplant the other completely. So it is with SSDs, which are a great addition to the data-storage arsenal but cannot reasonably be used as a direct substitute either for RAM at one end of the spectrum or for spinning disks at the other. Instead of putting all data on SSDs, we should be thinking about how to put the right data on them. As it turns out, there are several levels at which this can be done.

  • For many years, operating systems have implemented all sorts of ways to do prefetching to get data into RAM when it’s likely to be accessed soon, and bypass mechanisms to keep data out of RAM when it’s not (e.g. for sequential I/O). Processor designers have been doing similar things going from RAM to cache, and HSM folks have been doing similar things going from tape to disk. These basic approaches are also applicable when the fast tier is flash and the slow tier is spinning rust.
  • At the next level up, filesystems can evolve to take better advantage of flash. For example, consider a filesystem designed to keep not just journals but actual metadata on flash, with the actual data on disks. In addition to the performance benefits, this would allow the two resources to be scaled independently of one another. Databases and other software at a similar level can make similar improvements.
  • Above that level, applications themselves can make useful distinctions between warm and cool data, keeping the former on flash and relegating the latter to disk It even seems that the kind of data being served up by Wikia is particularly well suited to this, if only they decided to think and write code instead of throwing investor money at their I/O problems.

Basically what it all comes down to is that you might not need all those IOPS for all of your data. Don’t give me that “if you don’t use your data” false-dichotomy sound bite either. Access frequency falls into many buckets, not just two, and a simplistic used/not-used distinction is fit only for a one-bit brain. If you need a lot of machines for their CPU/memory/network performance anyway, and thus don’t need half a million IOPS per machine, then spending more money to get them is just a wasteful ego trip. By putting just a little thought into using flash and disk to complement one another, just about anyone should be able to meet their IOPS goals for lower cost and use the money saved on real operational improvements.

My current project is based on GlusterFS, which relies heavily on dlopen/dlsym to load its ‘translators” and other modules. Mostly this works great, and this modularity is the main reason I’m using GlusterFS in the first place, but yesterday I had to debug an interesting glitch that might be of interest to other people using similar plugin-based architectures.

The immediate problem had to do with GlusterFS’s quota feature. It turns out that the functionality is split across two translators, with most of it in the quota translator but some parts in the marker translator instead. I’m not sure why this is the case and suspect it shouldn’t be, but that pales in comparison to the fact that quota is run on clients. Huh? That makes the system trivially prone to quota cheating, which is simply going to be unacceptable in many situations – especially the “cloud billing” situation that’s promoted as a primary use of this feature. One of the nice things about the translator model is that you can move translators up or down in the “stack” or even move them across the server/client divide with relative ease, so I decided to try running quota on the server. I was a bit surprised when it blew up immediately, before the first client even finished mounting, but this kind of surprise is exactly why we do testing so I started to debug the problem.

The crash was a segfault in quota_lookup_cbk, which is called on the way out of the first lookup on the volume root. It looked like we were trying to free the “local” structure associated with this call, as we should be, but one of the component structures contained a bogus pointer – 0×85, which has never been a valid pointer on any operating system I’ve used and isn’t a common ASCII character either. Weird. Since GlusterFS is in user space, I have the rare (to me) luxury of using a debugger to step through the first part of the lookup code, but that only showed that everything seemed to be initialized properly. Then I started reading the quota code to see where the structure involved might get set to anything but zero. There didn’t seem to be any. Breakpoints on the functions that would have been involved in such a thing were never hit. I went back and stepped through the initialization again to see if there was anything I’d missed, and that’s when I realized that dynamic loading was involved.

What I noticed, as I stepped through the code, was that at one moment I was in quota_lookup at quota.c:688, about to call quota_local_new. When I stepped into that function, though, I found myself at marker-quota-helper.c:322 – part of a whole different translator. It didn’t take long from there to see that marker has its very own function named quota_local_new, so the problem clearly related to the duplicate symbol names. These duplicate symbols wouldn’t occur unless the two translators were both loaded in the same process, so now I knew why nobody at Gluster had seen the problem, but how exactly do the duplicate symbols cause it and what could I do to fix it? After more investigation, I saw that the dlopen(3) call that GlusterFS uses to load translators specifies the RTLD_GLOBAL flag. What does this mean? Here’s the man page:

RTLD_GLOBAL
The symbols defined by this library will be made available for
symbol resolution of subsequently loaded libraries.

Oh. A quick check verified that quota_local_new became valid when marker was loaded first, and remained at the same value even when quota was loaded subsequently, so quota_lookup was using the wrong version of quota_local_new. The marker version of this function does the wrong kind of initialization on the wrong kind of structure as far as quota is concerned, but this is all way after the compiler does all of its type checking so even if that checking were stronger it wouldn’t catch this. We get back a pointer to the wrong kind of structure, initialized the wrong way, and the only surprise is that we don’t blow up before we try to free it in quota_lookup_cbk.

So much for diagnosis. How about a fix? Most translator functions and dispatch tables are explicitly looked up using dlsym, and loading multiple translators wouldn’t work at all if RTLD_GLOBAL caused the wrong symbol to be returned in that case. I can’t think of any cases where code in one translator intentionally depends on a symbol exported from another instead of using the dispatch tables and such provided by the translator framework, so maybe using RTLD_LOCAL instead of RTLD_GLOBAL would help. Rather surprisingly, it doesn’t; quota_local_new still retains its marker value even after quota is loaded. That seems like a bug, but I can’t be bothered debugging dlopen. Another flag-based solution that initially seemed promising was RTLD_DEEPBIND. Here’s the man page again.

RTLD_DEEPBIND (since glibc 2.3.4)
Place the lookup scope of the symbols in this library ahead of
the global scope. This means that a self-contained library
will use its own symbols in preference to global symbols with
the same name contained in libraries that have already been
loaded. This flag is not specified in POSIX.1-2001.

Whether it works or not seems a bit irrelevant, though, since it’s non-portable and introduces a few problems of its own. Even the guy who added it seems to think it’s a bad idea. In the end, I adopted what many probably thought was the obvious solution: rename the conflicting symbols. I actually found four of them using “nm” and renamed the versions in marker, so now everything works.

The moral of the story, and the reason I’m writing about this instead of filing it away as just another among hundreds of other debugging stories that nobody else will ever care about, is that plugins and dynamic loading can be trickier than you think. This one would have been easy to miss if I had just stepped over quota_local_new instead of stepping into it, or if I hadn’t happened to notice the line numbers, or if I hadn’t already tangled with linkers and loaders enough to know that the way symbols get resolved can lead to some pretty “spooky” results. Maybe somebody reading this, or searching for terms like dlopen or RTLD_GLOBAL, will find this and be saved some tedious debugging.

Tom Trainer wrote what was supposed to be a thoughtful examination of what “cloud storage” should mean, but it came across as a rather nasty anti-Isilon hit piece. I tried to reply there, but apparently my comment won’t go through until I register with “UBM TechWeb” so they can sell me some crap, so I’m posting my response here. Besides being a defense of an unfairly maligned competitor – mine as well as Tom’s unnamed employer’s – it might help clarify some of the issues around what is or is not “real” cloud storage.

As the project lead for CloudFS, which addresses exactly the kinds of multi-tenancy and encryption you mention, I agree with many of your main points about what features are necessary for cloud storage. Where I disagree is with your (mis)characterization of Isilon to make those points.

* First, their architecture is far from monolithic. Yes, OneFS is proprietary, but that’s a *completely* different thing.

* Second, scaling to 144 servers is actually pretty good. When you look closely at what many vendors/projects claim, you find out that they’re actually talking about clients . . . and any idiot can put together thousands of clients. Conflating node counts with server counts was a dishonest trick when I caught iBrix doing it years ago, and it’s a dishonest trick now. Even the gigantic “Spider” system at ORNL only has 192 servers, and damn few installations need even half of that. It’s probably a support limit rather than an architectural limit. No storage vendor supports configurations bigger than they’ve tested, and testing even 144 servers can get pretty expensive – at least if you do it right. I’m pretty sure that Isilon would raise that limit if somebody asked them for a bigger system and let them use that configuration for testing.

Third, Isilon does have a “global” namespace as that term is usually used – i.e. at a logical level, to mean that the same name means the same thing across multiple servers, just like a “global variable” represents the same thing across multiple modules or processes. Do you expect global variables to be global in a physical sense too? In common usage, people use terms like “WAN” or “multi-DC” or “geo” to mean distribution across physical locations, and critiquing a vendor for common usage of a term makes your article seem like even more of a paid-for attack piece.

Disclaimer: I briefly evaluated and helped deploy some Isilon gear at my last job (SiCortex). I respect the product and I like the people, but I have no other association with either.