Eee Pad Thoughts

A few days ago, I got my Eee Pad Transformer. I guess you could say I didn’t get the Transformer part, because I only got the tablet part without the accompanying keyboard. So far, I’m loving it. It’s a very convenient size, the screen is gorgeous, performance and battery life seem excellent. I’ve had to spend some time getting used to Android, but so far I like that too. There’s also something much more satisfying about interacting by touch than through a keyboard and/or mouse, but maybe that’s just novelty. Here are some of the apps I’ve been using.

  • Built-in mail client. Not bad, so far it seems much better than the one that’s on my (non-Snow Leopard) Mac.
  • Seesmic for Twitter. I tried a few others, but none offered the combination of features I want. Twimbow for Android would rock.
  • Mobo Player (plus codecs) for playing unconverted video. I’m never going back to watching video on my iPad Touch while I exercise. Having a screen this large, but flat enough to put on the elliptical’s console, is fantastic. Playback has been smooth as butter, too.
  • ConnectBot (ssh client), plus Hacker’s Keyboard so I can have control keys.
  • Dungeon Defender. Yeah, I got this for entertainment as well as work.
  • Google Reader, Calendar, etc. through the browser.

All of this software was easily installed from the app market, and cost me zero. The ASUS on-screen keyboard seems a little nicer than the stock Android one, and it seems to suffice for short messages, but I don’t think I’d want to sit through an extended session of trying to use bash and vi that way. If I wanted to take this as my sole computer on a trip, I’d definitely get the keyboard – not only for faster typing but for extended battery life as well. I’d probably want to get VPN and IRC clients as well, though to be honest I’ve gotten tired of VPN glitches so I’ve switched to doing everything through ssh tunnels and SOCKS proxies when I’m working from home (like now) so maybe I’d skip the VPN.

Overall, I’d say my initial impression has been very positive. Maybe I’ll check back in a month or two and tell people how it fares over the longer term.

Don’t Be a Jerk

Charles Hooper wrote an interesting article about Letting Tech People be Socially Inept, in response to a recent incident at ThisWebHost where a technical director got mad at a customer and deleted data. There’s so much wrong here that it’s hard to know where to begin. “Jules” at This* was totally in the wrong. Charles is also wrong when he says that every position is customer-facing. Some people can do really great work only interacting with one person – usually their boss – and referring to that one person as a “customer” seems rather facile to me. Most of all, though, the puntwits on Hacker News are wrong too.

Probably the most common theme in the HN responses is that high technical skill and eschewing social niceties are closely and necessarily related because they both involve pruning away extraneous detail to get to the heart of something complex. For example, the very first comment there refers to “seemingly-meaningless boilerplate and social grease that we call people skills” and sets the tone for many that follow. The problem I have with this is two-fold. First, a lot of the bad behavior I see from my colleagues has nothing to do with social niceties. Go look at How To Ruin a Project and you’ll see that many of the items listed there don’t have anything to do with social skills. You don’t need a fine understanding of social cues to realize it’s wrong to miss the point, focus on the trivial, or wage “guerilla warfare” against a decision you don’t like. Those things might well have social effects, but they’re wrong even for purely practical reasons as well.

My other objection to the “boilerplate and grease” meme is that social cues actually do serve a practical purpose. They help to identify how strongly someone holds a belief, and how much importance they attach to the subject. Without that information, it’s easy to waste enormous amounts of time fighting wars that didn’t really need to be fought, and sometimes that leaves little time or energy – or poisons the atmosphere – for the debates that really do need to occur. One of the real problems I see with my fellow techies is a general lack of perspective, proportion, or priority. People who fail to give or read cues regarding these three important factors are demonstrating a deficiency that can affect even the most purely technical decision making. Being socially inept is not just cosmetic; it has a real and tangible effect on overall competence.

That brings me to the second most common theme in the HN responses: Asperger’s Syndrome. I’ve known quite a few real Aspies. I’ve known even more people who self-diagnose or self-identify that way as an excuse for being lazy about social interaction, so I won’t claim that I’m that way myself, but I will say I’m close enough (as is visibly the case for most of the other males in my family) to understand the challenges they face. I understand the “gravity” that always pulls one’s thoughts inward, and I know the pain of having to pull one’s attention away to deal with the “noise” that other people can generate. I will gladly do everything I can to accommodate people for whom these burdens truly are greater. I’ll teach them coping strategies I’ve learned from others, act as interpreter, run interference for them, whatever. However, there’s a difference here. It’s one thing to have a hard time understanding social cues. It’s another thing to understand those cues perfectly well and use that knowledge to troll more effectively. It’s really not that hard to tell the difference between an Aspie and an asshole; those who throw spitballs from behind a shield meant for others are not only jerks but cowards as well.

My conclusion is that Charles was wrong about everyone being customer facing, but right about the more fundamental reality that we techies in general need to stop being such jerks. We need to stop enabling the jerks by applauding when they act in deliberately offensive ways on HN, on Twitter, in conference presentations, etc. We need to stop pretending that the combative style prevalent on HN or LKML is the best way to facilitate progress; there’s no empirical evidence that “culling the herd” or “honing one’s weapons” or other such bloody metaphors really apply. We need to stop encouraging young techies to expend their energy emulating those styles instead of developing real people skills. Social skills really do serve a useful purpose, and anyone can improve them. That doesn’t make you less technical; it makes you more adult.

Running an Open Source Project

I think a lot of medium-to-senior programmers think it would be cool to run their open-source project, particularly if they have chafed under the leadership of someone else on another project. I’m not going to say it’s not fun, but after having done this for a while I think a word of warning is in order. You’re going to spend a lot of your time, for long stretches practically all of it, doing things besides the design/code/test cycle that individual contributors get to focus on. Here’s an assuredly partial list.

  • Be your own IT staff. Set up a source code repository, mailing list, bug tracker, website, etc. Sure, there are distros and forges that will be glad to help with some of this, and it’s definitely better than having to set up every bit of software and keep up with every security issue on your own leased machines yourself, but even then you get to deal with someone else’s infrastructure team and release schedule.
  • Be your own release engineer. The second biggest time-suck in your life is going to be managing branches and shepherding patches. As much as you might hate rigid coding standards and checkin policies, you’re almost certainly going to be the one defining and enforcing those for any project with more than (in my experience) two developers because otherwise things very quickly get out of control. On top of that, you get to deal with packaging as well and packaging issues can often take more time to resolve than almost any technical issue.
  • Be your own HR department. Whether you’re actually hiring or just attracting developers for your project, you’re going to have to spend some time recruiting others’ participation. It’s actually harder for pure open source, because you can neither offer money nor do a proper interview. Some people who work in easy/popular specialties act as though interviews are passe anyway, but that’s total BS. There are still fields where most of the advanced work is still being done behind closed doors. If you relied on GitHub profiles to hire people for work on distributed replication, SSD tiering, or especially de-duplication, you’d practically guarantee that you’re getting the wannabes instead of the true experts (not that the “GitHub or git out” crowd is qualified to tell the difference). You’re going to end up interviewing and evaluating people either before they go through the door or after. Since it’s really hard to get rid of an incompetent or toxic person after, you owe it to yourself and the other members of your team to do the filtering before.
  • Be your own marketing department. If you work at an actual company, whether startup or established, you might have full-time marketing folks to attract business interest, but they’ll be all but totally useless (and disinterested) when it comes to attracting technical interest so this is going to be the biggest expenditure of your time. Blog, tweet, attend conferences, respond to inquiries in email and IRC. Lather, rinse, repeat. If you’re lucky, you’ll get to spend time presenting to customers/partners as well, which means even more time away from the code but is obviously worth it. “Evangelism” is necessary partly as part of your recruiting strategy – see above – but also so that when you do talk to people about deploying your code you don’t find yourself taking a torpedo in the side from some techie who’s either unfamiliar with your project or already inclined to use something else.
  • Write your own documentation. Developers hate writing documentation. They even hate reviewing documentation. However, if you’re running your own project you probably won’t have any trained and dedicated doc writers until very late in the game if ever, so you get to provide not only technical documentation but user documentation as well. No, a wiki doesn’t cut it, and neither does a FAQ on your website. If you’re really serious about letting your users figure things out for themselves instead of bugging you on mailing lists or IRC (which is even more time away from coding) then you’ll need to write not just technical documentation but end-user documentation as well. Writing man pages in nroff might seem like chipping flakes off a piece of flint, but you’ll still have to do it.
  • Deal with legal issues. If you’re lucky you can avoid patents, but – as I’ve found out – you can’t avoid trademarks. You might also deal with contributor agreements and such where your project relates to others, even if you eschew them on your own.

So now you’re spending 30% of your time on recruiting and evangelism, 25% of your time playing release engineer, 20% of your time doing all these other things. What do you do with the other 25% of your time that you actually get to spend on code? Probably more than half of that time will be spent on “peripheral” pieces of the code – selecting libraries and debugging their problems, writing config/argument parsers, running static code analysis (including memory-leak analysis) tools, or generally filling in wherever there are gaps. Maybe you’ll get to spend 10% of your time doing the things that you started the project to do. Maybe not.

I don’t mean to seem like a total downer. Even spending 10% of your time on the parts you really enjoy can be worth it if that allows you to prove a point or make the world a better place. It’s better than working on someone else’s dream (or merely lining their pockets) during the day, and getting to spend only tiny scraps of your “spare” time pursuing your own dream. I’m just trying to sound a note of caution here. Be aware that when you turn your tinkering into an actual project you lose a lot of control over both its direction and your own involvement. Personally I don’t subscribe to the open-source mantra that you should start inviting others to participate as soon as you have an idea. That’s great for the kibitzers, but it’s not so great for the players who might actually be better off keeping an idea to themselves for a while before giving up the chance to work on it the way they want to.

(Yes, I’m playing devil’s advocate a bit here. There’s enough “big happy family” rah-rah out there already. Average that with the deliberately negative view I present here, and you might end up somewhere near reality.)

SeaMicro’s New Machines

I read about this a few days ago on the Green Data Center Blog, and tried to comment there, but apparently Dave Ohara either isn’t checking his moderation queue or doesn’t like skeptical comments about companies whose press releases he’s re-publishing. I’ll try to reconstruct the gist of that comment here.

Since working at SiCortex, I’ve kept a bit of an eye on other companies trying to produce high-density high-efficiency systems, like SeaMicro and Calxeda. Frankly, I find SeaMicro’s “throughput computing” spin very unconvincing. A little over a year ago I took a look at their architecture and reached the conclusion that all those processors were going to starve – not enough memory to let data rest, not enough network throughput to keep it moving. Now they have somewhat more memory and a lot more CPU cycles, but as far as I can tell the exact same network and storage capabilities, so the starvation problem is going to be even more severe. Even worse, the job posting Dave forwards for them (if he doesn’t get a fee for that he should) seems to indicate that they’s still pursuing an Ethernet-centric interconnect strategy. That will keep them well behind where SiCortex, Cray, IBM and others were years ago in terms of internal bandwidth for similar systems.

On the other hand, even if Calxeda’s more radical departure from “me too” computing seems more likely to yield something useful, it would be unfair to contrast their still-hoped-for systems to SeaMicro’s actually-shipping ones. Come on, Calxeda, ship something so we can actually make that comparison.

Efficiency, Performance, and Locality

While it might have been overshadowed by events on my other blog, my previous post on Solid State Silliness did lead to some interesting conversations. I’ve been meaning to clarify some of the reasoning behind my position that one should use SSDs for some data instead of all data, and that reasoning applies to much more than just the SSD debate, so here goes.

The first thing I’d like to get out of the way is the recent statement by everyone’s favorite SSD salesman that “performant systems are efficient systems”. What crap. There are a great many things that people do to get more performance (specifically in terms of latency) at the expense of wasting resources. Start with every busy-wait loop in the world. Another good example is speculative execution. There, the waste is certain – you know you’re not going to execute both sides of a branch – but it’s often done anyway because it lowers latency. It’s not efficient in terms of silicon area, it’s not efficient in terms of power, it’s not efficient in terms of dollars, but it’s done anyway. (This is also, BTW, why a system full of relatively weak low-power CPUs really can do some work more efficiently than one based on Xeon power hogs, no matter how many cores you put on each hog.) Other examples of increased performance without increased efficiency include most kinds of pre-fetching, caching, or replication. Used well, these techniques actually can improve efficiency as requests need to “penetrate” fewer layers of the system to get data, but used poorly they can be pure waste.

If you’re thinking about performance in terms of throughput rather than latency, then the equation of performance with efficiency isn’t quite so laughable, but it’s still rather simplistic. Every application has a certain ideal balance of CPU/memory/network/storage performance. It might well be the case that thinner “less performant” systems with those performance ratios are more efficient – per watt, per dollar, whatever – than their fatter “more performant” cousins. Then the question becomes how well the application scales up to the higher node counts, and that’s extremely application-specific. Many applications don’t scale all that well, so the “more performant” systems really would be more efficient. (I guess we can conclude that those pushing the “performance = efficiency” meme are used to dealing with systems that scale poorly. Hmm.) On the other hand, some applications really do scale pretty well to the required node-count ranges, and then the “less performant” systems would be more efficient. It’s a subject for analysis, not dogmatic adherence to one answer.

The more important point I want to make isn’t about efficiency. It’s about locality instead. As I mentioned above, prefetch and caching/replication can be great or they can be disastrous. Locality is what makes the difference, because these techniques are all based on exploiting locality of reference. If you have good locality, fetching the same data many times in rapid succession, then these techniques can seem like magic. If you have poor locality, then all that effort will be like the effort you make to save leftovers in the refrigerator to save cooking time . . . only to throw those leftovers away before they’re used. One way to look at this is to visualize data references on a plot, using time on the X axis and location on the Y axis, using Z axis or color or dot size to represent density of accesses . . . like this.


time/location plot

It’s easy to see patterns this way. Vertical lines represent accesses to a lot of data in a short amount of time, often in a sequential scan. If the total amount of data is greater than your cache size, your cache probably isn’t helping you much (and might be hurting you) because data accessed once is likely to get evicted before it’s accessed again. This is why many systems bypass caches for recognizably sequential access patterns. Horizontal lines represent constant requests to small amounts of data. This is a case where caches are great. It’s what they’re designed for. In a multi-user and/or multi-dataset environment, you probably won’t see many thick edge-to-edge lines either way. You’ll practically never see the completely flat field that would result from completely random access either. What you’ll see the most of are partial or faint lines, or (if your locations are grouped/sorted the right way) rectangles and blobs representing concentrated access to certain data at certain times.

Exploiting these blobs is the real fun part of managing data-access performance. Like many things, they tend to follow a power-law distribution – 50% of the accesses are to 10% of the data, 25% of the accesses are to the next 10%, and so on. This means that you very rapidly reach the point of diminishing returns, and adding more fast storage – be it more memory or more flash – is no longer worth it. When you consider time, this effect becomes even more pronounced. Locality over short intervals is likely to be significantly greater than that over long intervals. If you’re doing e-commerce, certain products are likely to be more popular at certain times and you’re almost certain to have sessions open for a small subset of your customers at any time. If you can predict such a transient spike, you can migrate the data you know you’ll need to your highest-performance storage before the spike even begins. Failing that, you might still be able to detect the spike early enough to do some good. What’s important is that the spike is finite in scope. Only a fool, given such information, would treat their hottest data exactly the same as their coldest. Only a bigger fool would fail to gather that information in the first place.

Since this all started with Artur Bergman’s all-SSD systems, let’s look at how these ideas might play out at a place like Wikia. Wikia runs a whole lot of special-interest wikis. Their top-level categories are entertainment, gaming, and lifestyle, though I’m sure they host wikis on other kinds of subjects as well. One interesting property of these wikis is that each one is separate, which seems ideal for all kinds of partitioning and differential treatment of data. At the very grossest level, it seems like it should be trivial to keep some of the hottest wikis’ data on SSDs and relegate others to spinning disks. Then there’s the temporal-locality thing. The access pattern for a TV-show wiki must be extremely predictable, at least while the show’s running. Even someone as media-ignorant as me can guess that there will be a spike starting when an episode airs (or perhaps even a bit before), tailing off pretty quickly after the next day or two. Why on Earth would someone recommend the same storage for content related to a highly rated and currently running show as for a show that was canceled due to low ratings a year ago? I don’t know.

Let’s take this a bit further. Using Artur’s example of 80TB and a power-law locality pattern, let’s see what happens. What if we have a single 48GB machine, with say 40GB available for caching? Using the “50% of accesses to 10% of the data” pattern, that means 3.125% of accesses are even out of memory. No matter what the latency difference between flash and spinning disks might be, it’s only going to affect that 3.125% of accesses so it’s not going to affect your average latency that much. Even if you look at 99th-percentile latency, it’s fairly easy to see that adding SSD up to only a few times memory size will reduce the level of spinning-disk accesses to noise. Factor in temporal locality and domain-specific knowledge about locality, and the all-SSD case gets even weaker. Add more nodes – therefore more memory – and it gets weaker. Sure, you can assume a flatter access distribution, but in light of all these other considerations you’d have to take that to a pretty unrealistic level before the all-SSD prescription starts to look like anything but quackery.

Now, maybe Artur will come along to tell me about how my analysis is all wrong, how Wikia really is such a unique special flower that principles applicable to a hundred other systems I’ve seen don’t apply there. The fact is, though, that those other hundred systems are not well served by using SSDs profligately. They’ll be wasting their owners’ money. Far more often, if you want to maximize IOPS per dollar, you’d be better off using a real analysis of your system’s locality characteristics to invest in all levels of your memory/storage hierarchy appropriately.

Solid State Silliness

Apparently Artur Bergman did a very popular talk about SSDs recently. It’s all over my Twitter feed, and led to a pretty interesting discussion at High Scalability. I’m going to expand a little on what I said there.

I was posting to comp.arch.storage when Artur was still a little wokling, so I’ve had ample opportunity to see how a new technology gets from “exotic” to mainstream. Along the way there will always be some people who promote it as a panacea and some who condemn it as useless. Neither position requires much thought, and progress always comes from those who actually think about how to use the Hot New Thing to complement other approaches instead of expecting one to supplant the other completely. So it is with SSDs, which are a great addition to the data-storage arsenal but cannot reasonably be used as a direct substitute either for RAM at one end of the spectrum or for spinning disks at the other. Instead of putting all data on SSDs, we should be thinking about how to put the right data on them. As it turns out, there are several levels at which this can be done.

  • For many years, operating systems have implemented all sorts of ways to do prefetching to get data into RAM when it’s likely to be accessed soon, and bypass mechanisms to keep data out of RAM when it’s not (e.g. for sequential I/O). Processor designers have been doing similar things going from RAM to cache, and HSM folks have been doing similar things going from tape to disk. These basic approaches are also applicable when the fast tier is flash and the slow tier is spinning rust.
  • At the next level up, filesystems can evolve to take better advantage of flash. For example, consider a filesystem designed to keep not just journals but actual metadata on flash, with the actual data on disks. In addition to the performance benefits, this would allow the two resources to be scaled independently of one another. Databases and other software at a similar level can make similar improvements.
  • Above that level, applications themselves can make useful distinctions between warm and cool data, keeping the former on flash and relegating the latter to disk It even seems that the kind of data being served up by Wikia is particularly well suited to this, if only they decided to think and write code instead of throwing investor money at their I/O problems.

Basically what it all comes down to is that you might not need all those IOPS for all of your data. Don’t give me that “if you don’t use your data” false-dichotomy sound bite either. Access frequency falls into many buckets, not just two, and a simplistic used/not-used distinction is fit only for a one-bit brain. If you need a lot of machines for their CPU/memory/network performance anyway, and thus don’t need half a million IOPS per machine, then spending more money to get them is just a wasteful ego trip. By putting just a little thought into using flash and disk to complement one another, just about anyone should be able to meet their IOPS goals for lower cost and use the money saved on real operational improvements.

Fun With Dynamic Loading

My current project is based on GlusterFS, which relies heavily on dlopen/dlsym to load its ‘translators” and other modules. Mostly this works great, and this modularity is the main reason I’m using GlusterFS in the first place, but yesterday I had to debug an interesting glitch that might be of interest to other people using similar plugin-based architectures.

The immediate problem had to do with GlusterFS’s quota feature. It turns out that the functionality is split across two translators, with most of it in the quota translator but some parts in the marker translator instead. I’m not sure why this is the case and suspect it shouldn’t be, but that pales in comparison to the fact that quota is run on clients. Huh? That makes the system trivially prone to quota cheating, which is simply going to be unacceptable in many situations – especially the “cloud billing” situation that’s promoted as a primary use of this feature. One of the nice things about the translator model is that you can move translators up or down in the “stack” or even move them across the server/client divide with relative ease, so I decided to try running quota on the server. I was a bit surprised when it blew up immediately, before the first client even finished mounting, but this kind of surprise is exactly why we do testing so I started to debug the problem.

The crash was a segfault in quota_lookup_cbk, which is called on the way out of the first lookup on the volume root. It looked like we were trying to free the “local” structure associated with this call, as we should be, but one of the component structures contained a bogus pointer – 0×85, which has never been a valid pointer on any operating system I’ve used and isn’t a common ASCII character either. Weird. Since GlusterFS is in user space, I have the rare (to me) luxury of using a debugger to step through the first part of the lookup code, but that only showed that everything seemed to be initialized properly. Then I started reading the quota code to see where the structure involved might get set to anything but zero. There didn’t seem to be any. Breakpoints on the functions that would have been involved in such a thing were never hit. I went back and stepped through the initialization again to see if there was anything I’d missed, and that’s when I realized that dynamic loading was involved.

What I noticed, as I stepped through the code, was that at one moment I was in quota_lookup at quota.c:688, about to call quota_local_new. When I stepped into that function, though, I found myself at marker-quota-helper.c:322 – part of a whole different translator. It didn’t take long from there to see that marker has its very own function named quota_local_new, so the problem clearly related to the duplicate symbol names. These duplicate symbols wouldn’t occur unless the two translators were both loaded in the same process, so now I knew why nobody at Gluster had seen the problem, but how exactly do the duplicate symbols cause it and what could I do to fix it? After more investigation, I saw that the dlopen(3) call that GlusterFS uses to load translators specifies the RTLD_GLOBAL flag. What does this mean? Here’s the man page:

RTLD_GLOBAL
The symbols defined by this library will be made available for
symbol resolution of subsequently loaded libraries.

Oh. A quick check verified that quota_local_new became valid when marker was loaded first, and remained at the same value even when quota was loaded subsequently, so quota_lookup was using the wrong version of quota_local_new. The marker version of this function does the wrong kind of initialization on the wrong kind of structure as far as quota is concerned, but this is all way after the compiler does all of its type checking so even if that checking were stronger it wouldn’t catch this. We get back a pointer to the wrong kind of structure, initialized the wrong way, and the only surprise is that we don’t blow up before we try to free it in quota_lookup_cbk.

So much for diagnosis. How about a fix? Most translator functions and dispatch tables are explicitly looked up using dlsym, and loading multiple translators wouldn’t work at all if RTLD_GLOBAL caused the wrong symbol to be returned in that case. I can’t think of any cases where code in one translator intentionally depends on a symbol exported from another instead of using the dispatch tables and such provided by the translator framework, so maybe using RTLD_LOCAL instead of RTLD_GLOBAL would help. Rather surprisingly, it doesn’t; quota_local_new still retains its marker value even after quota is loaded. That seems like a bug, but I can’t be bothered debugging dlopen. Another flag-based solution that initially seemed promising was RTLD_DEEPBIND. Here’s the man page again.

RTLD_DEEPBIND (since glibc 2.3.4)
Place the lookup scope of the symbols in this library ahead of
the global scope. This means that a self-contained library
will use its own symbols in preference to global symbols with
the same name contained in libraries that have already been
loaded. This flag is not specified in POSIX.1-2001.

Whether it works or not seems a bit irrelevant, though, since it’s non-portable and introduces a few problems of its own. Even the guy who added it seems to think it’s a bad idea. In the end, I adopted what many probably thought was the obvious solution: rename the conflicting symbols. I actually found four of them using “nm” and renamed the versions in marker, so now everything works.

The moral of the story, and the reason I’m writing about this instead of filing it away as just another among hundreds of other debugging stories that nobody else will ever care about, is that plugins and dynamic loading can be trickier than you think. This one would have been easy to miss if I had just stepped over quota_local_new instead of stepping into it, or if I hadn’t happened to notice the line numbers, or if I hadn’t already tangled with linkers and loaders enough to know that the way symbols get resolved can lead to some pretty “spooky” results. Maybe somebody reading this, or searching for terms like dlopen or RTLD_GLOBAL, will find this and be saved some tedious debugging.

Fighting FUD Again

Tom Trainer wrote what was supposed to be a thoughtful examination of what “cloud storage” should mean, but it came across as a rather nasty anti-Isilon hit piece. I tried to reply there, but apparently my comment won’t go through until I register with “UBM TechWeb” so they can sell me some crap, so I’m posting my response here. Besides being a defense of an unfairly maligned competitor – mine as well as Tom’s unnamed employer’s – it might help clarify some of the issues around what is or is not “real” cloud storage.

As the project lead for CloudFS, which addresses exactly the kinds of multi-tenancy and encryption you mention, I agree with many of your main points about what features are necessary for cloud storage. Where I disagree is with your (mis)characterization of Isilon to make those points.

* First, their architecture is far from monolithic. Yes, OneFS is proprietary, but that’s a *completely* different thing.

* Second, scaling to 144 servers is actually pretty good. When you look closely at what many vendors/projects claim, you find out that they’re actually talking about clients . . . and any idiot can put together thousands of clients. Conflating node counts with server counts was a dishonest trick when I caught iBrix doing it years ago, and it’s a dishonest trick now. Even the gigantic “Spider” system at ORNL only has 192 servers, and damn few installations need even half of that. It’s probably a support limit rather than an architectural limit. No storage vendor supports configurations bigger than they’ve tested, and testing even 144 servers can get pretty expensive – at least if you do it right. I’m pretty sure that Isilon would raise that limit if somebody asked them for a bigger system and let them use that configuration for testing.

Third, Isilon does have a “global” namespace as that term is usually used – i.e. at a logical level, to mean that the same name means the same thing across multiple servers, just like a “global variable” represents the same thing across multiple modules or processes. Do you expect global variables to be global in a physical sense too? In common usage, people use terms like “WAN” or “multi-DC” or “geo” to mean distribution across physical locations, and critiquing a vendor for common usage of a term makes your article seem like even more of a paid-for attack piece.

Disclaimer: I briefly evaluated and helped deploy some Isilon gear at my last job (SiCortex). I respect the product and I like the people, but I have no other association with either.

How To Ruin a Project

This is a bit of a riff on Jerry Weinberg’s ten commandments of egoless programming (via the also-excellent Jeff Atwood). I’ve found that many engineers, perhaps even a majority, respond more to aversion than to encouragement, so a “how not to” that can be inverted in one’s head sometimes works better than a “how to” taken straight. So here are the ways to turn a promising and fun project into a soul-sucking wreck. Take it from an expert.

  • Miss the point. Never try to figure out what the project’s really about, or what it will be used for. Add features that have nothing to do with any valid use, removing or breaking other essential features in the process. Develop and test on a totally inappropriate platform. Confuse scalability with performance, or test the wrong kind of performance and then complain about the results.
  • Claim authority you haven’t earned. Assume that your reputation or barely-relevant experience entitles you to sit at the head of the table before you’ve made any significant contribution. Push aside those who started the project, who have contributed/invested more, or are more affected by its outcome.
  • Focus on the trivial. Spend all of your time – and everyone else’s – on issues like coding style, source-control processes, and mailing-list etiquette. Carefully avoid any task that involves a more than superficial knowledge of the code. Make others do the heavy lifting, then take half credit because you added some final polish.
  • Be self-important. Make sure everyone knows this is the least important of your many projects. Insist that every work flow and library choice conform with your habits on those other projects so that your life will be easier, even (especially) if it’s to the detriment of those doing more work.
  • Be dismissive. Clearly, your specialty is the one requiring the greatest technical prowess. If you’re a kernel/C programmer, look down your nose at web/Ruby punks. If you’re a web/Ruby programmer, look down your nose at enterprise/Java drones. If you’re an enterprise/Java programmer, look down your nose at kernel/C neanderthals. If you’re the world’s greatest maintenance programmer, specializing in minor tweaks to twenty-year-old programs representing many person-years of prior work , criticize the “immaturity” of new code that does new things. Above all, treat the specialty most crucial to your project as the exclusive domain of novices and children, with you as the “adult supervision” to bring it up to snuff in the areas that experts really care about.
  • Argue from authority, not facts. If anybody provides empirical evidence to support their choice of a technique, algorithm, or style choice, ignore or reject it. Never provide any such evidence to support your own choices. Make sure everyone knows that your own personal experience trumps any such evidence, even if that experience is less (or less relevant) than others’.
  • Lecture. Use every interaction as an opportunity to “educate” others about the reasons (rationalizations) for your personal preferences. Bonus points if you deliver a lecture on a subject that the recipient actually knows better than you, without ever pausing to determine whether that’s the case.
  • Be persistent. If any decision ever goes against you, resist it forever. Treat every bug from then on, no matter how unrelated or trivial, as an excuse to beat the dead horse some more. If you’re at a company, use company time to pursue your approach instead of the approved one, and drag colleagues with you. If you’re working on open source, threaten to leave/fork.
  • Be a hypocrite. Take some valid principle – portability, scalability, loose coupling – and use it to demand invasive change from others, then make your own changes that violate the same principle. Demand code reviews for every checkin, aggressively review others’ patches, then check in your own changes unilaterally. Bonus points if your unilateral changes were clearly never tested and cause grievous breakage.
  • Be the martyr. After all doing all of these other things, your colleagues might not be keen to work with you again. Make sure everyone knows you were just trying to help, that you tried ever so hard to make them better engineers, and that the lack of gratitude reflects on them instead of you.

Many thanks to the people I’ve worked with who have inspired this. Without you, the list would be much less comprehensive. I hope its inverse will help others participate in projects that are more successful and fulfilling for all involved as a result.

Amazon’s Own Post Mortem

Amazon has posted their own explanation of the recent EBS failure. Since I had offered some theories earlier, I think it’s worthwhile to close this out by comparing my theories with Amazon’s explanation. Specifically, I had suggested two things.

  • EBS got into a state where it didn’t know what had been replicated, and fell back to re-replicating everything.
  • There was inadequate flow control on the re-replication/re-mirroring traffic, causing further network overload.

It turns out that both theories were slightly correct but mostly incorrect. Here’s the most relevant part of Amazon’s account.

When this network connectivity issue occurred, a large number of EBS nodes in a single EBS cluster lost connection to their replicas. When the incorrect traffic shift was rolled back and network connectivity was restored, these nodes rapidly began searching the EBS cluster for available server space where they could re-mirror data. Once again, in a normally functioning cluster, this occurs in milliseconds. In this case, because the issue affected such a large number of volumes concurrently, the free capacity of the EBS cluster was quickly exhausted, leaving many of the nodes “stuck” in a loop, continuously searching the cluster for free space. This quickly led to a “re-mirroring storm,” where a large number of volumes were effectively “stuck” while the nodes searched the cluster for the storage space it needed for its new replica. At this point, about 13% of the volumes in the affected Availability Zone were in this “stuck” state.

the nodes failing to find new nodes did not back off aggressively enough when they could not find space, but instead, continued to search repeatedly

The first part refers to the sort of full re-mirroring that I had mentioned, although it was re-mirroring to a new replica instead of an old one. The last part is a classic congestion-collapse pattern: transient failure, followed by too-aggressive retries that turn the transient failure into a persistent one. I had thought this would apply to the data traffic, but according to Amazon it affected the “control plane” instead. This is also what caused it to affect multiple availability zones, since the control plane – unlike the data plane – spans availability zones within a region.

The most interesting parts, to me, are the mentions of actual bugs – one in EBS and one in RDS. Here are the descriptions.

There was also a race condition in the code on the EBS nodes that, with a very low probability, caused them to fail when they were concurrently closing a large number of requests for replication. In a normally operating EBS cluster, this issue would result in very few, if any, node crashes; however, during this re-mirroring storm, the volume of connection attempts was extremely high, so it began triggering this issue more frequently. Nodes began to fail as a result of the bug, resulting in more volumes left needing to re-mirror.

Of multi-AZ database instances in the US East Region, 2.5% did not automatically failover after experiencing “stuck” I/O. The primary cause was that the rapid succession of network interruption (which partitioned the primary from the secondary) and “stuck” I/O on the primary replica triggered a previously un-encountered bug. This bug left the primary replica in an isolated state where it was not safe for our monitoring agent to automatically fail over to the secondary replica without risking data loss, and manual intervention was required.

These bugs represent an important lesson for distributed-system designers: bugs strike without regard for location. Careful sharding and replication across machines and even sites won’t protect you against a bug that exists in every instance of the code. A while back, when I was attending the UCB retreats because of OceanStore, the Recovery Oriented Computing” folks were doing some very interesting work on correlated failures. I remember some great discussions about distributing a system not just across locations but across software types and versions as well. This lesson has stuck with me ever since. For example, in iwhd the extended replication-policy syntax was developed with a specific goal of allowing replication across different back-end types (e.g. S3, OpenStack) or operating systems as well as different locations. Maybe distributing across different software versions wouldn’t have helped in Amazon’s specific case if the bugs involved have been in there long enough, but it’s very easy to imagine a related scenario in which having different versions with different mirror-retry strategies in play (same theory behind multiple hashes in Stochastic Fair Blue BTW) might at least have avoided one factor contributing to the meltdown.