Platypus Reloaded, 09 Jan 2015 09:41:00 -0500Notes on File System Semantics<p>Just some random thoughts from an email I sent recently, plus a bonus SCSI war story.</p> <blockquote> <p>As the PVFS folks said long before I came along, some POSIX requirements are inappropriate for a distributed file system. I agree with that, but not with the object-store folks who claim that the <em>entire</em> hierarchical byte-addressable file system model is obsolete. I think most of that model is still valuable for compatibility with the thousands of applications that are out there. Only a few over-specified behaviors which few will miss (often few even know about them) and which are obviously problematic in a distributed system need to be retired.</p> </blockquote> <p>...and...</p> <blockquote> <p>users can't reason well about consistency guarantees that are conditional on the availability of specific servers. "After a write, readers will see X" is easy to reason about. Adding "...or reads will fail if certain system-wide conditions are met" doesn't make it much worse. Adding "...or they might see Y if some otherwise-invisible event intervenes" kind of leaves them hanging. If writes can disappear, other than in the event of a system-wide failure, then I'd say you effectively have no guarantees at all <em>and that's OK</em>. One of the hard-won lessons from working in this field for a long time is that it's better to make few and simple promises (which you can be sure of keeping) than get dragged into long discussions of what was or was not promised under what conditions. That's not a good place to be in when users' data is at stake.</p> </blockquote> <p>The last point is the most important IMO. I first ran into this back in '94, when I was working on one of the earlier multi-pathing SCSI drivers (REACT for the IBM 7135). My code would try <em>really hard</em> to maintain re-establish contact with a volume, despite any combination of failures. While I was working in England with the people who actually built the hardware, we discovered one case where this persistence meant we'd flap around for five minutes or so, repeatedly switching between controllers before we had finally observed and cleared enough error conditions to continue normally. I thought it was awesome that we were able to recover. One of the older engineers was unimpressed. To him, those five minutes of unpredictable behavior negated any subsequent success. He argued that it would be better to try both controllers, then simply fail. His view prevailed, and in retrospect I think rightly so. Sometimes, "weak promises strongly kept" is better than the alternative, especially when there's a higher layer that can build on that to provide its own guarantees.</p> <p>BTW, the test involved here was the infamous "pen in a fan" which was amusing in its own way. The board had three signal lines to report faults, but more than three faults to report. Therefore, the lines were multiplexed. Sticking a pen in a fan would cause the board to signal fault 0x7 (all three lines asserted). However, the person who wrote the board firmware didn't read the hardware spec properly, and out in SCSI-land this would be reported as three separate faults - 0x4, 0x2, and 0x1. This is what caused us to keep going back and forth so much, clearing one pseudo-fault each time instead of all at once. Now that the wounds have healed, I can look back and laugh. At the time I was not so amused.</p>Fri, 09 Jan 2015 09:41:00,2015-01-09:2015-01-fs-semantics.htmlstorageTechnical Debt vs. Technical Risk<p>One of the most useful metaphors in software engineering is Ward Cunningham's <a href="">technical debt</a>. Definitions and interpretations vary, but technical debt is basically all the stuff you're going to fix later because you were in too much of a hurry to do it right the first time. We all know what it's like to be up against a release deadline, or in the middle of a bug firefight, and we find something that works and we allow it to pass even though we know it's not really right. Some common types of technical debt might include:</p> <ul> <li> <p>Using a "private" member of a data structure instead of adding the proper API.</p> </li> <li> <p>Layering violations and circular dependencies.</p> </li> <li> <p>Copying and pasting code instead of creating a more general common function.</p> </li> <li> <p>Adding "garbage" function arguments that change behavior to suit a new use (often to avoid the previous error but really just as bad).</p> </li> <li> <p>Checking the same condition ten different places instead of refactoring to use subclasses, dispatch tables, or any of several other cleaner techniques.</p> </li> </ul> <p>The important thing about the debt metaphor is the idea that it's OK to have some debt as long as it's tracked and kept under control. Sure, it's different when developers add technical debt just because they're lazy twits, but that fits in the metaphor too - akin to people who can't stop abusing their credit cards to buy junk they don't need and can't really afford. The debt metaphor really helps both developers and business types think about the future consequences of their short-term decisions.</p> <p>But ... is it really debt?</p> <p>The thing about debt, in the real world, is that it's a known quantity. You know how much debt you have, you know the rate at which it's increasing, and you know what it will take to pay it back. It might be more than you can bear, if you've been careless, but you know. Technical debt usually isn't like that. The problem with all of these shortcuts is not usually a steady and predictable drag on your resources. A project laden with technical debt might get away with that for a very long time, but that becomes less likely as the cruft accumulates. Every ugly shortcut increases the chance that the codebase will run into a <em>sudden</em> and <em>catastrophic</em> failure - a severe and hard-to-fix bug, a missed deadline, or even a simple failure to remain competitive because changing the code safely has become as difficult as crossing a minefield.</p> <p>That's not debt. That's <strong>risk</strong>. As in finance, technical risk can be measured and reasoned about, leading to sensible tradeoffs even though there's uncertainty involved. As in finance, avoiding risk altogether is impossible and trying too hard will mean missed opportunities. Some people will manage risk well, and reap rewards from doing so. Others will manage risk poorly, and will fail - not slowly or quietly as with debt, but often quite suddenly and spectacularly.</p> <p>I'm not saying that the risk metaphor should displace the debt metaphor. They both have their value. However, in my experience, what gets called technical debt is really technical risk more often than not. The important lesson here is to keep them separate. The next time you hear someone talk about technical debt, or are tempted to make a point that way yourself, it might be helpful to think about whether the conversation should really be about technical risk. In particular, the next time somebody says that refactoring some overburdened but critical piece of code is too risky, it might be worth pointing out that <em>failure</em> to refactor carries its own risk. Some people will always pick debt over risk. Framing the discussion as risk vs. risk might be more effective than letting it seem like risk vs. debt.</p>Mon, 05 Jan 2015 17:02:00,2015-01-05:2015-01-technical-risk.htmldesignworkingWhy "DSO" is an Awful Term<p>A recent discussion on the GlusterFS development mailing list got a bit hung up on the issue of what is or is not a "DSO" (Dynamically Shared Object). This is one of a many issues with dynamic linking and dynamic loading that I've seen cause problems before, in large part because they're <strong>two different things</strong> that people often mix up. I'll try to explain how this fact leads to confusion, and suggest how to avoid that confusion.</p> <p>For the sake of this discussion, let's separate the two kinds of things that "DSO" might refer to. We'll use "library" to mean something that is specified when linking an executable, and is therefore reflected in that executable's on-disk contents. By contrast, a "module" is not specified when linking and not reflected in the on-disk executable; one must use <em>dlopen</em> from within the program to get at it. Despite their differences, both of these are dynamically <strong>linked</strong>. In both cases, the executable lacks a complete symbol table for the shared object (in the module case it lacks any symbol table at all). The library or module's symbols will be resolved when it is loaded. In fact, this late resolution is essential to make any kind of shared object work, on any platform, so the "D" in "DSO" is kind of redundant.</p> <p>The difference between libraries and modules is that modules are also dynamically <strong>loaded</strong> whereas libraries are not. Libraries are implicitly loaded into a process's memory space before the process starts (i.e. before <em>main</em> is called). Modules are explicitly loaded only when <em>dlopen</em> is called. Either way, loading includes mapping library/module contents into a process's memory. In the dynamic-linking case it also includes resolving symbols, but it is actually possible to do dynamic loading without dynamic linking (see my <a href="quora">Quora answer</a> on this topic for more details) so this is not essential.</p> <p>Where did all of this go wrong? Apparently it's Apple's fault. In their infinite arrogance, and contrary to every other UNIX platform, they decided that the same shared object could not be used as both module and library. It had to be one on the other. While precluding dual use without reason is generally a bad decision technically, Apple then made it worse by using "DSO" to mean only modules and not libraries. Is the "D" what really distinguishes an Apple DSO from an Apple non-DSO? Nope. That didn't stop them, and it didn't stop the libtool folks either. They never saw a stupid idea they didn't like, so they mindlessly copied Apple's bad terminology (including the "module" flag). This has led to much confusion since, including that which inspired this post.</p> <p>So, if "DSO" doesn't work, what would? Surprisingly, it's not the "D" but the "S" that must go. Everything I've said so far about dynamic linking and loading would apply even if the objects in question are not shared. What we're really talking about here is two kinds of dynamically linked objects. On every platform but Apple's, the loading issue doesn't matter so "DLO" would be sufficient to distinguish these from statically linked libraries. However, we've seen that Apple's choices and terminology do infect others. Where the loading distinction does matter, it's between implicit (or immediate) loading vs. explicit loading. That would lead us to the rather unwieldy IDLO and EDLO. Alternatively, we could embrace the "library" vs. "module" distinction, resulting in DLL and DLM. Yes, <a href="dll">DLL</a>. Microsoft pretty much got this one right, folks. It's a technically acccurate term, which would also be common across the Windows and UNIX/Linux platforms, so how is that a bad thing?</p> <p><em>Sigh</em>. But we programmers aren't so rational, as a group. Apple's not going to change. Libtool won't either. They'll both continue to use "DSO" inaccurately and misleadingly. At least now maybe the term will raise a red flag, and people will know to ask for clarification. When someone says "DSO" ask them whether they mean all things that are dynamic and shared and objects, or just some arbitrary Apple-defined subset.</p>Fri, 12 Dec 2014 10:39:00,2014-12-12:2014-12-dso-terminology.htmloperating-systems"Scale Out" Applies to Interfaces, Too<p>Because of what I do for $dayjob, I hear a lot about "scale out" vs. "scale up" in various contexts. Also because of what I do for $dayjob, I get to read a lot of code. Some of it's new and clean. Some of it's . . . not. That's only partly a reflection on the skill of the programmers involved. Part of it is just the fact that all code tends to accumulate technical debt over time. Layering violations, "privacy" violations, and mutual dependencies all chip away at modularity. Short parameter lists turn into long ones, reflecting every new feature added since the code was properly refactored. (Really, when was the last time you saw a parameter list get <em>shorter</em>?) Types, fields, and flags proliferate. Cats and dogs start living together. It's chaos, I tell you!</p> <p>Something similar also tends to happen with public APIs. They start simply enough, then they grow and grow and grow. Something like this, if I may mix my movie metaphors.</p> <div style="text-align: center"> <img src="" alt="Audrey" /> </div> <p>As it turns out, there are two ways that an interface can increase in complexity. Yep, you guessed it: scale up or scale out. A "scale up" interface is one that gets <strong>monolithically</strong> bigger - you can't use any part of it without having to deal with significant complexity. Doing even the simplest thing requires several calls. OpenSSL provides a great example: set up a method table, create three types of objects, tie two of those together, set up cipher lists and certificate chains, and more, all before you can even start to do regular socket stuff (which is non-trivial already). It's tedious, it's error-prone, and just about everybody who has to use OpenSSL ends up wrapping all of that crap into their own function or object with a much simpler interface. (BTW, the code that inspired this post had nothing to do with OpenSSL.)</p> <p>By contrast, a "scale out" interface is one that gets bigger in a <em>modular</em> way. Maybe it just has a lot of functions, but using any one of those is simple and straightforward. In some cases, those functions might be grouped according to the objects they operate upon or the functionality they provide, but if you don't use a particular subset then you don't have to set up for it. <em>Defaults</em> are applied intelligently, so that simple calls yield obvious results but more sophisticated usage is also possible. Secondary objects are <em>automatically created</em> using defaults, so the user has to go through fewer steps. <em>Hooks</em> and <em>callbacks</em> are provided to customize behavior further, but remain entirely optional. In all of these cases, the goal is either to reduce the knowledge needed by basic users, or reduce the number of users who need non-basic knowledge. In other words, you want to minimize the area under this curve.</p> <div style="text-align: center"> <img src="" alt="usage type vs. knowledge" /> </div> <p>A "scale out" interface can be just as complex as a "scale up" interface. It can have just as many calls, require just as much code and tests and documentation. However, it <strong>grows more gracefully</strong>. Exposing your guts to every caller, whether or not they really want to see those guts, is what creates all of that bad coupling and technical debt. If a caller never had to know about a particular interface element (e.g. a function) to get their job done, neither you nor they will have to worry about compatibility when it changes. That reduces complexity and breakage on both sides. There's also less need (or temptation) to "reach in" and muck with stuff that is (or should be) internal, so the level of debt-inducing inflexibility is further reduced. Defining a scale-out interface might be a bit more difficult, but it pays off in the long run.</p>Wed, 03 Dec 2014 16:40:00,2014-12-03:2014-12-scale-out-interfaces.htmldesignThoughts on Running<p>(...and now for something completely different.)</p> <p>Back in July, I started running. That would not be a particularly notable statement for many people, but most people haven't detested running all their lives and avoided it for thirty years. Instead, I've used stairclimbers and ellipticals for many years, but I've grown to hate my elliptical even more than I hated running. (It's a Livestrong 10.0E which always required constant tweaking to keep it from clanking intolerably, started showing rust after only six months, and is now approaching its second flywheel replacement. Never <em>ever</em> buy anything made by Johnson, regardless of which brand it says it is.) I didn't feel like using that, I didn't feel like driving to work or a club multiple times a week, but I needed to do something. More as an experiment than anything else, I forced myself to try running again.</p> <p><img alt="image" src="" /></p> <p>(image from</p> <p>It turns out that the reason I hated running is that I was doing it wrong. Yeah, I know that sounds crazy. How can a supposedly-smart person fail at something so basic as running? Well, the problem is that a "traditional" heavy-heel-strike running style just doesn't suit me. Maybe it works for a lot of other people - I still see most runners on the road using that style - but it always makes me feel like I'm knocking the breath out of myself with every step. When I first started out on that July day, running was just as unpleasant as I had remembered. However, I had read a lot about barefoot running and landing one's weight more on the front or middle of the foot, so I decided to give that a try. That was just <strong>so</strong> much better - not exactly fun, I guess, but not particularly unpleasant either. So I stuck with it.</p> <p>When I first started, I could run 2.5 miles in about 28 minutes - probably about ten minutes per mile while actually running, plus plenty of walking breaks. Three months later, I'm at about 21:30 (8:36 per mile). Better still, I can maintain that exact same pace even at five miles. My time at the midpoint of that route - the top of a 5% grade - is exactly the same as my time for the shorter route. No, I don't understand that either*. My goal is to do at least one unofficial 10K before Thanksgiving. My stretch goal is to do it in under 50 minutes. It's good to have goals.</p> <p>So, am I "one of those runners"? According to some definitions, which separate running from jogging at 10:00 per mile, yes. I certainly don't feel like I'm jogging. If I had to stop suddenly, my momentum - except on steeper climbs - would carry me forward more than one step. That seems like an interesting cutoff. I've run six days out of seven, and twelve out of fifteen. I've run in the rain, and I plan to run in the snow at least some of the time this winter (probably on the Minuteman bike trail because road running in winter seems pretty scary). I've also lost five pounds and my resting heart rate has gone from 60 to 54. I think about runnning, talk about running, and now I blog about it too. So yeah, I guess I'm a runner.</p> <p>Being a competitive guy, I also wonder whether I'm a <em>good</em> runner. I certainly don't feel like I am yet. Five miles in 43:00 (my best result so far) doesn't seem all that impressive, even if there is an annoyingly steep climb in the middle and slight uphill for the entire last mile. I probably do fall below the "jogging threshold" sometimes, and my training focus right now is on maintaining good pace along the entire course. On the other hand, I've checked last year's results from races in Lexington and Andover. According to those, I'd consistently place about a third of the way down. Of the people I see <em>on the road</em>, most of whom do not enter races, I'd say only half that many seem to be going faster than I am (I don't seem to run the same routes/times as others enough to make a more direct comparison). That doesn't make me feel like any kind of a champion, but - more importantly - it keeps me from feeling like so much of a slouch that I get discouraged. I feel competent, and I know I can get better, so that keeps me going.</p> <p><img alt="image" src="" /></p> <p>The other thing that keeps me going is the people who have encouraged me and given me advice. Hank, Mike, Patrick, Allison, Shari, Nick, David - you all rock. You guys at Greater Boston Running Company, who helped me find the right shoes when I had ankle problems, rock too. I feel fitter now than I have in a long time, perhaps ever, and I couldn't have done it alone. Who would have thought that something so "obviously" solitary as running could be so social?</p> <p>* UPDATE: ...and it's actually not true any more. I actually wrote this a couple of days ago. Today, with the extra incentive of getting in before a break in the rain ended, I managed 21:06. I guess the difference in my pace from week to week outweighs the difference from course to course.</p>Tue, 21 Oct 2014 12:16:00,2014-10-21:2014-10-running-thoughts.htmlrunningDistributed Systems Prayer<p>Forgive me, Lord, for I have sinned.</p> <ul> <li> <p>I have written distributed systems in languages prone to race conditions and memory leaks.</p> </li> <li> <p>I have failed to use model checking when I should have.</p> </li> <li> <p>I have failed to use static analysis when I should have.</p> </li> <li> <p>I have failed to write tests that simulate failures properly.</p> </li> <li> <p>I have tested on too few nodes or threads to get meaningful results.</p> </li> <li> <p>I have tweaked timeout values to make the tests pass.</p> </li> <li> <p>I have implemented a thread-per-connection model.</p> </li> <li> <p>I have sacrificed consistency to get better benchmark numbers.</p> </li> <li> <p>I have failed to measure 99th percentile latency.</p> </li> <li> <p>I have failed to monitor or profile my code to find out where the real bottlenecks are.</p> </li> </ul> <p>I know I am not alone in doing these things, but I alone can repent and I alone can try to do better. I pray for the guidance of Saint Leslie, Saint Nancy, and Saint Eric. Please, give me the strength to sin no more.</p> <p>Amen.</p>Tue, 21 Oct 2014 10:35:00,2014-10-21:2014-10-dist-sys-prayer.htmldistributedhumorTen Stages of Technology Familiarity<p>Without further ado...</p> <ol> <li> <p>Never heard of it.</p> </li> <li> <p>Yeah, I hear all the hipsters yammering about it.</p> </li> <li> <p>I checked out the docs and examples once.</p> </li> <li> <p>I used it for a side project.</p> </li> <li> <p>We're using it for some new projects at work.</p> </li> <li> <p>We're using it in production.</p> </li> <li> <p>We're using it in production, but with a bunch of other stuff wrapped around it to address its deficiencies.</p> </li> <li> <p>We forked the project and our version's way better.</p> </li> <li> <p>Yeah, we used to use it.</p> </li> <li> <p>Never heard of it.</p> </li> </ol>Wed, 10 Sep 2014 14:45:00,2014-09-10:2014-09-tech-familiarity.htmlhumorStorage Benchmarking Sins<p>I've written and talked many times about storage benchmarking. Mostly, I've focused on how to run tests and analyze results. This time, I'd like to focus on the parts that come before that - how you set up the system so that you have at least some chance of getting a fair or informative result later. To start, I'm going to separate the setup into layers.</p> <ul> <li> <p>The physical configuration of the test equipment.</p> </li> <li> <p>Base-level software configuration.</p> </li> <li> <p>Tuning and workload selection.</p> </li> </ul> <h2>Physical Configuration</h2> <p>The first point about physical configuration is that there's almost never any excuse for testing two kinds of software on different physical configurations. Sure, if you're testing the hardware that makes some sense, but even then the only comparisons that make sense are the ones that exhibit equality at some level such as number of machines or system cost (including licenses). Testing on different hardware is the most egregious kind of dishonest benchmarking, but it's only the first of many.</p> <p>The second point about physical configuration is that just testing on the same hardware doesn't necessarily make things fair. What if one system can transparently take advantage of RDMA or other kinds of network offload but the other can't? Is it really fair to compare on a configuration with those features, and not even mention the disparity? What if one system can use the brand-new and vendor-specific SSE9 instructions to accelerate certain operations, but the other can't? The answer's less clear, I'll admit, but a respectable benchmark report would at least note these differences instead of trying to bury them. A good rule of thumb is that it's hardware <strong>used</strong> that counts, not merely hardware <strong>present</strong>. If the two systems aren't actually using the same hardware, the benchmark's probably skewed.</p> <p>The third and last point about hardware is it's still possible to skew benchmark results even if two systems are using the same hardware. How's that? Not all programs benefit equally from the same system performance profile. What if one system made a design decision that saves memory at the expense of using more CPU cycles, and the other system made a different design decision with the opposite effect? Is it fair to test on machines that are CPU-rich but memory starved, or vice versa? Of course not. A fair comparison would be on balanced hardware, though it's obviously difficult to determine what "balance" means. This is why it's so important for people who do benchmarks to disclose and even highlight potential confounding factors. Another common trick in storage is "short stroking" by using lots of disks and testing only across a small piece of those to reduce seek times. The flash equivalent might be to test one system on clean drives and the other after those same drives have become heavily fragmented. These differences can be harder to identify than the other two kinds, but they can have a similar effect on the validity of results.</p> <h2>Base Software Configuration</h2> <p>For the purposes of this section, "base" effectively means anything but the software under test - notably operating-system stuff. Storage configuration is particularly important. Is it fair to compare performance of one system using RAID-6 vs. another using JBOD? Probably not. (The RAID-6 might actually be faster if it's through a cached RAID controller, but that takes us back to differences in physical configuration so it's not what we're talking about right now.) Snapshots enabled vs. snapshots disabled is another dirty trick, since there's usually some overhead involved. Many years ago, when I worked on networking rather than storage, I even saw people turning compression on and off for similar reasons.</p> <p>Other aspects of base configuration can be used to cheat as well. Tweaking virtual-memory settings can have a profound effect on performance, which will disproportionately hurt some systems. Timer frequency is another frequent target, as are block and process schedulers. In the Java world, I've seen benchmarks that do truly heinous things with GC parameters to give one system an advantage over another. As with physical configuration, base software configuration can be easily done so that it's equal but far from fair. The rule of thumb here is whether the systems have been set up in a way that an experienced system administrator might have done, either with or without having read each product's system tuning guides. If the configuration seems "exotic" or is undisclosed, somebody's probably trying to pull a fast one.</p> <h2>Tuning</h2> <p>Most of the controvery in benchmarking has to do with tuning of the actual software under test. When I and others have tested GlusterFS vs. Ceph, there have always been complaints that we didn't tune Ceph properly. Those complaints are not entirely without merit, even though I don't feel the results were actually unfair. The core issue is that there are two ways to approach tuning for a competitive benchmark.</p> <ul> <li> <p>Measure "out of the box" (OOTB) performance, with no tuning at all. If one system has bad defaults, too bad for them.</p> </li> <li> <p>Measure "tuned to the max" performance, consulting experts on each side on how best to tweak every single parameter.</p> </li> </ul> <p>The problem is that the second approach is almost impossible to pull off in practice. Most competitive benchmarks are paid for by one side, and the other is going to be distinctly uninterested in contributing. Even in cases where the people doing the testing are independent, it's just very rare that competitors' interest and resource levels will align that closely. Therefore, I strongly favor the OOTB approach. Maybe it doesn't fully explore the <em>capabilities</em> of each system, but it's more likely to be fair and representative of what actual users would see.</p> <p>However, even pure OOTB doesn't quite cut it. What if the systems come out of the box with different default replication levels? It's clearly not fair to compare replicated vs. non-replicated, or even two-way vs. three-way, so I'd say tuning there is a good thing. On the other hand, I'd go the other way for striping. While different replication levels effectively result in using different hardware (different usable capacity), the same is not true of striping which merely uses the same hardware a little differently. That falls into the "too bad for them" category of each project being responsible for putting its own best foot forward.</p> <p>Another area where I think it's valid to depart from pure OOTB is durability. It's simply not valid or useful to compare a system which actually gets data on disk when it's supposed to vs. one that leaves it buffered in memory, as at least two of GlusterFS's competitors (MooseFS and HDFS) have been guilty of. You have to compare apples to apples, not apples to promises of apples maybe some day. Any deviations from pure OOTB should be looked at in terms of whether they correct confounding differences between systems or introduce/magnify those differences.</p> <h2>Conclusion</h2> <p>Benchmarking software is difficult. Benchmarking storage software is particular difficult. Very few people get it right. Many get it wrong just because they're not aware of how their decisions affect the results. Even with no intent to deceive, it's easy to run a benchmark and only find out after the fact that what seemed like an innocent configuration choice caused the result to be unrepresentative of any real-world scenario. On the other hand, there often <strong>is</strong> intent to deceive. With some shame, I have to admit that my colleagues in storage often play a particularly dirty game. Many of them, especially at the big companies, have been deliberately learning and applying all of these dirty tricks for decades, since EMC vs. Clariion (both sides) and NetApp vs. Auspex (ditto). None of them will be the first to stop, for obvious game-theoretic reasons.</p> <p>I've tried to make this article as generic and neutral as I could, because I know that every accusation will be met with a counter-accusation. That's also part of how the dirty game is played. However, I do invite anyone who has read this far to apply what they've learned as they evaluate <a href="">recent benchmarks</a>, and reach <strong>their own</strong> conclusions about whether those benchmarks reveal anything more than the venality of those who ran them.</p>Mon, 23 Jun 2014 09:34:00,2014-06-23:2014-06-benchmark-sins.htmldistributedstorageglusterfsvmwarebenchmarksWannabe of the Month: Skylable<p>Every month or two, someone comes along and claims to be the new Best Thing Ever in distributed file storage. More often than not, it's just another programmer who recently discovered things like consistent hashing and replication, then slapped together another HTTP object store because that's what people nowadays do instead of writing their own LISP/Forth interpreter or "make" replacement. There's nothing wrong with the exercise itself, of course. It's a great learning experience, and it's how real projects get started. For example, <a href="">LeoFS</a> might not really be "the leading DFS" as they claim, but it certainly a serious effort that I'm watching with interest. What gets my goat is always the grandiose claims, often made in the form of comparisons between real production-level file systems like GlusterFS and things that are neither production-level nor file systems.</p> <p>This month's example is <a href="">Skylable</a>, which tried to take advantage of the publicity around yesterday's big announcement to pimp their own spare-time project. At first they just tried to position themselves as a competitor to GlusterFS and Ceph when they're clearly not. I tried, as neutrally as I could, to point out that it's not a valid comparison. They didn't take the hint. Instead, @tkojm decided to double down.</p> <blockquote> <p>Skylable SX beats both Ceph &amp; Gluster in terms of security, code quality, ease of use and robustness. Cheers.</p> <p></p> </blockquote> <p>OK, game on. Such claims really piss me off, not because they're made against my own project but because they're disrespectful to every project in the same space. For example, <a href="">Tahoe-LAFS</a> plays in exactly this space, and they actually know what they're doing when it comes to security. Making competitive claims that are not only unaccompanied by one shred of evidence but <em>clearly false</em> to anyone with even the most cursory knowledge of the competitive landscape is outright dishonest. The Skylable folks have practically invited more serious comparisons, so I'm going to give them what they asked for and they're not going to like it. Maybe that will keep the next tyro from making the same mistake.</p> <p>Before I go on, I should mention that this post has nothing to do with Red Hat or the Gluster community. No time or equipment from either was used to test the Skylable code or write up the results. This is not big bad Red Hat picking on a smaller competitor. This is one guy (me), on his own time, trying to find the truth behind some very ambitious claims.</p> <p>Let's start with ease of use. Here are the steps to install GlusterFS, set up a two-way replicated volume, and mount it on a client.</p> <ul> <li> <p>yum install glusterfs... (or equivalent for other distros)</p> </li> <li> <p>/etc/init.d/glusterd start</p> </li> <li> <p>gluster peer probe server2 (from server1)</p> </li> <li> <p>gluster volume create myvol replica 2 server1:/path server2:/path</p> </li> <li> <p>gluster volume start myvol</p> </li> <li> <p>mount -t glusterfs server1:/myvol /wherever (from client)</p> </li> </ul> <p>What's the equivalent for Skylable? Well, you start by downloading, configuring, and building from source. Really. I don't expect such a young project to have stuff in major-distro repos yet. I wouldn't even ding them for not having their own specfiles or whatever, but they brought up ease of use and <em>requiring users to build from source is not good for ease of use</em>. It's even worse if you trip over their unnecessary dependency on libcurl being built with special OpenSSL support, which is not the case on RHEL/Fedora platforms. So much for the "tested on all major UNIX platform" claim.</p> <p>Once you've done who-knows-what to your system by running "make install" you're ready to begin configuring. Oh, what fun. To do this, you run "sxsetup" which will prompt you for several things and spit out some user-incomprehensible things like an administrator key. Then you have to <em>log in to another node</em> to repeat the process, manually copying and pasting an admin key from one window to another. Then you have to repeat the process again to set things up for the special-purpose programs you only need because its not a real mountable file system, only this time they call it a "user key" instead. Between the installation mess and the extra steps and the lack of real documentation, I think we can pretty clearly say...</p> <p><strong>Ease of Use: LIE</strong></p> <p>OK, so how about security, code quality, and robustness? With regard to security, they make a big deal of having both on-network and on-disk encryption, the latter using client keys. GlusterFS also has both of those, and much of the code has been vetted by Red Hat's renowned security team. Skylable's has been vetted by approximately nobody. A quick perusal of the code shows that it's all home-grown and littered with rookie mistakes. My favorite was this:</p> <div class="highlight"><pre> <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">skysalt</span> <span class="o">=</span> <span class="s">&quot;sky14bl3&quot;</span><span class="p">;</span> <span class="cm">/* salt should be 8 bytes */</span> </pre></div> <p>Yep, that's a constant salt embedded in the code, apparently used in lieu of a real KDF to generate the user's key from their password. Here's a hint: this process <strong>adds no entropy</strong> to the original text password, no matter how many times you apply EVP_BytesToKey. I actually pointed this one out to them on HN, as did somebody else, and (without admitting error) they claim they'll do better next time, but it does raise an important question. How likely is it that somebody who made such an inexcusable mess of generating a user key then managed to get every other little detail right in their home-grown storage encryption? The odds are nil. I could play free security consultant here and find the next terrible flaw for them, and the next one and the next one, but I shouldn't need to and neither should anyone else. This is just not serious crypto code, so...</p> <p><strong>Security: LIE</strong></p> <p>That also tells us where we're headed for code quality. Obviously I can't just rely on "gut feel" and familiarity because I have years of experience with the GlusterFS code and none with this, so I'll try to look at objective measures. I picked several files at random to look at. I deliberately excluded those marked as third-party, but still found a lot of code copied from other codebases - libtool, the ISAAC hash, SQLite. This is not only a terrible code smell but in many cases might be a license violation as well. The data structures seem to be better documented in the Skylable code than in GlusterFS (using the Doxygen style), but otherwise there seemed to be little evidence of this vaunted code quality - even though new code written by a small tight-knit team generally <em>should</em> have a higher median quality than older code written by a much larger team.</p> <p>Error checking doesn't seem notably more consistent, and cleanup after an error often seems to have involved copying free/close/etc. functions from one code block to another instead of using any of several more robust idioms. I specifically looked at error checking around read(2) and write(2) to see if it handled partial success as well as outright failure. Generally, no. The code uses lt__malloc for no apparent reason, but doesn't get any extra memory safety for the extra effort. Logging/tracing doesn't seem particularly strong. Skylable's own code (as opposed to that they copied) seems to use sprintf more than snprintf. I know these are all incredibly superficial observations, but code quality is an enormously complex topic. These are just the things that are easy to put into words, and they're already more than @tkojm has offered in support of his claim. They're enough to say...</p> <p><strong>Better Code Quality: QUESTIONABLE</strong></p> <p>It's pretty much the same for robustness. There's no evidence from the real world about the robustness of this code, of course. I also see even fewer tests than there are for GlusterFS, so I'd have to say that claim's <strong>QUESTIONABLE</strong> as well. Of the four claims @tkojm made, therefore, two are questionable and two out outright false, so the whole evaluates to false. Let's move on to the part he didn't even want to talk about: performance.</p> <p>To test performance, I used two SSD-equipped 16GB instances at Digital Ocean. It took me an hour or so to work through all of the dependency crap and get things set up before I was able to run any tests. Then the very first test I ran was a very simple 4KB file-create loop using sxcp. What was the ressult?</p> <div class="highlight"><pre><span class="mf">0.67</span> <span class="n">files</span> <span class="n">per</span> <span class="n">second</span> </pre></div> <p>I'm not joking, but apparently they are. GlusterFS is often criticized for its performance on exactly this kind of workload, and I'd be the first to say rightly so, but it's still <strong>orders of magnitude</strong> better than that. Those are modem speeds, on machines that can and do perform quite well using any other software. There's just no excuse. Why even bother looking further?</p> <p>I could keep going. I could go into detail about how the CLI lacks built-in help, how it doesn't seem to include anything to report on node or cluster status, how there don't seem to be any provisions for basics such as rebalancing after a server is added or permanently removing one from the config after the hardware blew up. I could talk about how storing sensitive data unencrypted in a plethora of separate SQLite files is bad for security, performance, and maintainability all at once. But really . . . <strong>enough</strong>. More than enough. Not even the most fervent advocate of "Minimum Viable Product" could consider this to be past the prototype phase.</p> <p>Let's try to give the Skylable folks as much benefit of the doubt as we can here. Maybe a few people decided that they'd had enough of some other technical area, and settled on distributed object storage as their next challenge. So they started tinkering around with Skylable SX as a platform for learning and experimentation. I think that's awesome. I want to encourage that kind of thing. Sure, I might have suggested a little more reading and studying how existing systems do the same sorts of things, before diving into code (especially that awful home-grown crypto), but I'd still try to be supportive. The problems only start when someone decides to start chasing money instead of technology. After all, this is a hot area, but new enough that many users don't know how to tell the serious players from the charlatans, so why not try to cash in? So he starts mouthing off about how this is <strong>already</strong> a serious contender, even though the people actually writing the code know it's still years from that. That's no longer OK. That's encouraging people to use code that will not store their data safely or securely, and I have a zero tolerance policy for that sort of thing.</p> <p>My message here is really pretty simple: keep coding, keep experimenting, if I have offended anyone's <strong>technical</strong> sensibilities I sincerely apologize, but for heaven's sake somebody get @tkojm to STFU until it's done. Fix the crypto, fix the performance, fix the packaging and UI. Then we'll have something real to talk about. Maybe, if I find more time than has already been wasted, I'll even dig in and submit a patch or two, fix some of that egregiously bad performance. But not if I keep hearing how it's already better than anything else.</p>Thu, 01 May 2014 12:08:00,2014-05-01:2014-05-skylable.htmldistributedstorageglusterfscephskylableInktank Acquisition<p>I know a lot of people are going to be asking me about Red Hat's acquisition of Inktank, so I've decided to collect some thoughts on the subject. The very very simple version is that <strong>I'm delighted</strong>. Occasional sniping back and forth notwithstanding, I've always been a huge fan of Ceph and the people working on it. This is great news. More details in a bit, but first I have to take care of some administrivia.</p> <p><em>Unlike everything else I have ever written here, this post has been submitted to my employer for approval prior to publication. I swear to you that it's still my own sincere thoughts, but I believe it's an ethical requirement for independent bloggers such as myself to be up front about any such entanglement no matter how slight the effect might have been. Now, on with the real content.</em></p> <p>As readers and conference-goers beyond number can attest, I've always said that Ceph and GlusterFS are allies in a common fight against common rivals. First, we've both stood against proprietary storage appliances, including both traditional vendors and the latest crop of startups. A little less obviously, we've also both stood for Real File Systems. Both projects have continued to implement and promote the classic file system API even as other projects (some even with the gall to put "FS" in their names) implement various stripped-down APIs that don't preserve the property of working with every script and library and application of the last thirty years. Not having to rewrite applications, or import/export data between various special-purpose data stores, is a <strong>huge</strong> benefit to users.</p> <p>Naturally, these two projects have a lot of similarities. In addition to the file system API, both have tried to address object and block APIs as well. Because of their slightly different architectures and user bases, however, they've approached those interfaces in slightly different ways. For example, GlusterFS is "files all the way down" whereas Ceph has separate bulk-data and metadata layers. GlusterFS distributes cluster management among all servers, while Ceph limits some of that to a dedicated "monitor" subset. Whether it's because of these technical differences or because of relationships or pure happenstance, the two projects have experienced different levels of traction in each of these markets. This has led to different lessons, and different ideas embedded in each project's code.</p> <p>One of the nice things about joining forces is that we each gain even more freedom than before to borrow each other's ideas. Yes, they were both open source, so we could always do some of that, but it's not like we could have used one project's management console on top of the other's data path. GlusterFS using RADOS would have been unthinkable, as would Ceph using GFAPI. Now, all things are possible. In each area, we have the chance to take two sets of ideas and either converge on the better one or merge the two to come up with something even better than either was before. I don't know what the outcomes will be, or even what all of the pieces are that we'll be looking at, but I do know that there are some very smart people joining the team I'm on. Whenever that happens, all sorts of unpredictable good things tend to happen.</p> <p>So, welcome to my new neighbors from the Ceph community. Come on in, make yourself comfortable by the fire, and let's have a good long chat.</p>Mon, 28 Apr 2014 13:23:00,2014-04-28:2014-04-inktank-acquisition.htmldistributedstorageglusterfscephNew Style Replication<p>This afternoon, I'll be giving a talk about (among other things) my current project at work - New Style Replication. For those who don't happen to be at Red Hat Summit, here's some information about why, what, how, and so on.</p> <p>First, why. I'm all out of tact and diplomacy right now, so I'm just going to come out and say what I really think. The replication that GlusterFS uses now (AFR) is unacceptably prone to "split brain" and always will be. That's fundamental to the "fan out from the client" approach. Quorum enforcement helps, but the quorum enforcement we currently have sacrifices availability unnecessarily and still isn't turned on by default. Even worse, once split brain has occurred we give the user very little help resolving it themselves. It's almost like we actively get in their way, and I believe that's unforgivable. I've <a href="">submitted</a> <a href="">patches</a> to overcome both of these shortcomings, but for various reasons those have been almost completely ignored. Many of the arguments about NSR vs. AFR have been about performance, which I'll get into later, but that's really not the point. In priority order, my goals are:</p> <ul> <li> <p>More correct behavior, particularly with respect to split brain.</p> </li> <li> <p>More flexibility regarding tradeoffs between performance, consistency, and availability. At the extremes, I hope that NSR can be used for a whole continuum from fully synchronous to fully asynchronous replication.</p> </li> <li> <p>Better performance in the most common scenarios (though our unicorn-free reality dictates that in return it might be worse in others).</p> </li> </ul> <p>To show the most fundamental difference between NSR and AFR, I'll borrow one of my slides.</p> <p><img alt="image" src="" /></p> <p>The "fan out" flow is AFR. The client sends data directly to both servers, and waits for both to respond. The "chain" flow is NSR. The client sends data to one server (the temporary master), which then sends it to the others, then the replies have to propagate back through that first server to the client. (There is actually a fan-out on the server side for replica counts greater than two, so it's technically more splay than chain replication, but bear with me.) The master is elected and re-elected via etcd, in case people were wondering why I'd been hacking on that.</p> <p>Using a master this way gives us two advantages. First, the master is key to how "reconciliation" (data repair after a node has left and returned) works. NSR recovery is log-based and precise, unlike AFR which marks files as needing repair and then has to scan the file contents to find parts that differ between replicas. Masters serve for terms. The order of requests between terms is recorded as part of the leader-election process, and the order within a term is implicit in the log for that term. Thus, we have all of the information we need to do reconciliation across any set of operations without having to throw up our hands and say we don't know what the correct final state should be.</p> <p>There's a lot more about the "what" and the "how" that I'll leave for a later post, but that should do as a teaser while we move on to the flexibility and performance parts. In its most conservative default mode, the master forwards writes to all other replicas before performing them locally and doesn't report success to the client until all writes are done. Either of those "all" parts can be relaxed to achieve better performance and/or asynchronous replication at some small cost in consistency.</p> <ul> <li> <p>First we have an "issue count" which might be from zero to N-1 (for N replicas). This is the number of non-leader replicas to which a write must be <strong>issued</strong> before the master issues it locally.</p> </li> <li> <p>Second we have a "completion count" which might be from one to N. This is the number of writes that must be <strong>complete</strong> (including on the master) before success is reported to the client.</p> </li> </ul> <p>The defaults are Issue=N-1 and Completion=N for maximum consistency. At the other extreme, Issue=0 means that the master can issue its local write immediately and Completion=1 means it can report success as soon as one write - almost certainly that local one - completes. Any other copies are written asynchronously but in order. Thus, we have both sync and async replication under one framework, merely tweaking parameters that affect small parts of the implementation instead of having to use two completely different approaches. This is what "unified replication" in the talk is about.</p> <p>OK, on to performance. The main difference here is that the client-fan-out model splits the client's outbound bandwidth. If you have N replicas, a client with bandwidth BW can never achieve more than BW/N write throughput. In the chain/splay model, the client can use its full bandwidth and the server can use its own BW/(N-1) simultaneously. This means increased throughput in most cases, and that's not just theoretical: I've observed and commented on exactly that phenomenon in head-to-head comparisons with more than one alternative to GlusterFS. Yes, <strong>if</strong> enough clients gang up on a server then that server's networking can become more of a bottleneck than with the client-fan-out model, and <strong>if</strong> the server is provisioned similarly to the clients, and <strong>if</strong> we're not disk-bound anyway, then that can be a problem. Likewise, the two-hop latency with this approach can be a problem for latency-sensitive and insufficiently parallel applications (remember that this is all within one replica set among many active simultaneously within a volume). However, these negative cases are much - <strong>much</strong> - less common in practice than the positive cases. We did have to sacrifice some unicorns, but the workhorses are doing fine.</p> <p>That's the plan to (almost completely) eliminate the split-brain problems that have been the bane of our users' existence, while also adding flexibility and improving performance in most cases. If you want to find out more, come to one of my many talks or find me online, and I'll be glad to talk your ear off about the details.</p>Wed, 16 Apr 2014 10:24:00,2014-04-16:2014-04-new-style-replication.htmldistributedstorageglusterfsChange the Axis<p>The other day, I was talking to a colleague about the debate within OpenStack about whether to chase Amazon's AWS (what another colleague called the "failed Eucalyptus strategy") or forge its own path. It reminded me of an idea that was given to me years ago. I can't take credit for the idea, but I can't remember who I got it from so I'll do my best to represent it visually myself. Consider the following image.</p> <p><img alt="image" src="" /></p> <p>Let's say your competitor is driving toward a Seattle feature set - coffee, grunge, rain. You have a slightly different vision, or perhaps just a different execution, that leads toward more of an LA feature set - fewer evergreens, more palm trees. If you measure yourself by progress toward your opponent's goals (the dotted line), <em>you're going to lose</em>. That's true even if you actually make better progress toward your goals. You're just playing one game and expecting to win another. That might seem like an obviously stupid thing to do, but an amazing number of companies and projects end up doing just that. I'll let someone with more of a stake in the OpenStack debate decide whether that applies to them. Now, consider a slightly different picture.</p> <p><img alt="image" src="" /></p> <p>Here, we've drawn a second line to compare <em>our competitor's progress</em> against <em>our</em> yardstick. Quite predictably, now they're the ones who are behind. Isn't that so much better? If you're building a different product, you need to communicate why you're aiming at the right target and shift the debate to who's hitting it. In other words, <em>change the axis</em>.</p> <p>I don't mean to say that copying someone else's feature set is always a mistake. If you think you can execute on their vision better than they can, that's great. Bigger companies do this to smaller companies all the time. At Revivio, we weren't afraid of other small competitors. We were afraid of some big company like EMC getting serious about what we were doing, then beating us with sheer weight of numbers and marketing muscle. Occasionally things even go the other way, when a smaller team takes advantage of agility and new technology to out-execute a larger legacy-bound competitor. The real point is not that one strategy's better, but that you can't mix them. You can't send execution in one direction and messaging in another. You have to pick one, and stick to it, or else you'll always be perceived as falling short.</p>Thu, 13 Feb 2014 09:04:00,2014-02-13:2014-02-change-the-axis.htmlopenstackstrategyData Gravity<p>In the last few days, I had an interesting exchange <a href="">on Twitter</a> about the concept of data gravity. For convenience, I'll include the relevant parts here.</p> <ul> <li> <p><a href="">Mat Ellis</a>: Interesting piece by @mjasay <a href="">link</a> … @randybias is right on the money, data gravity is already a big deal on the cloud</p> </li> <li> <p><a href="">me</a>: Data gravity will continue to be a big deal, no matter how fast the network. Can't beat the speed of light.</p> </li> <li> <p><a href="">Randy Bias</a>: Data gravity and speed of light are entirely unrelated.</p> </li> <li> <p>me: No matter how much bandwidth you have, latency-bound sync and coordination limit total data velocity.</p> </li> </ul> <p>I think this is an important point, and Randy is hardly the first to get it wrong, but the explanation is a little longer than Twitter's 140-character limit. If you have data that you want to access from multiple places, you have two choices.</p> <ul> <li> <p>Keep a copy in one location, access it remotely from elsewhere. Besides being <strong>extremely</strong> latency-bound, this does nothing for availability.</p> </li> <li> <p>Keep multiple copies, and keep them in sync. The sync process/protocol still tends to be quite latency-bound, and as the number of replicas increases you get increasingly poor storage utilization. Even Google doesn't have an infinite budget for disks.</p> </li> </ul> <p>Either way, no matter how much bandwidth you have, latency - bound by speed of light - is an issue. This is exactly the point I made in my <a href="">Dude, Where's My Data</a> talk at LISA'12: making that initial copy is easy, but keeping it up to date is hard. Sooner or later you're back to this.</p> <p><img alt="image" src="" /></p> <p>That's data gravity, despite high bandwidth. Computing is full of "if you just do/have X" pipe dreams, of which "throw hardware at it" is just a subcategory. People who've actually tried X have usually found that there are tons of secondary issues that have to be solved, and even then X isn't the panacea it was imagined to be. This is such a case. Having tons of bandwidth is nice, it does allow Google to do things that others can't, but it simply doesn't make data gravity disappear.</p>Mon, 10 Feb 2014 08:27:00,2014-02-10:2014-02-data-gravity.htmldistributedstorageTiers Without Tears<p>A lot of people have asked when GlusterFS is going to have support for tiering or Hierarchical Storage Management, particularly to stage data between SSDs and spinning disks. This is a pretty hot topic for these kinds of systems, and many - e.g. Ceph, HDFS, Swift - have announced upcoming support for some form or other. However, tiering is just one part of a larger story. What do the following all have in common?</p> <ul> <li>Migrating data between SSDs and spinning disks.</li> <li>Migrating data between replicated storage and deduplicated, compressed, erasure-coded storage.</li> <li>Placing certain types of data in a certain rack to increase locality relative to the machines that will be using it.</li> <li>Segregating data in a multi-tenant environment, including between tenants at different service levels requiring different back-end configurations.</li> </ul> <p>While these might seem like different things, they're all mostly the same except for one part that decides where to place a file/object. It doesn't really matter whether the criteria include file activity, type, owner, or physical location of servers. The mechanics of actually placing it there, finding it later, operating on it, or moving it somewhere else are all pretty much the same. We already have all those parts in GlusterFS, in the form of the DHT (consistent hashing) translator. We've even added tweaks to it before, such as the ill-named NUFA. Therefore, it makes perfect sense to use that as the basis for our own tiering strategy, but I call it "data classification" because the same enhancements will allow it to do far more than tiering alone.</p> <p>The key idea behind data classification is reflected in its earlier name - DHT over DHT. Our "translator" abstraction allows us to have multiple instances of the same code active at once, differing only in their parameters and relationship to one another. It's just one of many ways that GlusterFS is more modular than its closest competitors, even though those are implemented in more object-oriented languages. To see how this kind of setup works, let's start with an example <em>without</em> it, capable of implementing only the simplest form of tiering.</p> <p><img alt="image" src="" /></p> <p>In this example, we have four bricks each consisting of a smaller SSD component (red) and a larger spinning-disk component (blue). This can easily be done using something like <a href="">dm-cache</a>, <a href="">Bcache</a>, <a href="">FlashCache</a>, or various hardware solutions. Those hybrid bricks are then combined, first into replica pairs and finally into a volume using the DHT (a.k.a. "distribute") translator. This approach actually works pretty well and is easy to implement, but it's less than ideal. If your working set is concentrated on anything less than the entire set of bricks, then you could fill up their SSD parts and either become network-bound or have accesses spill over to the spinning-disk components even though potentially usable resources on other bricks remain idle. This approach doesn't deal well with adding more resources in anything but a totally symmetric fashion across all bricks, and in particular precludes concentrating those SSDs on a separate set of beefier servers with extra-fast networking. Lastly, it doesn't support tiering across different encoding methods or replication levels, let alone the other non-tiering functions mentioned above. Now, consider this different kind of setup.</p> <p><img alt="image" src="" /></p> <p>Here, the left half is our fast working-storage tier and the right half is our archival tier optimized for storage efficiency and survivability instead of performance. Note that this is a logical/functional view, not a physical one. A1 and A2 might still be on the same server, but now their logical relationship has changed and so they could also be moved separately.</p> <p>Our performance tier looks much like the whole system did before, with bricks arranged into replica sets and then DHT (as it is today). However, we've split off the spinning disks into a whole separate pool, and put a new "tiering" translator (a modified version of DHT) on top. Here's the cool part: that "replicate 3" layer might actually be erasure coding instead of normal replication. That would suck for performance, but since this is only used for our slow tier that's OK. 90% of the accesses to the fast tier + 90% of data in the storage-efficient tier = goodness. We could also toss in deduplication, compression, or bit-rot detection <em>on that side only</em> for extra fun. Note that we couldn't do this in the other model, because you can't put non-distributed tiering on top of distributed erasure coding. Most other tiering proposals I've seen do the tiering at too low a level, and are far more useful as result.</p> <p>Finally, let's consider those other functions that aren't tiering. In the second diagram above, it would be trivial to replace the "distribute" component above with one that's making decisions based on rack location instead of random hashing. Similarly, it would be trivial to replace the top-level "tier" component with one that makes decisions based on tenant identity or service level instead of file activity. It's almost as easy to add even more layers, doing all of these things at once in a fully compatible way. No matter what, migrating data based on new policies or conditions can still use the same machinery we've worked so hard to debug and optimize for DHT rebalancing.</p> <p>Over the last few years I've come up with a lot of ways to improve GlusterFS, or distributed filesystems in general, but this is one of my favorites. It can add so much functionality in return for so little deep-down "heavy lifting" and that's pretty exciting.</p>Fri, 31 Jan 2014 13:07:00,2014-01-31:2014-01-data-classification.htmldistributedstorageglusterfsThe World Is Not Flat<p>Way back when I was a young pup, either in college or after that but before I started my career, I got to use an operating system called MTS. That stands for Michigan Terminal System. It was created to run on IBM (and later Amdahl) mainframes, when U of M got tired of waiting for IBM to deliver a multi-user operating system. Like most code that old, it was an interesting combination of ideas that have since been abandoned because they were stupid, ideas that were ahead of their time, and ideas that were somewhere in between. Here are some of the more interesting ideas.</p> <ul> <li> <p>The filesystem had a feature to include the entire contents of one file at a specific point within another. Who needs symbolic links when you can just create a file containing a single %include directive? Why would programming languages have to synthesize this behavior in a bazillion subtly different ways if the basic functionality existed natively in the OS? Yeah, I know, record-oriented filesystems (basically a prerequisite for this) lost out to simple byte-streams for many good reasons, but every victory comes at a cost.</p> </li> <li> <p>MTS had a very robust ACL system, which allowed you to control access by user, group, or "pkey" (i.e. what program was running). Much better than set-uid in my opinion.</p> </li> </ul> <p>While I was still using MTS, they added a macro system - what we would now think of as a shell scripting language. One of the very first uses of this macro system was to sythesize a hierarchical directory structure on top of the flat one native to MTS. I really wish I could remember the name of the author, to give credit. He was a Computing Center consultant, and this would have been in 1985 or so, if anybody wants to help me out. It was a pretty slick combination of naming conventions and macros, and I think it made many users' lives easier.</p> <p>The reason I started thinking about MTS is that I see people doing the exact same things now - nearly thirty years later - to simulate a hierarchical namespace on top of the flat one provided by most object stores. Let me repeat something I've said many times before, in many ways: flat namespaces weren't just crap in DOS, they were crap in MTS even before that and they're still crap today. <strong>Crap</strong>, I say. Anybody who implements a supposedly modern file/object store with a flat namespace is simply screwing their users to suit their own convenience. The scalability arguments don't hold water, because the scalability issues mostly have to do with the operations that you have to support (e.g. atomic rename) than with whether or not you have nested directories. This is something that has to be built into the data store, with the necessary recursive name resolution done one place one time by people who understand that data store, instead of being done ten incompatible ways by ten different outsiders. Even quite smart people can trip when they try to bolt on a hierarchical structure <a href="">after the fact</a>.</p> <p>Users have shown over and over again that they want flexibility to organize and reorganize their data, in ways richer than a flat or even single-level hierarchy will allow. Maybe there's an even better way, but so far none of the attempts to replace nested directories with tags or links or database-like queries seem to have gained much traction. Until someone comes up with something better, the nested-directory structure should be considered a minimum standard for anything that's supposed to replace the traditional filesystem.</p>Sun, 29 Dec 2013 20:00:00,2013-12-29:2013-12-world-is-not-flat.htmlstoragedistributedData Extortion<p>This is a story about the dark side of moving your stuff into the cloud. It does have a (reasonably) happy ending, but along the way there are some important lessons to be learned about the relationship between cloud users and cloud providers, and how it's possible for people on either side to get burned. There are some bits about contract (and other) law, and customer service, and other things as well, but let's begin with what happened this morning.</p> <p>Between my reduced hours and the Christmas shutdown, I figure I owe Red Hat about 4.8 hours of work this week. I didn't do it on Monday, so I figured I'd do it this morning. I decided to debug some performance-testing scripts, but since my machines at work are all powered off I figured I'd do it on my cloud machine at Host Virtual. For debugging, I only needed to do short runs - no more than forty seconds or one gigabyte per run, as it turns out. I'd done about a dozen of these when my machine became unresponsive. What gives? I looked around all the usual ways, then logged in to my Host Virtual console to see that my VM was locked with the following message.</p> <blockquote> <p>i/o abuse from your vm - we are investigating</p> </blockquote> <p>There are two things wrong here. The less important problem is the premature "abuse" accusation. "Anomalous" would have been fine, "excessive" might even have been OK, but "abuse" is insulting a customer for no good reason. More importantly, locking the VM was a complete overreaction. The tools exist to throttle the I/O from a particular VM instead of shutting it down entirely. I've seen such throttling kick in when testing on other providers many times (more about that in a minute). Even when a shutdown is considered necessary, it's <strong>never</strong> appropriate to do it without notification. By their own admission they were still investigating, but from my perspective they had already gone beyond investigation to accusation, conviction, and execution.</p> <p>At this point, I submitted a ticket explaining what I had been doing, and suggesting that their reaction had been premature. If they had just admitted as much, things would have been fine. If they had asked me to reduce my I/O load, I would have. Instead, Customer Disservice Representative par excellence "Mark M" replied saying that I had been affecting other customers and violating their Acceptable Use Policy. Unfortunately, there's nothing to back up that claim. There is no I/O limit specified in their AUP. None. The closest they get is this.</p> <blockquote> <p>We have determined what constitutes reasonable use of our network for the particular services and products you purchase from us. These standards are based on typical customer use of our network, for similar services and products. It is your obligation to monitor the use of your services and/or server(s) to ensure that there are not unusual spikes and peaks in your bandwidth or disk usage.</p> </blockquote> <p>In other words, they claim to have some numbers in mind, but won't commit to them in their own AUP. Because those limits aren't specified, even by reference to another document or method of notification, they're legally nonexistent. Even now, nobody at HV has identified a limit that I exceeded, by how much, or for how long. They can't claim any AUP violation without such specifics, and thus they can't claim any right to modify our existing relationship in any way. So, their AUP claim is complete bullshit. What about the "affecting other users" claim?</p> <p>Well, sorry, but tough cookies for them. Do you know who's responsible for meeting their obligations to other customers? <strong>Them</strong>. As it happens, I know quite a bit about the problems and technologies involved in providing these kinds of services. In the course of becoming an expert on cloud I/O performance, I've done this same sort of testing on about twenty providers. I've seen the "noisy neighbor" problem from both sides. I've seen my own I/O performance go all over the map because of other users, and I've seen it throttled into the ground supposedly to protect other users from me. I don't love being throttled, but it's an entirely valid response so long as its depth and duration are protective rather than punitive. More importantly, it proves that <strong>the technology exists</strong>. If HV chooses not to apply it, that's their fault. They can't simultaneously preach about meeting commitments to users while spitting on their commitment to me.</p> <p>The funny thing is that until now I've been one of HV's biggest boosters. They seemed to be one of the few providers whose I/O performance was marred by neither massive variability nor punitive throttling. Little did I know that their "secret" was to kill VMs arbitrarily when they got in the way. In any case, I expressed my skepticism about their AUP claim, and my dissatisfaction with the lack of notification. That's when "Mark M" really stuck his head up his ass.</p> <blockquote> <p>if the abuse is ongoing and continued your account will simply be terminated and your server deleted.</p> </blockquote> <p>What we now have is someone threatening a customer with <strong>deletion of data</strong> in response to a "violation" that has already been called into question. That's extortion. There is absolutely no situation where it would be appropriate to delete a server while such disagreements are still outstanding, and the fact that Marky Mark regards it as a "simple" matter is appalling.</p> <p>So I've already moved all my data, and I'll be warning everyone away from this decade's version of Feature Price (widely regarded as the worst web host ever, especially since they also tried to take users' data hostage). No big deal, actually. What's far more important is that this could happen <strong>to any user, at any cloud provider</strong>. Go take a good look at your own AUP, TOS, or whatever you think spells out the obligations back and forth between you and your cloud provider. How many MB/s may you write, for how long, before they decide you're being "abusive"? Bear in mind that anything left vague might be subject to mind-bending reinterpretation, and anything left out (like HV's I/O limits) might be subject to outright fabrication. What recourse do you have if the provider inappropriately terminates service? Do they admit to any obligation regarding preservation of data while there is an ongoing dispute? I've seen a whole lot of these documents, and all of these things are typically missing. Maybe it's time for someone - users, providers, please not the government - to define minimum standards that cloud providers should meet regarding these sorts of issues. The better providers won't have any problem signing up. The worse ones? Well, I suppose they'll keep on threatening and extorting - and losing - their customers.</p> <p>By the way, welcome to the new site.</p>Fri, 27 Dec 2013 21:50:00,2013-12-27:2013-12-data-extortion.htmlcloudlegalRoll Back or Rock On?<p>For a while now, Kyle Kingsbury has been doing some <a href="">excellent work</a> evaluating the consistency and other properties of various distributed databases. His <a href="">latest target</a> is Redis. Mostly I agree with the points he makes, and that Redis Cluster is subject to inexcusable data loss, but there is one point on which my own position is closer to the opposition.</p> <blockquote> <p>we have to be able to roll back operations which should not have happened in the first place. If those failed operations can make it into our consistent timeline in an unsafe way, perhaps corrupting our successful operations, we can lose data.</p> </blockquote> <p>Those are strong words, but their strength is not matched by their precision. What does "unsafe" really mean here? Or "corrupting"? I'm the last person to take data corruption or loss lightly, but that's precisely why I think it's important to be crystal clear on what they mean. How is it "corruption" to perform a write that the user asked you to perform? The answer depends very much on what rules we're actually supposed to follow. Let's start with some of the most basic requirements for any distributed storage system.</p> <ul> <li> <p>Internal consistency: all nodes will eventually agree on whether each write happened or not. (Note: this is more CAP consistency than ACID consistency).</p> </li> <li> <p>Durability: once a write has completed, it will be reflected in all subsequent reads despite transient loss of all nodes and/or permanent loss of some number (system-specific but always less than quorum).</p> </li> </ul> <p>We're not done yet, because we've only defined an internal kind of consistency. As many have pointed out, a distributed system includes its clients. A system that simply throws away all writes could satisfy our requirements, so let's add a more externally oriented consistency requirement.</p> <ul> <li>External consistency: any write that has been acknowledged to the user as successfully completed must be complete according to the durability definition.</li> </ul> <p>That's really about it. The last acknowledged write to a location will eventually become available everywhere, and remain available unless the failure threshold is exceeded (or a user deliberately overwrites it but that's a different matter). There are certainly many more requirements we could add, as we'll see, but these few are sufficient for a usable system.</p> <p>One thing that's noticeably missing from our external-consistency rule is anything to do with <strong>un</strong>acknowledged writes. Unless we add more rules, the system is free to choose whether they should be completed or rolled back (so long as our other rules are followed). Here's a rule that would force the system to decide a certain way.</p> <ul> <li>Any write that has <strong>not</strong> been acknowledged to the user must <strong>not</strong> be reflected on subsequent reads.</li> </ul> <p>That should be pretty familiar to database folks as isolation (plus a bit of atomicity), and it's no surprise that database folks would assume it . . . but you know what they say about assumptions. Other kinds of systems, such as filesystems, do not have such a requirement. Instead of appeal to (conflicting) authority or tradition, let's try taking a look at what's actually right for users.</p> <p>Unacknowledged writes fall into two categories: still in progress or definitively failed. For in-progress writes, isolation can be enforced by storing them "off to the side" in one way or another. This doesn't work for definitively failed writes, because "off to the side" is finite. Those writes have to be actually removed from the system - i.e. roll-back. The problem is that roll-back is subject to the same coordination problems as the original write and carries its own potential for data loss. In fact, for a write that overlaps with a previous one and succeeded at some nodes but not others, data loss absolutely <strong>will</strong> occur either way. The difference is only which data - old or new - will be lost.</p> <p>So, back to what's right for users. Why is it better to lose the data that the user explicitly intended to overwrite than to lose the data that they explicitly intended to put in its place? Trick question: it's not. The careless user who didn't bother checking return values would obviously be better served by moving forward than by rolling back . . . but who cares about them? More importantly, even a diligent user who does check error codes should be aware that lack of acknowledgement does not mean lack of effect. By now "everybody knows" that if you send a network message and don't get a reply you can't assume it had no effect. The same "lost ack" problem exists for storage I/O as well, and has forever. In both worlds, the "must have had no effect" assumption is just as dumb as the careless programmer's "must have worked" assumption.</p> <p>If we exclude the careless and truly diligent programmers, the only people left who would care about not having rollback would be those who know to check for errors but don't know or don't care enough to handle them properly. They must also be comfortable with the performance impact of roll-back support, most often from double writing. I'm not saying these people don't exist or their concerns aren't valid, but clearly roll-back is not the best or only system-design choice for everyone. Building a system that tries to keep as many writes as possible instead of throwing away as many as possible is an entirely valid option.</p> <p>If Redis threw away 45% of acknowledged writes in Kyle's testing, that's a serious problem. That violates our consistency rule, or any reasonable alternative, and I have no problem saying that such a system is broken. When Kyle adds that Redis "adds insult to injury" by completing all of the unacknowledged writes instead, he's also correct - but it's only an insult, not a new injury. A new injury would be further loss of data, and whether those successful writes represent loss of data is very open to interpretation. If I accidentally knock some money out of your hands, then bend down and pick up only the pennies for you, it's not the pennies that are the problem. It's the money - or data - that got dropped on the floor and left there.</p> <p>(NOTE: it has been pointed out to me that what Kyle tested was not Redis Cluster but a proposed WAIT enhancement to Redis replication. Or something like that. Fortunately, those distinctions aren't particularly relevant to the point I'm trying to make here, which is about the supposed necessity of roll-back support in this type of system. Nor does it change the fact that the system under test - whatever it was - failed miserably. Still, I used the wrong term so I've added this paragraph to correct it.)</p>Wed, 11 Dec 2013 17:32:00,2013-12-11:2013-12-roll-back-or-rock-on.htmldistributedGiving Thanks<p>This was inspired both by a <a href="">blog post elsewhere</a> and by a nice email I got this morning thanking me for this blog (thanks Tristan). It seems like we all fail to give thanks, and nowhere more so than in the "gift economy" of open source. I'll start with all of the <strong>code</strong> for which I'm thankful, and then move outward from there.</p> <p>I'm grateful for the operating systems, compilers/interpreters, and text editors I use. For the web browsers, email clients, and servers of all kinds. For the hardware, from chips up to systems, that runs all of this code. (We software folks are <strong>really</strong> bad about recognizing all of the efforts that are made before we even start.) I'm grateful for the internet in all of its physical, technical, and financial manifestations. It is truly a wonder that I can carry a device anywhere that lets me sit down wherever, connect wirelessly to the rest of the world, and work or play. Lastly, I'm grateful to all of the computer scientists and mathematicians and physicists and all sorts of real engineers who toiled away, often in obscurity, to lay the foundations for all of this.</p> <p>OK, so much for the purely technical. I'd also like to thank all of my colleagues, past and present, for helping me achieve whatever it is that I've achieved, for providing intellectual challenges, and (sometimes) for pure camaraderie. I'd like to thank my current bosses at Red Hat, for letting me take time off and reduce my hours so that I can stay sane. Very few of my past bosses would have done so much. Thanks to all those who have to work today, and work every day, to create the environment that allows me such freedom and opportunity - soldiers, police and other emergency workers, doctors, nurses, the people who maintain our power and communications grids, inspectors, regulators, and so on. Yes, even legislators, judges, mayors, governors, and presidents.</p> <p>There are even more people to thank, but I have to cut myself short so I can thank the most important group: my family. Yes, I know it's trite, but it's also true. Without them I wouldn't be able to do the other things you all get to see, and I wouldn't have any reason to, and I wouldn't have anything else to go back to when I'm done. Family, whether inherited or chosen as friends, is really the basis of everything else. Let's all try not to forget that.</p> <p>Now, off to lunch.</p>Thu, 28 Nov 2013 11:43:00,2013-11-28:2013-11-giving-thanks.htmlShared Libraries are Obsolete<p>I was around when shared libraries were still a new thing in the UNIX world. At the time, they seemed like a great idea. On multi-user systems like those I worked on at Encore, static linking meant not only having a separate copy of the same code in every program, but having a separate copy even for every user running the same program. The waste of both disk space and memory was a serious concern. Making shared libraries work required a lot of effort from both compiler and operating-system people, but it was well worth it.</p> <p>Fast forward to the present day. Not only are disk and memory cheaper, but there aren't as many users running copies of the same program on the same machine either. Those savings are neither as big nor as important as they used to be. At the same time, shared libraries have created a whole new world of software maintenance problems. In the Windows world this is called "DLL Hell" but I think the problem is even worse in the Linux world. When every application depends on dozens of libraries, and every one of those libraries is shared, that means dozens of possibilities for an upgraded library to cause a new crash or security failure in your application. Yes, sometimes bugs can be fixed without needing to rebuild applications, but I challenge anyone to show empirical evidence that the fixes are more common than the breakage.</p> <p>People actually do test their applications against specific combinations of the libraries they depend on. If there are bugs in that combination, they get found and fixed or worked around. Every behind-the-back library upgrade creates a new <strong>untested</strong> configuration that might be better but is more likely worse. In what other context do we assume that an untested change will "just work"? In what other context should we? Damn few. Applications should run with the libraries they were tested with.</p> <p>At this point, someone's likely to suggest that strict library versioning solves the problem. It sort of does, so long as the library version includes information about how it was built as well as which version of code, because the same code built differently is still likely to behave differently sometimes. Unfortunately, it just trades one problem for another - dependency gridlock. If every application specifies strict library dependencies, then what do you do when a library changes? If you blindly mass-rebuild applications to update their dependencies, then you haven't solved the "untested combination" problem. If you keep old library versions on the system, then you've thrown away the advantage of having shared libraries in the first place. Either way, you've created a package-management nightmare both for yourself and for distribution maintainers.</p> <p>Shared libraries still make sense for the very low-level libraries that practically every application uses and that users are already wary of updating, like glibc. If the library maintainer's testing and API-preservation bar is higher than most app developers', that's OK. In almost every other case, you're probably better off with statically linked and tested combinations of apps and libraries. If you want to save some memory, make sure the load addresses are page aligned and do memory deduplication. Otherwise, you're probably just saving less memory than you think at the expense of much more important stability and security.</p> <p>Update: This post sparked a fairly lively <a href="">Twitter conversation</a>.</p>Tue, 26 Nov 2013 17:23:00,2013-11-26:2013-11-shared-libraries.htmloperating-systemsFixing Fsync<p>When I wrote about how <a href="">local filesystems suck</a> a while ago, it sparked a bit of debate. Mostly it was just local-filesystem developers being defensive, but Dave Chinner did make the quite reasonable suggestion that I could help by proposing a better alternative to the fsync problem. I've owed him an answer since then; here it is.</p> <p>To recap, the main problem with fsync is that it conflates <em>ordering</em> with <em>synchrony</em>. There's no way to ensure that two writes happen in order except by waiting for the first to complete before issuing the second. This sacrifices any possibility of pipelining requests, which is essential for performance. What's funny is that local filesystems themselves take advantage of a model that does allow such ordered pipelines - tagged command queuing, which I first encountered twenty years ago when I worked with parallel SCSI. The basic idea is that a device has multiple queues. Each request specifies which queue it should go on, plus some bits to specify how it should be queued and dequeued.</p> <ul> <li> <p>SIMPLE means that there are no particular queuing or ordering restrictions. A series of SIMPLE requests can be reordered and/or issued in parallel with respect to each other, but not with respect to non-SIMPLE requests.</p> </li> <li> <p>ORDERED means that the request must wait for all earlier requests on the same queue, and all later requests on the same queue must wait for it. This allows pipelining, but not parallelism or reordering.</p> </li> <li> <p>HEAD is the same as ORDERED, except that the new request is inserted at the head of the queue instead of the tail. This is generally a very bad idea, but it's necessary in certain situations. For example, the drivers I was writing used it to issue the commands for controller failover while leaving the rest of the queue intact.</p> </li> </ul> <p>The funny thing is that this model has been around so long that it has bubbled up to the OS block layer, where local filesystems can take advantage of it to ensure correct ordering while maintaining performance, but then those same local filesystems don't expose it to anyone else. Seems rather selfish to me.</p> <p>The obvious solution is simply to add queue/type parameters to writev (and possibly other calls as well). Current behavior is equivalent to SIMPLE queuing. Fsync is equivalent to an ORDERED no-op issued synchronously. That's all very well, but the model provides pretty obvious ways to do even more interesting things.</p> <ul> <li> <p>An asynchronous fsync becomes possible simply by issuing an ORDERED no-op (zero-length write?) using AIO. You don't have to <strong>wait</strong> for it, but you can be assured that it's in the pipeline and order will be maintained.</p> </li> <li> <p>If you only need ordering between <strong>some</strong> requests, you can use ORDERED on multiple queues.</p> </li> </ul> <p>This is a clean, powerful, and well proven model. Unfortunately, local-FS developers will probably argue that it's too hard to implement (even though they've already implemented it for their own uses e.g. ordering data vs. inode writes). However, most of this complexity has to do with multiple queues. Implementing just SIMPLE/ORDERED without multiple queues would be much easier, and still much better than what we have now.</p> <p>The other problem with fsync is that it only flushes the pipeline for a single file descriptor (actually in practice it's more likely to be the inode). If you want to flush the pipelines for a bunch of file descriptors, you have to issue fsync for each one separately. This is not just an inconvenience; it means that you either need to wait for N fsyncs in sequence, or have N threads handy to wait for them in parallel. The other alternative is to issue syncfs instead - possibly having to wait for I/O from other applications and other users as well as your own. All of these options are awful. A better option would be a way to group file descriptors together through a single "special" one, and then issue more powerful combined operations on that. In fact, such an interface already exists - <em>epoll</em>. Some of that same code could probably be reused to implement a way of flushing multiple files instead of waiting for them. At the very least, this would make flushing lots of files at once simpler and less syscall-intensive. Even better, a decent implementation might allow filesystems to reason about a whole bunch of fsyncs <strong>as a group</strong> and optimize how all of the relevant block-level I/O gets done. I don't expect that to happen soon, but at least the right API makes it possible.</p> <p>Of course, it's always easy to make suggestions for other people to implement. I try not to tell other people their business, because I have quite enough of them telling me mine. Nonetheless, the need is there and I was asked to propose a solution, so I have. Maybe if the people whose job it is don't want to do it themselves then I'd even be willing to help.</p>Fri, 22 Nov 2013 12:30:00,2013-11-22:2013-11-fixing-fsync.htmlstoragelinuxThe "IOPS Myth" Myth<p>It's nice to see more people becoming aware that IOPS are not the be-all and end-all of storage performance. Unfortunately, as with all consciousness raising, the newest converts are often the ones that take things too far. Thus, we get extreme claims like <a href="">IOPS Are A Scam</a> or <a href="">Storage Myths: IOPS Matter</a>. What a surprise, that somebody who works for Violin would claim that it's all about latency, all the time. Let's get away from the extremists on both sides, and try to find the truth somewhere in between.</p> <p>As people who have actually worked on storage - and particularly storage performance - for a while know, different workloads present different needs and challenges. The grain of truth in the extremists' claims is that latency really is king for some applications. Those applications are the ones where I/O is both serialized and synchronous. Quick, how many of those do you run? Probably very few, quite possibly even fewer than you think, for a few different reasons.</p> <ul> <li> <p>Most modern applications have some internal parallelism. Therefore, their performance often is bound more by IOPS than by pure latency.</p> </li> <li> <p>If applications do asynchronous writes, then the I/O system responsible for ensuring their (eventual) completion can take advantage of parallelism even if the writes were issued sequentially. This is what's likely to happen every time you do a series of writes followed by fsync. It can be done all the way from the filesystem down to an individual disk sitting in another cabinet.</p> </li> <li> <p>Even applications that do serialized and synchronous writes often only do so some of the time - e.g. for logs/journals but not for main data. These applications are often latency-bound, but that doesn't mean low latency is necessary for every bit they store.</p> </li> </ul> <p>I've made the point myself, <a href="">quite recently</a>, that it's important to look at all aspects of storage performance, including predictability and behavior over time. That still doesn't mean that you should just pick one set of characteristics as "best" and leave it at that. You're going to be using many kinds of storage. Get used to it. For example, you might need low latency for 1% of your data that's written serially, high IOPS for the next 9% for data that's still warm but read/written in parallel, and neither (at lowest possible cost) for the cold remainder. In that middle part, low-latency storage would be overkill. What matters is how many IOPS you can get within a single system, to avoid the management and resource provisioning/migration headaches of having several. Thus, a high-IOPS system still has value even if it doesn't also offer low latency. If that weren't true, nobody would even consider using S3 or Swift let alone Glacier, since those all have <strong>terrible</strong> latency characteristics.</p> <p>In short, "latency is king" is the new "scale up" motto, but we mostly live in a "scale out" world. Yes, sure, there are situations where you just need a single super-fast widget, but <strong>much</strong> more often you need a whole bunch of more conventional widgets providing high aggregate throughput within a single system. Low latency and high IOPS are entirely complementary goals. Just as there have been valid uses for both mainframes and supercomputers since they started to diverge in the 70s, there are valid uses for both types of storage systems. Those designing or selling one should not lightly dismiss the other, lest that lead to a discussion of who's merely picking components and who's solving hard algorithmic problems.</p>Thu, 21 Nov 2013 09:57:00,2013-11-21:2013-11-iops-myth.htmlstorageperformanceSecure Email<p>Ever since one of the talks at LISA, I've been thinking about secure email. My thoughts are nowhere near complete, but I need to get them out of my head and I do that by writing about them. Apologies in advance.</p> <p>I've actually been thinking for many years about how email should be overhauled. For at least twenty years the idea that the same message contents get stored over and over again for multiple users, even on the same system, has bugged me. Sure, nowadays we have deduplication, but that's a hack. At the time an email is sent, we know for almost zero cost and with absolute certainty that the body is the same for every recipient. Why rely on expensive and approximate deduplication to make up for the fact that we were too stupid to take advantage of that information within the email system itself? For those same twenty-plus years, I've been thinking about how to implement email by separating storage and notification. The message contents get stored <em>once</em> in a data store that's accessible to the sender and recipients, then pointers to those contents are sent separately. In fact, I would be surprised if large email services such as those run by Google or Yahoo don't work that way for messages sent among their own subscribers.</p> <p>Unfortunately, this approach is incompatible with the current email protocols such as IMAP and SMTP. They don't separate storage and notification that way. Sure, you can do it all in the servers, but then you have the same problem as with most cloud-storage services that do something similar: if the server has your ciphertext and your keys, they might as well have your cleartext. They can talk all they like about how carefully they manage those keys, but it's all bullshit. Some of us were talking about this years ago, and built systems like HekaFS to address it, but were largely ignored. If there's one good thing that has come out of the recent NSA/Google revelations, it's that people finally realize <strong>keys have to stay on the client side</strong>. Thank you, Edward Snowden.</p> <p>The way around this is to use a local proxy on the user's machine. On one side it speaks IMAP and/or SMTP. On the other, it speaks the protocols necessary to interact with our secure data store and notification system. This requires only a very tiny bit of extra configuration by the user to point their email program at the proxy instead of a regular server, but then it opens up a whole new world of possibilities that don't exist when trying to preserve legacy protocols throughout the system. Let's look at how this would work in the context of email between users of the same provider.</p> <ol> <li> <p>The sender's email client talks to their proxy, using local SMTP, to send a message.</p> </li> <li> <p>The sender's proxy generates a new symmetric encryption key and initialization vector (IV) and encrypts the message - including both the contents and the "envelope" metadata. It also generates an HMAC to protect against both corruption and tampering.</p> </li> <li> <p>The encrypted message, IV, and HMAC are stored in the provider's message store, yielding an ID. The message store can be pretty plain, or it can have all sorts of features to improve security. For example, if traffic analysis to match senders with receivers is a concern (and it should be) then the provider can implement techniques known from Freenet/Tahoe-LAFS to foil such attempts.</p> </li> <li> <p>Anybody who has the ID from the previous step can now retrieve the message, but it's still encrypted using a unique key. This key is <strong>not stored anywhere</strong> (except maybe on the sender's machine, but ideally not even then). What we do instead is construct a separate notifier for each recipient, encrypting the message ID and key using that particular recipient's public key.</p> </li> <li> <p>At this point, the recipient could be notified synchronously, connecting to them via SSL or similar. This provides the best forward secrecy, but also requires that the recipient be online to receive the notification. More often, the notifiers will need to be stored somewhere for later retrieval. In this case, we could use a second kind of distributed data store, much like the message store and with the same potential for additional code to foil traffic analysis etc. Each user is represented by an existing file or object, and sending a message is just a matter of appending a new notifier.</p> </li> <li> <p>Some time later, a recipient's email client talks to their proxy, this time using local IMAP, to check for messages.</p> </li> <li> <p>The recipient's proxy fetches their file/object from the notification store, and possibly truncates it back down to zero.</p> </li> <li> <p>For each notifier received, the proxy extracts the message ID and key, then uses them to fetch the corresponding message from the message store.</p> </li> <li> <p>Messages are decrypted and translated into IMAP responses to the recipient's email client, as needed.</p> </li> </ol> <p>This scheme seems as secure as anything I've heard described elsewhere, and neither hard to implement nor hard to use. The biggest problem with it that I can think of is garbage collection. To do that properly, objects in the message store would need to have reference counts, with an authenticated decrement protocol or some such. To start with, I'd probably just avoid that by saying that message have expiration dates. The provider's guarantee of security matches their guarantee of persistence. If you don't fetch your messages before they expire, too bad. If you want to keep copies longer, then you have to fetch and store them separately, assuming responsibility for securing the copy (or perhaps that's a separate service offered by the same provider).</p> <p>That's all great within a single provider. How well does it extend out to many providers like we have in the real world? Not that well, unfortunately, but I think that's OK. Just having truly secure email within one provider would be useful. It doesn't seem all that hard to come up with new protocols between providers, allowing them properly controlled access to each other's message and notification stores. Thus, providers that use such protocols could create a whole secure-email ecosystem. Perhaps this is what Lavabit and Silent Circle are already doing within the <a href="">Dark Mail Alliance</a>, but they're being awfully quiet about the details. The key is that secure email practically has to be a <em>separate</em> ecosystem from the email we already have. A lot of the user-facing parts can still be used without too much trouble, but the entire transmission and storage infrastructure will have to change. While I'm sure people can poke all sorts of holes in what I've outlined above, perhaps something in it will provoke some productive thought. The time for keeping ideas in this area to ourselves is over.</p>Thu, 14 Nov 2013 12:26:00,2013-11-14:2013-11-secure-email.htmlsecuritycommunicationsMoot Comments<p>I have a couple of posts coming up where I'll be soliciting feedback, so it's time to implement blog comments again. After looking at the alternatives, I eventually decided that <a href="">Moot</a> had the best combination of features for me (as the guy who has to integrate them) and my users. As it turns out, integrating Moot for all posts including those in the past was laughably easy. Kudos to them for a job well done.</p> <p>I have no idea how long this will last. Most of the comments I've received in the past have seemed more "in the moment" than "part of the permanent record" anyway, so I hope it's OK to be explicit that <strong>comments here are ephemeral</strong>. I make no promise to preserve them, either individually or entirely, but maybe they're more convenient (and less spam-prone) than email. Let me know . . . in the comments. ;)</p>Thu, 14 Nov 2013 10:29:00,2013-11-14:2013-11-moot-comments.htmlmootcommentsComedic Open Storage<p>I've written before about some people's <a href="">mania</a> for object storage as an alternative to blocks and files. It's a valid model, but I do think its benefits are being pretty drastically oversold. Often there's a lot of FUD about distributed filesystems in particular, from people who clearly don't know the details about what features they have or how they work. As a result, even though some people seem pretty excited about Seagate's new <a href="">Kinetic Open Storage</a> initiative, I approached it with a bit more skepticism. Here's the short version.</p> <ul> <li> <p>It's great that somebody's implementing object storage at this level.</p> </li> <li> <p>This particular implementation is a joke.</p> </li> </ul> <p>I'm not just being nasty for no reason. There's a very real danger, with a technology like this, of early implementations over-promising and under-delivering so badly that by the time a good implementation comes along nobody can get over the bad taste in their mouth from the last version. That's what happened in distributed filesystems twenty years ago. Even though things have improved since then, there are still plenty of people who've never moved past "those things don't work" and don't even do the most basic research into the current state of the art before they go off and implement their own crappy incompatible almost-filesystem storage layers. I <strong>don't</strong> want object storage to be abandoned like that. I want it to succeed, but to do that it has to offer a better value proposition.</p> <p>Before I start talking about the ways KOS falls short, I have to start by saying that I'm talking about details and the documentation so far almost seems intended to obscure those details. The wiki is long on rhetoric, short on information. For example, I had to dig a bit to find the maximum size of a key (a potentially wasteful 4KB), and I still haven't found the maximum size of an object. So I cloned the preview <a href="">repository</a> and found a big steaming pile of javadoc. It's not even the good kind of javadoc; it's a lot more of the "bytearray: an array of bytes" boiler-plate kind. So I might actually be wrong about some of the details. If so, I'll update appropriately.</p> <p>My first objection has to do with NIH syndrome. After all, these ideas first reached prominence with Garth Gibson's <a href="">NASD</a> back in 1999, and later influenced the ANSI T10 object-storage standard. Back when it was still a PhD thesis, Ceph used a similar model called EBOFS (since abandoned in favor of btrfs), and there are others as well. Instead of building on - or even acknowledging - these predecessor, Seagate went off and developed Yet Another Object Storage API. Then, instead of documenting wire formats and noting differences vs. things people might already know, they just threw a Java library over the wall. Nice.</p> <p>The second objection is security. There's a reason the S in NASD stands for Secure. If you want to gang a bunch of these devices together as the basis for a multi-user or multi-tenant distributed system, you'd better think hard about how to handle security. Apparently KOS didn't. There's some fluff about on-disk encryption, but nothing about key management, connection security, the actual semantics of their ACLs, etc. This information is not just "nice to have"; it's absolutely essential before developers can even begin to reason about the system they'll be coding for.</p> <p>My third and most serious objection has to do with supporting only whole-object GET and PUT operations. That's fine for a key/value store or a deep archival store (the very opposite of "kinetic" BTW) but for anything else it's awful. If the objects can be very large, then updating any part of one involves a horrendous read/modify/write cycle. If they're kept small, then a higher level has to deal with the mapping from larger user-visible objects to smaller Kinetic objects. If there are multiple clients - and when are there not? - then there are some pretty serious coordination problems involved, and apparently not even a "conditional put" to help deal with the obvious race conditions. Instead of abstracting away the details and difficulties of modifying a single byte within an object (the original NASD vision), KOS requires the involvement of a robust coordination layer for even the simplest operations. Building cluster filesystems on top of shared block devices didn't work too well when the blocks were fixed size. Variable-sized blocks with 4KB keys don't change the equation much.</p> <p>As far as I can tell, this project does very little to help distributed-storage users and developers to meet their needs. Instead it creates false differentiation, disrupting for the sake of disruption or perhaps trying to justify higher margins in a cut-throat industry. It's like a double agent in the object-storage camp, potentially sabotaging others' efforts to have that vision accepted in the broader market.</p>Thu, 24 Oct 2013 11:32:00,2013-10-24:2013-10-comedic-open-storage.htmlstorageWhy You Don't Need STONITH<p>(This started as a <a href="">Hacker News discussion</a> about an <a href="">article on Advogato</a>. The articles title/premise is "Why You Need STONITH" where "STONITH" means "Shoot The Other Node In The Head" and is an important concept in old-school HA. I might even have been present when the acronym was coined, after having used a similar one at CLaM.)</p> <p>I was working on HA software in 1992. Specifically, I was working on the software from which Linux-HA copied all of its terminology and basic architecture. We ourselves were not the first, and often found ourselves copying things done even earlier at DEC, so I'm not complaining, but I want to make the point that this article from 2010 is actually a rehash of a much older conversation. As cute as the metaphor is, it gets two things seriously wrong.</p> <p>(1) Fencing and STONITH are not the same thing. Fencing is shutting off access to a shared resource (e.g. a LUN on a disk array) from another possibly contending node. STONITH is shutting down the possibly contending node itself. They're quite different in both implementation and operational significance. Using the two terms as though they're interchangeable only sows confusion.</p> <p>(2) You only need STONITH if you have the aforementioned possibly contending nodes - in other words, only if the same resource can be provided by/through either node. If the resources provided by each node are known to be different, as e.g. in any of the systems derived from Dynamo, then STONITH is not necessary.</p> <p>To elaborate on that second point, the problem STONITH addresses is one of mutual exclusion. It might not be safe for the resource to be available through two nodes, because it could lead to inconsistency or because they can't both do a proper job of it simultaneously. As in other contexts, mutual exclusion is a useful primitive but often not the optimal one to use. In general it's better to avoid it by avoiding the kinds of resource sharing that make it necessary. That's why "shared nothing" is the most common model for such systems designed in the last decade or more, and they don't need STONITH unless they've screwed up by not fully distributing some component (such as a metadata server for a distributed filesystem).</p>Tue, 22 Oct 2013 09:26:00,2013-10-22:2013-10-stonith.htmlarchitectureLeaning Out<p>In April of '89 I left my family and friends to move from Michigan to Massachusetts for a programming job. The new job paid twice as much as my first programming job had, which means three times as much as I was making since that company laid me off, so it seemed like a pretty big step in my then-new career. I hit the road in a used Ford Escort that I'd just bought for $900, and which barely survived the 800-mile trip. Stupid piece of junk.</p> <p>Since then I've worked at a bunch of companies. Red Hat is the only one of those that I joined when it was already large. Of the ten startups (including the one in Michigan) none went to an IPO. One was the subject of a moderately successful acquisition (Conley by EMC). Four more were moderately successful for a while in some niche, and one of those is still going. The rest were either acquired at fire-sale prices or just sank without a trace (but only one while I was still on board). In other words, I did better than average. As much as we all like to dream about billion-dollar exits, the grim reality is that most startups fail suddenly and completely, leaving employees in the lurch.</p> <p>It's a good thing there are plenty of other reasons to work at startups. You get to learn a lot that way, both technically and otherwise, in a short time. You'll almost certainly be given more responsibility and more freedom than you would at a larger company. You're more likely to work on cutting-edge technology (though not every large company is as far behind as some would have you believe). Startups provide a lot of opportunity in a very energizing environment.</p> <p>Unfortunately, anything that's energizing in the short term is likely to become tiring in the long term, and that's what I'm here to write about today. While I'm not at a startup any more, that's how I got to where I am today. It's how I got to where I am in terms of being hired at Red Hat to start one project and then become an architect on another. It's also how I got to be kind of burned out, not just on my current job but on programming in general. 24 years is a long time - six bachelor's degrees, two of them entirely at specific companies. To explain how that feels, I'll start with a quote from The Hobbit.</p> <blockquote> <p>Now it is a strange thing, but things that are good to have and days that are good to spend are soon told about, and not much to listen to; while things that are uncomfortable, palpitating, and even gruesome, may make a good tale, and take a good deal of telling anyway.</p> </blockquote> <p>The funny thing about working on any project is that you forget all the good parts. Everything that was initially cool about what it did - and how - becomes so familiar that it's forgotten or taken for granted. Meanwhile, every architectural or design flaw you ever noticed still seems to be there. Bugs get fixed, but every troublesome module or interface is still troublesome. Every critical feature that was missing years ago is still missing, after being put off over and over again while other people's <em>stupid</em> ideas always jump to the head of the queue. After a while, you see nothing good and everything bad. It's an unfortunate quirk of human nature.</p> <p>I must emphasize that this phenomenon isn't because of the code. It's a change in a person's relationship to code. I've worked on enough projects to know it's not just one, and I've talked to enough other developers to know it's not just me. Familiarity truly does breed contempt when it comes to code, and outsiders could be forgiven for thinking that "this code sucks" is the official motto of our profession. For what it's worth, I've also changed jobs enough times to know that's not necessarily the solution. Remember, I've done that ten times already. The grass isn't really greener, and the same problems tend to reappear everywhere. Eventually, disenchantment with one particular project becomes disenchantment with programming in general, or at least to programming within a particular domain. That's when the feeling of being trapped really sets in.</p> <p>This steady deterioration doesn't just apply to code, either. Similar processes occur with respect to the social and organizational aspects of programming as well. I have a lot more to say about burnout in general (I even have a long post half written about it) but I'll leave that for another day. Today's post is about how I'm trying to fix it. Basically, I have three things I need to do.</p> <ul> <li> <p>Reduce my pace. When I was young, I thought I could sprint forever, but this is a marathon and nobody can sprint at that distance. Nobody. Also, the hills get larger.</p> </li> <li> <p>Catch up on my personal life. Re-start my exercise program, de-clutter and fix stuff around the house, have lunch with friends, play with my daughter. All the stuff I've been too busy, or too tired, or too grumpy to do enough of.</p> </li> <li> <p>Get some perspective. I need to re-familiarize myself with what else is out there in the computing world, not just as a matter of education and keeping up but also to remember what's good and cool about the project I'm on.</p> </li> </ul> <p>This leads to two concrete actions. First, I'm taking a break. I have already asked for, and been given permission for, a month off. That starts this next Monday, and ends in mid-November (right after <a href="">LISA'13</a>). Technically, I'm not even supposed to check email during that time. If I do anything technical at all before LISA, it will probably be on things very far removed from my usual work. Maybe I'll learn Go or Javascript, learn how to write a modern web page or perhaps even a game. I need to do something that <em>doesn't seem like work</em>.</p> <p>After the break, I'm not coming back full time. Like a recuperating patient, I'm not just going to dive right back in trying to do everything I did before. I'm going to take things a bit slower, at least for a while. Then again, it could be permanent. We'll see. The important thing is that I get to try, and I can't say that without acknowledging the role my bosses at Red Hat have played. They haven't just grudgingly or passively allowed me to do this. They have encouraged me, offered valuable suggestions, and agreed to terms far more favorable than I could have hoped for. Kudos to them, and to the company.</p> <p>I don't know how many people can consider the sort of actions I'm taking, but I will say this: beware of burnout. It <em>will</em> creep up on you, no matter how immune you think you are when you're still early in your career. You can't ignore it. You can take positive steps to avoid it, or you can fall prey to it. Don't be one of those people who lose their family or their sanity first.</p>Thu, 10 Oct 2013 20:01:00,2013-10-10:2013-10-leaning-out.htmlworkingModel Checking<p>Model checking is one of the most effective tools available for reducing the prevalence of bugs in highly concurrent code. Nonetheless, a surprising number of even very smart and very senior software developers and architects seem to know about it. Of the many such people I've worked with over the years, maybe one in ten have even heard of it, and I can count on one hand the number who've appreciated its value. Seems like a good subject for a blog post, then. ;) Let's start with what the heck it is.</p> <div class="highlight"><pre><span class="n">Model</span> <span class="n">checking</span> <span class="n">is</span> <span class="n">a</span> <span class="n">technique</span> <span class="k">for</span> <span class="n">verifying</span> <span class="n">finite</span> <span class="n">state</span> <span class="n">concurrent</span> <span class="n">systems</span> <span class="n">such</span> <span class="n">as</span> <span class="n">sequential</span> <span class="n">circuit</span> <span class="n">designs</span> <span class="n">and</span> <span class="n">communication</span> <span class="n">protocols</span><span class="p">.</span> </pre></div> <p>That's from the blurb for <a href="">Model Checking</a> - the seminal book on the subject by Clarke, Grumberg, and Peled. The way model checking works is by generating states within a system according to rules you specify (the model), and checking them against conditions that you also specify to ensure that invalild states never occur. Some model checkers also check for deadlock and livelock that might preclude reaching a valid final state, but that's not essential. It should be pretty obvious that the number of states even in a fairly simple system can be quite large, so many of the tools also do things like symmetry reduction or Monte Carlo sampling as well.</p> <p>My favorite set of tools in this space is the <a href="">Murphi</a> family, of which <a href="">CMurphi</a> is the one that has been most usable for me recently. Like many such tools, Murphi requires that you specify your model in a language that they describe as Pascal-like but which to my eye looks even more Ada-like. That's really not as awful as it sounds. I've actually found writing Murphi code quite enjoyable every time I've done it. The fact that the model is not written in the same language as the implementation is a known problem in the field. On the one hand, it creates a very strong possibility that the model will be correct but the separate implementation will not, reducing the value of the entire effort. On the other hand, traditional languages struggle to express the kinds of things that a model checker needs (and futhermore to work efficiently). I tried to write a real-code model checker <a href="">once</a>, and didn't get very far.</p> <p>To give you some idea of why it's so hard for model checkers to do what they do, I'll use an example from my own recent experience. I'm developing a new kind of replication for GlusterFS. To make sure the protocol behaved correctly even across multiple failures, I developed a Murphi model for it. This model - consisting of state definitions, rules for transitions between states, and invariant conditions to be checked - comes to 550 lines (72 blank or comments). Running this simple model generates the following figures.</p> <div class="highlight"><pre><span class="mi">172838</span> <span class="n">states</span> <span class="mi">468981</span> <span class="n">rules</span> <span class="mf">10.60</span> <span class="n">seconds</span> </pre></div> <p>That's for a simple protocol, with a small problem size - three nodes, three writes, two failures. The model was also relentlessly optimized, e.g. eliminating states that Murphi would see as different only because of fields that would never be used again. Still, that's a lot of states. When I introduced a fourth write, the run time tripled. When I introduced a fourth node, I let it run for five minutes (3M states and 10M transitions) but it still showed no signs of starting to converge so I killed it. BTW, I forgot to mention that the model contains five known shortcuts to make it checkable, plus probably at least as many more shortcuts I didn't even realize I wasn't taking.</p> <p>If it's so hard and you have to take so many shortcuts, is it still worth it? Most definitely. Look at those numbers again. How many people do you think can reason about so many states and transitions, many of them representing dark unexpected corners of the state space because of forgotten possibilities, in their heads? I'm guessing <strong>none</strong>. Even people who are very good at this will find errors in their protocols, as has happened to me every time I've done the exercise. I actually thought I'd done pretty well this time, with nothing that I could characterize as an out-and-out <strong>bug</strong> in the protocol. Sure, there were things that turned out to be missing, so that out of five allowable implementations only one would actually be bug free, so I still thought the exercise was worth it. Then I added a third failure. I didn't expect a three-node system to continue working if more than one of those were concurrent (the model allows the failures to be any mix of sequential and concurrent), but I expected it to fail cleanly without reaching an invalid state. Surprise! It managed to produce a case where a reader can observe values that go back in time. This might not make much sense without knowing the protocol involved, but it might give some idea of the crazy conditions a model checker will find that you couldn't possibly have considered.</p> <div class="highlight"><pre>write #1 happens while node A is the leader B fails immediately C completes the write read #1 happens while A isn&#39;t finished yet (but reads newer value) A fails B comes back up, becomes leader C fails while B is still figuring out what went on A comes back up read #2 happens, gets older value from B </pre></div> <p>So now I have a bug to fix, and that's a good thing. Clearly, it involves a very specific set of ill-timed reads, writes, and failures. Could I have found it by inspection or ad-hoc analysis? Hell, no. Could I have found it by testing on live systems? Maybe, eventually, but it probably would have taken months for this particular combination to occur on its own. Forcing it to occur would require a lot of extra code, plus an exerciser that would amount to a model checker running 100x slower across machines than Murphi does. With enough real deployments over enough time it would have happened, but the only feasible way to prevent that was with model checking. Try it.</p> <p>P.S. I fixed the bug.</p>Fri, 27 Sep 2013 15:24:00,2013-09-27:2013-09-model-checking.htmlprocessSAN Stalwarts and Wistful Thinking<p>I've often said that open-source distributed storage solutions such as GlusterFS and Ceph are on the same side in a war against more centralized proprietary solutions, and that we have to finish that war before we start fighting over the spoils. Most recently I said that on Hacker News, in <a href="">response</a> to what I saw as a very misleading evaluation of GlusterFS as it relates to OpenStack. In some of the ensuing Twitter discussion, <a href="">Ian Colle</a> alerted me to an article by Randy Bias entitled <a href="">Converged Storage, Wishful Thinking &amp; Reality</a>. Ian is a Ceph/Inktank guy, so he's an ally in that first war. Randy presents himself as being on that side too, but when you really look at what he's saying it's pretty clear he's on the other team. To see why, let's look at the skeleton of his argument.</p> <ul> <li> <p>"Elastic block storage" is a good replacement for traditional SAN/NAS.</p> </li> <li> <p>"Distributed storage" promises to replace <strong>everything</strong> but can't.</p> </li> <li> <p>The CAP theorem is real, failures are common, and distributed storage doesn't account for that.</p> </li> </ul> <p>The first two points are hopelessly muddled by his choice of terms. When people in this space hear "elastic block storage" they're likely to think it means Amazon's EBS. However, Amazon's EBS <strong>is</strong> distributed storage. Try to read the following as though Randy means Amazon EBS.</p> <blockquote> <p>Elastic Block Storage (EBS) is simply an approach to abstracting away SAN/NAS storage {from page 4}</p> <p>Elastic block storage is neither magic not special. It’s SAN resource pooling. {from <a href="">Twitter</a>}</p> </blockquote> <p>That conflicts with everything else I've heard about Amazon EBS. I even interviewed for that team once, and they sure seemed to be asking a lot of questions that they wouldn't have bothered with if EBS weren't distributed storage. Amazon's own official <a href="">description of EBS</a> bears this out.</p> <blockquote> <p>Amazon EBS volume data is replicated across multiple servers in an Availability Zone to prevent the loss of data from the failure of any single component.</p> </blockquote> <p>Servers, eh? Not arrays. That sounds a lot like distributed storage, and very unlike the "SAN resource pooling" Randy talks about. Clearly he's not talking about Amazon EBS vs. distributed storage because one is a subset of the other. What he's really talking about is SAN-based vs. distributed <strong>block</strong> storage. In other words, his first point is that SAN hardware repackaged as "elastic block storage" can displace SAN hardware sold as itself. Yeah, when you cut through all of the terminological insanity it ends up sounding rather silly to me too.</p> <p>Randy's second point is that users need multiple tiers of storage, and a distributed storage system that satisfies the "tier 1" (lowest latency) role would be a poor fit for the others. Well, duh. The same is true of his alternative. The fundamental problem here seems to be an assumption that each storage technology can only be deployed one way. That's kind of true for proprietary storage systems, where configurations are tightly restricted, but open storage systems are far more malleable. You can buy different hardware and configure it different ways to satisfy different needs. If you want low latency you do one thing and if you want low cost you do another, but it's all the same technology and the same operations toolset either way. That's much better than deploying two or more fundamentally different storage-software stacks just because they're tied to different hardware.</p> <p>The homogeneity assumption is especially apparent in Randy's discussion of moving data between tiers (HSM) as though it's something distributed storage can't do. In fact, there's nothing precluding it at all, and it even seems like an obvious evolution of mechanisms we already have. With features like GlusterFS's upcoming <a href="">data classification</a> you'll be able to combine disparate types of storage and migrate automatically between them, <strong>if you want to</strong> and according to policies you specify. Again, this can be done better in a single framework than by mashing together disparate systems and slathering another management layer on top.</p> <p>Lastly, let's talk about CAP. Randy makes a big deal of the CAP theorem and massive failure domains, leading to this turd of a conclusion:</p> <blockquote> <p>Distributed storage systems solve the scale-out problem, but they don’t solve the failure domain problem. Instead, they make the failure domain much larger</p> </blockquote> <p>Where the heck does that idea come from? I mean, seriously, WTF? I think I'm on pretty safe ground when I say that I know a bit about CAP and failure domains. So do many of my colleagues on my own or similar projects. The fact is that distributed storage systems are <strong>very</strong> CAP-aware. One of the main tenets of CAPology is that no network is immune to partitions, and that includes the networks inside Randy's spastic block storage. Does he seriously believe a traditional SAN or NAS box will keep serving requests when their internal communications fail? Of course not, and the reason is very simple: they're distributed storage too, just wrapped in tin. We all talk to each other at the same conferences. We're all bringing the same awareness and the same algorithms to bear on the same problems. Contrary to Randy's claim, the failure domains are exactly the same size relative to TB or IOPS served. The difference is in the quality of the network implementation and the system software that responds to network events, not in the basic storage model. Open-source distributed storage lets you build essentially the same network and run essentially the same algorithms on it, without paying twice as much for some sheet metal and a nameplate.</p> <p>In conclusion, then, Randy's argument about storage diversity and tiering is bollocks. His argument about CAP and failure domains is something even more fragrant. People who continue to tout SANs as a necessary component of a complete storage system are only serving SAN vendors' interests - not users'.</p>Wed, 11 Sep 2013 11:38:00,2013-09-11:2013-09-wistful-thinking.htmlstorageglusterfsmarketingStanding Desks<p>A while ago, I got an <a href="">Ergotron WorkFit-S</a> sit/stand monitor mount. I love it, and have talked about it to plenty of people. Yesterday I joined a <a href="">Hacker News discussion</a> about standing desks, and it left me with some thoughts that I'd rather share here than there, so here goes.</p> <p>Some people have suggested a drafting table plus a high chair as a cheaper alternative. I did consider that approach, but I don't think it's so much a direct alternative as a fundamentally different thing. I don't just want one big flat surface. I have a desk already, which wraps around slightly and has a pedestal on one side (nominally for a printer but also served quite well as my original standing-desk solution). It even has things like doors, drawers, and shelves. Likewise, I don't want to sit on a high chair. When I do sit, I like a full-height back and feet on the floor; a wrap-around bar doesn't suffice. I also like to swivel a bit, and that can be downright dangerous on one of those. Buying a new desk and a new chair and a new separate drawer unit would almost certainly cost more than my current setup and still be less comfortable. When the cost of the things I already have is subtracted, the cost difference is even larger.</p> <p>Speaking of cost, the HN thread also included the absurd claim that a motorized desk would have been cheaper. Maybe that's true in a sense, but here's a basic fact: adding a motor to something never made it cheaper. If motorized desk X is cheaper than manual-adjust desk Y, it's because X is <strong>inherently</strong> cheaper than Y by enough to offset the additional cost of the motor. That's reflected in cheaper materials, cheaper construction, missing features, and so on. The WorkFit-S consists of machined metal, welds/coatings where appropriate, and high-density wood composite for the hinged keyboard tray. Sure, you can spend less for something made of thin untreated metal and plastic held together with screws, but that's apples to cherries. Everything I looked at that was <strong>at all</strong> comparable in terms of build quality cost 2-4x as much. They were also all in the drafting-table category, with all of the other drawbacks noted above. Really, this option only makes sense for the disabled. Anyone who's just too lazy to raise and lower their monitor manually should stop making stuff up to rationalize their choice.</p> <p>If you're building a home office from scratch, and you <strong>prefer</strong> a different kinds of setup than I have, that's great. Comfort is important, so even if it costs a little more you should go for it. On the other hand, if you already have a desk/chair that work for you and don't have a clear preference (based on experience) for the drafting-table approach, I really would suggest ignoring those who just want you to follow the fashion. I <strong>definitely</strong> don't recommend going straight from a sitting setup to a standing-only setup. That was the problem with my original ad-hoc approach, and the adjustability makes even more of a difference than had led me to the WorkFit in the first place. Two modes are better than one.</p>Wed, 28 Aug 2013 09:16:00,2013-08-28:2013-08-standing-desks.htmlworkingLocal Filesystems Suck<p>Distributed filesystems represent an important use case for local filesystems. Local-filesystem developers can't seem to deal with that. That, in a nutshell, is one of the most annoying things about working on distributed filesystems. Sure, there are lots of fundamental algorithmic problems. Sure, networking stuff can be difficult too. However, those problems are "natural" and not caused by the active reality-rejection that afflicts a related community. Even when I was in the same group as the world's largest collection of local-filesystem developers, with the same boss, it was often hard to get past the belief that software dealing with storage in user space was just A Thing That Should Not Exist and therefore its needs could be ignored. That's an anti-progress attitude.</p> <p>So, what are the problems with local filesystems? Many of the problems I'm going to talk about aren't actually in the local filesystems themselves - they're in the generic VFS layer, or in POSIX itself - but evolution of those things is almost entirely driven by local-filesystem needs so that's a distinction without a difference. Let's look at some examples from my own recent experience.</p> <ul> <li> <p>Both the interfaces and underlying semantics for extended attributes still vary across filesystems and operating systems, despite their usefulness and the obvious benefits of converging on a single set of answers. This is true even for the most basic operations; if you want to do something "exotic" like set multiple xattrs at once, you have to use truly FS-specific calls.</p> </li> <li> <p>Mechanisms to deallocate/discard/trim/hole-punch unused space still haven't converged, after $toomany years of being practically essential to deal with SSDs and thin provisioning.</p> </li> <li> <p>Ditto for overlay/union mounts, which have been worked on for years to no useful result. There's a pattern here.</p> </li> <li> <p>The readdir interface is just totally bogus. Besides being barely usable and inefficient, besides having the worst possible consistency model for concurrent reads and writes, it poses a particular problem for distributed filesystems layered on top of their local cousins. It requires the user to remember and return N bits with every call, instead of using a real cursor abstraction. Then the local filesystem at the far end gets to use that N bits however it wants. This leaves a distributed filesystem in between, constrained by its interfaces to that same N bits, with zero left for itself. That means distributed filesystems have to do all sorts of gymnastics to do the same things that local filesystems can do trivially.</p> </li> <li> <p>Too often, local filesystems implement enhancements (such as aggressive preallocation and request batching) that look great in benchmarks but are actually harmful for real workloads and especially for distributed filesystems. There's another big pile of unnecessary work shoved onto other people.</p> </li> <li> <p>It's ridiculously hard to make even such a simple and common operation as renaming a file atomic. Here's the <a href="">magic formula</a> that almost nobody knows.</p> </li> </ul> <p>The last point above relates to the really problematic issue: very poor support for specifying things like ordering and durability of requests without taking out the Big Hammer of forcing synchronous operations. By the time we get a request, it has already been cached and buffered and coalesced and so on all to hell and back by the client. Those games have already been played, so our responsibility is to provide <strong>immediate</strong> durability, while respecting operation order, with minimal performance impact. It's a tall order at the best of times, but the paucity of support from local filesystems makes it far worse.</p> <p>In a previous life, I worked on some SCSI drivers. There, we had tagged command queuing, which was a bit of a pain sometimes but offered excellent control over which requests overlapped or followed which others. With careful management of your tags and queues, you could enforce the strictest order or provide maximum parallelism or make any tradeoff in between. So what does the "higher level" filesystem interface provide? We get fsync, sync, O_SYNC, O_DIRECT and AIO. That might be enough, except...</p> <ul> <li> <p>Fsync is pretty broken in most local filesystems. The "stop the world" entanglement problems in ext4 are pretty <a href="">well known</a>. What's less well known is that XFS (motto: "at least it's not ext4") has essentially the same problem. An fsync forces everything queued <strong>internally</strong> before to complete, but that's completely useless to an application which still gets no useful information about which other file descriptors no longer need their own fsync. The pattern continues even when you look further afield.</p> </li> <li> <p>O_SYNC has essentially the same problems as fsync, and sync is <strong>defined</strong> to require "stop the world" behavior.</p> </li> <li> <p>O_DIRECT throws away <strong>too much</strong> functionality. Sure, we don't want write-back, but a write-through cache would still be nice for subsequent reads and O_DIRECT eliminates even that.</p> </li> <li> <p>AIO still uses a thread pool behind the scenes on Linux, unless you use a low-level interface that even its developers admit isn't ready for prime time, so it fails the efficiency requirement.</p> </li> </ul> <p>Implementing a correct and efficient server is way harder than it needs to be when all you have to work with is broken fsync, broken O_DIRECT, and broken AIO. Apparently btrfs tries to get some of this stuff right, thanks to the Ceph folks, but even they balked at trying to make those changes more generic, so unless you want to use btrfs you're still out of luck. That's why I return the local-filesystem developers' contempt, plus interest. Virtualization systems, databases, and other software all have many of the same needs a distributed filesystems, for many of the same reasons, and are also ignored by the grognards who continue to optimize for synthetic workloads that weren't even realistic twenty years ago. While I still believe that the POSIX abstraction is far from being obsolete, pretty soon it might not be possible to say the same about the people most involved with implementing or improving it.</p>Mon, 26 Aug 2013 12:42:00,2013-08-26:2013-08-local-filesystems-suck.htmlstoragelinuxTechnical Credit<p>To a first approximation, "software engineering" refers to all of the things you need to know when you take "programming" and try to scale it up - more code, more people, more time. You don't need an a civil engineer to dig a latrine, but you'd better have one to design a sewer system for an entire city. Likewise, you don't need a software engineer to write a small one-off script, but you will if you're designing a distributed filesystem. Many of the discussions about "everybody should learn to program" seem to go in circles because the participants don't acknowledge this distinction. Applying even a common skill often requires a new level of professional rigor when done at a larger scale.</p> <p>One of the things you learn about as a software engineer, and not "just" a programmer, is <a href="">technical debt</a>. To boil a complex idea down a bit, this is the idea that, as code evolves, it accumulates flaws. I don't mean flaws in the sense of things that behave incorrectly, but flaws that make the code harder to work on - messy (often duplicated) code paths and data structures, "impedance mismatches" that force one piece of code to compensate for another's strangeness (with the effects usually rippling ever outward), and so on. Over time, you end up with code that does what it does pretty well, but can barely be persuaded or coerced to do anything else and ultimately becomes obsolete.</p> <p>If technical debt is stuff that slows you down, technical credit is stuff that speeds you up. Just as technical debt often consists of things forced into a release for the sake of expediency (with an intent to go back and "do it right" later), technical credit consists of things that are left out of a release for the sake of caution (often it's the "do it right" but wasn't ready in time). Specifically, it's a library of solved problems in the form of prototypes, code snippets, or designs in various stages of completion. As a project progresses, these previously isolated bits of hard-won knowledge can be pressed into immediate service, instead of having to wait for solutions to be developed from scratch. Technical credit is thus primarily a risk reduction or mitigation strategy. If you have ten solutions for various hard problems "on the shelf" and you only need five, the benefit in development velocity might far outweigh the cost of having developed another five solutions you don't need (yet). Developing technical credit can also be considered a career-development strategy, or even a perk. (This post was inspired by a discussion of Google's "20% time" which fits this description.) Increasing technical credit is <strong>fun</strong>. It usually means solving technical problems that are much more interesting than those in the current version, in relative isolation and without the time pressure of having to fix the bug <em>du jour</em>.</p> <p>The reason I'm writing this is because it seems like a lot of people never think about technical credit - let alone have a structured way to think about it. Both traditional and new-age development processes focus entirely on a direct line from writing code to releasing it. The idea of writing code and then letting it sit for a while has no place in either model. (This same problem shows up in synchronizing separate "upstream" and "downstream" projects with open source.) As a result, many projects go to battle each release cycle with an empty arsenal. Organizations that recognize the existence of technical credit and assign it a proper value will - over time - outperform those that don't.</p>Fri, 16 Aug 2013 12:06:00,2013-08-16:2013-08-technical-credit.htmlprocessGlusterFS 3.5 Features<p>It's time to let some cats out of some bags. As my loyal readers (yeah right) have surely noticed, things have been quiet around here. Part of that has been the result of vacations and such, but also there's a lot of stuff I just haven't felt ready to write about. Now that I've finished writing my <a href="">feature proposals for GlusterFS 3.5</a>, I'm ready to write here as well. But first, a bit of philosophy.</p> <p>On any non-trivial software project, there's likely to be a certain tension between "loyalists" who want to keep things largely the same and "rebels" who keep pushing for radical change. I was tempted to use "liberal" and "conservative" but <a href="">Steve Yegge</a> beat me to it with a different set of definitions. (Good reading, BTW, though he's a bit loquacious so make sure you have plenty of time.) The distinction I'm trying to make is as follows:</p> <ul> <li> <p>A loyalist believes that "one more tweak" or a succession of them will bring the project to greatness. There's no need, and too much risk, associated with radical change.</p> </li> <li> <p>A rebel believes that you can't cross a chasm with small steps. You have to take bold steps, even if those steps are going to cause regret later.</p> </li> </ul> <p>Most people don't stay at one position on that spectrum, and certainly not at the ends. Often one's position is determined by circumstance. For example, despite my general tendency toward the rebel view, I've probably spent more of my career (especially at Revivio) fighting on the loyalist side. When it comes to GlusterFS, I'm not just <strong>a</strong> rebel but probably <strong>the</strong> rebel. There are others more fit and more inclined to take the loyalist position. I see it as my responsibility to take an under-represented rebel position by proposing new ideas and accumulating "technical credit" to remove or balance out the significant technical debt that has accumulated under the loyalist regime. I don't mind at all if my ideas stay "on the shelf" for years, as has happened with most of HekaFS. It's good to have extra ammo in the locker. I mind a little bit more when the loyalists reject ideas with pure FUD or passivity instead of sound technical argument, or when they can't seem to change their mind about an old idea without claiming it was their own, but that's a subject for a different post. The point here is that what I'm proposing is deliberately ambitious, because our competitors aren't exactly standing still. They're often acting more boldly than we are, and we'll never gain or maintain a lead over them (I won't get into which we're doing) with baby steps. With that in mind, here's what I've proposed.</p> <ul> <li> <p>Better SSL support. Surprise! This is actually a very loyalist feature. Basically the core code has been there for years, but there are several pieces that still need to be done - mainly on the usability front - before this can really be something we're proud of.</p> </li> <li> <p>New Style Replication. For years, I've been advocating for fundamental change in this area. I even wrote an infamous "Why AFR Must Die" email explaining why I don't believe the current design or implementation will be sufficient going forward (using real user complaints and code exampels respectively). There finally seems to be some acceptance of the idea that we should use a log/journal instead of scattered xattrs or hard-link trickery to keep track of incomplete changes, but NSR goes even further than that. For one thing, it's server-based to take better advantage of how NICs and switches work. For another, it has an almost Dynamo-like consistency model which offers users better tradeoffs of consistency vs. availability and/or performance. There's a lot more, which I'll discuss in detail some time soon. For now I'll just say that performance will be better, recovery will be faster, and (best of all IMO) "split brain" will be easier to avoid.</p> </li> <li> <p>Thousand-node scalability. The biggest limitation on our scalability right now is not in the I/O path, which scales just fine, but in our management path. The changes I'm proposing here - based on some of the latest advances that some of my distributed-system friends will surely recognize - represent important steps on the way to an exabyte-scale system. We're already at petabyte scale, TYVM, so I don't think that's an exaggeration.</p> </li> <li> <p>Data tiering/classification. While NSR might be the one that I've spent the most time thinking about, this is the one that might have the most immediate impact and appeal for users. We get queries <strong>all the time</strong> about how we're going to deal with SSDs. Hybrid drives and "smart" HBAs aren't really very good solutions, because they only have local and limited knowledge. (This is the same issue as global vs. local deduplication BTW.) Being able to combine SSD-based and disk-based bricks <strong>in a single volume</strong> with smart placement and migration across all of them is a leap far beyond such hacks. Perhaps even more importantly, we also get asked a lot about various features that would be beneficial in some way - e.g. deduplication, bitrot detection, erasure coding - but carry a high performance cost. We need to tier between these in much the same way as between different hardware types. As it turns out, the exact same infrastructure also allows us to implement locality awareness, security-level awareness, and all sorts of other features with relatively little effort. Our modular structure and our existing data-distribution code already do 95% of what's needed, and just need a little nudge.</p> </li> </ul> <p>Those are just my own proposals. Others have made proposals too, so please go check them out. Personally, I've been eagerly awaiting Xavi's erasure-coding "disperse" translator for ages, and can't wait to see it become a full part of the project. While there's practically no chance that all of this will get into GlusterFS 3.5, a lot of it will and what's left will become a formidable arsenal of opportunities for years to come.</p>Tue, 13 Aug 2013 11:15:00,2013-08-13:2013-08-glusterfs-35-features.htmlstorageglusterfsAvoiding Jet Lag<p>And now for something completely different...</p> <p>As part of my job - educating and evangelizing and whatever else you call it - I travel a fair amount. I know there are other people who travel ten times as much as I do, but then there are many more who travel less than a tenth as much. As everyone who does travel frequently knows, jet lag is a very real problem. Most of us travel across the country or across the world because something is important. It's very depressing when you're at your destination, trying to do something important, and your brain is so fogged by jet lag that you can barely put together a coherent sentence. The really funny thing, as I just noted to my seat-mate on BA203 from LHR to BOS with regard to typing, is that a lot of people who absolutely live to optimize the hell out of every little thing they do never try to optimize around jet lag. Therefore, I'll share something that has worked for me and might work for you.</p> <p>I can't in any way take credit for this idea as my own invention. I read about it in a magazine a while ago, probably The Atlantic but I'm not sure. In that article, they cited this technique as being in use by the US military, Olympic teams, and so on. They probably also cite the scientific sources better than I'm going to. In any case, the very basic observation is this:</p> <blockquote> <p>Your body's clock is determined at least as much by when you <strong>eat</strong> as by when you <strong>sleep</strong>.</p> </blockquote> <p>Thus, even though jet lag is a problem of sleep and wakefulness, the way to address it is through your stomach. Specifically, if you fast for a while your body's clock goes into "free wheeling" mode (much like pushing in the clutch on a manual-transmission car). Then, the next big meal is interpreted as dinner (getting back in gear) with sleep soon to folow. Therefore, the smart thing to do is <strong>don't eat on the plane</strong>. Instead: fast until an appropriate dinner time for your destination, then eat a full dinner, then sleep.</p> <p>My interpretation has been to start fasting a full day before my anticipated arrival time. It's not a total fast, because if I didn't eat at all then my stomach noises both be uncomfortable and annoy people around me. Similarly, I don't make heroic efforts to avoid sleep. When I get to that point where I can just barely keep my eyes open or focus on what somebody's saying, then I'll take a short nap. Fortunately, I've always been good at cat-naps. If I don't want to sleep more than an hour maximum, then it's very unlikely that I'll do so - even without an alarm. (Heck, I had to get up at 3am local time for my flight today, and I woke up "naturally" at 2:55. It's a handy feature.) The key is to eat and sleep little enough to avoid sending that "now we know when to sleep" signal. That way, when you send the <strong>real</strong> signal with a big dinner, your body responds to it. It's definitely hard. I've been bumped up to one of the better-food sections on this flight for the first time in forever, and it's hard to say no when everybody around me is eating. Still, a little bit of discomfort in the air helps to avoid much more discomfort on the ground later.</p> <p>How well does this work? I've done it on my last two Bangalore trips, plus my last coast-to-coast trip. Actually I'm not quite done with the second Bangalore trip; I'm somewhere slightly south of Greenland as I write this. Still, I feel very encouraged that I haven't had any jet lag at all during any of those trips. I find myself going to bed at a normal time for where I am, getting up at a normal time, and not feeling particularly tired during the day. Meanwhile, co-workers on the same trips have been literally falling down because of jet lag. Sure, I've had periods of being tired during the day, but I don't think that has anything to do with jet lag. If you put me in a windowless room with twenty other people to talk about something boring right after lunch, I'm going to nod off a bit regardless of what time zones are involved.</p> <p>So, there it is. It's a very simple idea, it almost seems obvious, and I'm sure there are plenty of people who've already heard of it. Still, it seems like a lot of people either haven't heard of it or haven't tried it, so here's a data point for you.</p>Thu, 25 Jul 2013 18:50:00,2013-07-25:2013-07-avoiding-jet-lag.htmltravelStartups and Patents<p>This should be a <a href="">pretty familiar story</a> to anyone in high tech by now. Startup makes something cool, becomes a target for patent litigation from what we used to call an NPE (Non Practicing Entity). Apparently the new term is PAE (Patent Assertion Entity) but I prefer an even more concise term: <strong>troll</strong>. There is much predictable moaning and gnashing of teeth <a href="">on Hacker News</a>, of course, but nobody wants to think about a very simple question.</p> <blockquote> <p>Where does all this ammunition come from?</p> </blockquote> <p>(BTW, I know nobody on HN wants to think about this because I've raised the issue before and I got slammed hard for the effort. That's why I'm posting here this time. Can't censor this, you fucking cowards.)</p> <p>Having worked at a dozen or so startups myself, I know exactly where the ammunition comes from: the looted carcasses of earlier startups. While I've never had one of my own patents abused this way, I have half a dozen friends who have been through that and it's always the same story.</p> <ol> <li> <p>Friend works at a startup.</p> </li> <li> <p>Investors apply pressure to file for patents, either as a bargaining chip in any subsequent acquisition or possibly as a hedge against failure.</p> </li> <li> <p>Friend gets named on a patent or ten.</p> </li> <li> <p>The startup does in fact fail or get acquired.</p> </li> <li> <p>Patents get sold, and sold again, until eventually they end up in the hands of a troll.</p> </li> <li> <p>Troll asserts patent.</p> </li> <li> <p>Friend is livid about how their creative work, their contribution to the state of the art, is being abused.</p> </li> <li> <p>Friend's feelings have no effect whatsoever on the litigation.</p> </li> </ol> <p>In other words, if you work at a startup that files patents, and you're not taking steps to put them firmly out of reach of the trolls, then <strong>you're part of the problem</strong>. If you're an investor and you're allowing portfolio companies to file patents without such protection, then you're part of the problem too. Yeah, I know, some VCs claim they discourage such things, but somehow I'm less than convinced when I can go to USPTO or Google Patents or PatentStorm and immediately pull up a list of patents filed by companies that somehow overcame that discouragement without any ill consequence. I'm shocked - shocked! - to hear that patents are going on here. Your winnings, sir.</p> <p>Until we get serious patent reform, which is going to take a while, patents are still necessary to establish precedence. That keeps trolls from acquiring patents to the same idea and then pursuing others - possibly including the people who truly had the idea first and based products on it. Don't let your oh-so-principled distaste for patents overcome common sense and keep you from protecting doing your part to protect everyone including yourself. The key is to ensure that the patents are <strong>only</strong> usable in a defensive fashion. One approach is to turn them over to something like the <a href="">Open Invention Network</a>. Another approach is the <a href="">Innovator's Patent Agreement</a> from Twitter. That at least avoids the scenario I lay out above, though a down-on-their-luck former developer might not offer much restraint when push comes to shove. There are other approaches as well, but the sad fact is that most people who file patents - including those who complain about others' patents - are doing absolutely nothing to ensure that their own work won't be turned against them and their community. That's a disgrace. Developers and founders, disagree with me all you want about what it is we should do, but do <strong>something</strong> besides complain.</p>Fri, 19 Jul 2013 16:19:00,2013-07-19:2013-07-startups-and-patents.htmllegalSmall Synchronous Writes<p>Sometimes people ask me why I always use small synchronous writes for my performance comparisons. Surely (they say), there are other kinds of operations that are more common or more important. Yes there are (I say), and don't call me Shirley. But seriously, folks, there are definitely other kinds of performance that matter. The problem is that they just don't tell you much about what makes two distributed filesystems different. I'll try to explain why.</p> <p>Let's start with read-dominated workloads. It's well known that OS (and app) caches can absorb most of the reads in a system. This was the fundamental observation behind Seltzer et al's work on log-structured filesystems all those years ago. Reads often take care of themselves, so <strong>at the filesystem level</strong> focus on writes. The significance of caching is hardly less in distributed filesystems with greater latency. The primary exception to this rule is large sequential reads, but those tend to become bandwidth-bound very quickly and just about every distributed filesystem I've ever seen can saturate whatever network connections you have <strong>easily</strong> for such workloads. Boring. Between these two effects, it just turns out that read-dominated workloads aren't all that interesting.</p> <p>Why not different kinds of writes? Mostly because large and/or asynchronous writes tend to follow the same patterns as large reads. Once you have the opportunity to batch and/or coalesce writes, effectively eliminating the effect that network latency might have on most of them, it becomes pretty easy to fill the pipe with huge packets. Boring again. It's important to measure how well the servers handle <strong>parallelism</strong> among many requests that are still kept separate, but that's a whole different thing. If both reads and large/async writes are uninteresting, what does that leave? Small sync writes, of course.</p> <p>While I'm here, I might as well address a couple of other issues. One is the question about scale. Does a test of a single client and a single server (if replicating) really tell us anything useful for filesystems that are designed to have many servers? I think it does, for a certain class of such filesystems. In a system that uses algorithmic placement, such as GlusterFS or Ceph, an individual request really will hit only those servers and really will scale pretty linearly until you start hitting the <strong>network's</strong> scaling limits. It absolutely makes sense to test the network in the context of an actual deployment, but in the context of evaluating technologies the performance of a single server (or replica pair) does work as a proxy for the performance of N. That doesn't mean you should obsess over micro-optimizations or implementation concerns that don't have much measurable effect (e.g. kernel vs. FUSE clients), but it's really the data flow and algorithmic efficiency that matter most. This argument doesn't work nearly as well for more outdated architectures that use directory-based placement, such as HDFS or Lustre. In those cases, the need to go through the MDS or NameNode or whatever really does create a bottleneck that impacts system-wide scaling. That's something to consider when you're looking at such systems.</p> <p>Lastly, what about metadata operations? File creation and directory listings are even worse than writes, aren't they? Yes, absolutely, they are. Testing only data operations is kind of a bad habit among filesystem folks, and I'm guilty too. I really should test and report on those things too, even though it probably means developing even more tools myself because the existing tools are even worse for that than they are for testing plain old reads and writes.</p> <p>To make a long story . . . no longer, if not actually short, I've found that testing small synchronous writes is simply the best place to start. It's the first result to look at, but absolutely not the only one. If I were actually looking to deploy a system myself I'd try all sorts of workloads at the same scale as the deployment itself, or as close as I could get, and I'd show everyone a detailed report. On the other hand, when I'm doing the tests on my own time and at my own expense (in a public cloud) for a blog post or presentation, that's quite a different story.</p>Thu, 18 Jul 2013 10:33:00,2013-07-18:2013-07-why-sync-writes.htmlstorageperformancePerformance Measurement Pitfalls<p>One of the problems with measuring and comparing performance of scalable systems is that any workload capable of producing meaningful results is going to be highly multi-threaded, and most developers don't know much about how to collect or interpret the results. After all, they hardly ever get any training in that area, and many of the tools don't exactly make it easy (as we'll see in a moment). Considering all the effort spent on complex ways to define the input workload - some tools have entire domain-specific languages for this - you'd think that some effort might have been spent on making the output more meaningful. You'd be wrong.</p> <p>To see how easy it is to be misled, and how badly, let's consider a simple example. You have a storage system capable of sustaining 1000 IOPS. A single I/O thread can generate a load of 1000 IOPS. What happens when you run four of those?</p> <ul> <li> <p>Scenario 1: the storage system effectively delivers 250 IOPS per thread, continuously. Therefore they each report 250 IOPS, you add those up, and you get a correct sum of 1000 IOPS.</p> </li> <li> <p>Scenario 2: the storage system effectively serializes the four threads. Thread A completes in one second, reporting 1000 IOPS. Thread B completes in two seconds - the first second sitting idle - and reports 500 IOPS. Threads C and D complete in three and four seconds respectively, reporting 333 and finally 250 IOPS. Add them all up and you get the wildly wrong sum of 2083 IOPS.</p> </li> </ul> <p>The mistake in the second scenario seems obvious when described this way, but I've seen smart people make it again and again and again over the years. One way to avoid it is not to trust reports from individual threads, but to measure the start and end times <strong>for the whole group</strong>. Unfortunately, you can miss a lot of useful information that way. Most importantly, a single slow worker can drag the entire average down and you won't even notice that the actual I/O rate for most of the threads and most of the time was actually far higher unless you're paying pretty close attention. Dean and Barroso call this the <a href="">latency tail</a> and it's significant in operations as well as measurement.</p> <p>Another way to avoid the original over-counting problem is "stonewalling" - a term and technique popularized by <a href="">iozone</a>. This means stopping all threads when the first one finishes - i.e. "first to reach the stone wall" - and collecting the results even from threads that were stopped prematurely. This does avoid over-counting, but it can distort results in even worse ways than the previous method. It fundamentally means that your workers didn't do all of the I/O that you meant them to, and that they would have if they had all proceeded at the same pace. If you meant to do more I/O than will fit in cache, or on disks' inner tracks, too bad. If you wanted to see the effects of filesystem or memory fragmentation over a long run, too bad again. The slightest asymmetry in your workers' I/O rates will blow all of that away, and what storage system doesn't present any such asymmetry? None that I've ever seen. Worst of all, as <a href="">Brian Behlendorf</a> mentions, this approach doesn't even solve the single-slow-worker problem.</p> <blockquote> <p>The use of stonewalling tends to hide the stragglers effect rather than explain or address it</p> </blockquote> <p>In other words, iozone's stonewalling is worse than the problem it supposedly solves. Turn it off. If you want to see what's <strong>really</strong> happening to your I/O performance, the solution is neither of the above. Measuring just a start and end time, per worker or per run, is insufficient. To see how much work your system is doing per second, you have to look each second. Such <strong>periodic</strong> aggregation can not only give you accurate overall numbers and highlight stragglers, but it can also show you information like:</p> <ul> <li> <p>Performance percentiles (per thread or overall)</p> </li> <li> <p>Global pauses, possibly indicating outside interference</p> </li> <li> <p>Per-thread pauses e.g. due to contention/starvation</p> </li> <li> <p>Mode switches as caches/tiers are exhausted or internal optimizations kick in</p> </li> <li> <p>Cyclic behavior as timers fire or resources are acquired/exhausted</p> </li> </ul> <p>This is all <strong>really</strong> useful information. Do any existing tools provide it? None that I know of. I used to have such a tool at SiCortex, but it was part of their intellectual property and thus effectively died with them. Besides, it depended on MPI. Plain old sockets would be a better choice for general use. Reporting from workers to the controller process could be push or pull, truly periodic or all sent at the end (if you're more concerned about generating network traffic during the run than about clock synchronization). However it's implemented, the data from such a tool would be much more useful than the over-simplified crap that comes out of the current common programs. Maybe when I have some spare time - more about that in a future post - I'll even work on it myself.</p>Tue, 09 Jul 2013 20:03:00,2013-07-09:2013-07-perf-pitfalls.htmlstorageperformanceTwo Weeks is Not a Sprint<p>We're moving to an "agile" development process at work. Yes, we're becoming scrumbags. ;) One of the terms that really bothers me is "sprint" because I think of a sprint as a flat-out effort. That means minimal eating, sleeping, or time with family. Even hard-core hackers rarely do that for two weeks at a time. I think a better metaphor for what's a sprint and what's not is running: 100m equals one day of coding. So...</p> <ul> <li> <p>100m: the classic. Not much more to say about this one.</p> </li> <li> <p>200m = two days. A short hackathon. The focus shifts a bit from acceleration to maximum speed (productivity) and the overall pace is actually a bit higher because that startup time is amortized.</p> </li> <li> <p>400m = four days. A long hackathon, or close enough to a full week. Still a sprint, but at the upper end of the range.</p> </li> <li> <p>1500m/mile = two weeks (approximately). Another marquee distance. No longer a true sprint, but still fast. Most sensitive to pace, because it's long enough to burn out but not long enough to make many adjustments.</p> </li> <li> <p>5k/10k = a few months. Not much to say here either.</p> </li> <li> <p>40k/marathon = just over a year. The longest distance/duration most people plan for, though ultra-marathons do exist.</p> </li> </ul> <p>The mile seems like the closest equivalent to how "sprint" is used in agile terminology, so why don't we use good old-fashioned "milestone" instead?</p>Tue, 25 Jun 2013 08:10:00,2013-06-25:2013-06-two-weeks.htmlprocessLies, Damn Lies, and Parallels<p>This apparently happened a while ago, but it recently came to my attention via <a href="">LWN</a> that James Bottomley has made the claim that "Gluster sucks" (not a paraphrase, those seem to be his exact words). Well, I couldn't just let that go by, could I? Why would he say such a thing? The only visible thing is a recent <a href="">presentation</a> at the Parallels Summit, which is - to put it bluntly - just <strong>full of lies</strong>. Let's take a look at just how bad it is.</p> <p>Our starting point is a performance graph on slide 3, purportedly showing how Parallels Cloud Storage is way ahead of everyone else in terms of aggregate Gbps . . . but wait. How many clients are we talking about? How many servers? He doesn't say. What kind of hardware? He doesn't say. What kind of configuration? He doesn't say. What kind of workload? He doesn't say. What does it even mean to put up numbers for both distributed storage systems (running on what kind of network?) and "DAS - 15,000 RPM"? Is he comparing apples to oranges, or apples to whole crates full of oranges? That graph is the absolute worst kind of fact-free marketing. It's utterly useless for drawing any engineering conclusions about anything. Onward to slide 5. What does this mean?</p> <blockquote> <p>File based Storage</p> <p>...</p> <p>suffers from metadata issues on the server</p> </blockquote> <p>"The" server eh? Where have I heared that before? Oh yeah, <a href="">right here</a>. He's making the same mistake that James Hughes did, of thinking that because he can't think of a better way to handle metadata then nobody can. To quote Schopenhauer, "Everyone takes the limits of his own vision for the limits of the world." Onward to slide 7.</p> <blockquote> <p>Using a fixed size object incurs no metadata overhead whatsoever</p> </blockquote> <p>Here he has inadvertently identified a deficiency not in real cloud filesystems but in the Parallels alternative. Fixed-size objects are just not a reasonable limitation in many use cases. Any system designed around such a limitation is hopelessly weak compared to one that handles the more general case. As I explained the <a href="">last time</a> Parallels was slinging this kind of FUD, the same can be said about systems that don't allow real sharing of data - including both object and block stores. People wouldn't still be making billions of dollars per year selling NAS if users didn't want those more general semantics. Onward to slide 8.</p> <blockquote> <p>Fuse is the Linux Userspace Filesystem</p> <p>Main problem is it’s incredibly SLOW</p> </blockquote> <p>So why has FUSE historically been slow? Because the kernel hackers whose sign-off was needed to make it less slow were extremely resistant to any change that would have that effect. People like James Bottomley himself. When you're wrong for so long it's a disingenous to take so much credit for finally ceasing your own resistance to change. Onward to slide 9.</p> <blockquote> <p>Eventual Consistency is the usual norm</p> <p>...</p> <p>Gluster (does have a much slower strong consistency quorum enforcement mode)</p> </blockquote> <p>The first part is highly misleading. Eventual consistency is <strong>not</strong> the norm in GlusterFS. In normal operation, updates are fully synchronous and there will be no inconsistency beyond that which exists in any distributed system while an update is still in progress. The only time there's any observable inconsistency is in the presence of failures, and not just any failure but the kind or number that can lead to split-brain. Also, quorum enforcement does <strong>not</strong> make anything slower. It has zero performance impact; that's just more FUD.</p> <p>Basically, what Bottomley has provided is just one big hatchet job based on misleading or outright false statements. The <strong>fact</strong> is that GlusterFS can do many things that Parallels Cloud Storage can't. It provides full filesystem semantics, truly shared data, geo-replication (still a hand-wave for PCS), Hadoop and Swift integration, and many other features. Yes, it might be true that PCS can outperform GlusterFS for the only use case that PCS can handle, on an unspecified configuration with an unspecified workload. Or maybe not, since those details are missing and the software itself isn't open so that others can make their own comparison.</p> <p>In my experience, people only make such totally <strong>bullshit</strong> comparisons when legitimate ones don't paint the picture they want. It's not science. It's not engineering. It's not even marketing done right. It's just lying.</p>Mon, 24 Jun 2013 13:02:00,2013-06-24:2013-06-lies-damn-lies.htmlstorageglusterfsmarketingPackage Managers<p>There are many things that differentiate a true software engineer from a mere programmer. Most of them are unpleasant - planning releases, reviewing designs or code, testing, release engineering, and so on. One of the most odious tasks is packaging software. I'll admit that it's an area where my self-discipline sometimes breaks down and I dump the task on somebody else as quickly as I can. Nonetheless, I recognize that the task itself as well as the tools and people who do it have value. I recognize that the rules those people have developed generally exist for a good reason. Apparently <a href="">some people don't</a>.</p> <p>The post actually makes some pretty decent points, especially about packagers breaking up packages unnecessarily. Mixed in are some really <strong>bad</strong> points, of which I'll focus on just three.</p> <blockquote> <p>Dynamic linking lets 2 programs indicate they want to use library X at runtime, and possibly even share a copy of X loaded into RAM. This is great if it is 1987 and you have 12mb of ram and want to run more than 3 xterms, but we don’t live in that world anymore.</p> </blockquote> <p>That demonstrates some pretty serious ignorance about the real issues, including performance. Sure, people have lots of RAM, but they want to use it for something besides redundant copies of the same (or almost the same) code. More applications, more VMs, more heap space for whichever program is the machines main role, etc. A dozen copies of the same library means a dozen times as much RAM <strong>and cache</strong>, and making those footprints larger does indeed have an impact on performance.</p> <blockquote> <p>One often touted benefit of dynamic linking is security, you can upgrade library X to fix some security hole and all the applications that use it will automatically gain the security fix the next time they’re run (assuming they still can run). I admit this benefit, but I think that package managers could work around this if they used static linking (Y depends on X, which has a security update, rebuild X and then rebuild Y and ship an updated package).</p> </blockquote> <p>That doesn't really work. You might be able to <strong>build</strong> against the new version of X, but that doesn't mean the result will be free of subtle bugs due to the difference. The author even seems aware of this when he talks about the "carefully curated" (how pretentious) libraries that are shipped with Riak, but sort of tries to walk both sides of the street by ignoring the issue here.</p> <p>The situation gets even worse when transitive dependencies are considered. Let's say that X depends on a specific version of Y, and it enforces that dependency either via the package definition or via bundling. Either way, if Y depends on Z then an update to Z can also break X. This possibility remains unless X includes all of its dependencies <strong>all the way down</strong> to the OS. I know plenty of people who do exactly this in the form of virtual appliances and such, and it's a valid approach when pursued to its logical conclusion, but capturing only one level of dependencies solves <strong>nothing</strong> in return for the problems it causes.</p> <p>The last issue has to do with bundling <strong>modified</strong> versions of dependencies.</p> <blockquote> <p>Leveldb is a key/value database originally developed by Google for implementing things like HTML5’s indexeddb feature in Google Chrome. Basho has invested some serious engineering effort in adapting it as one of the backends that Riak can be configured to use to store data on disk. Problem is, our usecase diverges significantly from what Google wants to use it for, so we’ve effectively forked it</p> </blockquote> <p>This approach is problematic for reasons that go well beyond packaging. There's also a serious "doing open-source wrong" aspect to it as well, though there may be room for debate about which side is guilty in this case. Nonetheless, these things do happen. I myself violated the no-bundling rule for HekaFS on Fedora at one point . . . and you know what? It ended up being broken, for exactly the reasons we're talking about. If you do have to bundle a modified version of someone else's code, there's a right way to do it and a wrong way. The right way is to <strong>engage</strong> with the distro packagers, instead of calling them "OCD" or accusing them of adhering blindly to "1992" standards that have become outdated, and collaborate with them on a sustainable solution. That solution is very likely to include more tightly specified dependencies, and a more active role keeping your own package up to date as the underlying original dependency gets updated. It's a huge pain for everyone involved, which is why it should only be done as a last resort. If you do decide to go down that path, then at least - as I put it in the <a href="">Hacker News thread</a> - pull up your big-girl panties and deal with it. Asking someone else to do part of your job and then complaining about how they do it is a loser move.</p>Sat, 22 Jun 2013 12:42:00,2013-06-22:2013-06-package-managers.htmlprocessMetadata Servers<p>I was sad that I had to miss RICON East, because I knew they had a lot of great speakers lined up. I really liked <a href="">James Hughes's presentation</a>, but must take issue with slide 15.</p> <blockquote> <p>Metadata Servers</p> <p>Required by traditional filesystems (POSIX) to translate names to sectors</p> <p>Hard to scale, heavy HA requirements</p> </blockquote> <p>The minor quibble is that no metadata servers I know of translate names to sectors. They all translate names in the global distributed namespace into a tuple of a node ID and a file/object ID on that node, and then other layers are responsible for translating to sectors (which might themselves be virtualized eight ways to Sunday). That's just a minor objection, though. My major objection is to the idea that there must be a metadata-server role separate from the data-server role. GlusterFS has proven that assumption false. Even Ceph, which does have a separate metadata-server role, distributes that role in very much the same way as the proposed alternative object stores, so it's not subject to "heavy HA requirements" any more than they are.</p> <p>There are some valid points to be made about the ordering and atomicity requirements of a full POSIX filesystem vs. an object store with simpler (saner?) semantics, but the "heavy HA requirements" of metadata servers are avoidable. There is one well known case of a distributed not-quite-filesystem that made that mistake (HDFS), but an argument based on one bad example won't get very far.</p>Fri, 21 Jun 2013 15:47:00,2013-06-21:2013-06-metadata-servers.htmlstorageStarting Over<p>You might have noticed that things look a bit different around here. OK, if you're reading this in an RSS reader then maybe not, but otherwise it's kind of obvious. I've switched platforms yet again, because I was feeling a bit blocked. Publishing new stuff using my static-wordpress technique was a bit cumbersome, but I didn't want to go back to the bloat and security nightmare that is regular WordPress either, so I'm moving to a system that's designed to generate static pages - Pelican. All of the old content will remain available at the same locations (don't want to lose all that Google juice), but the front page and feeds will be all about the new stuff. I actually have a bunch of ideas queued up in my head. Now that I've made the leap, I'll be letting them out into the world shortly. Let's see how it works out.</p>Thu, 20 Jun 2013 17:38:00,2013-06-20:2013-06-starting-over.htmlpelican