Platypus Reloaded, 06 Jun 2016 13:48:00 -0400Collecting my thoughts about Torus<p>The other day, CoreOS announced a new distributed storage system called <a href="">Torus</a>. Not too surprisingly, a lot of people have asked for my opinion about it, so I might as well collect some of my thoughts here.</p> <p>First off, let me say that I like the CoreOS team and I welcome new projects in this space - especially when they're open source. When I wrote the first C bindings for etcd, it gave me occasion to interact a bit with Brandon Phillips. He seems like an awesome fellow, and as far as I can tell others on that team are good as well. I think it's great that they're turning their attention to storage. I don't want them to go away, or fail. I want to see them succeed, and teach us all something new.</p> <p>If I seem negative it's not toward or because of the developers. Like many engineers, I have a strong distaste for excessive marketing, and that's what I find objectionable about the announcement. The claims are not only far beyond anything that has actually been achieved, which is fine for a new project, but also far in excess of anything that experience tells us is <em>likely</em> to be achieved within any relevant period of time. Willingness to tackle unknown problems is great, but these are for the most part not unknown problems. The difficulties are <em>quite</em> well known, and represent hard distributed-system problems. If you want to claim that solutions are imminent, it really helps to demonstrate a thorough understanding of those problems. Instead, we're presented with claims that are vague or misleading, claims that illustrate significant gaps in knowledge, and at least one claim that's blatantly false. Quoting from the announcement:</p> <blockquote> <p>These distributed storage systems were mostly designed for a regime of small clusters of large machines, rather than the GIFEE approach that focuses on large clusters of inexpensive, “small” machines.</p> </blockquote> <p>It's not true for Gluster. It's not true for Ceph. It's not true for Lustre, OrangeFS, and so on. It's not even true for Sheepdog, which Torus <em>very</em> strongly resembles. None of these systems were <em>designed</em> for small clusters. It's true that some of them might have more trouble than they should scaling up to hundreds of machines, but those are implementation issues and the work that remains to be done is still less than building a whole new system from scratch.</p> <p>The same paragraph then continues by talking about the specific problems with high-margin <em>proprietary</em> systems, implying that they're the most relevant alternative. They're not. Already, I've seen many people comparing Torus to open-source solutions, and nobody comparing them to proprietary ones. The omission of other open-source projects from their portrayal stands out as deliberate avoidance of hard questions. So does the lack of any explanation of what makes Torus any better than anything else for containers. Being written in Go doesn't make something container-specific. Neither does using etcd. There's nothing in the announcement about any <em>actual</em> container-oriented features, like multi-tenancy or built-in support for efficient overlays. It's just a vanilla block store using basic algorithms, <em>marketed</em> as good for containers. There's nothing wrong with that, in fact it's quite useful, but it's hardly ground-breaking. Anyone who attended my FAST tutorials on this subject during the three years I gave them could have built something similar in the same six months.</p> <p>The other part of the announcement that bothers me is this.</p> <blockquote> <p>Torus includes support for consistent hashing, replication, garbage collection, and pool rebalancing through the internal peer-to-peer API. The design includes the ability to support both encryption and efficient Reed-Solomon error correction in the near future, providing greater assurance of data validity and confidentiality throughout the system.</p> </blockquote> <p>"Includes support" via an API? Does that mean it's already there, or planned, or just hypothetically possible? The first two seem to be there already. I wouldn't be so sure about any sufficiently transparent and non-disruptive form of rebalancing. Encryption and Reed-Solomon are supposedly in the "near future" but I doubt that future is really so near. The implication is that these will be easy to add, but I think the people who have worked on these for Gluster or Ceph or HDFS or Swift would all disagree. Similarly, there's this from <a href="">Hacker News</a>:</p> <blockquote> <p>early versions had POSIX access, though it was terribly messy. We know the architecture can support it, it's just a matter of learning from the mistakes and building something worth supporting.</p> </blockquote> <p>"Just a matter" eh? It was "just a matter" for CephFS to be implemented on top of RADOS too, but it took multiple genius-level people multiple years to get that where it is today. Saying this is "just" anything sets an unrealistic expectation. I'd expect anyone who actually understands the problem domain to warn people that getting from block-storage simplicity to filesystem complexity is a <em>big</em> step. Such a transition might take a while, or not happen at all. Then there's <a href="">this</a>.</p> <blockquote> <p>Some good benchmarks to run:</p> <p>Linear write speed</p> <p>dd if=/dev/zero of=/mnt/torus/testfile bs=1K count=4000000</p> <p>Traditional benchmark</p> <p>bonnie++ -d /mnt/torus/test -s 8G -u core</p> </blockquote> <p>Single-threaded sequential 1KB writes for a total of 4GB, without even oflag=sync? Bonnie++? Sorry, but these are not "good benchmarks to run" at all. They're garbage. People who know storage would never suggest these. We're all sick of complaints that these are slow, or slower on distributed systems than on local disks, as though that's avoidable somehow. Anybody who would suggest these is not a storage professional, and should not be making <em>any</em> claims about how long it might take to implement filesystem semantics on top of what Torus already has.</p> <p>So, again, this is not about the <em>project</em> itself but the <em>messaging</em> around it. For the project itself and the engineers working on it: welcome. Best of luck to you. Feel free to ping me if you want to brainstorm or compare notes. BTW, I'll be in San Francisco at the end of this month. For the <em>marketing</em> folks: get real. You're setting your own engineers up for failure, disappointment, and recriminations. I know you want to paint the best picture you can, but that's no excuse for presenting fiction as fact. That's a perfectly good horse you have there. Maybe that horse will even be good enough to win a race or two some day. Stop trying to tell people it's a unicorn just because you have some ideas about how to graft a horn onto its head.</p>Mon, 06 Jun 2016 13:48:00,2016-06-06:2016-06-torus.htmlstoragemarketingReal Freedom vs. Illusions of Freedom<p>Randy Bias has taken it upon himself to explain that <a href="">vendor lockin is unavoidable</a>. How surprising that someone who works for one of the most aggressively proprietary vendors in the storage space would say such a thing, with a healthy dose of sneering about unicorns and Santa Claus thrown in. (Really, Randy, what part does "making you dumber" play in your claimed mission of helping people?) Since much of his argument consists of casting aspersions on open source, a response is called for, but first, let's summarize what he says.</p> <ol> <li> <p>Switching costs are still non-negligible even with open source, with a whole list of reasons.</p> </li> <li> <p>We don't have this problem with networking, even though the enterprise-level products there are proprietary, because they're still bound by standards.</p> </li> <li> <p>Because open source doesn't matter, standards and migration tools are sufficient to avoid the lock-in he says is unavoidable. (Wait, what?)</p> </li> </ol> <p>The first point is actually pretty correct, so let's move on to the second. The problem is that the analogy between storage and networking is invalid because switches and routers don't store your data. As Randy himself has written, data has a gravity or inertia that network flows do not. Migrating from one network vendor to another doesn't involve a weeks-long process of moving years-old data. In the ways that matter - different tooling, different skillsets, everything Randy mentioned in the previous point - the switching costs are there for networking just as much as for storage. Anyone who has ever actually been involved e.g. in replacing Cisco with Juniper would be acutely aware of that. The difference between storage and networking transition costs has nothing to do with proprietary vs. open source and everything to do with the essential nature of the two functions. It tells us <em>nothing</em> about whether open source matters.</p> <p>The only reason that broken example is even worth addressing is because it's used as a springboard for the giant leap Randy takes next. Open source does matter precisely because of the ways that storage is not like networking. If a piece of networking gear blows up, you can replace it with another and get back to exactly where you were. The effort and expense might not be trivial, but you won't have suffered a permanent loss of the data that is your company's lifeblood. Sure, there are backups, but the very fact that backups exist is evidence of how different storage is from networking. Restoring from backup is an expensive and imperfect process that has no parallel in networking. Also, storage can't just drop writes the way networks can drop packets. Storage has to prepare for these events <em>ahead of time</em> with various forms of redundancy and careful scheduling of operations. How all this is done, and how to use these artifacts to get at the actual user's data despite a failure, can be very complex. That's why you can't just take some disks out of a high-end storage system and expect to make sense of their contents on your own. You might be able to piece together most of what's on those disks, but to get <em>all</em> of it (especially the most recently written parts) you need a thorough understanding of the formats and algorithms that are being used to write it.</p> <p>That's where open source comes in. I've worked for open-source storage companies and I've worked for proprietary storage companies (including EMC). Nobody - and I do mean nobody - documents enough of their back-end formats and internal operations well enough to put absolutely everything back together under all kinds of failure conditions. The only sufficiently thorough documentation is the source code for the software that actually writes the data. With open source, you at least have a decent chance of getting answers about those kinds of details, or hiring someone who knows and can share their knowledge of those internals, or worst case of figuring out how the software works and developing your own in-house expertise. With proprietary software you have none of that. It works the way it works, and it fails the way it fails, and there's not a whole lot you can do about it except trust that the vendor will be responsive. What was that about Santa Claus, again?</p> <p>Lastly, what about migration tools? The fact is that any door that <strong>can</strong> be locked can't be trusted to remain open. Open-source doors can't be locked. Nothing will ever stop you from walking through. Vendor-provided migration or recovery tools, which are inevitably designed to have a limited set of functionality and are tested against a finite set of conditions, might not address the need or condition that you have right now. Maybe they do something close, and you could tweak them to do what you want . . . if you only had the source . . . which you don't, so too bad. Back to the mercy of the vendor. Who knows if or when your enhancement request will be acted upon? Likely never, when - no, I don't mean if - they change their minds and decide that supporting such tools isn't in their strategic interest.</p> <p>There's no panacea for vendor lock-in. Not even open source, but open source alone gets you further than any number of standards that don't cover what really matters or vendor-provided tools that might go away at any moment. It's the first and best tool for dealing with lock-in, even if it's not perfect. Or, to paraphrase the <a href="">Man in Black</a>:</p> <blockquote> <p>Life is pain, Highness. Anyone who says differently is selling something.</p> </blockquote> <p>Vendors won't take initiative to give you greater control, as Randy suggests. They might take initiative to give you an <em>illusion</em> of control, but illusions are not reality. Keeping that control in your hands requires continual ongoing effort. There doesn't need to be any malice of deceptive intent involved in its subsequent disappearance. Simple laziness or inattention are sufficient. Every company goes through shifts in personnel or strategy. Every such shift is likely to leave something behind. Open source was invented in large part to ensure that the one real collection of information about how things actually work <em>couldn't</em> be left behind even when these inevitable changes occur. Its importance can not be dismissed in favor of some snake oil about magical migration tools that will always be current, always be free, and always cover every need.</p>Tue, 10 May 2016 19:47:00,2016-05-10:2016-05-real-freedom.htmlstoragemarketingUpdating POSIX<p>"POSIX is obsolete." If you're a filesystem developer, you've probably heard that many times. I certainly have. It doesn't tell me anything I didn't already know about POSIX, but it does tell me two things about whoever says it.</p> <ul> <li> <p>They don't know what POSIX is.</p> </li> <li> <p>They're lazy.</p> </li> </ul> <p>To the first point, many people seem unaware that POSIX is an actual set of standards - IEEE 1003.1 in several variations, plus descendants. These standards cover a lot more than just operations on files, and technically "POSIX" only refers to systems that have passed a set of conformance tests covering all of those. Nonetheless, people often use "POSIX" to mean only the section dealing with file operations, and only in a loose sense of things that implement something like the standard without having been tested against it. Many systems, notably including Linux, pretty explicitly do not claim to comply with the actual standard.</p> <p>That brings me to the second point. The "POSIX is obsolete" claim often comes from people who can't tell babies from bathwater. They take a few reasonable concerns about the POSIX standard (which I'll get to in a moment) and use those as an excuse to throw out everything to do with POSIX as it exists in the real world. That's the lazy way out. To pick just one example, if nested directories weren't useful, there wouldn't be about twenty implementations of them on top of various object stores, all of them incompatible with each other and each one subject to numerous race conditions or other bugs that could cost users their data. There's value in the POSIX(ish) feature set. There's even more value in the fact that every programming language and application already knows how to speak that language, without requiring API-specific adapters or shims.</p> <p>I'm not going to defend the official POSIX standard. It's not obsolete, but it is outdated. Yes, there's a difference. Something that's obsolete can never recover its former usefulness. Something that's out of date can. A standard that described the actual and remarkably uniform behavior of the filesystems that are actually out there today would satisfy the POSIX goal of supporting application portability. That would be a good thing ... but it wouldn't be the best thing, because "POSIX in practice" is almost as outdated as "POSIX as written". Both are based on a computing model that I'd place at the late 80s - before SMP and NUMA and complicated cache/memory hierarchies, but even more importantly before distributed systems became the norm. Things that might have made sense or seemed feasible in the original POSIX context nearly thirty years ago often make absolutely no sense today, or in some cases we've learned since then that they were bad ideas all along. These anachronisms make it very difficult to achieve correctness and/or acceptable performance. They create the breathing room exploited by the peddlers of deficient almost-filesystems like HDFS and object stores like S3, which end up being even worse fits for application developers' needs that a filesystem would have been.</p> <p>An up-to-date filesystem interface would avoid these ills. My goal in this article is to cast light on some of the problems with the current interface, and in some cases propose solutions. Each section will refer to a specific system call, but those system calls are merely exemplars or archetypes representing more general problems that actually affect multiple calls.</p> <h2>Rename</h2> <p>One example of the "stuck in the 80s" syndrome is <em>rename</em>. In the worst case, this might affect four objects:</p> <ul> <li> <p>The source directory.</p> </li> <li> <p>A separate destination directory.</p> </li> <li> <p>The object being renamed (e.g. to update ".." if it's also a directory).</p> </li> <li> <p>Another object at the destination, which is effectively being unlinked.</p> </li> </ul> <p>Thus, a <em>rename</em> might involve several operations. A failure of any one might require a rollback, which might itself fail, etc. In a local-filesystem context, where there's a single journal and you have options like throwing a lock around everything, it might not seem like an intractable problem. However, in a distributed filesystem where you don't have such luxuries (and the probability of that double failure is much higher), it's a total nightmare. It's a nightmare that we in distributed systems have learned to avoid, but neither the <em>de jure</em> nor <em>de facto</em> standards have kept up because everyone who has any influence there has remained stuck in the 80s.</p> <p>The simple solution to this problem is to have each operation affect as few objects as possible. A single-directory rename where the target doesn't already exist is an easy case affecting only one object. It's reasonable to expect that filesystems - even if they're distributed - will fully support this case. It's also reasonable for the filesystem to return EEXIST if the destination already exists, or EXDEV if the rename is across directories. Any application that can't handle EXDEV for a cross-directory rename is broken already, because it's always possible that the source and destination are on completely different filesystems. If an application wants to ensure that renames are always atomic, they already need to deal with that above the filesystem anyway, so why impose burdensome requirements on the filesystem as well?</p> <h2>Fsync</h2> <p>Everyone seems to know about one problem with fsync - that it can create huge latency bubbles. However, I see that problem as only the visible tip of the crapberg that is fsync, O_SYNC and all of their friends. In a way, it's really a side effect of the most fundamental problem with POSIX - that it's clueless about the relationship between consistency, durability, and ordering.</p> <p>Let's start with consistency and durability. POSIX is <em>very</em> strict about consistency, requiring full and immediate visibility of any write to any subsequent reader. That probably didn't sound too bad for a single-processor system in the 80s. For a modern distributed filesystem - or even a local filesystem running on a big NUMA machine - it can be quite burdensome. (Yes, this is closely related to the rename-atomicity issue above.) By contrast, POSIX is very <em>loose</em> about durability. Most programmers know that a literal <em>write</em> isn't guaranteed to hit storage unless O_SYNC is set or fsync is issued. What many don't know is that other modifying operations have similar behavior. For directory operations, an fsync is required on the directory being modified, usually requiring that the directory be opened for the sole purpose of issuing that fsync. That's neither convenient nor efficient for anyone, really. The problem of what to fsync for a cross-directory rename is left as an exercise for the reader. ;)</p> <p>What's being missed here is that consistency and durability need to be tunable separately. POSIX requires strong consistency and weak durability, but many applications need the exact opposite. As with the various kinds of barriers and flushes at the CPU level, there should be <em>separate</em> calls to ensure previously-deferred consistency and to ensure previously-deferred durability. Forcing every application toward one corner of the consistency/durability space is a huge part of the reason distributed databases - which allow more flexibility regarding these tradeoffs - have come to be used in so many cases where a distributed filesystem would have made more sense.</p> <p>Now, what about ordering? The problem here is that there's no way to ensure that the system will respect the ordering of two operations unless you wait for the first to complete before even issuing the second. In yet another echo of the 80s, this might make sense when that just means returning from one syscall and issuing another. However, if the filesystem happens to be distributed, we're talking about putting a network round-trip delay between two things that could have been pipelined. Filesystem developers well understand the importance of pipelining instead of playing ping-pong, because they rely on exactly that model from the block layer below them. Application developers should have access to the same thing, as should distributed filesystems layered on top of local ones. It should be possible <em>at the very least</em> to indicate which writes on a file descriptor are part of a reorderable group and which must retain their order relative to groups before or after. As with the durability/consistency options mentioned above, this gives application developers a powerful and yet portable way to manage tradeoffs that are important to them.</p> <p>At this point, we can go back to that above-the-waterline issue of latency bubbles. It's bad that unconstrained buffering can mean that whoever calls fsync might have to wait for gigabytes of pending data to get flushed out. It's far worse that entanglement might mean that they have to wait for gigabytes of <em>completely unrelated</em> data from other users to get flushed out. Worst of all, if fsync <em>can't</em> finish any writes it has no way to say so. If it fails, you have no idea what data didn't actually make it to disk. Unfortunately, I don't think there's a reasonable way for a standard to address that. Maybe adding some sort of control over per-file-descriptor buffer limits would be feasible. Beyond that, you start getting into multi-tenant issues that tend to exist only in proprietary form bound about with patents, and that's a poor basis for a standard. On the other hand, I think people often only use fsync because it's the only hammer they have. Who really <em>wants</em> an interface that often destroys performance while making no guarantees of correctness? If they had finer-grain control over consistency and durability and ordering, maybe they wouldn't even need to call fsync.</p> <h2>Readdir</h2> <p>This is actually one of the areas where the problem lies not with POSIX the standard but with "POSIX" as it exists in the world. The high cost of readdirp ("readdir plus") comes from NFS. The utterly insane <em>d_off</em> behavior that we Gluster developers and others have had to put up with is actually specific to Linux. These are real pain points, but this is getting long enough already so I'll just leave them alone for now. The problem with readdir, as defined in the standard, is that it's just too limited. Often, users or applications are actually only looking for files that meet certain criteria - most especially a name matching a certain pattern. That's why "find" and other utilities exist. Unfortunately, POSIX offers no option other than listing every single file in a directory (in effectively random order) and filtering out the ones you don't want. In yet another repetition of what should by now be a familiar pattern, this is particularly deadly for a distributed filesystem where all of those entries have to be passed over the network. We in Gluster-land would be glad to do some filtering or pattern matching ourselves, if users had some sort of standards-based way to tell us what they want. POSIX even defines the syntax for name-based matching, in the definition for <em>fnmatch</em> and elsewhere. It's just defined and implemented at the wrong level. This is such a common and severe problem that it seems like about time to combine these already-standardized pieces into something that serves users better.</p> <h2>Chmod</h2> <p>Access control is another area where people often can't distinguish between the POSIX standard and "POSIX" implementations in the real world. The standard only defines the simple user/group/other permissions we're all familiar with. It's a very useful model, and I think failing to support it is one of the most egregious examples of laziness on the part of the object-store folks. However, it's clearly not sufficient for all needs, so there was a later attempt to add more complex access control lists (see "man chacl" if you're not familiar with them). However, despite the fact that some popular platforms did implement the ACL semantics defined in POSIX.1e draft 17 (really), it never actually became a standard. Maybe that's for the best, because these ACLs still rely on the concept of a group, and that has (at least) the following problems in a distributed world:</p> <ul> <li> <p>There has to be at least some agreement between clients and servers about what groups mean, or else comparison between the group(s) being presented and the group(s) allowed to perform an action just make no sense. I had to deal with exploits based on this when I was at Encore in 1990, and it doesn't seem like things have gotten a lot better since.</p> </li> <li> <p>Attaching long lists of groups to every request, because you never know which one(s) might confer the needed access for that operation, is inefficient. Also, arbitrary-length lists are a pain from a protocol-definition standpoint, and no finite number ever seems to be enough. From Gluster I know of installations where users literally belong to hundreds of groups.</p> </li> <li> <p>There are still use cases that groups don't satisfy, such as access via a specific program (MTS had PKEY access for this before UNIX ever existed) or for a limited time.</p> </li> </ul> <p>What would be better? Capabilities. No, not the horrid mess of meaningless flags born of POSIX.1e and adopted by Linux. In the real CS literature, which those people apparently never read, a capability is an unforgeable token that can be communicated to others and which confers access to an object. Modern capabilities use end-to-end cryptography instead of relying on operating systems or other intermediaries to maintain a "chain of custody" between the granter, user, and target of the capability. This means anyone can make one up on the fly, attach it to an object, and then send it to any <em>ad hoc</em> collections of entities should have access. This collection can include both users and programs, with no requirement for either to be registered as a member of a group. You can do anything with this model that you can do with user/group permissions or POSIX.1e ACLs, plus a whole lot more, with better security and without the implementation problems mentioned above.</p> <h1>Conclusions</h1> <p>I'm sure there are many more parts of POSIX (both in the standard and in practice) that I could pick at, but hopefully these are enough to get started. The point is not in the specifics but in the fact that (a) there are <em>serious</em> problems with the current standard and (b) solutions to those problems are mostly well known. It's a crying shame that neither the official standard nor the dictators of the unofficial standard (i.e. what popular OSes actually implement) reflect the hard work and ingenuity of so many computer scientists or our fellow practitioners over in database-land. The people who say "POSIX is obsolete" are incorrect today, but if we filesystem developers keep screwing up so badly they might eventually be right.</p>Tue, 26 Apr 2016 10:18:00,2016-04-26:2016-05-updating-posix.htmlstorageStone Age Programming<p>As a systems programmer, I get to work with a lot of old-fashioned code and tools. The code base I work on every day is in C, complete with manual memory management and constant checking of return values instead of exceptions. Heck, the Gluster coding style even involves using "goto" for most error handling within a function. One could argue that this is all as it should be, and that features such as GC or exceptions don't belong in systems code. Within some systems-programming subdomains that's even true, but certainly not in all. Most often, doing everything the hard way is a vestige of code having been written in a more primitive era and not rewritten since. Nine times out of ten, writing <strong>new</strong> code in that same style would be somewhere between unwise and idiotic.</p> <p>Fortunately, as one moves further from the kernel/embedded space, the code rapidly becomes more modern. A lot of newer infrastructure code, from distributed object stores and databases to configuration management and container provisioning, is written in more modern languages. For a long time that meant Java, more recently it's Go, with Python sort of running second all along. You can even add Clojure or Erlang to the list if you want. Clearly, these people understand the value of having more modern features in a programming language. That's why it amazes me to see some of those very same people clinging to an archaic style when it comes to dealing with asynchronous programming. I refer, of course, to the "callback hell" style best known from Javascript and node.js.</p> <div class="highlight"><pre> <span class="nx">downloadFile</span><span class="p">(</span><span class="s1">&#39;;</span><span class="p">,</span> <span class="kd">function</span><span class="p">(</span><span class="nx">err</span><span class="p">,</span> <span class="nx">data</span><span class="p">)</span> <span class="p">{</span> <span class="nx">console</span><span class="p">.</span><span class="nx">log</span><span class="p">(</span><span class="s1">&#39;Got weather data:&#39;</span><span class="p">,</span> <span class="nx">data</span><span class="p">);</span> <span class="p">});</span> </pre></div> <p>(from <a href="">Stack Abuse</a>)</p> <p>What's happening here is that we're creating an anonymous function, then calling downloadFile and explicitly telling it to call that anonymous function when it's done. This is manual stack management, like you have to do in assembly language. It's what people did back before procedures and functions became commonplace <strong>in the 60s</strong>. The equivalence becomes even more apparent when you consider a case where we need to pass some of our own data through downloadFile and expect to get it back in our anonymous function. If you're lucky, your language has lambdas to do that. Failing that, you'd better hope that there's a version of downloadFile with an extra "user data" argument for that purpose, because if you don't even have that things get <strong>really</strong> ugly.</p> <p>People who advocate this approach to concurrency will try to make it sound all computer-science-y by talking about closures and continuations, but it's still fundamentally a stone-age technique - doing something manually that should be taken care of automatically, like manual memory management vs. automatic GC. They'll also make wild claims about how threads are so inefficient, but never back those claims up. Here's a suggestion: go write a program to see how many thread switches per second you can get on a modern processor. It's in the millions, which is more than enough for most situations. And that's for OS threads, which have to go through the kernel's scheduler. The numbers for user-level ("green") threads are even higher. What's happening is that the callback-hell advocates are conflating thread switches with process switches, which really are expensive because they have to do a lot more. A <strong>page fault</strong> is often more expensive than a thread switch, but ask one of these "threads are slow" types to explain how they're dealing with page-fault overhead and I guarantee you'll get nothing but a blank stare.</p> <p>I'm not saying that threads are The Answer to concurrent/asynchronous programming. Far from it. They have their own problems, though a lot of those are really with the way people do locking than with threads themselves. I've written about the actor model before, and more recently I've been moving more toward the promise/future camp. There are several approaches, all subject to the usual kinds of tradeoffs and preferences. All I'm saying is, let's please stop portraying the callback-heavy style as an <strong>advance</strong> when it's really a huge step <strong>back</strong>. It's old and rusty, not new and shiny. In fact, if you look under the covers of how any of these callback-oriented systems are actually implemented, you're likely to find a core/engine/reactor that's implemented in a far different and fundamentally better paradigm. Use that directly, instead of layering an archaic style on top of a modern framework.</p>Thu, 28 Jan 2016 10:06:00,2016-01-28:2016-01-stone-age-programming.htmldesignworkingFirst Race<p>Today I ran for the 200th time this year, fulfilling a promise I'd made to myself ten months ago. That makes me very happy. To make it extra special, I deliberately (since a couple of weeks ago) scheduled my runs so that #200 would be during the Genesis Battlegreen 5K right here in Lexington. Here's a picture taken for me by my lovely wife close to the 4km (80%) mark.</p> <p><img alt="" src="" /></p> <p>No, it's not the best picture of my face and I sure don't look like I'm having fun, but my form looks pretty good IMO. Anyway, here are some details.</p> <h2>Prelude</h2> <p>I've run courses very close to this one many <em>many</em> times. More than 80% of it overlaps with my most common routes, which I know literally better than the back of my hand by now. I've run this exact course with a different start/end point maybe half a dozen times. On Friday I ran the <em>absolutely</em> same course as kind of dress rehearsal, mostly to gauge the best pace for each section. I got a good time, and felt great initially, but that night the upper-outside portion of my right calf felt a bit tender. Running as much as I do, that wouldn't necessarily keep me off the road, but on Saturday morning I was seriously thinking about whether I wanted to risk it with the extra adrenaline of an actual race. I did everything I could to help it heal, and this morning I was relieved to feel that it was good to go.</p> <h2>Main Event</h2> <p>Because of the potential for injury, I changed my strategy a bit. I certainly wasn't going to risk injury going for a new personal record. This particular type of injury is characteristic of overstriding, which is usually something I don't even need to think about, but at my very highest pace my form does break down a bit. My stride lengthens, I shift my weight back a bit, and I'm sure that's exactly what almost got me on Friday's home stretch. So, no top speed. Instead, I resolved to keep my pace <em>down</em> on the flats, and focus on maintaining speed on the uphills - two steeper short ones, and a longer but very shallow one over almost the entire third kilometer. This is where I was glad I know how every inch of that course feels under my feet.</p> <p>Of course, I didn't quite follow the plan. I did a pretty good job on the first kilometer or so, consciously slowing down and letting others pass me on a stretch where I'd usually be trying to shave seconds off my time. After that I definitely ran faster than I'd meant to. My leg was feeling fine, though, so I just went with it. While I didn't pass too many people on the steeper uphills, on the longer (and later) one I had a pretty steady rhythm of passing someone every ten seconds or so. The biggest challenge to my self-imposed discipline was at the end. Besides my normal inclination to blast through that part, I knew I had three people right behind me - one oldster like me, and two middle-school kids - trying to squeak past. I felt myself starting to speed up in response, but I kept telling myself that's <em>exactly</em> the kind of running I needed to avoid. If I had held them off but injured myself doing it, I'd be beating myself up over it for months.</p> <h2>Aftermath</h2> <p>I'm pleased to say that, despite explicitly making it a non-goal, I did get a new personal record. 22:40 beats my previous true-5K time by forty seconds. My "dress rehearsal" on Friday was very slightly faster, but also slightly shorter because I misjudged exactly where the start/end points were (plus maybe I ran a bit further today going around people). Most importantly, I reached run #200 and my legs feel fine. I'll probably even run again tomorrow.</p>Sun, 01 Nov 2015 14:46:00,2015-11-01:2015-11-first-race.htmlrunningSoftware Will Nibble On Storage<p>There's a theme I keep coming back to lately, of the relationship between hardware and software and how that relates to recent industry developments such as the acquisition of EMC by Dell. It has shown up for me on <a href="">The Register</a>, on <a href="">StorageMojo</a>, and last night on <a href="">Twitter</a>, so it seems like a good time to get these ideas out of my head by writing a blog post. Yeah, you can call that garbage collection if you want. ;)</p> <p>When you need to set up some storage, you have to figure out where the software's coming from.</p> <ol> <li> <p>Bundled with hardware.</p> </li> <li> <p>Bundled with a service.</p> </li> <li> <p>Standalone software.</p> </li> <li> <p>Roll your own.</p> </li> </ol> <p>For the last two, you then need to decide where the software's going to run - your own hardware vs. someone else's (in the cloud). That's really six choices, or eight if you consider proprietary vs. open-source software to be separate categories.</p> <p>Obviously, some of these choices make more sense than others. Historically, enterprise types have considered only #1 to have sufficient functionality and/or performance, while #4 wasn't even feasible. That's where EMC and NetApp and the rest of the storage rogues' gallery made their fortunes. However, all of that is changing. Standalone software, such as I work on, has become far more competitive. I won't say it's all the way to where it needs to be. Since I'm so involved in trying to get us there, very few indeed are as aware as I am of the distance still to be traversed. Still, we've come a long way and we have the momentum. The service offerings, often leveraging the open-source ones either directly or indirectly, have also become more competitive. I'm going to leave "roll your own" for later, but the point is that the "storage monolith" approach is under siege from multiple directions. Those markets are no longer captive. Those who are willing to do without a particular feature, or who can fill the gap with some niche software product, will leap at the chance. Both volume and margins for the big boxes will keep crashing down.</p> <p>That's where the title of this post comes in. "Software will eat the world" is almost as popular as "disruption disruption disruption" was until companies like Uber and Theranos gave it a bad name. The thing is, storage hardware as a business is not going away. For one thing, someone obviously has to make all of the components. Seagate and WD/HGST (plus SanDisk now) are still going to make stuff and make money. For another thing, even storage <em>systems</em> won't be going away as a business. Storage workloads and power/density requirements create different design points than are well served by "standard" hardware. There's a different balance between CPU, cache, memory, internal and external interconnects, and other components (e.g. various NVRAM possibilities) to go with the masses of disk or flash. There will continue to be a market for hardware that's specially designed, chosen, assembled and/or tuned to serve those purposes. However...</p> <ul> <li> <p>It will be a smaller market than at present.</p> </li> <li> <p>It can be just as well served by current compute-hardware vendors as by current storage-hardware vendors.</p> </li> <li> <p>It will increasingly <strong>not</strong> run the hardware vendor's own software.</p> </li> </ul> <p>Yes, folks, the EMCs of the world might have to play nice with the Red Hats of the world to make their stuff run well together. I'm sure that will be lots of fun in lots of ways. Most relevantly, a smaller market plus a smaller share of that market plus sharing more of the wealth with partners isn't going to be good for their bottom line. Software will not eat the world, but it will definitely eat some large land masses.</p> <p>OK, now here's the part I'm less happy about. The hardware vendors aren't the only ones under siege. We software vendors are too. As I pointed out to Chris Mellor at El Reg, the service providers are both an opportunity and a threat. Some people license our software to run in the cloud. Others use various cloud facilities <em>instead</em> of our software. (No, I don't care about the persistent rumors that e.g. Amazon's code is actually someone else's open-source code. Even if I gave those rumors any credence, it wouldn't matter much in the grand scheme of things.) Over time, though, I think the "threat" part will outweigh the "opportunity" part. Besides its convenience, the cloud pricing model makes that option <em>appear</em> cheaper than a software license, exerting downward pressure on prices.</p> <p>Then there's the worst threat to us all, which is roll-your-own. Told you I'd get back to that. Rolling your own storage software used to be almost unthinkable. Now, not so much. Google, Facebook, Twitter, LinkedIn, and others have each rolled their own <em>multiple times</em> to address different use cases. They invented some of the techniques that are now well known among storage developers. I even like to think that I played some part in making them well known, through my articles and talks. As I said to Robin at StorageMojo, what's important is that <em>the knowledge is out there</em>. So are the components. Over time, it won't just be the big ubergeek companies rolling their own. More companies will start doing it, probably not from scratch but combining various open-source pieces with some of their own. "Mass customization" is not an oxymoron; it's the natural child of commoditization (for parts) and automation (to create systems).</p> <blockquote> <p>Good morning, madam. What kind of storage system would you like me to build for you today?</p> </blockquote> <p>Scary thought. That means that selling storage <em>products</em> is going to be hard for all of us. We'll be selling components, both hardware and software, or we'll be selling integration and support services. Somebody will always pay to have somebody else assemble the parts, maybe add some light customization, and support the result. There's a nice living to be made there . . . but no empires. I think that's what's behind some of the M&amp;A activity. People can see the empires crumbling. Even if they haven't thought it through as consciously as I've tried to lay it out above, something in their gut tells them they'd better get what they can <em>when</em> they can. Everyone's rushing to build and defend their own little fiefdom before it all falls apart. Think about that the next time you see another storage company disappear. It won't be long.</p>Tue, 27 Oct 2015 18:52:00,2015-10-27:2015-10-nibble-on-storage.htmlstorageWinter Running in New England<p>I still consider myself a bit of a running n00b. Several months ago, I was even more of one - so much so that I kept running through one of the worst winters anyone here seems able to remember. Paradoxically, that n00b decision seems to have left me in the position of knowing more than most about how to run safely in those conditions. Since a couple of friends have expressed curiosity about that exact topic recently, I might as well collect those thoughts here.</p> <p>First, the good news. It's entirely feasible to keep running all through a harsh New England Winter. There are certainly some challenges, which I'll try to address. There are rewards too. However, a little bit of context might help. I'm not <em>that</em> hard core. I'm talking about running in the suburbs, not the city or the country. I'm sure those present their own different challenges, of which I am certainy still ignorant. I'm not talking about extreme conditions, either. Even in the depths of winter I was still running on dry streets, not snow, and only down to about 15ºF. I'm crazy, but not that crazy.</p> <p>The most important thing about winter running is <strong>situational awareness</strong>. Sidewalks are likely to be useless, so you'll be out in the road with the cars. Both visibility and mobility are going to be restricted by piled-up snow and other obstacles. This is a dangerous situation, so the first thing you want to do is improve your odds as much as possible. Always know where you'll go if a car comes along, and for heaven's sake don't impair your ability to hear them. Learn when and where the school/work rush hours are going to pose a problem. Learn how long the snowplows remain out after a snowfall (so you can avoid them) and where they dump the big piles (ditto). Learn which roads have too many turns or driveways with poor visibility, and avoid them. Ditto for steep downhills (uphills are actually OK) and places where puddles are likely to form. Some of my favorite summer routes are unusable in winter for one or more of these reasons, but that's life. Knowing a variety of routes in your neighborhood is always good, but these limitations make it even more important in winter.</p> <p>Another big thing for winter running is knowing the weather. I find that <a href="">Weather Underground</a> is very accurate ahead of time, but just before I go out I double-check on <a href="">AccuWeather</a>; their "MinuteCast" is often eerily accurate. While knowing the temperature and likelihood of precipitation might determine <em>when</em> I run, wind speed and direction might determine <em>where</em>. Again, knowing a lot of routes comes in handy. There's nothing quite like coming over a hill or around a bend and getting blasted with a freezing wind. Lack of leaves on the trees might be good for visibility, but it can also make you more exposed.</p> <p>OK, so let's talk gear. The most important thing is not so much specific items or brands but <strong>flexibility</strong>. I'll wear different gear if it's 32ºF than if it's 24ºF, and different gear again if it's 16ºF. Wind and humidity are also factors. It's also important to remember how much you warm up while you're running. I warm up <em>a lot</em>, so I dress to be slightly cool at the outset and I still usually end up taking off my cap and gloves before I'm done. Lastly, don't wear cotton. Sweat + cold = death, and cotton just absorbs too much. I'm an all-synthetic guy myself, but others swear by wool and/or silk.</p> <p>With all that said, and purely by way of example, here are some of the items I have in my own winter-running closet. YMMV.</p> <ul> <li> <p>Head: lightweight "beanie" style hat. We're talking no more than a couple of layers of thin poly here. I'm not a big fan of ear warmers, but I do make sure my cap covers most of my ears. You can find any number of these at any running store.</p> </li> <li> <p>Face: I have a <a href="">convertible hat/mask</a> that I really like, but mostly for snowboarding. I only used it for running on the coldest days; otherwise it was too warm.</p> </li> <li> <p>Trunk/arms: Mostly I'd run in my usual T-shirts plus a <a href="">lightweight jacket</a> which I just love (mine's red BTW). Breathes well, nice little thumb holes to keep the sleeves from riding up, reflective material, etc. For really cold weather I have a couple of thermal long-sleeved shirts, but even the lighter one would be too warm above 20ºF or so.</p> </li> <li> <p>Hands: Like beanies, lightweight gloves are easy to find. I have two pairs, one <a href="">Under Armour</a> and one Saucony (I think). The UA ones are <em>very</em> slightly warmer, which I mention because even tiny gradations become very noticeable when you're out there. It's worth it to have multiple hat and glove options.</p> </li> <li> <p>Underwear: I have some New Balance, some Puma, some Champion. I can barely tell the difference. The important thing is that none of them are cotton.</p> </li> <li> <p>Legs: probably my favorite find (narrowly beating out the jacket) was these <a href="">leggings/tights</a>. They're honestly a bit of a pain to get on and off, but they're absolutely perfect for keeping the wind and splatters off. I'd wear these with shorts over them for a little extra warmth and to look (just slightly) less silly, anywhere from 40ºF on down, and my legs never felt too warm or too cold. Modern technology is awesome.</p> </li> <li> <p>Socks: the biggest decision point for each run. At the warmer end, I could actually get away with the same Balega or Fitsox I wear all year. At the colder end I'd wear <a href="">insulated socks</a>. Most of the time, in between, I'd wear <a href="">calf length</a> <a href="">compression socks</a> or (if I ran out of those) light ski socks.</p> </li> <li> <p>Shoes: there are special winter running shoes, and "micro spikes" for better traction, but to be honest I never had much use for either. I just ran in the same Asics GT-2000 shoes I'd been using already, and tried to avoid puddles. My feet never felt cold, and I never felt that I didn't have enough grip.</p> </li> <li> <p>Other: certain kinds of chafing are more of a problem in winter. I'll just mention <a href="">Transpore tape</a> and <a href="">Friction Defense</a> as potential solutions. If you have that even more awkward kind of chafing, I can recommend <a href="">Chamois Butt'r</a>. If you think it's gross to talk about these things I'm sorry, but if that's what it takes to save someone else some discomfort then it's worthwhile. I wish somebody had clued me in before I had to figure this stuff out on my own.</p> </li> </ul> <p>With all of that gear and preparation and good habits, you should be able to run safely even in that New England winter. It can even be fun. There's a special kind of quiet after a storm, and a special kind of light all the time. There are no cyclists. Places that are hidden behind greenery in summer become visible through bare trees. There's no danger of overheating. This spring, I was worried that I wouldn't even be able to run in anything over 50ºF because I'd gotten so used to it being cooler. I did adjust after all, but I think I still prefer running when it's cooler. You might find that you enjoy it too, no matter how crazy it seems.</p> <p>UPDATE (October 19). While I was putting all of this into practice today, I came up with a few more things I should have mentioned.</p> <ul> <li> <p>First and foremost, <strong>take it easy</strong> especially at the beginning of the cold season. Your muscles will stay stiffer longer, increasing the risk of over-extension if you push too hard. Your shoes will have less flex too, putting even more stress on your muscles to absorb impact. It's not just your legs, either. Your core will also stiffen up a bit. Your body will be less wiling to take big gulps of delicious air when that air's cold. Everything's going to be just a bit harder, especially at the top end of your range. Don't expect to maintain the same pace as in summer. If you do, that's great, but be prepared for a bit of a slowdown. It'll all come back to you in spring.</p> </li> <li> <p>At any time of year, I recommend bringing some ID plus a credit card and/or a small amount of cash, just in case something happens. I use a magnetic pocket that clips onto my waistband. Others prefer wrist or arm bands. Some people always bring a phone, though personally I can do without having something so dense weighing a pocket down.</p> </li> <li> <p>Don't put your all-synthetic socks or underwear in the drier. It won't necessarily kill it right away, but your stuff will sure last a <em>lot</em> longer if you hang it out to dry.</p> </li> </ul>Fri, 09 Oct 2015 14:31:00,2015-10-09:2015-10-winter-running.htmlrunningA Year Of Running<p>About a year ago, I started running. I say "about" a year because I don't know the exact date. I know it was early July, so it's not quite a year, but I feel like writing about it now so here goes.</p> <p>A year ago, I knew nothing. I didn't know about pronation and supination. I would have guessed that "gastrocnemius" was something to do with food, and "fartlek" was just a word that sounded funny. I didn't know the dangers of treadmills (or how to avoid them) and had no idea how winter running could possibly work. I had never heard of chamois creme, but perhaps the less said about that the better. Now I'm still far from an expert, but I can understand what more serious runners are saying and sometimes offer advice to runners even newer than me. But that's all just talk. How's my actual <em>running</em>? Let's take a look.</p> <ul> <li> <p>When I started, I needed about ten walking breaks during a 2.5-mile run, and it took me about 28 minutes.</p> </li> <li> <p>On August 11 I was able to do that same loop continuously, without walking.</p> </li> <li> <p>On September 13 I did my first (unofficial) 5K, in 28:09.</p> </li> <li> <p>On October 27 I did my first (also unofficial) 10K, in 57:25.</p> </li> <li> <p>I ran all through the winter, even in that awful February. I learned how to gauge the weather and bundle up enough but not too much. I learned how to recognize black ice. I became intimately familiar with every break between the snowbanks on every route I ran, in case I needed one to avoid a car, and I learned which streets to avoid entirely.</p> </li> <li> <p>I've worn out my first <em>two</em> pairs of shoes. Currently I alternate between two other pairs, because now I want different shoes for long runs than I do for steep/fast runs.</p> </li> <li> <p>I have quite a collection of other gear, from all-synthetic underwear and tights to belts and (my favorite) a clip-on pocket so I never have to leave the house without a key and a credit card.</p> </li> <li> <p>I've run on three continents. In fact, I did that within a two-week period - Bangalore, Boston, Barcelona.</p> </li> <li> <p>My weight is down to 180. My resting pulse is down to 54.</p> </li> <li> <p>I feel deprived when I <em>don't</em> run. I have to stop myself when I'm sore, or when it's too hot, and sometimes that's a challenge.</p> </li> </ul> <p>Currently my best 5K is 23:28 and my best 10K (this morning!) is 51:42. In other words, I can go either 2.5x as far or 1.5x as fast as when I started. If I still qualified as a "Clydesdale" I might be a contender in some of the local races, but I've lost too much weight. My new "M5059" division is much more competitive, but it looks like I'd still be able to place above half-way more often than not. I'm currently on track (heh) to achieve my New Year's goal of running 200 times this year, which is likely to put me at more than 1000 kilometers but less than 1000 miles. My other goal is to "run under my age" in a 10K. I don't think beating my current best by 1:29 tomorrow is likely, but beating it by 0:56 on December 31 might be. Otherwise, I'll have something to shoot for next year.</p> <p>So yes, I think I've made good progress and I'm proud of that, but before I get too full of myself I have to give a hat tip to my friend Hank. There are many people who have supported and encouraged me, not least Cindy who has to put up with my too-detailed reports every time I come back in. None of them are as inspirational as Hank. The dude ran the Grand Canyon, rim to rim <em>to rim</em>. Just a couple of weeks ago he ran the Mount Washington Road Race - 7.6 miles with an <em>average</em> grade higher than Loring Hill or the other "steep" sections I do around here. He looked good doing it, too. That shows me how far I still have to go, how much more I have to shoot for. That's pure gold right there.</p> <p>I don't know how far I'll go with this thing. I might do a half-marathon some day, but I'll probably never do a full one. From everything I've read and heard, it just doesn't appeal. I'm more likely to follow in Hank's footsteps (heh) and try some of the steeper/wilder stuff. Years of hiking and stair-climbing have already made me stronger on hills than elsewhere. Or maybe I'll just focus on doing normal 5K and 10K runs faster. One way or another, I'm pretty sure I'll keep running for a while yet. In one year running went from something I hated to something I did out of desperation to something I now do out of habit and desire. It still feels weird to say I'm a runner, but I guess after a year there's no way to say I'm not.</p>Sun, 05 Jul 2015 13:13:00,2015-07-05:2015-07-year-of-running.htmlrunningObject Store File Systems<p>Several years ago, Amazon created something called S3 - Simple Storage Service. The "simple" part was based on the premise that distributed file systems are too complex, inhibiting scalability while providing too little marginal value to users. According to that theory, a system with a simpler API and semantics (e.g. weak consistency) should be preferable. It's an appealing story, which has led to many imitators - most notably OpenStack's Swift.</p> <p>Personally, I've always viewed the "file systems are too complex" claim with skepticism bordering on contempt, but that's actually neither here nor there. Even if we accept that claim for the sake of argument, there's another claim I've been seeing that still remains beyond the pale - i.e. that somehow implementing a distributed file system <em>on top of</em> an S3-like object store can make the world better. No, it can't. However complicated a distributed file system might be to begin with, it sure doesn't get any less complicated when you stick an alien (usually HTTP-based) API in the middle. Overcoming the impedance mismatch between that API and its associated consistency/durability semantics vs. what a file system <em>requires</em> will always involve extra work and extra potential for failure. Implementing a file system on top of a richer kind of object API such as Ceph's RADOS can make sense, but whatever kind of file system you could implement on top of S3/Swift objects could be implemented <em>better</em> on top of a more compatible abstraction. There are in fact plenty of people who have already been doing that for years, and they've gotten pretty good at it. You're not going to improve on that with a design that people in that community already tried and long since improved upon.</p> <p>It's logically inconsistent to say that file systems in their native form are too complex, but file systems implemented on top of simple object stores wouldn't be. It's not just incorrect; it's impossible. Anyone making such a claim either doesn't know the truth or doesn't care about it. Either way, would you trust your data to someone who's trying to sell you the data-storage equivalent of homeopathy or perpetual-motion machines?</p>Tue, 23 Jun 2015 08:50:00,2015-06-23:2015-06-object-store-fs.htmlcephcloudopenstackstorageHTTPS Everywhere and Civil Debate<p>I really don't want to get in the middle of the "HTTPS Everywhere" debate, but a <a href="">recent message</a> on the topic by Roy Fielding (of REST fame) really bothered me, so I'll add my voice to the chorus anyway. Let's start with the non-technical problem with that email, just to get it behind us sooner. Here's what Fielding has to say near the end of the message.</p> <blockquote> <p>TLS everywhere is great for large companies with a financial stake in Internet centralization. It is even better for those providing identity services and TLS-outsourcing via CDNs. It's a shame that the IETF has been abused</p> </blockquote> <p>Yes indeed, abuse is a problem, and that passage is abusive. It implicitly accuses those on the opposite side of the debate of acting in bad faith, without even making an exact accusation that can be answered. Such "well poisoning" has no place in a supposedly technical debate, and I don't think it's a coincidence that the IETF chair sent out a <a href="">reminder</a> about discussion style and respect. Even if there are valid concerns about conflicts of interests, it's better to bring them up in a different forum and manner.</p> <p>Also, if Fielding wants to talk about conflicts of interest, his <em>first</em> responsibility (as recommended e.g. by the ACM <a href="">code of ethics</a> should be to disclose his own conflicts. How does his public stance on this issue relate to what his employer (Adobe) wants? How does it relate to his controversial decision to have the Apache web server ignore "do not track" requests? He has said that his actions were motivated by a belief that Microsoft was trying to sabotage DNT by making it a default - some pretty twisted logic there. Isn't it at least as likely that he's familiar with such attempts at sabotage because he's involved in one? He's remarkably effective at it, too. Actions speak louder than words, and his only <em>action</em> on this issue so far has been distinctly anti-privacy.</p> <p>Maybe, instead of opening such a can of worms by accusing others, Fielding should stick to the technical issues. Unfortunately, he's wrong there too.</p> <blockquote> <p>TLS does not provide privacy. What it does is disable anonymous access to ensure authority.</p> </blockquote> <p>TLS provides two kinds of functionality: authentication and encryption. Since it prevents passive collection of data in flight (e.g. at routers), encryption is clearly <em>good</em> for privacy. Therefore, for Fielding's claim to be true, authentication must be at least equally <em>bad</em> for privacy. Is it?</p> <p>As it turns out, TLS provides two kinds of authentication - servers to clients, and vice versa. The only way TLS makes clients less anonymous is if client certificates are used, which they rarely are. I happen to think that's a shame, because there are many situations where they'd be better than common alternatives, but that's the way things are nonetheless. Except in a few rare cases, TLS does nothing to make clients less anonymous than they were before. Fielding makes a big deal about how the sum of a client's interactions with many servers can still be used to reveal their identity and activities, but every bit of that information availabile with TLS is still available without TLS. TLS didn't make that part worse.</p> <p>The only semi-lucid nugget of truth in Fielding's rant is that ubiquity of TLS makes it more likely that website or application designers will embed other kinds of credentials in the HTTP/S stream(s). The (clumsily unstated) assumption is that this information can be harvested <em>at the endpoints</em> and used to facilitate the kind of traffic analysis mentioned previously. Well, perhaps, but these "equalization of risk" arguments tend to cut both ways. TLS itself makes it safer to send credentials over the wire, and application designers might well respond by doing so more frequently . . . but why assume they'll stop there? Might they not equalize again, by adding more safeguards at the endpoints to prevent misuse of that data? Should we <em>assume</em> that they won't? The slope's not slippery one moment and sticky the next. That's even worse than assuming it's slippery throughout.</p> <p>So, do I think Fielding's wrong, and "HTTPS Everywhere" is a good idea? Actually, I haven't made up my mind yet. I'm sure there are decent arguments to be made on both sides. In particular, cache interactions still seem to be a problem. Perhaps that's why otherwise-smart people who work on caching have lined up on Fielding's side. If so, I wish they'd present the real arguments behind their position, because counterfactual claims and <em>ad hominem</em> attacks just don't cut it. From what has been presented so far, the argument in favor of HTTPSE looks a lot stronger than the argument against.</p>Fri, 05 Jun 2015 15:16:00,2015-06-05:2015-06-https-everywhere-and-civil-debate.htmlinternetpoliticsDistributed File System SBFAQ<p>There's a lot of hype around distributed file systems and their relatives, such as object stores. Every week, it seems, there's a new project claiming to be the the fastest, most scalable, most robust, most space-efficient distributed file system ever, sweeping all precursors before it. Nine times out of ten, those claims are simply ridiculous. A distributed file system is a complex thing. Design choices and tradeoffs must be made. Anybody who claims to be the best in all of these categories is simply lying, and most new projects break new ground in only one direction - all too often, in zero. Instead of trying to identify the ways in which each new braggart is lying to you, I've compiled this list of questions so you can figure it out for yourself.</p> <ul> <li> <p>Is it, in fact, a file system at all? A real file system must be mountable and usable in the same way as a local file sytem, e.g. to store your source code and run your applications. That means it must have indefinitely hierarchical directories, byte addressability within files, UNIX-style permissions, and so on. It should also meet all or at least most POSIX requirements around issues such as consistency and durability. For example, it is <strong>not</strong> OK for a file sytem to support only whole-file or appending writes, or to ignore <em>fsync</em>. Not everybody needs a real file system - many people are quite happy with object stores and good for them - but if your project doesn't meet the basic definition of a file system then don't call it one.</p> </li> <li> <p>How is metadata distributed? This is probably the biggest distinction among distributed file systems. Serious practitioners have known for a decade or more that single-metadata-server designs are bad for both reliability and scalability. Old-school active/passive failover doesn't even fully address reliability - the <em>entire</em> file system goes down during the failure-detection interval - and fails to address scalability at all. Having to provision two special ultra-powerful machines as your active and standby metadata servers should not be acceptable any more, but to this day new projects based on this approach continue to appear and still claim to be the best. Systems that are designed to have <em>as many metadata servers as you want</em> are common enough that you shouldn't have to settle for anything less.</p> </li> <li> <p>How is <em>data</em> distributed and/or replicated? Does each file live on only one node, or can files be split/striped across multiple nodes? Is replication (or erasure coding) built in, or will you have to rely on some other piece of software to protect against disk or node failures? Can data be replicated across data centers? If so, with what kinds of consistency/ordering guarantees?</p> </li> <li> <p>What access methods are supported? Even when accessing a distributed file system <em>as</em> a distributed file system, there are differences among native protocols (implemented either in the kernel or user space), NFS (multiple versions), and SMB (ditto). It might also be useful to have an S3/Swift object-store interface, or a block-device interface.</p> </li> <li> <p>What network configurations are supported? Does the system support InfiniBand or other kinds of RDMA? Does it use one network for both client-to-server and server-to-server communication, or can/must those be segregated? Does it support IPv6?</p> </li> <li> <p>How easy is it to set up and use? This is another major differentiator. Distributed file systems have traditionally been one of the most difficult kinds of software to work with. With only a couple of exception, setting one up will require an inordinate amount of editing and copying files around manually, on each node and usually for each layer. Does adding or removing a node require more manual reconfiguration? Can it even be done online, or does it require a cold restart? How about a software upgrade?</p> </li> <li> <p>What security features does it have? Does it support ACLs? Which flavor? Does it support SELinux? Can data be encrypted on the network? On disk? Where are the keys, in either case? How are identities managed for authorization? Can it use Kerberos, or LDAP/AD?</p> </li> <li> <p>How efficiently is data stored? Replication can (obviously) require N times as much disk space, but often offers the best performance. Erasure coding is more storage-efficient, but typically slower. Are compression and/or deduplication supported? Are there block-size or other concerns that might also lead to wasted space?</p> </li> <li> <p>Are snapshots supported? Multiple snapshots? Snapshots of snapshots? Writable snapshots (clones)? Snapshots/clones of clones? How space-efficient are they? What dependencies (e.g. LVM or ZFS) do they introduce?</p> </li> <li> <p>How does the system detect failures, and subsequently repair files for which one or more copies/fragments were lost to those failures? Is the process fully automatic, or does it require some sort of manual intervention? How does it affect ongoing performance? How is "split brain" handled? Is some sort of quorum enforced to prevent it? How is it reported? Can it be repaired automatically? Can it be repaired manually?</p> </li> <li> <p>How does the system migrate data when nodes are added or removed? Again, is this automatic or does it require manual intervention? How does it affect ongoing performance? How well do the rebalancing algorithms work to ensure that a minimal amount of data is moved? How flexible are those algorithms, or are there multiple algorithms that the user can choose?</p> </li> </ul> <p>That's a lot, isn't it? And we haven't even gotten to performance yet. Everybody wants to skip ahead to performance. There's a saying in the military, that amateurs talk about strategy (without actually understanding even that) while professionals worry about logistics. In storage, amateurs talk about performance (without actually understanding even that) while professionals worry about things like robustness and security and operational simplicity - all of the stuff above. Still, performance does matter, so there's a whole separate set of questions to ask about any performance claims people make.</p> <ul> <li> <p>Is <em>anything</em> about the configuration, workload, or testing methodology described? Claims like "24x better response time" or "7x better IOPS" are <strong>utterly worthless</strong> without any of that information. If I try hard enough, I can find situations where GlusterFS is 7x faster than HDFS, others where HDFS is 7x faster than Ceph, and still others where Ceph is 7x faster than GlusterFS. If you see nothing but one or two headline numbers, you might as well ignore those.</p> </li> <li> <p>How many servers were involved? How many clients? What kind of network between them? I'm not against micro-benchmarks involving small numbers of clients and servers - I've run a few and posted about the results myself - but it's important to understand that they can only measure one aspect of the system. Even when it's a very important aspect, and very likely to be reflected in a broader kind of test, it's still a bit like looking at a few cells under a microscope vs. looking at the whole animal.</p> </li> <li> <p>How many <em>disks</em> were involved? What kind? What kinds of RAID controllers? What partitioning scheme or local file systems were involved? Many distributed file systems are highly sensitive to the speed of these underlying components, either generally or in specific roles (e.g. as log/journal or metadata targets). It's very easy to configure a system in a way that's optimal for one competitor and awful for another. Look up "short stroking" and "head thrashing" to get some idea of the tricks that commercial storage vendors use both among themselves and as a bulwark against true software-defined storage.</p> </li> <li> <p>What did the workload look like? Reads or writes? Large requests or small? Sequential or random? If random, was it all blocks exactly once but in random order, or true random, or weighted somehow? How many concurrent I/O streams per process, per client, or overall? What queue depth or <em>fsync</em> interval was each thread using? The ideal here is to show the actual <em>iozone</em> or <em>fio</em> command line and/or job files, to remove any ambiguity, but any information at all is better than none.</p> </li> <li> <p>How large were the datasets, and how long were the runs? Did requests mostly go only to client caches/buffers, to the same on the server, or to actual disk? What parts of the system were really exercised? Note that the answers might be different based on different file systems' caching and other strategies. Exploiting these differences to generate misleading numbers is another favorite big-dollar-storage vendor trick.</p> </li> <li> <p>How did the performance scale along various axes? Number of clients, servers, disks? Speed of network or disks? Number of worker threads? Replication level? This is where you would normally expect to see some back-and-forth betwen alternatives, as their respective tradeoff spaces are explored.</p> </li> <li> <p>How did the performance vary over time? Was it steady, or glitchy? Did some clients/threads race along while others starved? Don't just look at averages over an entire run. Look at IOPS and 99th-percentile latencies for each client and for each interval within a run. Your application's performance might be bound by the worst sample in that entire set, so make sure you know what that worst case is and how often it's approached.</p> </li> </ul> <p>Phew. There's a lot more, of course, but if you can get the answers to these questions you should have a pretty good handle on whether the thing in front of you is a serious distributed file system or just someone's research project that they expect you to underwrite. If you can compare the answers for two or more distributed file systems, you should have a good idea which one will really suit your needs. I'm sure I forgot something. Please let me know if you find out what it was.</p>Tue, 26 May 2015 13:42:00,2015-05-26:2015-05-dfs-sbfaq.htmlglusterfsstorageperformanceStop Calling It Neutrality<p>Usually, when anyone in government tries to do anything about issues of equality or fairness, the techie-libertarian reaction is to complain about "legislating equal outcomes" and invoke the spectre of <a href="">Harrison Bergeron</a> as proof. (Hint: it's fiction!) For some reason, "neutrality" doesn't get the same reaction even though it's a strongly related concept. Thus, when the FCC announced new network-neutrality rules, the reaction was mostly positive. I won't say that self interest and/or pure hatred of the last-mile oligopolists have caused principle to be abandoned, but they certainly have caused that principle to be modified or attenuated.</p> <p>Let's get one thing straight right at (or at least near) the start: I'm not opposed to neutrality in the common etymologically and historically based meaning of that word. What I'm opposed to is "neutrality" as a label for policies that don't result in more actual neutrality, and often don't even seem intended to have that effect. That's a bit too much Orwellian doublespeak for me.</p> <p>Past network-neutrality rules or proposals have often seemed broken to me because of the specific ways that they would have affected network operators. Those rules, based in technical ignorance (and some exploitation of that ignorance by interested parties), would have ruled out legitimate network-management practices and either broken things or pushed them in a direction contrary to actual neutrality. This time around, the problem is not so much that the rules are broken as that they're incomplete. <a href="">Kieren McCarthy</a> puts it pretty well.</p> <blockquote> <p>There's been no Damascene conversion; the FCC hasn't suddenly discovered it must fight for the people's rights: it's simply realized that it's time to serve new masters.</p> </blockquote> <p>You are not the intended beneficiary of these rules. If you benefit at all, that's an accident. What you are is (still) a commodity. Comcast, Time Warner, and Verizon rather predictably tried to structure things so that your internet addiction benefited themselves most of all. They went a little too far. The Facebook, Google, and other rival families decided they deserved a bigger cut of that action, and they used the mechanisms of regulatory capture to get it. What a big win for the rest of us.</p> <p>What happens when we stop making this "neutrality" concept so strangely specific in who it affects and how? Will Google still be our champion when someone decides that search results (and accompanying ads) should be governed by "neutrality" rules instead of Google's own? I don't like the SEO crowd any more than you do, but if you want to be consistent about this "neutrality" idea that's a logically necessary outcome. What happens when Google and Amazon have to be "neutral" about how they handle different phones and tablets, instead of limiting features however they want? I recently found out that Amazon instant video won't play on my tablet, even though it will on Amazon's own Fire tablets. Clearly it's not a technical issue. It's a clear violation of "neutrality" for their own benefit. Regulatory burdens always seem lightest when they fall on others (especially those we dislike), don't they?</p> <p>I could come up with a dozen more examples easily. Maybe such a broad application of the "neutrality" principle would actually be a good thing. I actually think that might be the case, but that's a topic for a more philosophical kind of post. The only point I'm trying to make here is that we shouldn't let Google's and others' selective and self-serving definition of "neutrality" cloud our thinking about these issues. <strong>This isn't really neutrality we're talking about</strong>. It's regulation of one group, in one way, that might or might not bring us closer to true neutrality.</p> <p>We've taken one step, which I believe to be in the right direction. Just don't assume that those pretty-sounding words mean the next step will also be in the right direction. Having a common enemy is not the same as having a friend. The "neutrality" propaganda is still propaganda, and a habit of accepting such manipulation (even in a good cause) has its own ill effects. Don't let the PR teams, who have surely slaved away night and day to build an effective <em>Neutrality&trade;</em> brand, get away with it.</p>Fri, 13 Mar 2015 08:48:00,2015-03-13:2015-03-stop-calling-it-net-neutrality.htmlinternetpoliticsContent From hekafs.org<p>Executive summary: all of that stuff's <a href="">over here</a> now. If you have links to it, just change "" to "" and almost everything should work.</p> <p>A while ago, I got a notice that the domain was about to expire. Even though domains don't cost much, I didn't feel particularly thrilled about continuing to bear that cost in perpetuity for a site that has been inactive for years and seems likely to remain so. Knowing that there's some content people might still find useful, I tried to find out if anyone at Red Hat could take it over. After all, we sponsor hundreds of similar projects, the same issue must have come up for some of them, surely there must be a well established procedure for this. Right? Well, if there is, I couldn't find it. Everyone seemed to think this was Somebody Else's Problem. Sometimes I forget that, despite the many ways it's unique, Red Hat is still a large company with many of the usual large-company dysfunction. <em>[sigh]</em> I let the domain expire.</p> <p>I guess I didn't realize just how many people, from developers at Red Hat to complete strangers, rely on that content. I get email about the broken links, especially the "Translator 101" series, multiple times per week. For every person who sends email, there are probably two more who didn't bother. Unfortunately, I couldn't renew the domain if I wanted to. First it was in some sort of "timeout" period when it couldn't be re-registered (even by me as its prior owner) and then some domain-parking doofus snapped it up to serve ads.</p> <p>For a while now, all of the content has actually been available under (see the executive summary). Today I went through and fixed up all of the hyperlinks, image links, and everything else I could think of so that the site almost works normally. The only thing I know of that doesn't work is the search box, because that relies on PHP and all of the content is actually static now. However, having links both in this post and on my archives page should allow Google to see all that stuff again, so in a while people will be able to search that way (as most of them probably do already).</p>Fri, 06 Mar 2015 10:26:00,2015-03-06:2015-03-content-from-hekafs-org.htmlglusterfsLife on the Server Side<p>Of all the projects I've proposed or worked on for GlusterFS, New Style Replication (NSR) is one of the most ambitious. It has two major goals:</p> <ul> <li> <p>Improved handling of network partitions</p> </li> <li> <p>Improved performance, both normally and during repair</p> </li> </ul> <p>Personally, I consider the improved partition handling to be the more important problem. NSR's predecessor AFR continues to be plagued by split-brain problems, with new reports on the mailing list almost weekly, despite many claims over the years that this tweak or that tweak will make those problems go away forever. Most other people seem to care more about performance. Fortunately, the two goals do not conflict. Often, as we shall see in a moment, the same techniques that are good for one are also good for the other.</p> <p>NSR is deliberately designed to be like many other (especially database) replication systems that are known to work pretty well, which means it's very <em>unlike</em> AFR in two particular ways.</p> <ul> <li> <p>Replication is driven by a leader (server), which in our case is temporarily elected to the role, instead of directly from clients.</p> </li> <li> <p>Change detection and replica repair are done using a log, instead of per-file markings augmented by an index of recently changed files.</p> </li> </ul> <p>To illustrate the first difference, here are some slides from my <a href="">2014 Red Hat Summit talk</a>.</p> <p><img alt="" src="" /> <img alt="" src="" /></p> <p>This is where we start to see how our two goals are compatible. Even though the main reason to use chain/leader based replication is to gain better control over what happens during a network partition, it also turns out to be good for performance. For one thing, it reduces coordination overhead to that needed for leader election and failure detection. That's tiny, compared to the coordination that has to happen <em>for every write</em> in the fan-out approach. Basically, we amortize that overhead over a gigantic number of writes, so the per-write cost is next to nothing.</p> <p>Also, with fan-out replication, each client write has to be transmitted directly from the client to two (or more) replicas. If the client only has a single interface, as is typically the case, its effective outbound bandwidth is thus divided by two (or more). With chain replication, the client and the leader can each use their full outbound bandwidth simultaneously. That's good even if they're on the same network, better if they're on separate networks, and best of all if the back-end server network is faster than the front-end client network. To put it another way, for real networks configured the way real people do it, NSR's chain replication makes much better use of the available resources.</p> <p>Of course, there is a downside. Or maybe there isn't. As you can see from the slide on the right, the chain method involves two network hops before a write can be acknowledged - client to leader, then leader to follower. This incurs some extra latency, but we'll see shortly that it might not actually be a problem. The only real case where chain replication "loses" is when many clients gang up on a single leader, and the leader's outbound bandwidth becomes more of a bottleneck than the clients' aggregate outbound bandwidth would have been. It can definitely happen. On the other hand, even in setups where clients and servers are similarly equipped, ganging up isn't as easy as you'd think. Some HPC workloads would be susceptible to this effect, but other than that it's far more likely - especially across a volume with many leaders for different replica sets - that only a few clients will be banging on a single leader at any given moment. Again, the way people really use these systems trumps the theoretical possibility of an opposite result.</p> <p>So, where am I going with all of this? I've always thought of these two architectural directions - chain replication and logging - as part of one thing, but they're actually quite separable. I've also encountered massive "not invented here" resistance to NSR throughout its lifetime. Recently, I started to worry about how this resistance would affect my ability to see NSR through an adequate testing phase. Sure, I can write it, but if most of the resources I need for testing are continually diverted to AFR then I might never be able to test it properly. Then I hit on the idea of combining one of NSR's core ideas (and the associated code), but using it with AFR's change-recording and repair mechanisms instead. That way, we can get much more mileage on some parts while we finish the others. Thus, server-side AFR was born.</p> <p>Long time GlusterFS users are probably thinking that server-side AFR is an old idea, and they're right. Back in the 2.x days, this was a very common way to deploy GlusterFS. Then along came 3.0, and the acquisition, and server-side AFR became deprecated. Well, it's back. During the short time that we had multiple people working on NSR, all of the infrastructure was developed to elect a leader, have clients use it, fail over when necessary, and manage the resulting I/O flow. That infrastructure is just as applicable to AFR. All we need to do is load AFR on the server side, talking to one local replica through a normal server-side set of translators and to the other as a sort of network proxy from the real client. How well does it work? Let's see.</p> <p>Contrary to the usual habit among most of my colleagues, I like to run a lot of my tests in the cloud. For one thing, it's easier for me to find SSD-equipped instances that way, and I really don't want to have massive disk latencies in the way when I'm trying to measure the effect of a <em>network</em> data-flow change. More importantly, this makes my results more reproducible than if I used some internal setup specifically purchased and configured to run this code. So I hopped on Rackspace, and on Digital Ocean, and I hacked some volfiles, and I started running some tests. I used 32GB 20- or 24-core machines for each of two servers, and up to four clients each about a quarter that size as clients. Clients were each doing 32 threads' worth of I/O, using fio.</p> <p>Networking turned out to be an interesting issue. On Rackspace, each instance type is defined to have a certain amount of network bandwidth, and that does seem to be enforced. Oddly, as far as I can tell, throttling is enforced at the flow (rather than host) level. Unfortunately, that sort of negates the very difference in network-interface use that I was trying to explore. Instead of having to split its bandwidth, a traditional AFR client just gets twice as much. I still saw some differences, but I think the Digital Ocean results were more informative because DO doesn't' throttle traffic the same way RS does. That's more like a real network would behave, so it increases applicability there as well.</p> <p>I could go on and on about the hardware and software configuration, but instead let's just go straight to the juicy bits. Here's a graph.</p> <p><img alt="" src="" /></p> <p>Pretty dramatic, huh? While client-side fan-out (traditional) AFR topped out and even started to decline rather early, server-side chain-replicating AFR continued to climb. At a mere four active clients, server-side was ahead by more than 50%. It's also worth noting that the difference would be greater if I took the time to eliminate various locking and other stuff that's no longer necessary when writes go through a leader. That leader only needs to coordinate regular writes with those from self-heal (repair), and only when it knows self-heal is running. It never has to coordinate with writes from another simultaneous leader. That's actually a <em>lot</em> of overhead and code complexity we'll be able to avoid some day.</p> <p>We already knew that chain replication was likely to win the bandwidth contest. We also expected it to lose the latency contest. Did it? Well, sort of.</p> <p><img alt="" src="" /> <img alt="" src="" /></p> <p>Median latency was only 0-8% worse. 99th-percentile latency was 7-28% worse. So yes, latency was negatively affected. How much would that actually matter? Some people care about raw latency but more people care about IOPS, which brings us to the most interesting graph of the bunch.</p> <p><img alt="" src="" /></p> <p>Slightly worse, then pretty close to even, then 85% better. I probably should have added another client or two to see if the trend continued, but it was getting late and I was spending my own money. Also, I considered the point adequately made already. Chain replication already won the bandwidth contest. Given the latency characteristics of the two approaches, it would have been neither surprising nor fatal for it to lose the IOPS contest by a modest amount. Pulling ahead even on such a limited test seems sufficient to show that the <em>overall</em> tradeoff is still positive.</p> <p>As it turns out, the two approaches are not mutually exclusive. It should be easy enough to switch between them as a volume-level option. (Even doing it at a replica-set level is possible, but there we have UI problem because there's currently no way to address a single replica set within a volume.) If a configuration or workload works better with the fan-out approach, users would still be able to do that. For most, switching to server-side chain replication is likely to yield a much needed boost in both robustness and performance.</p> <p>The net result here is that users will be able to try a very different approach to replication well before the rest of GlusterFS 4.0 is ready. They should see some benefits, and we (the developers) should be able to learn from their experience. Maybe we'll even be able to free up some people from fixing AFR bugs, and get them to work on the remaining parts of NSR - or of 4.0 more generally - instead. Everybody wins.</p>Wed, 04 Mar 2015 18:28:00,2015-03-04:2015-03-life-on-the-server-side.htmlglusterfsstorageperformanceHow Erasure Coding is Not Like Replication<p>Many people think of erasure coding as equivalent to replication, but with better storage utilization. Want to store 100TB of data with two-failure survivability? With replication you'll need 300TB of physical storage; with 8+2 erasure coding you'll need only 125TB. Yeah, sure, there's a performance downside, but from a user's perspective that's the only other difference.</p> <p>From a developer's perspective, there's another subtle but very important difference related to the atomicity of writes. In a replicated system, after a write every replica will contain valid data where the write occurred (assuming a reasonable implementation). Ideally, what's there will be the just-written data, but in failure scenarios that might not immediately be the case on some replica(s). Instead you might have stale data, but at least it was something the user wrote at some time. Even if you're down to one replica, at least you can retrieve data that was valid at some time. If the most recent operations can be replayed from somewhere else, you still have a proper base from which to do that.</p> <p>Erasure-coded systems aren't necessarily like that. Systematic erasure codes are, but non-systematic codes aren't. If you're doing the aforementioned 8+2 with a non-systematic code and exactly half of your nodes manage to complete the write before lightning strikes the data center, what do you have? <strong>Garbage.</strong> You have insufficient state to reconstruct either the new or old versions of that data. While it's OK to say that the write wouldn't have been reported as successful in that case, it's not OK for it to have destroyed what was there previously in the course of what should have ended up as a no-op.</p> <p>Because of this, an erasure-coded system must take extra steps that wouldn't be necessary in a replicated system. With replication, each replica can accept the write and then <em>by itself</em> ensure that old data is still available until the new data is fully written. With erasure coding, this atomicity guarantee must be maintained globally - <em>no</em> node may overwrite any old data until <em>all</em> nodes have the new data. You can meet this coordination need with 2PC, with Paxos or Raft, with MVCC, but you can't just punt. Whatever approach you choose, it adds a significant piece of complexity that replicated systems can omit (and usually do for the sake of performance).</p> <p>I'm not saying erasure codes - or even non-systematic erasure codes - are bad. The space advantages are still there. The very "need <em>k</em> out of <em>n</em> fragments" property that makes this extra complexity necessary can be combined with a dash of cryptography to create storage with some very desirable security features. (<a href="">AONT-RS</a> is a very good starting point for understanding how.) I love erasure codes. This is just a tip for implementors of erasure-coded systems - or perhaps even more for <em>would be</em> implementors - so that they can plan and prepare appropriately.</p>Fri, 13 Feb 2015 14:50:00,2015-02-13:2015-02-ec-vs-replication.htmlstorageNotes on File System Semantics<p>Just some random thoughts from an email I sent recently, plus a bonus SCSI war story.</p> <blockquote> <p>As the PVFS folks said long before I came along, some POSIX requirements are inappropriate for a distributed file system. I agree with that, but not with the object-store folks who claim that the <em>entire</em> hierarchical byte-addressable file system model is obsolete. I think most of that model is still valuable for compatibility with the thousands of applications that are out there. Only a few over-specified behaviors which few will miss (often few even know about them) and which are obviously problematic in a distributed system need to be retired.</p> </blockquote> <p>...and...</p> <blockquote> <p>users can't reason well about consistency guarantees that are conditional on the availability of specific servers. "After a write, readers will see X" is easy to reason about. Adding "...or reads will fail if certain system-wide conditions are met" doesn't make it much worse. Adding "...or they might see Y if some otherwise-invisible event intervenes" kind of leaves them hanging. If writes can disappear, other than in the event of a system-wide failure, then I'd say you effectively have no guarantees at all <em>and that's OK</em>. One of the hard-won lessons from working in this field for a long time is that it's better to make few and simple promises (which you can be sure of keeping) than get dragged into long discussions of what was or was not promised under what conditions. That's not a good place to be in when users' data is at stake.</p> </blockquote> <p>The last point is the most important IMO. I first ran into this back in '94, when I was working on one of the earlier multi-pathing SCSI drivers (REACT for the IBM 7135). My code would try <em>really hard</em> to maintain or re-establish contact with a volume, despite any combination of failures. While I was working in England with the people who actually built the hardware, we discovered one case where this persistence meant we'd flap around for five minutes or so, repeatedly switching between controllers before we had finally observed and cleared enough error conditions to continue normally. I thought it was awesome that we were able to recover. One of the older engineers was unimpressed. To him, those five minutes of unpredictable behavior negated any subsequent success. He argued that it would be better to try both controllers, then simply fail. His view prevailed, and in retrospect I think rightly so. Sometimes, "weak promises strongly kept" is better than the alternative, especially when there's a higher layer that can build on that to provide its own guarantees.</p> <p>BTW, the test involved here was the infamous "pen in a fan" which was amusing in its own way. The board had three signal lines to report faults, but more than three faults to report. Therefore, the lines were multiplexed. Sticking a pen in a fan would cause the board to signal fault 0x7 (all three lines asserted). However, the person who wrote the board firmware didn't read the hardware spec properly, and out in SCSI-land this would be reported as three separate faults - 0x4, 0x2, and 0x1. This is what caused us to keep going back and forth so much, clearing one pseudo-fault each time instead of all at once. Now that the wounds have healed, I can look back and laugh. At the time I was not so amused.</p>Fri, 09 Jan 2015 09:41:00,2015-01-09:2015-01-fs-semantics.htmlstorageTechnical Debt vs. Technical Risk<p>One of the most useful metaphors in software engineering is Ward Cunningham's <a href="">technical debt</a>. Definitions and interpretations vary, but technical debt is basically all the stuff you're going to fix later because you were in too much of a hurry to do it right the first time. We all know what it's like to be up against a release deadline, or in the middle of a bug firefight, and we find something that works and we allow it to pass even though we know it's not really right. Some common types of technical debt might include:</p> <ul> <li> <p>Using a "private" member of a data structure instead of adding the proper API.</p> </li> <li> <p>Layering violations and circular dependencies.</p> </li> <li> <p>Copying and pasting code instead of creating a more general common function.</p> </li> <li> <p>Adding "garbage" function arguments that change behavior to suit a new use (often to avoid the previous error but really just as bad).</p> </li> <li> <p>Checking the same condition ten different places instead of refactoring to use subclasses, dispatch tables, or any of several other cleaner techniques.</p> </li> </ul> <p>The important thing about the debt metaphor is the idea that it's OK to have some debt as long as it's tracked and kept under control. Sure, it's different when developers add technical debt just because they're lazy twits, but that fits in the metaphor too - akin to people who can't stop abusing their credit cards to buy junk they don't need and can't really afford. The debt metaphor really helps both developers and business types think about the future consequences of their short-term decisions.</p> <p>But ... is it really debt?</p> <p>The thing about debt, in the real world, is that it's a known quantity. You know how much debt you have, you know the rate at which it's increasing, and you know what it will take to pay it back. It might be more than you can bear, if you've been careless, but you know. Technical debt usually isn't like that. The problem with all of these shortcuts is not usually a steady and predictable drag on your resources. A project laden with technical debt might get away with that for a very long time, but that becomes less likely as the cruft accumulates. Every ugly shortcut increases the chance that the codebase will run into a <em>sudden</em> and <em>catastrophic</em> failure - a severe and hard-to-fix bug, a missed deadline, or even a simple failure to remain competitive because changing the code safely has become as difficult as crossing a minefield.</p> <p>That's not debt. That's <strong>risk</strong>. As in finance, technical risk can be measured and reasoned about, leading to sensible tradeoffs even though there's uncertainty involved. As in finance, avoiding risk altogether is impossible and trying too hard will mean missed opportunities. Some people will manage risk well, and reap rewards from doing so. Others will manage risk poorly, and will fail - not slowly or quietly as with debt, but often quite suddenly and spectacularly.</p> <p>I'm not saying that the risk metaphor should displace the debt metaphor. They both have their value. However, in my experience, what gets called technical debt is really technical risk more often than not. The important lesson here is to keep them separate. The next time you hear someone talk about technical debt, or are tempted to make a point that way yourself, it might be helpful to think about whether the conversation should really be about technical risk. In particular, the next time somebody says that refactoring some overburdened but critical piece of code is too risky, it might be worth pointing out that <em>failure</em> to refactor carries its own risk. Some people will always pick debt over risk. Framing the discussion as risk vs. risk might be more effective than letting it seem like risk vs. debt.</p>Mon, 05 Jan 2015 17:02:00,2015-01-05:2015-01-technical-risk.htmldesignworkingWhy "DSO" is an Awful Term<p>A recent discussion on the GlusterFS development mailing list got a bit hung up on the issue of what is or is not a "DSO" (Dynamically Shared Object). This is one of a many issues with dynamic linking and dynamic loading that I've seen cause problems before, in large part because they're <strong>two different things</strong> that people often mix up. I'll try to explain how this fact leads to confusion, and suggest how to avoid that confusion.</p> <p>For the sake of this discussion, let's separate the two kinds of things that "DSO" might refer to. We'll use "library" to mean something that is specified when linking an executable, and is therefore reflected in that executable's on-disk contents. By contrast, a "module" is not specified when linking and not reflected in the on-disk executable; one must use <em>dlopen</em> from within the program to get at it. Despite their differences, both of these are dynamically <strong>linked</strong>. In both cases, the executable lacks a complete symbol table for the shared object (in the module case it lacks any symbol table at all). The library or module's symbols will be resolved when it is loaded. In fact, this late resolution is essential to make any kind of shared object work, on any platform, so the "D" in "DSO" is kind of redundant.</p> <p>The difference between libraries and modules is that modules are also dynamically <strong>loaded</strong> whereas libraries are not. Libraries are implicitly loaded into a process's memory space before the process starts (i.e. before <em>main</em> is called). Modules are explicitly loaded only when <em>dlopen</em> is called. Either way, loading includes mapping library/module contents into a process's memory. In the dynamic-linking case it also includes resolving symbols, but it is actually possible to do dynamic loading without dynamic linking (see my <a href="quora">Quora answer</a> on this topic for more details) so this is not essential.</p> <p>Where did all of this go wrong? Apparently it's Apple's fault. In their infinite arrogance, and contrary to every other UNIX platform, they decided that the same shared object could not be used as both module and library. It had to be one on the other. While precluding dual use without reason is generally a bad decision technically, Apple then made it worse by using "DSO" to mean only modules and not libraries. Is the "D" what really distinguishes an Apple DSO from an Apple non-DSO? Nope. That didn't stop them, and it didn't stop the libtool folks either. They never saw a stupid idea they didn't like, so they mindlessly copied Apple's bad terminology (including the "module" flag). This has led to much confusion since, including that which inspired this post.</p> <p>So, if "DSO" doesn't work, what would? Surprisingly, it's not the "D" but the "S" that must go. Everything I've said so far about dynamic linking and loading would apply even if the objects in question are not shared. What we're really talking about here is two kinds of dynamically linked objects. On every platform but Apple's, the loading issue doesn't matter so "DLO" would be sufficient to distinguish these from statically linked libraries. However, we've seen that Apple's choices and terminology do infect others. Where the loading distinction does matter, it's between implicit (or immediate) loading vs. explicit loading. That would lead us to the rather unwieldy IDLO and EDLO. Alternatively, we could embrace the "library" vs. "module" distinction, resulting in DLL and DLM. Yes, <a href="dll">DLL</a>. Microsoft pretty much got this one right, folks. It's a technically acccurate term, which would also be common across the Windows and UNIX/Linux platforms, so how is that a bad thing?</p> <p><em>Sigh</em>. But we programmers aren't so rational, as a group. Apple's not going to change. Libtool won't either. They'll both continue to use "DSO" inaccurately and misleadingly. At least now maybe the term will raise a red flag, and people will know to ask for clarification. When someone says "DSO" ask them whether they mean all things that are dynamic and shared and objects, or just some arbitrary Apple-defined subset.</p>Fri, 12 Dec 2014 10:39:00,2014-12-12:2014-12-dso-terminology.htmloperating-systems"Scale Out" Applies to Interfaces, Too<p>Because of what I do for $dayjob, I hear a lot about "scale out" vs. "scale up" in various contexts. Also because of what I do for $dayjob, I get to read a lot of code. Some of it's new and clean. Some of it's . . . not. That's only partly a reflection on the skill of the programmers involved. Part of it is just the fact that all code tends to accumulate technical debt over time. Layering violations, "privacy" violations, and mutual dependencies all chip away at modularity. Short parameter lists turn into long ones, reflecting every new feature added since the code was properly refactored. (Really, when was the last time you saw a parameter list get <em>shorter</em>?) Types, fields, and flags proliferate. Cats and dogs start living together. It's chaos, I tell you!</p> <p>Something similar also tends to happen with public APIs. They start simply enough, then they grow and grow and grow. Something like this, if I may mix my movie metaphors.</p> <div style="text-align: center"> <img src="" alt="Audrey" /> </div> <p>As it turns out, there are two ways that an interface can increase in complexity. Yep, you guessed it: scale up or scale out. A "scale up" interface is one that gets <strong>monolithically</strong> bigger - you can't use any part of it without having to deal with significant complexity. Doing even the simplest thing requires several calls. OpenSSL provides a great example: set up a method table, create three types of objects, tie two of those together, set up cipher lists and certificate chains, and more, all before you can even start to do regular socket stuff (which is non-trivial already). It's tedious, it's error-prone, and just about everybody who has to use OpenSSL ends up wrapping all of that crap into their own function or object with a much simpler interface. (BTW, the code that inspired this post had nothing to do with OpenSSL.)</p> <p>By contrast, a "scale out" interface is one that gets bigger in a <em>modular</em> way. Maybe it just has a lot of functions, but using any one of those is simple and straightforward. In some cases, those functions might be grouped according to the objects they operate upon or the functionality they provide, but if you don't use a particular subset then you don't have to set up for it. <em>Defaults</em> are applied intelligently, so that simple calls yield obvious results but more sophisticated usage is also possible. Secondary objects are <em>automatically created</em> using defaults, so the user has to go through fewer steps. <em>Hooks</em> and <em>callbacks</em> are provided to customize behavior further, but remain entirely optional. In all of these cases, the goal is either to reduce the knowledge needed by basic users, or reduce the number of users who need non-basic knowledge. In other words, you want to minimize the area under this curve.</p> <div style="text-align: center"> <img src="" alt="usage type vs. knowledge" /> </div> <p>A "scale out" interface can be just as complex as a "scale up" interface. It can have just as many calls, require just as much code and tests and documentation. However, it <strong>grows more gracefully</strong>. Exposing your guts to every caller, whether or not they really want to see those guts, is what creates all of that bad coupling and technical debt. If a caller never had to know about a particular interface element (e.g. a function) to get their job done, neither you nor they will have to worry about compatibility when it changes. That reduces complexity and breakage on both sides. There's also less need (or temptation) to "reach in" and muck with stuff that is (or should be) internal, so the level of debt-inducing inflexibility is further reduced. Defining a scale-out interface might be a bit more difficult, but it pays off in the long run.</p>Wed, 03 Dec 2014 16:40:00,2014-12-03:2014-12-scale-out-interfaces.htmldesignThoughts on Running<p>(...and now for something completely different.)</p> <p>Back in July, I started running. That would not be a particularly notable statement for many people, but most people haven't detested running all their lives and avoided it for thirty years. Instead, I've used stairclimbers and ellipticals for many years, but I've grown to hate my elliptical even more than I hated running. (It's a Livestrong 10.0E which always required constant tweaking to keep it from clanking intolerably, started showing rust after only six months, and is now approaching its second flywheel replacement. Never <em>ever</em> buy anything made by Johnson, regardless of which brand it says it is.) I didn't feel like using that, I didn't feel like driving to work or a club multiple times a week, but I needed to do something. More as an experiment than anything else, I forced myself to try running again.</p> <p><img alt="image" src="" /></p> <p>(image from</p> <p>It turns out that the reason I hated running is that I was doing it wrong. Yeah, I know that sounds crazy. How can a supposedly-smart person fail at something so basic as running? Well, the problem is that a "traditional" heavy-heel-strike running style just doesn't suit me. Maybe it works for a lot of other people - I still see most runners on the road using that style - but it always makes me feel like I'm knocking the breath out of myself with every step. When I first started out on that July day, running was just as unpleasant as I had remembered. However, I had read a lot about barefoot running and landing one's weight more on the front or middle of the foot, so I decided to give that a try. That was just <strong>so</strong> much better - not exactly fun, I guess, but not particularly unpleasant either. So I stuck with it.</p> <p>When I first started, I could run 2.5 miles in about 28 minutes - probably about ten minutes per mile while actually running, plus plenty of walking breaks. Three months later, I'm at about 21:30 (8:36 per mile). Better still, I can maintain that exact same pace even at five miles. My time at the midpoint of that route - the top of a 5% grade - is exactly the same as my time for the shorter route. No, I don't understand that either*. My goal is to do at least one unofficial 10K before Thanksgiving. My stretch goal is to do it in under 50 minutes. It's good to have goals.</p> <p>So, am I "one of those runners"? According to some definitions, which separate running from jogging at 10:00 per mile, yes. I certainly don't feel like I'm jogging. If I had to stop suddenly, my momentum - except on steeper climbs - would carry me forward more than one step. That seems like an interesting cutoff. I've run six days out of seven, and twelve out of fifteen. I've run in the rain, and I plan to run in the snow at least some of the time this winter (probably on the Minuteman bike trail because road running in winter seems pretty scary). I've also lost five pounds and my resting heart rate has gone from 60 to 54. I think about runnning, talk about running, and now I blog about it too. So yeah, I guess I'm a runner.</p> <p>Being a competitive guy, I also wonder whether I'm a <em>good</em> runner. I certainly don't feel like I am yet. Five miles in 43:00 (my best result so far) doesn't seem all that impressive, even if there is an annoyingly steep climb in the middle and slight uphill for the entire last mile. I probably do fall below the "jogging threshold" sometimes, and my training focus right now is on maintaining good pace along the entire course. On the other hand, I've checked last year's results from races in Lexington and Andover. According to those, I'd consistently place about a third of the way down. Of the people I see <em>on the road</em>, most of whom do not enter races, I'd say only half that many seem to be going faster than I am (I don't seem to run the same routes/times as others enough to make a more direct comparison). That doesn't make me feel like any kind of a champion, but - more importantly - it keeps me from feeling like so much of a slouch that I get discouraged. I feel competent, and I know I can get better, so that keeps me going.</p> <p><img alt="image" src="" /></p> <p>The other thing that keeps me going is the people who have encouraged me and given me advice. Hank, Mike, Patrick, Allison, Shari, Nick, David - you all rock. You guys at Greater Boston Running Company, who helped me find the right shoes when I had ankle problems, rock too. I feel fitter now than I have in a long time, perhaps ever, and I couldn't have done it alone. Who would have thought that something so "obviously" solitary as running could be so social?</p> <p>* UPDATE: ...and it's actually not true any more. I actually wrote this a couple of days ago. Today, with the extra incentive of getting in before a break in the rain ended, I managed 21:06. I guess the difference in my pace from week to week outweighs the difference from course to course.</p>Tue, 21 Oct 2014 12:16:00,2014-10-21:2014-10-running-thoughts.htmlrunningDistributed Systems Prayer<p>Forgive me, Lord, for I have sinned.</p> <ul> <li> <p>I have written distributed systems in languages prone to race conditions and memory leaks.</p> </li> <li> <p>I have failed to use model checking when I should have.</p> </li> <li> <p>I have failed to use static analysis when I should have.</p> </li> <li> <p>I have failed to write tests that simulate failures properly.</p> </li> <li> <p>I have tested on too few nodes or threads to get meaningful results.</p> </li> <li> <p>I have tweaked timeout values to make the tests pass.</p> </li> <li> <p>I have implemented a thread-per-connection model.</p> </li> <li> <p>I have sacrificed consistency to get better benchmark numbers.</p> </li> <li> <p>I have failed to measure 99th percentile latency.</p> </li> <li> <p>I have failed to monitor or profile my code to find out where the real bottlenecks are.</p> </li> </ul> <p>I know I am not alone in doing these things, but I alone can repent and I alone can try to do better. I pray for the guidance of Saint Leslie, Saint Nancy, and Saint Eric. Please, give me the strength to sin no more.</p> <p>Amen.</p>Tue, 21 Oct 2014 10:35:00,2014-10-21:2014-10-dist-sys-prayer.htmldistributedhumorTen Stages of Technology Familiarity<p>Without further ado...</p> <ol> <li> <p>Never heard of it.</p> </li> <li> <p>Yeah, I hear all the hipsters yammering about it.</p> </li> <li> <p>I checked out the docs and examples once.</p> </li> <li> <p>I used it for a side project.</p> </li> <li> <p>We're using it for some new projects at work.</p> </li> <li> <p>We're using it in production.</p> </li> <li> <p>We're using it in production, but with a bunch of other stuff wrapped around it to address its deficiencies.</p> </li> <li> <p>We forked the project and our version's way better.</p> </li> <li> <p>Yeah, we used to use it.</p> </li> <li> <p>Never heard of it.</p> </li> </ol>Wed, 10 Sep 2014 14:45:00,2014-09-10:2014-09-tech-familiarity.htmlhumorStorage Benchmarking Sins<p>I've written and talked many times about storage benchmarking. Mostly, I've focused on how to run tests and analyze results. This time, I'd like to focus on the parts that come before that - how you set up the system so that you have at least some chance of getting a fair or informative result later. To start, I'm going to separate the setup into layers.</p> <ul> <li> <p>The physical configuration of the test equipment.</p> </li> <li> <p>Base-level software configuration.</p> </li> <li> <p>Tuning and workload selection.</p> </li> </ul> <h2>Physical Configuration</h2> <p>The first point about physical configuration is that there's almost never any excuse for testing two kinds of software on different physical configurations. Sure, if you're testing the hardware that makes some sense, but even then the only comparisons that make sense are the ones that exhibit equality at some level such as number of machines or system cost (including licenses). Testing on different hardware is the most egregious kind of dishonest benchmarking, but it's only the first of many.</p> <p>The second point about physical configuration is that just testing on the same hardware doesn't necessarily make things fair. What if one system can transparently take advantage of RDMA or other kinds of network offload but the other can't? Is it really fair to compare on a configuration with those features, and not even mention the disparity? What if one system can use the brand-new and vendor-specific SSE9 instructions to accelerate certain operations, but the other can't? The answer's less clear, I'll admit, but a respectable benchmark report would at least note these differences instead of trying to bury them. A good rule of thumb is that it's hardware <strong>used</strong> that counts, not merely hardware <strong>present</strong>. If the two systems aren't actually using the same hardware, the benchmark's probably skewed.</p> <p>The third and last point about hardware is it's still possible to skew benchmark results even if two systems are using the same hardware. How's that? Not all programs benefit equally from the same system performance profile. What if one system made a design decision that saves memory at the expense of using more CPU cycles, and the other system made a different design decision with the opposite effect? Is it fair to test on machines that are CPU-rich but memory starved, or vice versa? Of course not. A fair comparison would be on balanced hardware, though it's obviously difficult to determine what "balance" means. This is why it's so important for people who do benchmarks to disclose and even highlight potential confounding factors. Another common trick in storage is "short stroking" by using lots of disks and testing only across a small piece of those to reduce seek times. The flash equivalent might be to test one system on clean drives and the other after those same drives have become heavily fragmented. These differences can be harder to identify than the other two kinds, but they can have a similar effect on the validity of results.</p> <h2>Base Software Configuration</h2> <p>For the purposes of this section, "base" effectively means anything but the software under test - notably operating-system stuff. Storage configuration is particularly important. Is it fair to compare performance of one system using RAID-6 vs. another using JBOD? Probably not. (The RAID-6 might actually be faster if it's through a cached RAID controller, but that takes us back to differences in physical configuration so it's not what we're talking about right now.) Snapshots enabled vs. snapshots disabled is another dirty trick, since there's usually some overhead involved. Many years ago, when I worked on networking rather than storage, I even saw people turning compression on and off for similar reasons.</p> <p>Other aspects of base configuration can be used to cheat as well. Tweaking virtual-memory settings can have a profound effect on performance, which will disproportionately hurt some systems. Timer frequency is another frequent target, as are block and process schedulers. In the Java world, I've seen benchmarks that do truly heinous things with GC parameters to give one system an advantage over another. As with physical configuration, base software configuration can be easily done so that it's equal but far from fair. The rule of thumb here is whether the systems have been set up in a way that an experienced system administrator might have done, either with or without having read each product's system tuning guides. If the configuration seems "exotic" or is undisclosed, somebody's probably trying to pull a fast one.</p> <h2>Tuning</h2> <p>Most of the controvery in benchmarking has to do with tuning of the actual software under test. When I and others have tested GlusterFS vs. Ceph, there have always been complaints that we didn't tune Ceph properly. Those complaints are not entirely without merit, even though I don't feel the results were actually unfair. The core issue is that there are two ways to approach tuning for a competitive benchmark.</p> <ul> <li> <p>Measure "out of the box" (OOTB) performance, with no tuning at all. If one system has bad defaults, too bad for them.</p> </li> <li> <p>Measure "tuned to the max" performance, consulting experts on each side on how best to tweak every single parameter.</p> </li> </ul> <p>The problem is that the second approach is almost impossible to pull off in practice. Most competitive benchmarks are paid for by one side, and the other is going to be distinctly uninterested in contributing. Even in cases where the people doing the testing are independent, it's just very rare that competitors' interest and resource levels will align that closely. Therefore, I strongly favor the OOTB approach. Maybe it doesn't fully explore the <em>capabilities</em> of each system, but it's more likely to be fair and representative of what actual users would see.</p> <p>However, even pure OOTB doesn't quite cut it. What if the systems come out of the box with different default replication levels? It's clearly not fair to compare replicated vs. non-replicated, or even two-way vs. three-way, so I'd say tuning there is a good thing. On the other hand, I'd go the other way for striping. While different replication levels effectively result in using different hardware (different usable capacity), the same is not true of striping which merely uses the same hardware a little differently. That falls into the "too bad for them" category of each project being responsible for putting its own best foot forward.</p> <p>Another area where I think it's valid to depart from pure OOTB is durability. It's simply not valid or useful to compare a system which actually gets data on disk when it's supposed to vs. one that leaves it buffered in memory, as at least two of GlusterFS's competitors (MooseFS and HDFS) have been guilty of. You have to compare apples to apples, not apples to promises of apples maybe some day. Any deviations from pure OOTB should be looked at in terms of whether they correct confounding differences between systems or introduce/magnify those differences.</p> <h2>Conclusion</h2> <p>Benchmarking software is difficult. Benchmarking storage software is particular difficult. Very few people get it right. Many get it wrong just because they're not aware of how their decisions affect the results. Even with no intent to deceive, it's easy to run a benchmark and only find out after the fact that what seemed like an innocent configuration choice caused the result to be unrepresentative of any real-world scenario. On the other hand, there often <strong>is</strong> intent to deceive. With some shame, I have to admit that my colleagues in storage often play a particularly dirty game. Many of them, especially at the big companies, have been deliberately learning and applying all of these dirty tricks for decades, since EMC vs. Clariion (both sides) and NetApp vs. Auspex (ditto). None of them will be the first to stop, for obvious game-theoretic reasons.</p> <p>I've tried to make this article as generic and neutral as I could, because I know that every accusation will be met with a counter-accusation. That's also part of how the dirty game is played. However, I do invite anyone who has read this far to apply what they've learned as they evaluate <a href="">recent benchmarks</a>, and reach <strong>their own</strong> conclusions about whether those benchmarks reveal anything more than the venality of those who ran them.</p>Mon, 23 Jun 2014 09:34:00,2014-06-23:2014-06-benchmark-sins.htmldistributedstorageglusterfsvmwarebenchmarksWannabe of the Month: Skylable<p>Every month or two, someone comes along and claims to be the new Best Thing Ever in distributed file storage. More often than not, it's just another programmer who recently discovered things like consistent hashing and replication, then slapped together another HTTP object store because that's what people nowadays do instead of writing their own LISP/Forth interpreter or "make" replacement. There's nothing wrong with the exercise itself, of course. It's a great learning experience, and it's how real projects get started. For example, <a href="">LeoFS</a> might not really be "the leading DFS" as they claim, but it certainly a serious effort that I'm watching with interest. What gets my goat is always the grandiose claims, often made in the form of comparisons between real production-level file systems like GlusterFS and things that are neither production-level nor file systems.</p> <p>This month's example is <a href="">Skylable</a>, which tried to take advantage of the publicity around yesterday's big announcement to pimp their own spare-time project. At first they just tried to position themselves as a competitor to GlusterFS and Ceph when they're clearly not. I tried, as neutrally as I could, to point out that it's not a valid comparison. They didn't take the hint. Instead, @tkojm decided to double down.</p> <blockquote> <p>Skylable SX beats both Ceph &amp; Gluster in terms of security, code quality, ease of use and robustness. Cheers.</p> <p></p> </blockquote> <p>OK, game on. Such claims really piss me off, not because they're made against my own project but because they're disrespectful to every project in the same space. For example, <a href="">Tahoe-LAFS</a> plays in exactly this space, and they actually know what they're doing when it comes to security. Making competitive claims that are not only unaccompanied by one shred of evidence but <em>clearly false</em> to anyone with even the most cursory knowledge of the competitive landscape is outright dishonest. The Skylable folks have practically invited more serious comparisons, so I'm going to give them what they asked for and they're not going to like it. Maybe that will keep the next tyro from making the same mistake.</p> <p>Before I go on, I should mention that this post has nothing to do with Red Hat or the Gluster community. No time or equipment from either was used to test the Skylable code or write up the results. This is not big bad Red Hat picking on a smaller competitor. This is one guy (me), on his own time, trying to find the truth behind some very ambitious claims.</p> <p>Let's start with ease of use. Here are the steps to install GlusterFS, set up a two-way replicated volume, and mount it on a client.</p> <ul> <li> <p>yum install glusterfs... (or equivalent for other distros)</p> </li> <li> <p>/etc/init.d/glusterd start</p> </li> <li> <p>gluster peer probe server2 (from server1)</p> </li> <li> <p>gluster volume create myvol replica 2 server1:/path server2:/path</p> </li> <li> <p>gluster volume start myvol</p> </li> <li> <p>mount -t glusterfs server1:/myvol /wherever (from client)</p> </li> </ul> <p>What's the equivalent for Skylable? Well, you start by downloading, configuring, and building from source. Really. I don't expect such a young project to have stuff in major-distro repos yet. I wouldn't even ding them for not having their own specfiles or whatever, but they brought up ease of use and <em>requiring users to build from source is not good for ease of use</em>. It's even worse if you trip over their unnecessary dependency on libcurl being built with special OpenSSL support, which is not the case on RHEL/Fedora platforms. So much for the "tested on all major UNIX platform" claim.</p> <p>Once you've done who-knows-what to your system by running "make install" you're ready to begin configuring. Oh, what fun. To do this, you run "sxsetup" which will prompt you for several things and spit out some user-incomprehensible things like an administrator key. Then you have to <em>log in to another node</em> to repeat the process, manually copying and pasting an admin key from one window to another. Then you have to repeat the process again to set things up for the special-purpose programs you only need because its not a real mountable file system, only this time they call it a "user key" instead. Between the installation mess and the extra steps and the lack of real documentation, I think we can pretty clearly say...</p> <p><strong>Ease of Use: LIE</strong></p> <p>OK, so how about security, code quality, and robustness? With regard to security, they make a big deal of having both on-network and on-disk encryption, the latter using client keys. GlusterFS also has both of those, and much of the code has been vetted by Red Hat's renowned security team. Skylable's has been vetted by approximately nobody. A quick perusal of the code shows that it's all home-grown and littered with rookie mistakes. My favorite was this:</p> <div class="highlight"><pre> const char *skysalt = &quot;sky14bl3&quot;; /* salt should be 8 bytes */ </pre></div> <p>Yep, that's a constant salt embedded in the code, apparently used in lieu of a real KDF to generate the user's key from their password. Here's a hint: this process <strong>adds no entropy</strong> to the original text password, no matter how many times you apply EVP_BytesToKey. I actually pointed this one out to them on HN, as did somebody else, and (without admitting error) they claim they'll do better next time, but it does raise an important question. How likely is it that somebody who made such an inexcusable mess of generating a user key then managed to get every other little detail right in their home-grown storage encryption? The odds are nil. I could play free security consultant here and find the next terrible flaw for them, and the next one and the next one, but I shouldn't need to and neither should anyone else. This is just not serious crypto code, so...</p> <p><strong>Security: LIE</strong></p> <p>That also tells us where we're headed for code quality. Obviously I can't just rely on "gut feel" and familiarity because I have years of experience with the GlusterFS code and none with this, so I'll try to look at objective measures. I picked several files at random to look at. I deliberately excluded those marked as third-party, but still found a lot of code copied from other codebases - libtool, the ISAAC hash, SQLite. This is not only a terrible code smell but in many cases might be a license violation as well. The data structures seem to be better documented in the Skylable code than in GlusterFS (using the Doxygen style), but otherwise there seemed to be little evidence of this vaunted code quality - even though new code written by a small tight-knit team generally <em>should</em> have a higher median quality than older code written by a much larger team.</p> <p>Error checking doesn't seem notably more consistent, and cleanup after an error often seems to have involved copying free/close/etc. functions from one code block to another instead of using any of several more robust idioms. I specifically looked at error checking around read(2) and write(2) to see if it handled partial success as well as outright failure. Generally, no. The code uses lt__malloc for no apparent reason, but doesn't get any extra memory safety for the extra effort. Logging/tracing doesn't seem particularly strong. Skylable's own code (as opposed to that they copied) seems to use sprintf more than snprintf. I know these are all incredibly superficial observations, but code quality is an enormously complex topic. These are just the things that are easy to put into words, and they're already more than @tkojm has offered in support of his claim. They're enough to say...</p> <p><strong>Better Code Quality: QUESTIONABLE</strong></p> <p>It's pretty much the same for robustness. There's no evidence from the real world about the robustness of this code, of course. I also see even fewer tests than there are for GlusterFS, so I'd have to say that claim's <strong>QUESTIONABLE</strong> as well. Of the four claims @tkojm made, therefore, two are questionable and two out outright false, so the whole evaluates to false. Let's move on to the part he didn't even want to talk about: performance.</p> <p>To test performance, I used two SSD-equipped 16GB instances at Digital Ocean. It took me an hour or so to work through all of the dependency crap and get things set up before I was able to run any tests. Then the very first test I ran was a very simple 4KB file-create loop using sxcp. What was the ressult?</p> <div class="highlight"><pre>0.67 files per second </pre></div> <p>I'm not joking, but apparently they are. GlusterFS is often criticized for its performance on exactly this kind of workload, and I'd be the first to say rightly so, but it's still <strong>orders of magnitude</strong> better than that. Those are modem speeds, on machines that can and do perform quite well using any other software. There's just no excuse. Why even bother looking further?</p> <p>I could keep going. I could go into detail about how the CLI lacks built-in help, how it doesn't seem to include anything to report on node or cluster status, how there don't seem to be any provisions for basics such as rebalancing after a server is added or permanently removing one from the config after the hardware blew up. I could talk about how storing sensitive data unencrypted in a plethora of separate SQLite files is bad for security, performance, and maintainability all at once. But really . . . <strong>enough</strong>. More than enough. Not even the most fervent advocate of "Minimum Viable Product" could consider this to be past the prototype phase.</p> <p>Let's try to give the Skylable folks as much benefit of the doubt as we can here. Maybe a few people decided that they'd had enough of some other technical area, and settled on distributed object storage as their next challenge. So they started tinkering around with Skylable SX as a platform for learning and experimentation. I think that's awesome. I want to encourage that kind of thing. Sure, I might have suggested a little more reading and studying how existing systems do the same sorts of things, before diving into code (especially that awful home-grown crypto), but I'd still try to be supportive. The problems only start when someone decides to start chasing money instead of technology. After all, this is a hot area, but new enough that many users don't know how to tell the serious players from the charlatans, so why not try to cash in? So he starts mouthing off about how this is <strong>already</strong> a serious contender, even though the people actually writing the code know it's still years from that. That's no longer OK. That's encouraging people to use code that will not store their data safely or securely, and I have a zero tolerance policy for that sort of thing.</p> <p>My message here is really pretty simple: keep coding, keep experimenting, if I have offended anyone's <strong>technical</strong> sensibilities I sincerely apologize, but for heaven's sake somebody get @tkojm to STFU until it's done. Fix the crypto, fix the performance, fix the packaging and UI. Then we'll have something real to talk about. Maybe, if I find more time than has already been wasted, I'll even dig in and submit a patch or two, fix some of that egregiously bad performance. But not if I keep hearing how it's already better than anything else.</p>Thu, 01 May 2014 12:08:00,2014-05-01:2014-05-skylable.htmldistributedstorageglusterfscephskylableInktank Acquisition<p>I know a lot of people are going to be asking me about Red Hat's acquisition of Inktank, so I've decided to collect some thoughts on the subject. The very very simple version is that <strong>I'm delighted</strong>. Occasional sniping back and forth notwithstanding, I've always been a huge fan of Ceph and the people working on it. This is great news. More details in a bit, but first I have to take care of some administrivia.</p> <p><em>Unlike everything else I have ever written here, this post has been submitted to my employer for approval prior to publication. I swear to you that it's still my own sincere thoughts, but I believe it's an ethical requirement for independent bloggers such as myself to be up front about any such entanglement no matter how slight the effect might have been. Now, on with the real content.</em></p> <p>As readers and conference-goers beyond number can attest, I've always said that Ceph and GlusterFS are allies in a common fight against common rivals. First, we've both stood against proprietary storage appliances, including both traditional vendors and the latest crop of startups. A little less obviously, we've also both stood for Real File Systems. Both projects have continued to implement and promote the classic file system API even as other projects (some even with the gall to put "FS" in their names) implement various stripped-down APIs that don't preserve the property of working with every script and library and application of the last thirty years. Not having to rewrite applications, or import/export data between various special-purpose data stores, is a <strong>huge</strong> benefit to users.</p> <p>Naturally, these two projects have a lot of similarities. In addition to the file system API, both have tried to address object and block APIs as well. Because of their slightly different architectures and user bases, however, they've approached those interfaces in slightly different ways. For example, GlusterFS is "files all the way down" whereas Ceph has separate bulk-data and metadata layers. GlusterFS distributes cluster management among all servers, while Ceph limits some of that to a dedicated "monitor" subset. Whether it's because of these technical differences or because of relationships or pure happenstance, the two projects have experienced different levels of traction in each of these markets. This has led to different lessons, and different ideas embedded in each project's code.</p> <p>One of the nice things about joining forces is that we each gain even more freedom than before to borrow each other's ideas. Yes, they were both open source, so we could always do some of that, but it's not like we could have used one project's management console on top of the other's data path. GlusterFS using RADOS would have been unthinkable, as would Ceph using GFAPI. Now, all things are possible. In each area, we have the chance to take two sets of ideas and either converge on the better one or merge the two to come up with something even better than either was before. I don't know what the outcomes will be, or even what all of the pieces are that we'll be looking at, but I do know that there are some very smart people joining the team I'm on. Whenever that happens, all sorts of unpredictable good things tend to happen.</p> <p>So, welcome to my new neighbors from the Ceph community. Come on in, make yourself comfortable by the fire, and let's have a good long chat.</p>Mon, 28 Apr 2014 13:23:00,2014-04-28:2014-04-inktank-acquisition.htmldistributedstorageglusterfscephNew Style Replication<p>This afternoon, I'll be giving a talk about (among other things) my current project at work - New Style Replication. For those who don't happen to be at Red Hat Summit, here's some information about why, what, how, and so on.</p> <p>First, why. I'm all out of tact and diplomacy right now, so I'm just going to come out and say what I really think. The replication that GlusterFS uses now (AFR) is unacceptably prone to "split brain" and always will be. That's fundamental to the "fan out from the client" approach. Quorum enforcement helps, but the quorum enforcement we currently have sacrifices availability unnecessarily and still isn't turned on by default. Even worse, once split brain has occurred we give the user very little help resolving it themselves. It's almost like we actively get in their way, and I believe that's unforgivable. I've <a href="">submitted</a> <a href="">patches</a> to overcome both of these shortcomings, but for various reasons those have been almost completely ignored. Many of the arguments about NSR vs. AFR have been about performance, which I'll get into later, but that's really not the point. In priority order, my goals are:</p> <ul> <li> <p>More correct behavior, particularly with respect to split brain.</p> </li> <li> <p>More flexibility regarding tradeoffs between performance, consistency, and availability. At the extremes, I hope that NSR can be used for a whole continuum from fully synchronous to fully asynchronous replication.</p> </li> <li> <p>Better performance in the most common scenarios (though our unicorn-free reality dictates that in return it might be worse in others).</p> </li> </ul> <p>To show the most fundamental difference between NSR and AFR, I'll borrow one of my slides.</p> <p><img alt="image" src="" /></p> <p>The "fan out" flow is AFR. The client sends data directly to both servers, and waits for both to respond. The "chain" flow is NSR. The client sends data to one server (the temporary master), which then sends it to the others, then the replies have to propagate back through that first server to the client. (There is actually a fan-out on the server side for replica counts greater than two, so it's technically more splay than chain replication, but bear with me.) The master is elected and re-elected via etcd, in case people were wondering why I'd been hacking on that.</p> <p>Using a master this way gives us two advantages. First, the master is key to how "reconciliation" (data repair after a node has left and returned) works. NSR recovery is log-based and precise, unlike AFR which marks files as needing repair and then has to scan the file contents to find parts that differ between replicas. Masters serve for terms. The order of requests between terms is recorded as part of the leader-election process, and the order within a term is implicit in the log for that term. Thus, we have all of the information we need to do reconciliation across any set of operations without having to throw up our hands and say we don't know what the correct final state should be.</p> <p>There's a lot more about the "what" and the "how" that I'll leave for a later post, but that should do as a teaser while we move on to the flexibility and performance parts. In its most conservative default mode, the master forwards writes to all other replicas before performing them locally and doesn't report success to the client until all writes are done. Either of those "all" parts can be relaxed to achieve better performance and/or asynchronous replication at some small cost in consistency.</p> <ul> <li> <p>First we have an "issue count" which might be from zero to N-1 (for N replicas). This is the number of non-leader replicas to which a write must be <strong>issued</strong> before the master issues it locally.</p> </li> <li> <p>Second we have a "completion count" which might be from one to N. This is the number of writes that must be <strong>complete</strong> (including on the master) before success is reported to the client.</p> </li> </ul> <p>The defaults are Issue=N-1 and Completion=N for maximum consistency. At the other extreme, Issue=0 means that the master can issue its local write immediately and Completion=1 means it can report success as soon as one write - almost certainly that local one - completes. Any other copies are written asynchronously but in order. Thus, we have both sync and async replication under one framework, merely tweaking parameters that affect small parts of the implementation instead of having to use two completely different approaches. This is what "unified replication" in the talk is about.</p> <p>OK, on to performance. The main difference here is that the client-fan-out model splits the client's outbound bandwidth. If you have N replicas, a client with bandwidth BW can never achieve more than BW/N write throughput. In the chain/splay model, the client can use its full bandwidth and the server can use its own BW/(N-1) simultaneously. This means increased throughput in most cases, and that's not just theoretical: I've observed and commented on exactly that phenomenon in head-to-head comparisons with more than one alternative to GlusterFS. Yes, <strong>if</strong> enough clients gang up on a server then that server's networking can become more of a bottleneck than with the client-fan-out model, and <strong>if</strong> the server is provisioned similarly to the clients, and <strong>if</strong> we're not disk-bound anyway, then that can be a problem. Likewise, the two-hop latency with this approach can be a problem for latency-sensitive and insufficiently parallel applications (remember that this is all within one replica set among many active simultaneously within a volume). However, these negative cases are much - <strong>much</strong> - less common in practice than the positive cases. We did have to sacrifice some unicorns, but the workhorses are doing fine.</p> <p>That's the plan to (almost completely) eliminate the split-brain problems that have been the bane of our users' existence, while also adding flexibility and improving performance in most cases. If you want to find out more, come to one of my many talks or find me online, and I'll be glad to talk your ear off about the details.</p>Wed, 16 Apr 2014 10:24:00,2014-04-16:2014-04-new-style-replication.htmldistributedstorageglusterfsChange the Axis<p>The other day, I was talking to a colleague about the debate within OpenStack about whether to chase Amazon's AWS (what another colleague called the "failed Eucalyptus strategy") or forge its own path. It reminded me of an idea that was given to me years ago. I can't take credit for the idea, but I can't remember who I got it from so I'll do my best to represent it visually myself. Consider the following image.</p> <p><img alt="image" src="" /></p> <p>Let's say your competitor is driving toward a Seattle feature set - coffee, grunge, rain. You have a slightly different vision, or perhaps just a different execution, that leads toward more of an LA feature set - fewer evergreens, more palm trees. If you measure yourself by progress toward your opponent's goals (the dotted line), <em>you're going to lose</em>. That's true even if you actually make better progress toward your goals. You're just playing one game and expecting to win another. That might seem like an obviously stupid thing to do, but an amazing number of companies and projects end up doing just that. I'll let someone with more of a stake in the OpenStack debate decide whether that applies to them. Now, consider a slightly different picture.</p> <p><img alt="image" src="" /></p> <p>Here, we've drawn a second line to compare <em>our competitor's progress</em> against <em>our</em> yardstick. Quite predictably, now they're the ones who are behind. Isn't that so much better? If you're building a different product, you need to communicate why you're aiming at the right target and shift the debate to who's hitting it. In other words, <em>change the axis</em>.</p> <p>I don't mean to say that copying someone else's feature set is always a mistake. If you think you can execute on their vision better than they can, that's great. Bigger companies do this to smaller companies all the time. At Revivio, we weren't afraid of other small competitors. We were afraid of some big company like EMC getting serious about what we were doing, then beating us with sheer weight of numbers and marketing muscle. Occasionally things even go the other way, when a smaller team takes advantage of agility and new technology to out-execute a larger legacy-bound competitor. The real point is not that one strategy's better, but that you can't mix them. You can't send execution in one direction and messaging in another. You have to pick one, and stick to it, or else you'll always be perceived as falling short.</p>Thu, 13 Feb 2014 09:04:00,2014-02-13:2014-02-change-the-axis.htmlopenstackstrategyData Gravity<p>In the last few days, I had an interesting exchange <a href="">on Twitter</a> about the concept of data gravity. For convenience, I'll include the relevant parts here.</p> <ul> <li> <p><a href="">Mat Ellis</a>: Interesting piece by @mjasay <a href="">link</a> … @randybias is right on the money, data gravity is already a big deal on the cloud</p> </li> <li> <p><a href="">me</a>: Data gravity will continue to be a big deal, no matter how fast the network. Can't beat the speed of light.</p> </li> <li> <p><a href="">Randy Bias</a>: Data gravity and speed of light are entirely unrelated.</p> </li> <li> <p>me: No matter how much bandwidth you have, latency-bound sync and coordination limit total data velocity.</p> </li> </ul> <p>I think this is an important point, and Randy is hardly the first to get it wrong, but the explanation is a little longer than Twitter's 140-character limit. If you have data that you want to access from multiple places, you have two choices.</p> <ul> <li> <p>Keep a copy in one location, access it remotely from elsewhere. Besides being <strong>extremely</strong> latency-bound, this does nothing for availability.</p> </li> <li> <p>Keep multiple copies, and keep them in sync. The sync process/protocol still tends to be quite latency-bound, and as the number of replicas increases you get increasingly poor storage utilization. Even Google doesn't have an infinite budget for disks.</p> </li> </ul> <p>Either way, no matter how much bandwidth you have, latency - bound by speed of light - is an issue. This is exactly the point I made in my <a href="">Dude, Where's My Data</a> talk at LISA'12: making that initial copy is easy, but keeping it up to date is hard. Sooner or later you're back to this.</p> <p><img alt="image" src="" /></p> <p>That's data gravity, despite high bandwidth. Computing is full of "if you just do/have X" pipe dreams, of which "throw hardware at it" is just a subcategory. People who've actually tried X have usually found that there are tons of secondary issues that have to be solved, and even then X isn't the panacea it was imagined to be. This is such a case. Having tons of bandwidth is nice, it does allow Google to do things that others can't, but it simply doesn't make data gravity disappear.</p>Mon, 10 Feb 2014 08:27:00,2014-02-10:2014-02-data-gravity.htmldistributedstorageTiers Without Tears<p>A lot of people have asked when GlusterFS is going to have support for tiering or Hierarchical Storage Management, particularly to stage data between SSDs and spinning disks. This is a pretty hot topic for these kinds of systems, and many - e.g. Ceph, HDFS, Swift - have announced upcoming support for some form or other. However, tiering is just one part of a larger story. What do the following all have in common?</p> <ul> <li>Migrating data between SSDs and spinning disks.</li> <li>Migrating data between replicated storage and deduplicated, compressed, erasure-coded storage.</li> <li>Placing certain types of data in a certain rack to increase locality relative to the machines that will be using it.</li> <li>Segregating data in a multi-tenant environment, including between tenants at different service levels requiring different back-end configurations.</li> </ul> <p>While these might seem like different things, they're all mostly the same except for one part that decides where to place a file/object. It doesn't really matter whether the criteria include file activity, type, owner, or physical location of servers. The mechanics of actually placing it there, finding it later, operating on it, or moving it somewhere else are all pretty much the same. We already have all those parts in GlusterFS, in the form of the DHT (consistent hashing) translator. We've even added tweaks to it before, such as the ill-named NUFA. Therefore, it makes perfect sense to use that as the basis for our own tiering strategy, but I call it "data classification" because the same enhancements will allow it to do far more than tiering alone.</p> <p>The key idea behind data classification is reflected in its earlier name - DHT over DHT. Our "translator" abstraction allows us to have multiple instances of the same code active at once, differing only in their parameters and relationship to one another. It's just one of many ways that GlusterFS is more modular than its closest competitors, even though those are implemented in more object-oriented languages. To see how this kind of setup works, let's start with an example <em>without</em> it, capable of implementing only the simplest form of tiering.</p> <p><img alt="image" src="" /></p> <p>In this example, we have four bricks each consisting of a smaller SSD component (red) and a larger spinning-disk component (blue). This can easily be done using something like <a href="">dm-cache</a>, <a href="">Bcache</a>, <a href="">FlashCache</a>, or various hardware solutions. Those hybrid bricks are then combined, first into replica pairs and finally into a volume using the DHT (a.k.a. "distribute") translator. This approach actually works pretty well and is easy to implement, but it's less than ideal. If your working set is concentrated on anything less than the entire set of bricks, then you could fill up their SSD parts and either become network-bound or have accesses spill over to the spinning-disk components even though potentially usable resources on other bricks remain idle. This approach doesn't deal well with adding more resources in anything but a totally symmetric fashion across all bricks, and in particular precludes concentrating those SSDs on a separate set of beefier servers with extra-fast networking. Lastly, it doesn't support tiering across different encoding methods or replication levels, let alone the other non-tiering functions mentioned above. Now, consider this different kind of setup.</p> <p><img alt="image" src="" /></p> <p>Here, the left half is our fast working-storage tier and the right half is our archival tier optimized for storage efficiency and survivability instead of performance. Note that this is a logical/functional view, not a physical one. A1 and A2 might still be on the same server, but now their logical relationship has changed and so they could also be moved separately.</p> <p>Our performance tier looks much like the whole system did before, with bricks arranged into replica sets and then DHT (as it is today). However, we've split off the spinning disks into a whole separate pool, and put a new "tiering" translator (a modified version of DHT) on top. Here's the cool part: that "replicate 3" layer might actually be erasure coding instead of normal replication. That would suck for performance, but since this is only used for our slow tier that's OK. 90% of the accesses to the fast tier + 90% of data in the storage-efficient tier = goodness. We could also toss in deduplication, compression, or bit-rot detection <em>on that side only</em> for extra fun. Note that we couldn't do this in the other model, because you can't put non-distributed tiering on top of distributed erasure coding. Most other tiering proposals I've seen do the tiering at too low a level, and are far more useful as result.</p> <p>Finally, let's consider those other functions that aren't tiering. In the second diagram above, it would be trivial to replace the "distribute" component above with one that's making decisions based on rack location instead of random hashing. Similarly, it would be trivial to replace the top-level "tier" component with one that makes decisions based on tenant identity or service level instead of file activity. It's almost as easy to add even more layers, doing all of these things at once in a fully compatible way. No matter what, migrating data based on new policies or conditions can still use the same machinery we've worked so hard to debug and optimize for DHT rebalancing.</p> <p>Over the last few years I've come up with a lot of ways to improve GlusterFS, or distributed filesystems in general, but this is one of my favorites. It can add so much functionality in return for so little deep-down "heavy lifting" and that's pretty exciting.</p>Fri, 31 Jan 2014 13:07:00,2014-01-31:2014-01-data-classification.htmldistributedstorageglusterfsThe World Is Not Flat<p>Way back when I was a young pup, either in college or after that but before I started my career, I got to use an operating system called MTS. That stands for Michigan Terminal System. It was created to run on IBM (and later Amdahl) mainframes, when U of M got tired of waiting for IBM to deliver a multi-user operating system. Like most code that old, it was an interesting combination of ideas that have since been abandoned because they were stupid, ideas that were ahead of their time, and ideas that were somewhere in between. Here are some of the more interesting ideas.</p> <ul> <li> <p>The filesystem had a feature to include the entire contents of one file at a specific point within another. Who needs symbolic links when you can just create a file containing a single %include directive? Why would programming languages have to synthesize this behavior in a bazillion subtly different ways if the basic functionality existed natively in the OS? Yeah, I know, record-oriented filesystems (basically a prerequisite for this) lost out to simple byte-streams for many good reasons, but every victory comes at a cost.</p> </li> <li> <p>MTS had a very robust ACL system, which allowed you to control access by user, group, or "pkey" (i.e. what program was running). Much better than set-uid in my opinion.</p> </li> </ul> <p>While I was still using MTS, they added a macro system - what we would now think of as a shell scripting language. One of the very first uses of this macro system was to sythesize a hierarchical directory structure on top of the flat one native to MTS. I really wish I could remember the name of the author, to give credit. He was a Computing Center consultant, and this would have been in 1985 or so, if anybody wants to help me out. It was a pretty slick combination of naming conventions and macros, and I think it made many users' lives easier.</p> <p>The reason I started thinking about MTS is that I see people doing the exact same things now - nearly thirty years later - to simulate a hierarchical namespace on top of the flat one provided by most object stores. Let me repeat something I've said many times before, in many ways: flat namespaces weren't just crap in DOS, they were crap in MTS even before that and they're still crap today. <strong>Crap</strong>, I say. Anybody who implements a supposedly modern file/object store with a flat namespace is simply screwing their users to suit their own convenience. The scalability arguments don't hold water, because the scalability issues mostly have to do with the operations that you have to support (e.g. atomic rename) than with whether or not you have nested directories. This is something that has to be built into the data store, with the necessary recursive name resolution done one place one time by people who understand that data store, instead of being done ten incompatible ways by ten different outsiders. Even quite smart people can trip when they try to bolt on a hierarchical structure <a href="">after the fact</a>.</p> <p>Users have shown over and over again that they want flexibility to organize and reorganize their data, in ways richer than a flat or even single-level hierarchy will allow. Maybe there's an even better way, but so far none of the attempts to replace nested directories with tags or links or database-like queries seem to have gained much traction. Until someone comes up with something better, the nested-directory structure should be considered a minimum standard for anything that's supposed to replace the traditional filesystem.</p>Sun, 29 Dec 2013 20:00:00,2013-12-29:2013-12-world-is-not-flat.htmlstoragedistributedData Extortion<p>This is a story about the dark side of moving your stuff into the cloud. It does have a (reasonably) happy ending, but along the way there are some important lessons to be learned about the relationship between cloud users and cloud providers, and how it's possible for people on either side to get burned. There are some bits about contract (and other) law, and customer service, and other things as well, but let's begin with what happened this morning.</p> <p>Between my reduced hours and the Christmas shutdown, I figure I owe Red Hat about 4.8 hours of work this week. I didn't do it on Monday, so I figured I'd do it this morning. I decided to debug some performance-testing scripts, but since my machines at work are all powered off I figured I'd do it on my cloud machine at Host Virtual. For debugging, I only needed to do short runs - no more than forty seconds or one gigabyte per run, as it turns out. I'd done about a dozen of these when my machine became unresponsive. What gives? I looked around all the usual ways, then logged in to my Host Virtual console to see that my VM was locked with the following message.</p> <blockquote> <p>i/o abuse from your vm - we are investigating</p> </blockquote> <p>There are two things wrong here. The less important problem is the premature "abuse" accusation. "Anomalous" would have been fine, "excessive" might even have been OK, but "abuse" is insulting a customer for no good reason. More importantly, locking the VM was a complete overreaction. The tools exist to throttle the I/O from a particular VM instead of shutting it down entirely. I've seen such throttling kick in when testing on other providers many times (more about that in a minute). Even when a shutdown is considered necessary, it's <strong>never</strong> appropriate to do it without notification. By their own admission they were still investigating, but from my perspective they had already gone beyond investigation to accusation, conviction, and execution.</p> <p>At this point, I submitted a ticket explaining what I had been doing, and suggesting that their reaction had been premature. If they had just admitted as much, things would have been fine. If they had asked me to reduce my I/O load, I would have. Instead, Customer Disservice Representative par excellence "Mark M" replied saying that I had been affecting other customers and violating their Acceptable Use Policy. Unfortunately, there's nothing to back up that claim. There is no I/O limit specified in their AUP. None. The closest they get is this.</p> <blockquote> <p>We have determined what constitutes reasonable use of our network for the particular services and products you purchase from us. These standards are based on typical customer use of our network, for similar services and products. It is your obligation to monitor the use of your services and/or server(s) to ensure that there are not unusual spikes and peaks in your bandwidth or disk usage.</p> </blockquote> <p>In other words, they claim to have some numbers in mind, but won't commit to them in their own AUP. Because those limits aren't specified, even by reference to another document or method of notification, they're legally nonexistent. Even now, nobody at HV has identified a limit that I exceeded, by how much, or for how long. They can't claim any AUP violation without such specifics, and thus they can't claim any right to modify our existing relationship in any way. So, their AUP claim is complete bullshit. What about the "affecting other users" claim?</p> <p>Well, sorry, but tough cookies for them. Do you know who's responsible for meeting their obligations to other customers? <strong>Them</strong>. As it happens, I know quite a bit about the problems and technologies involved in providing these kinds of services. In the course of becoming an expert on cloud I/O performance, I've done this same sort of testing on about twenty providers. I've seen the "noisy neighbor" problem from both sides. I've seen my own I/O performance go all over the map because of other users, and I've seen it throttled into the ground supposedly to protect other users from me. I don't love being throttled, but it's an entirely valid response so long as its depth and duration are protective rather than punitive. More importantly, it proves that <strong>the technology exists</strong>. If HV chooses not to apply it, that's their fault. They can't simultaneously preach about meeting commitments to users while spitting on their commitment to me.</p> <p>The funny thing is that until now I've been one of HV's biggest boosters. They seemed to be one of the few providers whose I/O performance was marred by neither massive variability nor punitive throttling. Little did I know that their "secret" was to kill VMs arbitrarily when they got in the way. In any case, I expressed my skepticism about their AUP claim, and my dissatisfaction with the lack of notification. That's when "Mark M" really stuck his head up his ass.</p> <blockquote> <p>if the abuse is ongoing and continued your account will simply be terminated and your server deleted.</p> </blockquote> <p>What we now have is someone threatening a customer with <strong>deletion of data</strong> in response to a "violation" that has already been called into question. That's extortion. There is absolutely no situation where it would be appropriate to delete a server while such disagreements are still outstanding, and the fact that Marky Mark regards it as a "simple" matter is appalling.</p> <p>So I've already moved all my data, and I'll be warning everyone away from this decade's version of Feature Price (widely regarded as the worst web host ever, especially since they also tried to take users' data hostage). No big deal, actually. What's far more important is that this could happen <strong>to any user, at any cloud provider</strong>. Go take a good look at your own AUP, TOS, or whatever you think spells out the obligations back and forth between you and your cloud provider. How many MB/s may you write, for how long, before they decide you're being "abusive"? Bear in mind that anything left vague might be subject to mind-bending reinterpretation, and anything left out (like HV's I/O limits) might be subject to outright fabrication. What recourse do you have if the provider inappropriately terminates service? Do they admit to any obligation regarding preservation of data while there is an ongoing dispute? I've seen a whole lot of these documents, and all of these things are typically missing. Maybe it's time for someone - users, providers, please not the government - to define minimum standards that cloud providers should meet regarding these sorts of issues. The better providers won't have any problem signing up. The worse ones? Well, I suppose they'll keep on threatening and extorting - and losing - their customers.</p> <p>By the way, welcome to the new site.</p>Fri, 27 Dec 2013 21:50:00,2013-12-27:2013-12-data-extortion.htmlcloudlegalRoll Back or Rock On?<p>For a while now, Kyle Kingsbury has been doing some <a href="">excellent work</a> evaluating the consistency and other properties of various distributed databases. His <a href="">latest target</a> is Redis. Mostly I agree with the points he makes, and that Redis Cluster is subject to inexcusable data loss, but there is one point on which my own position is closer to the opposition.</p> <blockquote> <p>we have to be able to roll back operations which should not have happened in the first place. If those failed operations can make it into our consistent timeline in an unsafe way, perhaps corrupting our successful operations, we can lose data.</p> </blockquote> <p>Those are strong words, but their strength is not matched by their precision. What does "unsafe" really mean here? Or "corrupting"? I'm the last person to take data corruption or loss lightly, but that's precisely why I think it's important to be crystal clear on what they mean. How is it "corruption" to perform a write that the user asked you to perform? The answer depends very much on what rules we're actually supposed to follow. Let's start with some of the most basic requirements for any distributed storage system.</p> <ul> <li> <p>Internal consistency: all nodes will eventually agree on whether each write happened or not. (Note: this is more CAP consistency than ACID consistency).</p> </li> <li> <p>Durability: once a write has completed, it will be reflected in all subsequent reads despite transient loss of all nodes and/or permanent loss of some number (system-specific but always less than quorum).</p> </li> </ul> <p>We're not done yet, because we've only defined an internal kind of consistency. As many have pointed out, a distributed system includes its clients. A system that simply throws away all writes could satisfy our requirements, so let's add a more externally oriented consistency requirement.</p> <ul> <li>External consistency: any write that has been acknowledged to the user as successfully completed must be complete according to the durability definition.</li> </ul> <p>That's really about it. The last acknowledged write to a location will eventually become available everywhere, and remain available unless the failure threshold is exceeded (or a user deliberately overwrites it but that's a different matter). There are certainly many more requirements we could add, as we'll see, but these few are sufficient for a usable system.</p> <p>One thing that's noticeably missing from our external-consistency rule is anything to do with <strong>un</strong>acknowledged writes. Unless we add more rules, the system is free to choose whether they should be completed or rolled back (so long as our other rules are followed). Here's a rule that would force the system to decide a certain way.</p> <ul> <li>Any write that has <strong>not</strong> been acknowledged to the user must <strong>not</strong> be reflected on subsequent reads.</li> </ul> <p>That should be pretty familiar to database folks as isolation (plus a bit of atomicity), and it's no surprise that database folks would assume it . . . but you know what they say about assumptions. Other kinds of systems, such as filesystems, do not have such a requirement. Instead of appeal to (conflicting) authority or tradition, let's try taking a look at what's actually right for users.</p> <p>Unacknowledged writes fall into two categories: still in progress or definitively failed. For in-progress writes, isolation can be enforced by storing them "off to the side" in one way or another. This doesn't work for definitively failed writes, because "off to the side" is finite. Those writes have to be actually removed from the system - i.e. roll-back. The problem is that roll-back is subject to the same coordination problems as the original write and carries its own potential for data loss. In fact, for a write that overlaps with a previous one and succeeded at some nodes but not others, data loss absolutely <strong>will</strong> occur either way. The difference is only which data - old or new - will be lost.</p> <p>So, back to what's right for users. Why is it better to lose the data that the user explicitly intended to overwrite than to lose the data that they explicitly intended to put in its place? Trick question: it's not. The careless user who didn't bother checking return values would obviously be better served by moving forward than by rolling back . . . but who cares about them? More importantly, even a diligent user who does check error codes should be aware that lack of acknowledgement does not mean lack of effect. By now "everybody knows" that if you send a network message and don't get a reply you can't assume it had no effect. The same "lost ack" problem exists for storage I/O as well, and has forever. In both worlds, the "must have had no effect" assumption is just as dumb as the careless programmer's "must have worked" assumption.</p> <p>If we exclude the careless and truly diligent programmers, the only people left who would care about not having rollback would be those who know to check for errors but don't know or don't care enough to handle them properly. They must also be comfortable with the performance impact of roll-back support, most often from double writing. I'm not saying these people don't exist or their concerns aren't valid, but clearly roll-back is not the best or only system-design choice for everyone. Building a system that tries to keep as many writes as possible instead of throwing away as many as possible is an entirely valid option.</p> <p>If Redis threw away 45% of acknowledged writes in Kyle's testing, that's a serious problem. That violates our consistency rule, or any reasonable alternative, and I have no problem saying that such a system is broken. When Kyle adds that Redis "adds insult to injury" by completing all of the unacknowledged writes instead, he's also correct - but it's only an insult, not a new injury. A new injury would be further loss of data, and whether those successful writes represent loss of data is very open to interpretation. If I accidentally knock some money out of your hands, then bend down and pick up only the pennies for you, it's not the pennies that are the problem. It's the money - or data - that got dropped on the floor and left there.</p> <p>(NOTE: it has been pointed out to me that what Kyle tested was not Redis Cluster but a proposed WAIT enhancement to Redis replication. Or something like that. Fortunately, those distinctions aren't particularly relevant to the point I'm trying to make here, which is about the supposed necessity of roll-back support in this type of system. Nor does it change the fact that the system under test - whatever it was - failed miserably. Still, I used the wrong term so I've added this paragraph to correct it.)</p>Wed, 11 Dec 2013 17:32:00,2013-12-11:2013-12-roll-back-or-rock-on.htmldistributedGiving Thanks<p>This was inspired both by a <a href="">blog post elsewhere</a> and by a nice email I got this morning thanking me for this blog (thanks Tristan). It seems like we all fail to give thanks, and nowhere more so than in the "gift economy" of open source. I'll start with all of the <strong>code</strong> for which I'm thankful, and then move outward from there.</p> <p>I'm grateful for the operating systems, compilers/interpreters, and text editors I use. For the web browsers, email clients, and servers of all kinds. For the hardware, from chips up to systems, that runs all of this code. (We software folks are <strong>really</strong> bad about recognizing all of the efforts that are made before we even start.) I'm grateful for the internet in all of its physical, technical, and financial manifestations. It is truly a wonder that I can carry a device anywhere that lets me sit down wherever, connect wirelessly to the rest of the world, and work or play. Lastly, I'm grateful to all of the computer scientists and mathematicians and physicists and all sorts of real engineers who toiled away, often in obscurity, to lay the foundations for all of this.</p> <p>OK, so much for the purely technical. I'd also like to thank all of my colleagues, past and present, for helping me achieve whatever it is that I've achieved, for providing intellectual challenges, and (sometimes) for pure camaraderie. I'd like to thank my current bosses at Red Hat, for letting me take time off and reduce my hours so that I can stay sane. Very few of my past bosses would have done so much. Thanks to all those who have to work today, and work every day, to create the environment that allows me such freedom and opportunity - soldiers, police and other emergency workers, doctors, nurses, the people who maintain our power and communications grids, inspectors, regulators, and so on. Yes, even legislators, judges, mayors, governors, and presidents.</p> <p>There are even more people to thank, but I have to cut myself short so I can thank the most important group: my family. Yes, I know it's trite, but it's also true. Without them I wouldn't be able to do the other things you all get to see, and I wouldn't have any reason to, and I wouldn't have anything else to go back to when I'm done. Family, whether inherited or chosen as friends, is really the basis of everything else. Let's all try not to forget that.</p> <p>Now, off to lunch.</p>Thu, 28 Nov 2013 11:43:00,2013-11-28:2013-11-giving-thanks.htmlShared Libraries are Obsolete<p>I was around when shared libraries were still a new thing in the UNIX world. At the time, they seemed like a great idea. On multi-user systems like those I worked on at Encore, static linking meant not only having a separate copy of the same code in every program, but having a separate copy even for every user running the same program. The waste of both disk space and memory was a serious concern. Making shared libraries work required a lot of effort from both compiler and operating-system people, but it was well worth it.</p> <p>Fast forward to the present day. Not only are disk and memory cheaper, but there aren't as many users running copies of the same program on the same machine either. Those savings are neither as big nor as important as they used to be. At the same time, shared libraries have created a whole new world of software maintenance problems. In the Windows world this is called "DLL Hell" but I think the problem is even worse in the Linux world. When every application depends on dozens of libraries, and every one of those libraries is shared, that means dozens of possibilities for an upgraded library to cause a new crash or security failure in your application. Yes, sometimes bugs can be fixed without needing to rebuild applications, but I challenge anyone to show empirical evidence that the fixes are more common than the breakage.</p> <p>People actually do test their applications against specific combinations of the libraries they depend on. If there are bugs in that combination, they get found and fixed or worked around. Every behind-the-back library upgrade creates a new <strong>untested</strong> configuration that might be better but is more likely worse. In what other context do we assume that an untested change will "just work"? In what other context should we? Damn few. Applications should run with the libraries they were tested with.</p> <p>At this point, someone's likely to suggest that strict library versioning solves the problem. It sort of does, so long as the library version includes information about how it was built as well as which version of code, because the same code built differently is still likely to behave differently sometimes. Unfortunately, it just trades one problem for another - dependency gridlock. If every application specifies strict library dependencies, then what do you do when a library changes? If you blindly mass-rebuild applications to update their dependencies, then you haven't solved the "untested combination" problem. If you keep old library versions on the system, then you've thrown away the advantage of having shared libraries in the first place. Either way, you've created a package-management nightmare both for yourself and for distribution maintainers.</p> <p>Shared libraries still make sense for the very low-level libraries that practically every application uses and that users are already wary of updating, like glibc. If the library maintainer's testing and API-preservation bar is higher than most app developers', that's OK. In almost every other case, you're probably better off with statically linked and tested combinations of apps and libraries. If you want to save some memory, make sure the load addresses are page aligned and do memory deduplication. Otherwise, you're probably just saving less memory than you think at the expense of much more important stability and security.</p> <p>Update: This post sparked a fairly lively <a href="">Twitter conversation</a>.</p>Tue, 26 Nov 2013 17:23:00,2013-11-26:2013-11-shared-libraries.htmloperating-systemsFixing Fsync<p>When I wrote about how <a href="">local filesystems suck</a> a while ago, it sparked a bit of debate. Mostly it was just local-filesystem developers being defensive, but Dave Chinner did make the quite reasonable suggestion that I could help by proposing a better alternative to the fsync problem. I've owed him an answer since then; here it is.</p> <p>To recap, the main problem with fsync is that it conflates <em>ordering</em> with <em>synchrony</em>. There's no way to ensure that two writes happen in order except by waiting for the first to complete before issuing the second. This sacrifices any possibility of pipelining requests, which is essential for performance. What's funny is that local filesystems themselves take advantage of a model that does allow such ordered pipelines - tagged command queuing, which I first encountered twenty years ago when I worked with parallel SCSI. The basic idea is that a device has multiple queues. Each request specifies which queue it should go on, plus some bits to specify how it should be queued and dequeued.</p> <ul> <li> <p>SIMPLE means that there are no particular queuing or ordering restrictions. A series of SIMPLE requests can be reordered and/or issued in parallel with respect to each other, but not with respect to non-SIMPLE requests.</p> </li> <li> <p>ORDERED means that the request must wait for all earlier requests on the same queue, and all later requests on the same queue must wait for it. This allows pipelining, but not parallelism or reordering.</p> </li> <li> <p>HEAD is the same as ORDERED, except that the new request is inserted at the head of the queue instead of the tail. This is generally a very bad idea, but it's necessary in certain situations. For example, the drivers I was writing used it to issue the commands for controller failover while leaving the rest of the queue intact.</p> </li> </ul> <p>The funny thing is that this model has been around so long that it has bubbled up to the OS block layer, where local filesystems can take advantage of it to ensure correct ordering while maintaining performance, but then those same local filesystems don't expose it to anyone else. Seems rather selfish to me.</p> <p>The obvious solution is simply to add queue/type parameters to writev (and possibly other calls as well). Current behavior is equivalent to SIMPLE queuing. Fsync is equivalent to an ORDERED no-op issued synchronously. That's all very well, but the model provides pretty obvious ways to do even more interesting things.</p> <ul> <li> <p>An asynchronous fsync becomes possible simply by issuing an ORDERED no-op (zero-length write?) using AIO. You don't have to <strong>wait</strong> for it, but you can be assured that it's in the pipeline and order will be maintained.</p> </li> <li> <p>If you only need ordering between <strong>some</strong> requests, you can use ORDERED on multiple queues.</p> </li> </ul> <p>This is a clean, powerful, and well proven model. Unfortunately, local-FS developers will probably argue that it's too hard to implement (even though they've already implemented it for their own uses e.g. ordering data vs. inode writes). However, most of this complexity has to do with multiple queues. Implementing just SIMPLE/ORDERED without multiple queues would be much easier, and still much better than what we have now.</p> <p>The other problem with fsync is that it only flushes the pipeline for a single file descriptor (actually in practice it's more likely to be the inode). If you want to flush the pipelines for a bunch of file descriptors, you have to issue fsync for each one separately. This is not just an inconvenience; it means that you either need to wait for N fsyncs in sequence, or have N threads handy to wait for them in parallel. The other alternative is to issue syncfs instead - possibly having to wait for I/O from other applications and other users as well as your own. All of these options are awful. A better option would be a way to group file descriptors together through a single "special" one, and then issue more powerful combined operations on that. In fact, such an interface already exists - <em>epoll</em>. Some of that same code could probably be reused to implement a way of flushing multiple files instead of waiting for them. At the very least, this would make flushing lots of files at once simpler and less syscall-intensive. Even better, a decent implementation might allow filesystems to reason about a whole bunch of fsyncs <strong>as a group</strong> and optimize how all of the relevant block-level I/O gets done. I don't expect that to happen soon, but at least the right API makes it possible.</p> <p>Of course, it's always easy to make suggestions for other people to implement. I try not to tell other people their business, because I have quite enough of them telling me mine. Nonetheless, the need is there and I was asked to propose a solution, so I have. Maybe if the people whose job it is don't want to do it themselves then I'd even be willing to help.</p>Fri, 22 Nov 2013 12:30:00,2013-11-22:2013-11-fixing-fsync.htmlstoragelinuxThe "IOPS Myth" Myth<p>It's nice to see more people becoming aware that IOPS are not the be-all and end-all of storage performance. Unfortunately, as with all consciousness raising, the newest converts are often the ones that take things too far. Thus, we get extreme claims like <a href="">IOPS Are A Scam</a> or <a href="">Storage Myths: IOPS Matter</a>. What a surprise, that somebody who works for Violin would claim that it's all about latency, all the time. Let's get away from the extremists on both sides, and try to find the truth somewhere in between.</p> <p>As people who have actually worked on storage - and particularly storage performance - for a while know, different workloads present different needs and challenges. The grain of truth in the extremists' claims is that latency really is king for some applications. Those applications are the ones where I/O is both serialized and synchronous. Quick, how many of those do you run? Probably very few, quite possibly even fewer than you think, for a few different reasons.</p> <ul> <li> <p>Most modern applications have some internal parallelism. Therefore, their performance often is bound more by IOPS than by pure latency.</p> </li> <li> <p>If applications do asynchronous writes, then the I/O system responsible for ensuring their (eventual) completion can take advantage of parallelism even if the writes were issued sequentially. This is what's likely to happen every time you do a series of writes followed by fsync. It can be done all the way from the filesystem down to an individual disk sitting in another cabinet.</p> </li> <li> <p>Even applications that do serialized and synchronous writes often only do so some of the time - e.g. for logs/journals but not for main data. These applications are often latency-bound, but that doesn't mean low latency is necessary for every bit they store.</p> </li> </ul> <p>I've made the point myself, <a href="">quite recently</a>, that it's important to look at all aspects of storage performance, including predictability and behavior over time. That still doesn't mean that you should just pick one set of characteristics as "best" and leave it at that. You're going to be using many kinds of storage. Get used to it. For example, you might need low latency for 1% of your data that's written serially, high IOPS for the next 9% for data that's still warm but read/written in parallel, and neither (at lowest possible cost) for the cold remainder. In that middle part, low-latency storage would be overkill. What matters is how many IOPS you can get within a single system, to avoid the management and resource provisioning/migration headaches of having several. Thus, a high-IOPS system still has value even if it doesn't also offer low latency. If that weren't true, nobody would even consider using S3 or Swift let alone Glacier, since those all have <strong>terrible</strong> latency characteristics.</p> <p>In short, "latency is king" is the new "scale up" motto, but we mostly live in a "scale out" world. Yes, sure, there are situations where you just need a single super-fast widget, but <strong>much</strong> more often you need a whole bunch of more conventional widgets providing high aggregate throughput within a single system. Low latency and high IOPS are entirely complementary goals. Just as there have been valid uses for both mainframes and supercomputers since they started to diverge in the 70s, there are valid uses for both types of storage systems. Those designing or selling one should not lightly dismiss the other, lest that lead to a discussion of who's merely picking components and who's solving hard algorithmic problems.</p>Thu, 21 Nov 2013 09:57:00,2013-11-21:2013-11-iops-myth.htmlstorageperformanceSecure Email<p>Ever since one of the talks at LISA, I've been thinking about secure email. My thoughts are nowhere near complete, but I need to get them out of my head and I do that by writing about them. Apologies in advance.</p> <p>I've actually been thinking for many years about how email should be overhauled. For at least twenty years the idea that the same message contents get stored over and over again for multiple users, even on the same system, has bugged me. Sure, nowadays we have deduplication, but that's a hack. At the time an email is sent, we know for almost zero cost and with absolute certainty that the body is the same for every recipient. Why rely on expensive and approximate deduplication to make up for the fact that we were too stupid to take advantage of that information within the email system itself? For those same twenty-plus years, I've been thinking about how to implement email by separating storage and notification. The message contents get stored <em>once</em> in a data store that's accessible to the sender and recipients, then pointers to those contents are sent separately. In fact, I would be surprised if large email services such as those run by Google or Yahoo don't work that way for messages sent among their own subscribers.</p> <p>Unfortunately, this approach is incompatible with the current email protocols such as IMAP and SMTP. They don't separate storage and notification that way. Sure, you can do it all in the servers, but then you have the same problem as with most cloud-storage services that do something similar: if the server has your ciphertext and your keys, they might as well have your cleartext. They can talk all they like about how carefully they manage those keys, but it's all bullshit. Some of us were talking about this years ago, and built systems like HekaFS to address it, but were largely ignored. If there's one good thing that has come out of the recent NSA/Google revelations, it's that people finally realize <strong>keys have to stay on the client side</strong>. Thank you, Edward Snowden.</p> <p>The way around this is to use a local proxy on the user's machine. On one side it speaks IMAP and/or SMTP. On the other, it speaks the protocols necessary to interact with our secure data store and notification system. This requires only a very tiny bit of extra configuration by the user to point their email program at the proxy instead of a regular server, but then it opens up a whole new world of possibilities that don't exist when trying to preserve legacy protocols throughout the system. Let's look at how this would work in the context of email between users of the same provider.</p> <ol> <li> <p>The sender's email client talks to their proxy, using local SMTP, to send a message.</p> </li> <li> <p>The sender's proxy generates a new symmetric encryption key and initialization vector (IV) and encrypts the message - including both the contents and the "envelope" metadata. It also generates an HMAC to protect against both corruption and tampering.</p> </li> <li> <p>The encrypted message, IV, and HMAC are stored in the provider's message store, yielding an ID. The message store can be pretty plain, or it can have all sorts of features to improve security. For example, if traffic analysis to match senders with receivers is a concern (and it should be) then the provider can implement techniques known from Freenet/Tahoe-LAFS to foil such attempts.</p> </li> <li> <p>Anybody who has the ID from the previous step can now retrieve the message, but it's still encrypted using a unique key. This key is <strong>not stored anywhere</strong> (except maybe on the sender's machine, but ideally not even then). What we do instead is construct a separate notifier for each recipient, encrypting the message ID and key using that particular recipient's public key.</p> </li> <li> <p>At this point, the recipient could be notified synchronously, connecting to them via SSL or similar. This provides the best forward secrecy, but also requires that the recipient be online to receive the notification. More often, the notifiers will need to be stored somewhere for later retrieval. In this case, we could use a second kind of distributed data store, much like the message store and with the same potential for additional code to foil traffic analysis etc. Each user is represented by an existing file or object, and sending a message is just a matter of appending a new notifier.</p> </li> <li> <p>Some time later, a recipient's email client talks to their proxy, this time using local IMAP, to check for messages.</p> </li> <li> <p>The recipient's proxy fetches their file/object from the notification store, and possibly truncates it back down to zero.</p> </li> <li> <p>For each notifier received, the proxy extracts the message ID and key, then uses them to fetch the corresponding message from the message store.</p> </li> <li> <p>Messages are decrypted and translated into IMAP responses to the recipient's email client, as needed.</p> </li> </ol> <p>This scheme seems as secure as anything I've heard described elsewhere, and neither hard to implement nor hard to use. The biggest problem with it that I can think of is garbage collection. To do that properly, objects in the message store would need to have reference counts, with an authenticated decrement protocol or some such. To start with, I'd probably just avoid that by saying that message have expiration dates. The provider's guarantee of security matches their guarantee of persistence. If you don't fetch your messages before they expire, too bad. If you want to keep copies longer, then you have to fetch and store them separately, assuming responsibility for securing the copy (or perhaps that's a separate service offered by the same provider).</p> <p>That's all great within a single provider. How well does it extend out to many providers like we have in the real world? Not that well, unfortunately, but I think that's OK. Just having truly secure email within one provider would be useful. It doesn't seem all that hard to come up with new protocols between providers, allowing them properly controlled access to each other's message and notification stores. Thus, providers that use such protocols could create a whole secure-email ecosystem. Perhaps this is what Lavabit and Silent Circle are already doing within the <a href="">Dark Mail Alliance</a>, but they're being awfully quiet about the details. The key is that secure email practically has to be a <em>separate</em> ecosystem from the email we already have. A lot of the user-facing parts can still be used without too much trouble, but the entire transmission and storage infrastructure will have to change. While I'm sure people can poke all sorts of holes in what I've outlined above, perhaps something in it will provoke some productive thought. The time for keeping ideas in this area to ourselves is over.</p>Thu, 14 Nov 2013 12:26:00,2013-11-14:2013-11-secure-email.htmlsecuritycommunicationsMoot Comments<p>I have a couple of posts coming up where I'll be soliciting feedback, so it's time to implement blog comments again. After looking at the alternatives, I eventually decided that <a href="">Moot</a> had the best combination of features for me (as the guy who has to integrate them) and my users. As it turns out, integrating Moot for all posts including those in the past was laughably easy. Kudos to them for a job well done.</p> <p>I have no idea how long this will last. Most of the comments I've received in the past have seemed more "in the moment" than "part of the permanent record" anyway, so I hope it's OK to be explicit that <strong>comments here are ephemeral</strong>. I make no promise to preserve them, either individually or entirely, but maybe they're more convenient (and less spam-prone) than email. Let me know . . . in the comments. ;)</p>Thu, 14 Nov 2013 10:29:00,2013-11-14:2013-11-moot-comments.htmlmootcommentsComedic Open Storage<p>I've written before about some people's <a href="">mania</a> for object storage as an alternative to blocks and files. It's a valid model, but I do think its benefits are being pretty drastically oversold. Often there's a lot of FUD about distributed filesystems in particular, from people who clearly don't know the details about what features they have or how they work. As a result, even though some people seem pretty excited about Seagate's new <a href="">Kinetic Open Storage</a> initiative, I approached it with a bit more skepticism. Here's the short version.</p> <ul> <li> <p>It's great that somebody's implementing object storage at this level.</p> </li> <li> <p>This particular implementation is a joke.</p> </li> </ul> <p>I'm not just being nasty for no reason. There's a very real danger, with a technology like this, of early implementations over-promising and under-delivering so badly that by the time a good implementation comes along nobody can get over the bad taste in their mouth from the last version. That's what happened in distributed filesystems twenty years ago. Even though things have improved since then, there are still plenty of people who've never moved past "those things don't work" and don't even do the most basic research into the current state of the art before they go off and implement their own crappy incompatible almost-filesystem storage layers. I <strong>don't</strong> want object storage to be abandoned like that. I want it to succeed, but to do that it has to offer a better value proposition.</p> <p>Before I start talking about the ways KOS falls short, I have to start by saying that I'm talking about details and the documentation so far almost seems intended to obscure those details. The wiki is long on rhetoric, short on information. For example, I had to dig a bit to find the maximum size of a key (a potentially wasteful 4KB), and I still haven't found the maximum size of an object. So I cloned the preview <a href="">repository</a> and found a big steaming pile of javadoc. It's not even the good kind of javadoc; it's a lot more of the "bytearray: an array of bytes" boiler-plate kind. So I might actually be wrong about some of the details. If so, I'll update appropriately.</p> <p>My first objection has to do with NIH syndrome. After all, these ideas first reached prominence with Garth Gibson's <a href="">NASD</a> back in 1999, and later influenced the ANSI T10 object-storage standard. Back when it was still a PhD thesis, Ceph used a similar model called EBOFS (since abandoned in favor of btrfs), and there are others as well. Instead of building on - or even acknowledging - these predecessor, Seagate went off and developed Yet Another Object Storage API. Then, instead of documenting wire formats and noting differences vs. things people might already know, they just threw a Java library over the wall. Nice.</p> <p>The second objection is security. There's a reason the S in NASD stands for Secure. If you want to gang a bunch of these devices together as the basis for a multi-user or multi-tenant distributed system, you'd better think hard about how to handle security. Apparently KOS didn't. There's some fluff about on-disk encryption, but nothing about key management, connection security, the actual semantics of their ACLs, etc. This information is not just "nice to have"; it's absolutely essential before developers can even begin to reason about the system they'll be coding for.</p> <p>My third and most serious objection has to do with supporting only whole-object GET and PUT operations. That's fine for a key/value store or a deep archival store (the very opposite of "kinetic" BTW) but for anything else it's awful. If the objects can be very large, then updating any part of one involves a horrendous read/modify/write cycle. If they're kept small, then a higher level has to deal with the mapping from larger user-visible objects to smaller Kinetic objects. If there are multiple clients - and when are there not? - then there are some pretty serious coordination problems involved, and apparently not even a "conditional put" to help deal with the obvious race conditions. Instead of abstracting away the details and difficulties of modifying a single byte within an object (the original NASD vision), KOS requires the involvement of a robust coordination layer for even the simplest operations. Building cluster filesystems on top of shared block devices didn't work too well when the blocks were fixed size. Variable-sized blocks with 4KB keys don't change the equation much.</p> <p>As far as I can tell, this project does very little to help distributed-storage users and developers to meet their needs. Instead it creates false differentiation, disrupting for the sake of disruption or perhaps trying to justify higher margins in a cut-throat industry. It's like a double agent in the object-storage camp, potentially sabotaging others' efforts to have that vision accepted in the broader market.</p>Thu, 24 Oct 2013 11:32:00,2013-10-24:2013-10-comedic-open-storage.htmlstorageWhy You Don't Need STONITH<p>(This started as a <a href="">Hacker News discussion</a> about an <a href="">article on Advogato</a>. The articles title/premise is "Why You Need STONITH" where "STONITH" means "Shoot The Other Node In The Head" and is an important concept in old-school HA. I might even have been present when the acronym was coined, after having used a similar one at CLaM.)</p> <p>I was working on HA software in 1992. Specifically, I was working on the software from which Linux-HA copied all of its terminology and basic architecture. We ourselves were not the first, and often found ourselves copying things done even earlier at DEC, so I'm not complaining, but I want to make the point that this article from 2010 is actually a rehash of a much older conversation. As cute as the metaphor is, it gets two things seriously wrong.</p> <p>(1) Fencing and STONITH are not the same thing. Fencing is shutting off access to a shared resource (e.g. a LUN on a disk array) from another possibly contending node. STONITH is shutting down the possibly contending node itself. They're quite different in both implementation and operational significance. Using the two terms as though they're interchangeable only sows confusion.</p> <p>(2) You only need STONITH if you have the aforementioned possibly contending nodes - in other words, only if the same resource can be provided by/through either node. If the resources provided by each node are known to be different, as e.g. in any of the systems derived from Dynamo, then STONITH is not necessary.</p> <p>To elaborate on that second point, the problem STONITH addresses is one of mutual exclusion. It might not be safe for the resource to be available through two nodes, because it could lead to inconsistency or because they can't both do a proper job of it simultaneously. As in other contexts, mutual exclusion is a useful primitive but often not the optimal one to use. In general it's better to avoid it by avoiding the kinds of resource sharing that make it necessary. That's why "shared nothing" is the most common model for such systems designed in the last decade or more, and they don't need STONITH unless they've screwed up by not fully distributing some component (such as a metadata server for a distributed filesystem).</p>Tue, 22 Oct 2013 09:26:00,2013-10-22:2013-10-stonith.htmlarchitectureLeaning Out<p>In April of '89 I left my family and friends to move from Michigan to Massachusetts for a programming job. The new job paid twice as much as my first programming job had, which means three times as much as I was making since that company laid me off, so it seemed like a pretty big step in my then-new career. I hit the road in a used Ford Escort that I'd just bought for $900, and which barely survived the 800-mile trip. Stupid piece of junk.</p> <p>Since then I've worked at a bunch of companies. Red Hat is the only one of those that I joined when it was already large. Of the ten startups (including the one in Michigan) none went to an IPO. One was the subject of a moderately successful acquisition (Conley by EMC). Four more were moderately successful for a while in some niche, and one of those is still going. The rest were either acquired at fire-sale prices or just sank without a trace (but only one while I was still on board). In other words, I did better than average. As much as we all like to dream about billion-dollar exits, the grim reality is that most startups fail suddenly and completely, leaving employees in the lurch.</p> <p>It's a good thing there are plenty of other reasons to work at startups. You get to learn a lot that way, both technically and otherwise, in a short time. You'll almost certainly be given more responsibility and more freedom than you would at a larger company. You're more likely to work on cutting-edge technology (though not every large company is as far behind as some would have you believe). Startups provide a lot of opportunity in a very energizing environment.</p> <p>Unfortunately, anything that's energizing in the short term is likely to become tiring in the long term, and that's what I'm here to write about today. While I'm not at a startup any more, that's how I got to where I am today. It's how I got to where I am in terms of being hired at Red Hat to start one project and then become an architect on another. It's also how I got to be kind of burned out, not just on my current job but on programming in general. 24 years is a long time - six bachelor's degrees, two of them entirely at specific companies. To explain how that feels, I'll start with a quote from The Hobbit.</p> <blockquote> <p>Now it is a strange thing, but things that are good to have and days that are good to spend are soon told about, and not much to listen to; while things that are uncomfortable, palpitating, and even gruesome, may make a good tale, and take a good deal of telling anyway.</p> </blockquote> <p>The funny thing about working on any project is that you forget all the good parts. Everything that was initially cool about what it did - and how - becomes so familiar that it's forgotten or taken for granted. Meanwhile, every architectural or design flaw you ever noticed still seems to be there. Bugs get fixed, but every troublesome module or interface is still troublesome. Every critical feature that was missing years ago is still missing, after being put off over and over again while other people's <em>stupid</em> ideas always jump to the head of the queue. After a while, you see nothing good and everything bad. It's an unfortunate quirk of human nature.</p> <p>I must emphasize that this phenomenon isn't because of the code. It's a change in a person's relationship to code. I've worked on enough projects to know it's not just one, and I've talked to enough other developers to know it's not just me. Familiarity truly does breed contempt when it comes to code, and outsiders could be forgiven for thinking that "this code sucks" is the official motto of our profession. For what it's worth, I've also changed jobs enough times to know that's not necessarily the solution. Remember, I've done that ten times already. The grass isn't really greener, and the same problems tend to reappear everywhere. Eventually, disenchantment with one particular project becomes disenchantment with programming in general, or at least to programming within a particular domain. That's when the feeling of being trapped really sets in.</p> <p>This steady deterioration doesn't just apply to code, either. Similar processes occur with respect to the social and organizational aspects of programming as well. I have a lot more to say about burnout in general (I even have a long post half written about it) but I'll leave that for another day. Today's post is about how I'm trying to fix it. Basically, I have three things I need to do.</p> <ul> <li> <p>Reduce my pace. When I was young, I thought I could sprint forever, but this is a marathon and nobody can sprint at that distance. Nobody. Also, the hills get larger.</p> </li> <li> <p>Catch up on my personal life. Re-start my exercise program, de-clutter and fix stuff around the house, have lunch with friends, play with my daughter. All the stuff I've been too busy, or too tired, or too grumpy to do enough of.</p> </li> <li> <p>Get some perspective. I need to re-familiarize myself with what else is out there in the computing world, not just as a matter of education and keeping up but also to remember what's good and cool about the project I'm on.</p> </li> </ul> <p>This leads to two concrete actions. First, I'm taking a break. I have already asked for, and been given permission for, a month off. That starts this next Monday, and ends in mid-November (right after <a href="">LISA'13</a>). Technically, I'm not even supposed to check email during that time. If I do anything technical at all before LISA, it will probably be on things very far removed from my usual work. Maybe I'll learn Go or Javascript, learn how to write a modern web page or perhaps even a game. I need to do something that <em>doesn't seem like work</em>.</p> <p>After the break, I'm not coming back full time. Like a recuperating patient, I'm not just going to dive right back in trying to do everything I did before. I'm going to take things a bit slower, at least for a while. Then again, it could be permanent. We'll see. The important thing is that I get to try, and I can't say that without acknowledging the role my bosses at Red Hat have played. They haven't just grudgingly or passively allowed me to do this. They have encouraged me, offered valuable suggestions, and agreed to terms far more favorable than I could have hoped for. Kudos to them, and to the company.</p> <p>I don't know how many people can consider the sort of actions I'm taking, but I will say this: beware of burnout. It <em>will</em> creep up on you, no matter how immune you think you are when you're still early in your career. You can't ignore it. You can take positive steps to avoid it, or you can fall prey to it. Don't be one of those people who lose their family or their sanity first.</p>Thu, 10 Oct 2013 20:01:00,2013-10-10:2013-10-leaning-out.htmlworkingModel Checking<p>Model checking is one of the most effective tools available for reducing the prevalence of bugs in highly concurrent code. Nonetheless, a surprising number of even very smart and very senior software developers and architects seem to know about it. Of the many such people I've worked with over the years, maybe one in ten have even heard of it, and I can count on one hand the number who've appreciated its value. Seems like a good subject for a blog post, then. ;) Let's start with what the heck it is.</p> <div class="highlight"><pre>Model checking is a technique for verifying finite state concurrent systems such as sequential circuit designs and communication protocols. </pre></div> <p>That's from the blurb for <a href="">Model Checking</a> - the seminal book on the subject by Clarke, Grumberg, and Peled. The way model checking works is by generating states within a system according to rules you specify (the model), and checking them against conditions that you also specify to ensure that invalild states never occur. Some model checkers also check for deadlock and livelock that might preclude reaching a valid final state, but that's not essential. It should be pretty obvious that the number of states even in a fairly simple system can be quite large, so many of the tools also do things like symmetry reduction or Monte Carlo sampling as well.</p> <p>My favorite set of tools in this space is the <a href="">Murphi</a> family, of which <a href="">CMurphi</a> is the one that has been most usable for me recently. Like many such tools, Murphi requires that you specify your model in a language that they describe as Pascal-like but which to my eye looks even more Ada-like. That's really not as awful as it sounds. I've actually found writing Murphi code quite enjoyable every time I've done it. The fact that the model is not written in the same language as the implementation is a known problem in the field. On the one hand, it creates a very strong possibility that the model will be correct but the separate implementation will not, reducing the value of the entire effort. On the other hand, traditional languages struggle to express the kinds of things that a model checker needs (and futhermore to work efficiently). I tried to write a real-code model checker <a href="">once</a>, and didn't get very far.</p> <p>To give you some idea of why it's so hard for model checkers to do what they do, I'll use an example from my own recent experience. I'm developing a new kind of replication for GlusterFS. To make sure the protocol behaved correctly even across multiple failures, I developed a Murphi model for it. This model - consisting of state definitions, rules for transitions between states, and invariant conditions to be checked - comes to 550 lines (72 blank or comments). Running this simple model generates the following figures.</p> <div class="highlight"><pre>172838 states 468981 rules 10.60 seconds </pre></div> <p>That's for a simple protocol, with a small problem size - three nodes, three writes, two failures. The model was also relentlessly optimized, e.g. eliminating states that Murphi would see as different only because of fields that would never be used again. Still, that's a lot of states. When I introduced a fourth write, the run time tripled. When I introduced a fourth node, I let it run for five minutes (3M states and 10M transitions) but it still showed no signs of starting to converge so I killed it. BTW, I forgot to mention that the model contains five known shortcuts to make it checkable, plus probably at least as many more shortcuts I didn't even realize I wasn't taking.</p> <p>If it's so hard and you have to take so many shortcuts, is it still worth it? Most definitely. Look at those numbers again. How many people do you think can reason about so many states and transitions, many of them representing dark unexpected corners of the state space because of forgotten possibilities, in their heads? I'm guessing <strong>none</strong>. Even people who are very good at this will find errors in their protocols, as has happened to me every time I've done the exercise. I actually thought I'd done pretty well this time, with nothing that I could characterize as an out-and-out <strong>bug</strong> in the protocol. Sure, there were things that turned out to be missing, so that out of five allowable implementations only one would actually be bug free, so I still thought the exercise was worth it. Then I added a third failure. I didn't expect a three-node system to continue working if more than one of those were concurrent (the model allows the failures to be any mix of sequential and concurrent), but I expected it to fail cleanly without reaching an invalid state. Surprise! It managed to produce a case where a reader can observe values that go back in time. This might not make much sense without knowing the protocol involved, but it might give some idea of the crazy conditions a model checker will find that you couldn't possibly have considered.</p> <div class="highlight"><pre>write #1 happens while node A is the leader B fails immediately C completes the write read #1 happens while A isn&#39;t finished yet (but reads newer value) A fails B comes back up, becomes leader C fails while B is still figuring out what went on A comes back up read #2 happens, gets older value from B </pre></div> <p>So now I have a bug to fix, and that's a good thing. Clearly, it involves a very specific set of ill-timed reads, writes, and failures. Could I have found it by inspection or ad-hoc analysis? Hell, no. Could I have found it by testing on live systems? Maybe, eventually, but it probably would have taken months for this particular combination to occur on its own. Forcing it to occur would require a lot of extra code, plus an exerciser that would amount to a model checker running 100x slower across machines than Murphi does. With enough real deployments over enough time it would have happened, but the only feasible way to prevent that was with model checking. Try it.</p> <p>P.S. I fixed the bug.</p>Fri, 27 Sep 2013 15:24:00,2013-09-27:2013-09-model-checking.htmlprocessSAN Stalwarts and Wistful Thinking<p>I've often said that open-source distributed storage solutions such as GlusterFS and Ceph are on the same side in a war against more centralized proprietary solutions, and that we have to finish that war before we start fighting over the spoils. Most recently I said that on Hacker News, in <a href="">response</a> to what I saw as a very misleading evaluation of GlusterFS as it relates to OpenStack. In some of the ensuing Twitter discussion, <a href="">Ian Colle</a> alerted me to an article by Randy Bias entitled <a href="">Converged Storage, Wishful Thinking &amp; Reality</a>. Ian is a Ceph/Inktank guy, so he's an ally in that first war. Randy presents himself as being on that side too, but when you really look at what he's saying it's pretty clear he's on the other team. To see why, let's look at the skeleton of his argument.</p> <ul> <li> <p>"Elastic block storage" is a good replacement for traditional SAN/NAS.</p> </li> <li> <p>"Distributed storage" promises to replace <strong>everything</strong> but can't.</p> </li> <li> <p>The CAP theorem is real, failures are common, and distributed storage doesn't account for that.</p> </li> </ul> <p>The first two points are hopelessly muddled by his choice of terms. When people in this space hear "elastic block storage" they're likely to think it means Amazon's EBS. However, Amazon's EBS <strong>is</strong> distributed storage. Try to read the following as though Randy means Amazon EBS.</p> <blockquote> <p>Elastic Block Storage (EBS) is simply an approach to abstracting away SAN/NAS storage {from page 4}</p> <p>Elastic block storage is neither magic not special. It’s SAN resource pooling. {from <a href="">Twitter</a>}</p> </blockquote> <p>That conflicts with everything else I've heard about Amazon EBS. I even interviewed for that team once, and they sure seemed to be asking a lot of questions that they wouldn't have bothered with if EBS weren't distributed storage. Amazon's own official <a href="">description of EBS</a> bears this out.</p> <blockquote> <p>Amazon EBS volume data is replicated across multiple servers in an Availability Zone to prevent the loss of data from the failure of any single component.</p> </blockquote> <p>Servers, eh? Not arrays. That sounds a lot like distributed storage, and very unlike the "SAN resource pooling" Randy talks about. Clearly he's not talking about Amazon EBS vs. distributed storage because one is a subset of the other. What he's really talking about is SAN-based vs. distributed <strong>block</strong> storage. In other words, his first point is that SAN hardware repackaged as "elastic block storage" can displace SAN hardware sold as itself. Yeah, when you cut through all of the terminological insanity it ends up sounding rather silly to me too.</p> <p>Randy's second point is that users need multiple tiers of storage, and a distributed storage system that satisfies the "tier 1" (lowest latency) role would be a poor fit for the others. Well, duh. The same is true of his alternative. The fundamental problem here seems to be an assumption that each storage technology can only be deployed one way. That's kind of true for proprietary storage systems, where configurations are tightly restricted, but open storage systems are far more malleable. You can buy different hardware and configure it different ways to satisfy different needs. If you want low latency you do one thing and if you want low cost you do another, but it's all the same technology and the same operations toolset either way. That's much better than deploying two or more fundamentally different storage-software stacks just because they're tied to different hardware.</p> <p>The homogeneity assumption is especially apparent in Randy's discussion of moving data between tiers (HSM) as though it's something distributed storage can't do. In fact, there's nothing precluding it at all, and it even seems like an obvious evolution of mechanisms we already have. With features like GlusterFS's upcoming <a href="">data classification</a> you'll be able to combine disparate types of storage and migrate automatically between them, <strong>if you want to</strong> and according to policies you specify. Again, this can be done better in a single framework than by mashing together disparate systems and slathering another management layer on top.</p> <p>Lastly, let's talk about CAP. Randy makes a big deal of the CAP theorem and massive failure domains, leading to this turd of a conclusion:</p> <blockquote> <p>Distributed storage systems solve the scale-out problem, but they don’t solve the failure domain problem. Instead, they make the failure domain much larger</p> </blockquote> <p>Where the heck does that idea come from? I mean, seriously, WTF? I think I'm on pretty safe ground when I say that I know a bit about CAP and failure domains. So do many of my colleagues on my own or similar projects. The fact is that distributed storage systems are <strong>very</strong> CAP-aware. One of the main tenets of CAPology is that no network is immune to partitions, and that includes the networks inside Randy's spastic block storage. Does he seriously believe a traditional SAN or NAS box will keep serving requests when their internal communications fail? Of course not, and the reason is very simple: they're distributed storage too, just wrapped in tin. We all talk to each other at the same conferences. We're all bringing the same awareness and the same algorithms to bear on the same problems. Contrary to Randy's claim, the failure domains are exactly the same size relative to TB or IOPS served. The difference is in the quality of the network implementation and the system software that responds to network events, not in the basic storage model. Open-source distributed storage lets you build essentially the same network and run essentially the same algorithms on it, without paying twice as much for some sheet metal and a nameplate.</p> <p>In conclusion, then, Randy's argument about storage diversity and tiering is bollocks. His argument about CAP and failure domains is something even more fragrant. People who continue to tout SANs as a necessary component of a complete storage system are only serving SAN vendors' interests - not users'.</p>Wed, 11 Sep 2013 11:38:00,2013-09-11:2013-09-wistful-thinking.htmlstorageglusterfsmarketingStanding Desks<p>A while ago, I got an <a href="">Ergotron WorkFit-S</a> sit/stand monitor mount. I love it, and have talked about it to plenty of people. Yesterday I joined a <a href="">Hacker News discussion</a> about standing desks, and it left me with some thoughts that I'd rather share here than there, so here goes.</p> <p>Some people have suggested a drafting table plus a high chair as a cheaper alternative. I did consider that approach, but I don't think it's so much a direct alternative as a fundamentally different thing. I don't just want one big flat surface. I have a desk already, which wraps around slightly and has a pedestal on one side (nominally for a printer but also served quite well as my original standing-desk solution). It even has things like doors, drawers, and shelves. Likewise, I don't want to sit on a high chair. When I do sit, I like a full-height back and feet on the floor; a wrap-around bar doesn't suffice. I also like to swivel a bit, and that can be downright dangerous on one of those. Buying a new desk and a new chair and a new separate drawer unit would almost certainly cost more than my current setup and still be less comfortable. When the cost of the things I already have is subtracted, the cost difference is even larger.</p> <p>Speaking of cost, the HN thread also included the absurd claim that a motorized desk would have been cheaper. Maybe that's true in a sense, but here's a basic fact: adding a motor to something never made it cheaper. If motorized desk X is cheaper than manual-adjust desk Y, it's because X is <strong>inherently</strong> cheaper than Y by enough to offset the additional cost of the motor. That's reflected in cheaper materials, cheaper construction, missing features, and so on. The WorkFit-S consists of machined metal, welds/coatings where appropriate, and high-density wood composite for the hinged keyboard tray. Sure, you can spend less for something made of thin untreated metal and plastic held together with screws, but that's apples to cherries. Everything I looked at that was <strong>at all</strong> comparable in terms of build quality cost 2-4x as much. They were also all in the drafting-table category, with all of the other drawbacks noted above. Really, this option only makes sense for the disabled. Anyone who's just too lazy to raise and lower their monitor manually should stop making stuff up to rationalize their choice.</p> <p>If you're building a home office from scratch, and you <strong>prefer</strong> a different kinds of setup than I have, that's great. Comfort is important, so even if it costs a little more you should go for it. On the other hand, if you already have a desk/chair that work for you and don't have a clear preference (based on experience) for the drafting-table approach, I really would suggest ignoring those who just want you to follow the fashion. I <strong>definitely</strong> don't recommend going straight from a sitting setup to a standing-only setup. That was the problem with my original ad-hoc approach, and the adjustability makes even more of a difference than had led me to the WorkFit in the first place. Two modes are better than one.</p>Wed, 28 Aug 2013 09:16:00,2013-08-28:2013-08-standing-desks.htmlworkingLocal Filesystems Suck<p>Distributed filesystems represent an important use case for local filesystems. Local-filesystem developers can't seem to deal with that. That, in a nutshell, is one of the most annoying things about working on distributed filesystems. Sure, there are lots of fundamental algorithmic problems. Sure, networking stuff can be difficult too. However, those problems are "natural" and not caused by the active reality-rejection that afflicts a related community. Even when I was in the same group as the world's largest collection of local-filesystem developers, with the same boss, it was often hard to get past the belief that software dealing with storage in user space was just A Thing That Should Not Exist and therefore its needs could be ignored. That's an anti-progress attitude.</p> <p>So, what are the problems with local filesystems? Many of the problems I'm going to talk about aren't actually in the local filesystems themselves - they're in the generic VFS layer, or in POSIX itself - but evolution of those things is almost entirely driven by local-filesystem needs so that's a distinction without a difference. Let's look at some examples from my own recent experience.</p> <ul> <li> <p>Both the interfaces and underlying semantics for extended attributes still vary across filesystems and operating systems, despite their usefulness and the obvious benefits of converging on a single set of answers. This is true even for the most basic operations; if you want to do something "exotic" like set multiple xattrs at once, you have to use truly FS-specific calls.</p> </li> <li> <p>Mechanisms to deallocate/discard/trim/hole-punch unused space still haven't converged, after $toomany years of being practically essential to deal with SSDs and thin provisioning.</p> </li> <li> <p>Ditto for overlay/union mounts, which have been worked on for years to no useful result. There's a pattern here.</p> </li> <li> <p>The readdir interface is just totally bogus. Besides being barely usable and inefficient, besides having the worst possible consistency model for concurrent reads and writes, it poses a particular problem for distributed filesystems layered on top of their local cousins. It requires the user to remember and return N bits with every call, instead of using a real cursor abstraction. Then the local filesystem at the far end gets to use that N bits however it wants. This leaves a distributed filesystem in between, constrained by its interfaces to that same N bits, with zero left for itself. That means distributed filesystems have to do all sorts of gymnastics to do the same things that local filesystems can do trivially.</p> </li> <li> <p>Too often, local filesystems implement enhancements (such as aggressive preallocation and request batching) that look great in benchmarks but are actually harmful for real workloads and especially for distributed filesystems. There's another big pile of unnecessary work shoved onto other people.</p> </li> <li> <p>It's ridiculously hard to make even such a simple and common operation as renaming a file atomic. Here's the <a href="">magic formula</a> that almost nobody knows.</p> </li> </ul> <p>The last point above relates to the really problematic issue: very poor support for specifying things like ordering and durability of requests without taking out the Big Hammer of forcing synchronous operations. By the time we get a request, it has already been cached and buffered and coalesced and so on all to hell and back by the client. Those games have already been played, so our responsibility is to provide <strong>immediate</strong> durability, while respecting operation order, with minimal performance impact. It's a tall order at the best of times, but the paucity of support from local filesystems makes it far worse.</p> <p>In a previous life, I worked on some SCSI drivers. There, we had tagged command queuing, which was a bit of a pain sometimes but offered excellent control over which requests overlapped or followed which others. With careful management of your tags and queues, you could enforce the strictest order or provide maximum parallelism or make any tradeoff in between. So what does the "higher level" filesystem interface provide? We get fsync, sync, O_SYNC, O_DIRECT and AIO. That might be enough, except...</p> <ul> <li> <p>Fsync is pretty broken in most local filesystems. The "stop the world" entanglement problems in ext4 are pretty <a href="">well known</a>. What's less well known is that XFS (motto: "at least it's not ext4") has essentially the same problem. An fsync forces everything queued <strong>internally</strong> before to complete, but that's completely useless to an application which still gets no useful information about which other file descriptors no longer need their own fsync. The pattern continues even when you look further afield.</p> </li> <li> <p>O_SYNC has essentially the same problems as fsync, and sync is <strong>defined</strong> to require "stop the world" behavior.</p> </li> <li> <p>O_DIRECT throws away <strong>too much</strong> functionality. Sure, we don't want write-back, but a write-through cache would still be nice for subsequent reads and O_DIRECT eliminates even that.</p> </li> <li> <p>AIO still uses a thread pool behind the scenes on Linux, unless you use a low-level interface that even its developers admit isn't ready for prime time, so it fails the efficiency requirement.</p> </li> </ul> <p>Implementing a correct and efficient server is way harder than it needs to be when all you have to work with is broken fsync, broken O_DIRECT, and broken AIO. Apparently btrfs tries to get some of this stuff right, thanks to the Ceph folks, but even they balked at trying to make those changes more generic, so unless you want to use btrfs you're still out of luck. That's why I return the local-filesystem developers' contempt, plus interest. Virtualization systems, databases, and other software all have many of the same needs a distributed filesystems, for many of the same reasons, and are also ignored by the grognards who continue to optimize for synthetic workloads that weren't even realistic twenty years ago. While I still believe that the POSIX abstraction is far from being obsolete, pretty soon it might not be possible to say the same about the people most involved with implementing or improving it.</p>Mon, 26 Aug 2013 12:42:00,2013-08-26:2013-08-local-filesystems-suck.htmlstoragelinuxTechnical Credit<p>To a first approximation, "software engineering" refers to all of the things you need to know when you take "programming" and try to scale it up - more code, more people, more time. You don't need an a civil engineer to dig a latrine, but you'd better have one to design a sewer system for an entire city. Likewise, you don't need a software engineer to write a small one-off script, but you will if you're designing a distributed filesystem. Many of the discussions about "everybody should learn to program" seem to go in circles because the participants don't acknowledge this distinction. Applying even a common skill often requires a new level of professional rigor when done at a larger scale.</p> <p>One of the things you learn about as a software engineer, and not "just" a programmer, is <a href="">technical debt</a>. To boil a complex idea down a bit, this is the idea that, as code evolves, it accumulates flaws. I don't mean flaws in the sense of things that behave incorrectly, but flaws that make the code harder to work on - messy (often duplicated) code paths and data structures, "impedance mismatches" that force one piece of code to compensate for another's strangeness (with the effects usually rippling ever outward), and so on. Over time, you end up with code that does what it does pretty well, but can barely be persuaded or coerced to do anything else and ultimately becomes obsolete.</p> <p>If technical debt is stuff that slows you down, technical credit is stuff that speeds you up. Just as technical debt often consists of things forced into a release for the sake of expediency (with an intent to go back and "do it right" later), technical credit consists of things that are left out of a release for the sake of caution (often it's the "do it right" but wasn't ready in time). Specifically, it's a library of solved problems in the form of prototypes, code snippets, or designs in various stages of completion. As a project progresses, these previously isolated bits of hard-won knowledge can be pressed into immediate service, instead of having to wait for solutions to be developed from scratch. Technical credit is thus primarily a risk reduction or mitigation strategy. If you have ten solutions for various hard problems "on the shelf" and you only need five, the benefit in development velocity might far outweigh the cost of having developed another five solutions you don't need (yet). Developing technical credit can also be considered a career-development strategy, or even a perk. (This post was inspired by a discussion of Google's "20% time" which fits this description.) Increasing technical credit is <strong>fun</strong>. It usually means solving technical problems that are much more interesting than those in the current version, in relative isolation and without the time pressure of having to fix the bug <em>du jour</em>.</p> <p>The reason I'm writing this is because it seems like a lot of people never think about technical credit - let alone have a structured way to think about it. Both traditional and new-age development processes focus entirely on a direct line from writing code to releasing it. The idea of writing code and then letting it sit for a while has no place in either model. (This same problem shows up in synchronizing separate "upstream" and "downstream" projects with open source.) As a result, many projects go to battle each release cycle with an empty arsenal. Organizations that recognize the existence of technical credit and assign it a proper value will - over time - outperform those that don't.</p>Fri, 16 Aug 2013 12:06:00,2013-08-16:2013-08-technical-credit.htmlprocessGlusterFS 3.5 Features<p>It's time to let some cats out of some bags. As my loyal readers (yeah right) have surely noticed, things have been quiet around here. Part of that has been the result of vacations and such, but also there's a lot of stuff I just haven't felt ready to write about. Now that I've finished writing my <a href="">feature proposals for GlusterFS 3.5</a>, I'm ready to write here as well. But first, a bit of philosophy.</p> <p>On any non-trivial software project, there's likely to be a certain tension between "loyalists" who want to keep things largely the same and "rebels" who keep pushing for radical change. I was tempted to use "liberal" and "conservative" but <a href="">Steve Yegge</a> beat me to it with a different set of definitions. (Good reading, BTW, though he's a bit loquacious so make sure you have plenty of time.) The distinction I'm trying to make is as follows:</p> <ul> <li> <p>A loyalist believes that "one more tweak" or a succession of them will bring the project to greatness. There's no need, and too much risk, associated with radical change.</p> </li> <li> <p>A rebel believes that you can't cross a chasm with small steps. You have to take bold steps, even if those steps are going to cause regret later.</p> </li> </ul> <p>Most people don't stay at one position on that spectrum, and certainly not at the ends. Often one's position is determined by circumstance. For example, despite my general tendency toward the rebel view, I've probably spent more of my career (especially at Revivio) fighting on the loyalist side. When it comes to GlusterFS, I'm not just <strong>a</strong> rebel but probably <strong>the</strong> rebel. There are others more fit and more inclined to take the loyalist position. I see it as my responsibility to take an under-represented rebel position by proposing new ideas and accumulating "technical credit" to remove or balance out the significant technical debt that has accumulated under the loyalist regime. I don't mind at all if my ideas stay "on the shelf" for years, as has happened with most of HekaFS. It's good to have extra ammo in the locker. I mind a little bit more when the loyalists reject ideas with pure FUD or passivity instead of sound technical argument, or when they can't seem to change their mind about an old idea without claiming it was their own, but that's a subject for a different post. The point here is that what I'm proposing is deliberately ambitious, because our competitors aren't exactly standing still. They're often acting more boldly than we are, and we'll never gain or maintain a lead over them (I won't get into which we're doing) with baby steps. With that in mind, here's what I've proposed.</p> <ul> <li> <p>Better SSL support. Surprise! This is actually a very loyalist feature. Basically the core code has been there for years, but there are several pieces that still need to be done - mainly on the usability front - before this can really be something we're proud of.</p> </li> <li> <p>New Style Replication. For years, I've been advocating for fundamental change in this area. I even wrote an infamous "Why AFR Must Die" email explaining why I don't believe the current design or implementation will be sufficient going forward (using real user complaints and code exampels respectively). There finally seems to be some acceptance of the idea that we should use a log/journal instead of scattered xattrs or hard-link trickery to keep track of incomplete changes, but NSR goes even further than that. For one thing, it's server-based to take better advantage of how NICs and switches work. For another, it has an almost Dynamo-like consistency model which offers users better tradeoffs of consistency vs. availability and/or performance. There's a lot more, which I'll discuss in detail some time soon. For now I'll just say that performance will be better, recovery will be faster, and (best of all IMO) "split brain" will be easier to avoid.</p> </li> <li> <p>Thousand-node scalability. The biggest limitation on our scalability right now is not in the I/O path, which scales just fine, but in our management path. The changes I'm proposing here - based on some of the latest advances that some of my distributed-system friends will surely recognize - represent important steps on the way to an exabyte-scale system. We're already at petabyte scale, TYVM, so I don't think that's an exaggeration.</p> </li> <li> <p>Data tiering/classification. While NSR might be the one that I've spent the most time thinking about, this is the one that might have the most immediate impact and appeal for users. We get queries <strong>all the time</strong> about how we're going to deal with SSDs. Hybrid drives and "smart" HBAs aren't really very good solutions, because they only have local and limited knowledge. (This is the same issue as global vs. local deduplication BTW.) Being able to combine SSD-based and disk-based bricks <strong>in a single volume</strong> with smart placement and migration across all of them is a leap far beyond such hacks. Perhaps even more importantly, we also get asked a lot about various features that would be beneficial in some way - e.g. deduplication, bitrot detection, erasure coding - but carry a high performance cost. We need to tier between these in much the same way as between different hardware types. As it turns out, the exact same infrastructure also allows us to implement locality awareness, security-level awareness, and all sorts of other features with relatively little effort. Our modular structure and our existing data-distribution code already do 95% of what's needed, and just need a little nudge.</p> </li> </ul> <p>Those are just my own proposals. Others have made proposals too, so please go check them out. Personally, I've been eagerly awaiting Xavi's erasure-coding "disperse" translator for ages, and can't wait to see it become a full part of the project. While there's practically no chance that all of this will get into GlusterFS 3.5, a lot of it will and what's left will become a formidable arsenal of opportunities for years to come.</p>Tue, 13 Aug 2013 11:15:00,2013-08-13:2013-08-glusterfs-35-features.htmlstorageglusterfsAvoiding Jet Lag<p>And now for something completely different...</p> <p>As part of my job - educating and evangelizing and whatever else you call it - I travel a fair amount. I know there are other people who travel ten times as much as I do, but then there are many more who travel less than a tenth as much. As everyone who does travel frequently knows, jet lag is a very real problem. Most of us travel across the country or across the world because something is important. It's very depressing when you're at your destination, trying to do something important, and your brain is so fogged by jet lag that you can barely put together a coherent sentence. The really funny thing, as I just noted to my seat-mate on BA203 from LHR to BOS with regard to typing, is that a lot of people who absolutely live to optimize the hell out of every little thing they do never try to optimize around jet lag. Therefore, I'll share something that has worked for me and might work for you.</p> <p>I can't in any way take credit for this idea as my own invention. I read about it in a magazine a while ago, probably The Atlantic but I'm not sure. In that article, they cited this technique as being in use by the US military, Olympic teams, and so on. They probably also cite the scientific sources better than I'm going to. In any case, the very basic observation is this:</p> <blockquote> <p>Your body's clock is determined at least as much by when you <strong>eat</strong> as by when you <strong>sleep</strong>.</p> </blockquote> <p>Thus, even though jet lag is a problem of sleep and wakefulness, the way to address it is through your stomach. Specifically, if you fast for a while your body's clock goes into "free wheeling" mode (much like pushing in the clutch on a manual-transmission car). Then, the next big meal is interpreted as dinner (getting back in gear) with sleep soon to folow. Therefore, the smart thing to do is <strong>don't eat on the plane</strong>. Instead: fast until an appropriate dinner time for your destination, then eat a full dinner, then sleep.</p> <p>My interpretation has been to start fasting a full day before my anticipated arrival time. It's not a total fast, because if I didn't eat at all then my stomach noises both be uncomfortable and annoy people around me. Similarly, I don't make heroic efforts to avoid sleep. When I get to that point where I can just barely keep my eyes open or focus on what somebody's saying, then I'll take a short nap. Fortunately, I've always been good at cat-naps. If I don't want to sleep more than an hour maximum, then it's very unlikely that I'll do so - even without an alarm. (Heck, I had to get up at 3am local time for my flight today, and I woke up "naturally" at 2:55. It's a handy feature.) The key is to eat and sleep little enough to avoid sending that "now we know when to sleep" signal. That way, when you send the <strong>real</strong> signal with a big dinner, your body responds to it. It's definitely hard. I've been bumped up to one of the better-food sections on this flight for the first time in forever, and it's hard to say no when everybody around me is eating. Still, a little bit of discomfort in the air helps to avoid much more discomfort on the ground later.</p> <p>How well does this work? I've done it on my last two Bangalore trips, plus my last coast-to-coast trip. Actually I'm not quite done with the second Bangalore trip; I'm somewhere slightly south of Greenland as I write this. Still, I feel very encouraged that I haven't had any jet lag at all during any of those trips. I find myself going to bed at a normal time for where I am, getting up at a normal time, and not feeling particularly tired during the day. Meanwhile, co-workers on the same trips have been literally falling down because of jet lag. Sure, I've had periods of being tired during the day, but I don't think that has anything to do with jet lag. If you put me in a windowless room with twenty other people to talk about something boring right after lunch, I'm going to nod off a bit regardless of what time zones are involved.</p> <p>So, there it is. It's a very simple idea, it almost seems obvious, and I'm sure there are plenty of people who've already heard of it. Still, it seems like a lot of people either haven't heard of it or haven't tried it, so here's a data point for you.</p>Thu, 25 Jul 2013 18:50:00,2013-07-25:2013-07-avoiding-jet-lag.htmltravelStartups and Patents<p>This should be a <a href="">pretty familiar story</a> to anyone in high tech by now. Startup makes something cool, becomes a target for patent litigation from what we used to call an NPE (Non Practicing Entity). Apparently the new term is PAE (Patent Assertion Entity) but I prefer an even more concise term: <strong>troll</strong>. There is much predictable moaning and gnashing of teeth <a href="">on Hacker News</a>, of course, but nobody wants to think about a very simple question.</p> <blockquote> <p>Where does all this ammunition come from?</p> </blockquote> <p>(BTW, I know nobody on HN wants to think about this because I've raised the issue before and I got slammed hard for the effort. That's why I'm posting here this time. Can't censor this, you fucking cowards.)</p> <p>Having worked at a dozen or so startups myself, I know exactly where the ammunition comes from: the looted carcasses of earlier startups. While I've never had one of my own patents abused this way, I have half a dozen friends who have been through that and it's always the same story.</p> <ol> <li> <p>Friend works at a startup.</p> </li> <li> <p>Investors apply pressure to file for patents, either as a bargaining chip in any subsequent acquisition or possibly as a hedge against failure.</p> </li> <li> <p>Friend gets named on a patent or ten.</p> </li> <li> <p>The startup does in fact fail or get acquired.</p> </li> <li> <p>Patents get sold, and sold again, until eventually they end up in the hands of a troll.</p> </li> <li> <p>Troll asserts patent.</p> </li> <li> <p>Friend is livid about how their creative work, their contribution to the state of the art, is being abused.</p> </li> <li> <p>Friend's feelings have no effect whatsoever on the litigation.</p> </li> </ol> <p>In other words, if you work at a startup that files patents, and you're not taking steps to put them firmly out of reach of the trolls, then <strong>you're part of the problem</strong>. If you're an investor and you're allowing portfolio companies to file patents without such protection, then you're part of the problem too. Yeah, I know, some VCs claim they discourage such things, but somehow I'm less than convinced when I can go to USPTO or Google Patents or PatentStorm and immediately pull up a list of patents filed by companies that somehow overcame that discouragement without any ill consequence. I'm shocked - shocked! - to hear that patents are going on here. Your winnings, sir.</p> <p>Until we get serious patent reform, which is going to take a while, patents are still necessary to establish precedence. That keeps trolls from acquiring patents to the same idea and then pursuing others - possibly including the people who truly had the idea first and based products on it. Don't let your oh-so-principled distaste for patents overcome common sense and keep you from protecting doing your part to protect everyone including yourself. The key is to ensure that the patents are <strong>only</strong> usable in a defensive fashion. One approach is to turn them over to something like the <a href="">Open Invention Network</a>. Another approach is the <a href="">Innovator's Patent Agreement</a> from Twitter. That at least avoids the scenario I lay out above, though a down-on-their-luck former developer might not offer much restraint when push comes to shove. There are other approaches as well, but the sad fact is that most people who file patents - including those who complain about others' patents - are doing absolutely nothing to ensure that their own work won't be turned against them and their community. That's a disgrace. Developers and founders, disagree with me all you want about what it is we should do, but do <strong>something</strong> besides complain.</p>Fri, 19 Jul 2013 16:19:00,2013-07-19:2013-07-startups-and-patents.htmllegalSmall Synchronous Writes<p>Sometimes people ask me why I always use small synchronous writes for my performance comparisons. Surely (they say), there are other kinds of operations that are more common or more important. Yes there are (I say), and don't call me Shirley. But seriously, folks, there are definitely other kinds of performance that matter. The problem is that they just don't tell you much about what makes two distributed filesystems different. I'll try to explain why.</p> <p>Let's start with read-dominated workloads. It's well known that OS (and app) caches can absorb most of the reads in a system. This was the fundamental observation behind Seltzer et al's work on log-structured filesystems all those years ago. Reads often take care of themselves, so <strong>at the filesystem level</strong> focus on writes. The significance of caching is hardly less in distributed filesystems with greater latency. The primary exception to this rule is large sequential reads, but those tend to become bandwidth-bound very quickly and just about every distributed filesystem I've ever seen can saturate whatever network connections you have <strong>easily</strong> for such workloads. Boring. Between these two effects, it just turns out that read-dominated workloads aren't all that interesting.</p> <p>Why not different kinds of writes? Mostly because large and/or asynchronous writes tend to follow the same patterns as large reads. Once you have the opportunity to batch and/or coalesce writes, effectively eliminating the effect that network latency might have on most of them, it becomes pretty easy to fill the pipe with huge packets. Boring again. It's important to measure how well the servers handle <strong>parallelism</strong> among many requests that are still kept separate, but that's a whole different thing. If both reads and large/async writes are uninteresting, what does that leave? Small sync writes, of course.</p> <p>While I'm here, I might as well address a couple of other issues. One is the question about scale. Does a test of a single client and a single server (if replicating) really tell us anything useful for filesystems that are designed to have many servers? I think it does, for a certain class of such filesystems. In a system that uses algorithmic placement, such as GlusterFS or Ceph, an individual request really will hit only those servers and really will scale pretty linearly until you start hitting the <strong>network's</strong> scaling limits. It absolutely makes sense to test the network in the context of an actual deployment, but in the context of evaluating technologies the performance of a single server (or replica pair) does work as a proxy for the performance of N. That doesn't mean you should obsess over micro-optimizations or implementation concerns that don't have much measurable effect (e.g. kernel vs. FUSE clients), but it's really the data flow and algorithmic efficiency that matter most. This argument doesn't work nearly as well for more outdated architectures that use directory-based placement, such as HDFS or Lustre. In those cases, the need to go through the MDS or NameNode or whatever really does create a bottleneck that impacts system-wide scaling. That's something to consider when you're looking at such systems.</p> <p>Lastly, what about metadata operations? File creation and directory listings are even worse than writes, aren't they? Yes, absolutely, they are. Testing only data operations is kind of a bad habit among filesystem folks, and I'm guilty too. I really should test and report on those things too, even though it probably means developing even more tools myself because the existing tools are even worse for that than they are for testing plain old reads and writes.</p> <p>To make a long story . . . no longer, if not actually short, I've found that testing small synchronous writes is simply the best place to start. It's the first result to look at, but absolutely not the only one. If I were actually looking to deploy a system myself I'd try all sorts of workloads at the same scale as the deployment itself, or as close as I could get, and I'd show everyone a detailed report. On the other hand, when I'm doing the tests on my own time and at my own expense (in a public cloud) for a blog post or presentation, that's quite a different story.</p>Thu, 18 Jul 2013 10:33:00,2013-07-18:2013-07-why-sync-writes.htmlstorageperformancePerformance Measurement Pitfalls<p>One of the problems with measuring and comparing performance of scalable systems is that any workload capable of producing meaningful results is going to be highly multi-threaded, and most developers don't know much about how to collect or interpret the results. After all, they hardly ever get any training in that area, and many of the tools don't exactly make it easy (as we'll see in a moment). Considering all the effort spent on complex ways to define the input workload - some tools have entire domain-specific languages for this - you'd think that some effort might have been spent on making the output more meaningful. You'd be wrong.</p> <p>To see how easy it is to be misled, and how badly, let's consider a simple example. You have a storage system capable of sustaining 1000 IOPS. A single I/O thread can generate a load of 1000 IOPS. What happens when you run four of those?</p> <ul> <li> <p>Scenario 1: the storage system effectively delivers 250 IOPS per thread, continuously. Therefore they each report 250 IOPS, you add those up, and you get a correct sum of 1000 IOPS.</p> </li> <li> <p>Scenario 2: the storage system effectively serializes the four threads. Thread A completes in one second, reporting 1000 IOPS. Thread B completes in two seconds - the first second sitting idle - and reports 500 IOPS. Threads C and D complete in three and four seconds respectively, reporting 333 and finally 250 IOPS. Add them all up and you get the wildly wrong sum of 2083 IOPS.</p> </li> </ul> <p>The mistake in the second scenario seems obvious when described this way, but I've seen smart people make it again and again and again over the years. One way to avoid it is not to trust reports from individual threads, but to measure the start and end times <strong>for the whole group</strong>. Unfortunately, you can miss a lot of useful information that way. Most importantly, a single slow worker can drag the entire average down and you won't even notice that the actual I/O rate for most of the threads and most of the time was actually far higher unless you're paying pretty close attention. Dean and Barroso call this the <a href="">latency tail</a> and it's significant in operations as well as measurement.</p> <p>Another way to avoid the original over-counting problem is "stonewalling" - a term and technique popularized by <a href="">iozone</a>. This means stopping all threads when the first one finishes - i.e. "first to reach the stone wall" - and collecting the results even from threads that were stopped prematurely. This does avoid over-counting, but it can distort results in even worse ways than the previous method. It fundamentally means that your workers didn't do all of the I/O that you meant them to, and that they would have if they had all proceeded at the same pace. If you meant to do more I/O than will fit in cache, or on disks' inner tracks, too bad. If you wanted to see the effects of filesystem or memory fragmentation over a long run, too bad again. The slightest asymmetry in your workers' I/O rates will blow all of that away, and what storage system doesn't present any such asymmetry? None that I've ever seen. Worst of all, as <a href="">Brian Behlendorf</a> mentions, this approach doesn't even solve the single-slow-worker problem.</p> <blockquote> <p>The use of stonewalling tends to hide the stragglers effect rather than explain or address it</p> </blockquote> <p>In other words, iozone's stonewalling is worse than the problem it supposedly solves. Turn it off. If you want to see what's <strong>really</strong> happening to your I/O performance, the solution is neither of the above. Measuring just a start and end time, per worker or per run, is insufficient. To see how much work your system is doing per second, you have to look each second. Such <strong>periodic</strong> aggregation can not only give you accurate overall numbers and highlight stragglers, but it can also show you information like:</p> <ul> <li> <p>Performance percentiles (per thread or overall)</p> </li> <li> <p>Global pauses, possibly indicating outside interference</p> </li> <li> <p>Per-thread pauses e.g. due to contention/starvation</p> </li> <li> <p>Mode switches as caches/tiers are exhausted or internal optimizations kick in</p> </li> <li> <p>Cyclic behavior as timers fire or resources are acquired/exhausted</p> </li> </ul> <p>This is all <strong>really</strong> useful information. Do any existing tools provide it? None that I know of. I used to have such a tool at SiCortex, but it was part of their intellectual property and thus effectively died with them. Besides, it depended on MPI. Plain old sockets would be a better choice for general use. Reporting from workers to the controller process could be push or pull, truly periodic or all sent at the end (if you're more concerned about generating network traffic during the run than about clock synchronization). However it's implemented, the data from such a tool would be much more useful than the over-simplified crap that comes out of the current common programs. Maybe when I have some spare time - more about that in a future post - I'll even work on it myself.</p>Tue, 09 Jul 2013 20:03:00,2013-07-09:2013-07-perf-pitfalls.htmlstorageperformanceTwo Weeks is Not a Sprint<p>We're moving to an "agile" development process at work. Yes, we're becoming scrumbags. ;) One of the terms that really bothers me is "sprint" because I think of a sprint as a flat-out effort. That means minimal eating, sleeping, or time with family. Even hard-core hackers rarely do that for two weeks at a time. I think a better metaphor for what's a sprint and what's not is running: 100m equals one day of coding. So...</p> <ul> <li> <p>100m: the classic. Not much more to say about this one.</p> </li> <li> <p>200m = two days. A short hackathon. The focus shifts a bit from acceleration to maximum speed (productivity) and the overall pace is actually a bit higher because that startup time is amortized.</p> </li> <li> <p>400m = four days. A long hackathon, or close enough to a full week. Still a sprint, but at the upper end of the range.</p> </li> <li> <p>1500m/mile = two weeks (approximately). Another marquee distance. No longer a true sprint, but still fast. Most sensitive to pace, because it's long enough to burn out but not long enough to make many adjustments.</p> </li> <li> <p>5k/10k = a few months. Not much to say here either.</p> </li> <li> <p>40k/marathon = just over a year. The longest distance/duration most people plan for, though ultra-marathons do exist.</p> </li> </ul> <p>The mile seems like the closest equivalent to how "sprint" is used in agile terminology, so why don't we use good old-fashioned "milestone" instead?</p>Tue, 25 Jun 2013 08:10:00,2013-06-25:2013-06-two-weeks.htmlprocessLies, Damn Lies, and Parallels<p>This apparently happened a while ago, but it recently came to my attention via <a href="">LWN</a> that James Bottomley has made the claim that "Gluster sucks" (not a paraphrase, those seem to be his exact words). Well, I couldn't just let that go by, could I? Why would he say such a thing? The only visible thing is a recent <a href="">presentation</a> at the Parallels Summit, which is - to put it bluntly - just <strong>full of lies</strong>. Let's take a look at just how bad it is.</p> <p>Our starting point is a performance graph on slide 3, purportedly showing how Parallels Cloud Storage is way ahead of everyone else in terms of aggregate Gbps . . . but wait. How many clients are we talking about? How many servers? He doesn't say. What kind of hardware? He doesn't say. What kind of configuration? He doesn't say. What kind of workload? He doesn't say. What does it even mean to put up numbers for both distributed storage systems (running on what kind of network?) and "DAS - 15,000 RPM"? Is he comparing apples to oranges, or apples to whole crates full of oranges? That graph is the absolute worst kind of fact-free marketing. It's utterly useless for drawing any engineering conclusions about anything. Onward to slide 5. What does this mean?</p> <blockquote> <p>File based Storage</p> <p>...</p> <p>suffers from metadata issues on the server</p> </blockquote> <p>"The" server eh? Where have I heared that before? Oh yeah, <a href="">right here</a>. He's making the same mistake that James Hughes did, of thinking that because he can't think of a better way to handle metadata then nobody can. To quote Schopenhauer, "Everyone takes the limits of his own vision for the limits of the world." Onward to slide 7.</p> <blockquote> <p>Using a fixed size object incurs no metadata overhead whatsoever</p> </blockquote> <p>Here he has inadvertently identified a deficiency not in real cloud filesystems but in the Parallels alternative. Fixed-size objects are just not a reasonable limitation in many use cases. Any system designed around such a limitation is hopelessly weak compared to one that handles the more general case. As I explained the <a href="">last time</a> Parallels was slinging this kind of FUD, the same can be said about systems that don't allow real sharing of data - including both object and block stores. People wouldn't still be making billions of dollars per year selling NAS if users didn't want those more general semantics. Onward to slide 8.</p> <blockquote> <p>Fuse is the Linux Userspace Filesystem</p> <p>Main problem is it’s incredibly SLOW</p> </blockquote> <p>So why has FUSE historically been slow? Because the kernel hackers whose sign-off was needed to make it less slow were extremely resistant to any change that would have that effect. People like James Bottomley himself. When you're wrong for so long it's a disingenous to take so much credit for finally ceasing your own resistance to change. Onward to slide 9.</p> <blockquote> <p>Eventual Consistency is the usual norm</p> <p>...</p> <p>Gluster (does have a much slower strong consistency quorum enforcement mode)</p> </blockquote> <p>The first part is highly misleading. Eventual consistency is <strong>not</strong> the norm in GlusterFS. In normal operation, updates are fully synchronous and there will be no inconsistency beyond that which exists in any distributed system while an update is still in progress. The only time there's any observable inconsistency is in the presence of failures, and not just any failure but the kind or number that can lead to split-brain. Also, quorum enforcement does <strong>not</strong> make anything slower. It has zero performance impact; that's just more FUD.</p> <p>Basically, what Bottomley has provided is just one big hatchet job based on misleading or outright false statements. The <strong>fact</strong> is that GlusterFS can do many things that Parallels Cloud Storage can't. It provides full filesystem semantics, truly shared data, geo-replication (still a hand-wave for PCS), Hadoop and Swift integration, and many other features. Yes, it might be true that PCS can outperform GlusterFS for the only use case that PCS can handle, on an unspecified configuration with an unspecified workload. Or maybe not, since those details are missing and the software itself isn't open so that others can make their own comparison.</p> <p>In my experience, people only make such totally <strong>bullshit</strong> comparisons when legitimate ones don't paint the picture they want. It's not science. It's not engineering. It's not even marketing done right. It's just lying.</p>Mon, 24 Jun 2013 13:02:00,2013-06-24:2013-06-lies-damn-lies.htmlstorageglusterfsmarketingPackage Managers<p>There are many things that differentiate a true software engineer from a mere programmer. Most of them are unpleasant - planning releases, reviewing designs or code, testing, release engineering, and so on. One of the most odious tasks is packaging software. I'll admit that it's an area where my self-discipline sometimes breaks down and I dump the task on somebody else as quickly as I can. Nonetheless, I recognize that the task itself as well as the tools and people who do it have value. I recognize that the rules those people have developed generally exist for a good reason. Apparently <a href="">some people don't</a>.</p> <p>The post actually makes some pretty decent points, especially about packagers breaking up packages unnecessarily. Mixed in are some really <strong>bad</strong> points, of which I'll focus on just three.</p> <blockquote> <p>Dynamic linking lets 2 programs indicate they want to use library X at runtime, and possibly even share a copy of X loaded into RAM. This is great if it is 1987 and you have 12mb of ram and want to run more than 3 xterms, but we don’t live in that world anymore.</p> </blockquote> <p>That demonstrates some pretty serious ignorance about the real issues, including performance. Sure, people have lots of RAM, but they want to use it for something besides redundant copies of the same (or almost the same) code. More applications, more VMs, more heap space for whichever program is the machines main role, etc. A dozen copies of the same library means a dozen times as much RAM <strong>and cache</strong>, and making those footprints larger does indeed have an impact on performance.</p> <blockquote> <p>One often touted benefit of dynamic linking is security, you can upgrade library X to fix some security hole and all the applications that use it will automatically gain the security fix the next time they’re run (assuming they still can run). I admit this benefit, but I think that package managers could work around this if they used static linking (Y depends on X, which has a security update, rebuild X and then rebuild Y and ship an updated package).</p> </blockquote> <p>That doesn't really work. You might be able to <strong>build</strong> against the new version of X, but that doesn't mean the result will be free of subtle bugs due to the difference. The author even seems aware of this when he talks about the "carefully curated" (how pretentious) libraries that are shipped with Riak, but sort of tries to walk both sides of the street by ignoring the issue here.</p> <p>The situation gets even worse when transitive dependencies are considered. Let's say that X depends on a specific version of Y, and it enforces that dependency either via the package definition or via bundling. Either way, if Y depends on Z then an update to Z can also break X. This possibility remains unless X includes all of its dependencies <strong>all the way down</strong> to the OS. I know plenty of people who do exactly this in the form of virtual appliances and such, and it's a valid approach when pursued to its logical conclusion, but capturing only one level of dependencies solves <strong>nothing</strong> in return for the problems it causes.</p> <p>The last issue has to do with bundling <strong>modified</strong> versions of dependencies.</p> <blockquote> <p>Leveldb is a key/value database originally developed by Google for implementing things like HTML5’s indexeddb feature in Google Chrome. Basho has invested some serious engineering effort in adapting it as one of the backends that Riak can be configured to use to store data on disk. Problem is, our usecase diverges significantly from what Google wants to use it for, so we’ve effectively forked it</p> </blockquote> <p>This approach is problematic for reasons that go well beyond packaging. There's also a serious "doing open-source wrong" aspect to it as well, though there may be room for debate about which side is guilty in this case. Nonetheless, these things do happen. I myself violated the no-bundling rule for HekaFS on Fedora at one point . . . and you know what? It ended up being broken, for exactly the reasons we're talking about. If you do have to bundle a modified version of someone else's code, there's a right way to do it and a wrong way. The right way is to <strong>engage</strong> with the distro packagers, instead of calling them "OCD" or accusing them of adhering blindly to "1992" standards that have become outdated, and collaborate with them on a sustainable solution. That solution is very likely to include more tightly specified dependencies, and a more active role keeping your own package up to date as the underlying original dependency gets updated. It's a huge pain for everyone involved, which is why it should only be done as a last resort. If you do decide to go down that path, then at least - as I put it in the <a href="">Hacker News thread</a> - pull up your big-girl panties and deal with it. Asking someone else to do part of your job and then complaining about how they do it is a loser move.</p>Sat, 22 Jun 2013 12:42:00,2013-06-22:2013-06-package-managers.htmlprocessMetadata Servers<p>I was sad that I had to miss RICON East, because I knew they had a lot of great speakers lined up. I really liked <a href="">James Hughes's presentation</a>, but must take issue with slide 15.</p> <blockquote> <p>Metadata Servers</p> <p>Required by traditional filesystems (POSIX) to translate names to sectors</p> <p>Hard to scale, heavy HA requirements</p> </blockquote> <p>The minor quibble is that no metadata servers I know of translate names to sectors. They all translate names in the global distributed namespace into a tuple of a node ID and a file/object ID on that node, and then other layers are responsible for translating to sectors (which might themselves be virtualized eight ways to Sunday). That's just a minor objection, though. My major objection is to the idea that there must be a metadata-server role separate from the data-server role. GlusterFS has proven that assumption false. Even Ceph, which does have a separate metadata-server role, distributes that role in very much the same way as the proposed alternative object stores, so it's not subject to "heavy HA requirements" any more than they are.</p> <p>There are some valid points to be made about the ordering and atomicity requirements of a full POSIX filesystem vs. an object store with simpler (saner?) semantics, but the "heavy HA requirements" of metadata servers are avoidable. There is one well known case of a distributed not-quite-filesystem that made that mistake (HDFS), but an argument based on one bad example won't get very far.</p>Fri, 21 Jun 2013 15:47:00,2013-06-21:2013-06-metadata-servers.htmlstorageStarting Over<p>You might have noticed that things look a bit different around here. OK, if you're reading this in an RSS reader then maybe not, but otherwise it's kind of obvious. I've switched platforms yet again, because I was feeling a bit blocked. Publishing new stuff using my static-wordpress technique was a bit cumbersome, but I didn't want to go back to the bloat and security nightmare that is regular WordPress either, so I'm moving to a system that's designed to generate static pages - Pelican. All of the old content will remain available at the same locations (don't want to lose all that Google juice), but the front page and feeds will be all about the new stuff. I actually have a bunch of ideas queued up in my head. Now that I've made the leap, I'll be letting them out into the world shortly. Let's see how it works out.</p>Thu, 20 Jun 2013 17:38:00,2013-06-20:2013-06-starting-over.htmlpelican