Canned Platypus

Saving the world one byte at a time since 2000

Archive for the ‘systems’ Category

David Strauss tweeted an interesting comment about using filesystems (actually he said “block devices” but I think he really meant filesystems) for scale and high availability. I thought I was following him (I definitely am now) but in fact I saw the comment when it was retweeted by Jonathan Ellis. The conversation went on a while, but quickly reached a point where it became impossible to fit even a minimally useful response under 140 characters, so I volunteered to extract the conversation into blog form.

Before I start, I’d like to point out that I know both David and Jonathan. They’re both excellent engineers and excellent people. I also don’t know the context in which David originally made his statement. On the other hand, NoSQL/BigData folks pissing all over things they’re too lazy to understand has been a bit of a hot button for me lately (e.g. see Stop the Hate). So I’m perfectly willing to believe that David’s original statement was well intentioned, perhaps a bit hasty or taken out of context, but I also know that others with far less ability and integrity than he has are likely to take such comments even further out of context and use them in their ongoing “filesystems are irrelevant” marketing campaign. So here’s the conversation so far, rearranged to show the diverging threads of discussion and with some extra commentary from me.

DavidStrauss Block devices are the wrong place scale and do HA. It’s always expensive (NetApp), unreliable (SPOF), or administratively complex (Gluster).

Obdurodon Huh? GlusterFS is *less* administratively complex than e.g. Cassandra. *Far* less. Also, block dev != filesystem.

Obdurodon It might not be the right choice for any particular case, but for reasons other than administrative complexity.
What reasons, then? Wrong semantics, wrong performance profile, redundant wrt other layers of the system, etc. I think David and I probably agree that scale and HA should be implemented in the highest layer of any particular system, not duplicated across layers or pushed down into a lower layer to make it Somebody Else’s Problem (the mistake made by every project to make the HDFS NameNode highly available). However, not all systems have the same layers. If what you need is a filesystem, then the filesystem layer might very well be the right place to deal with these issues (at least as they pertain to data rather than computation). If what you need is a column-oriented database, that might be the right place. This is where I think the original very general statement fails, though it seems likely that David was making it in a context where layering two systems had been suggested.

DavidStrauss GlusterFS is good as it gets but can still get funny under split-brain given the file system approach: http://t.co/nRu1wNqI
I was rather amused by David quoting my own answer (to a question on the Gluster community site) back at me, but also a bit mystified by the apparent change of gears. Wasn’t this about administrative complexity a moment ago? Now it’s about consistency behavior?

Obdurodon I don’t think the new behavior (in my answer) is markedly weirder than alternatives, or related to being a filesystem.

DavidStrauss It’s related to it being a filesystem because the consistency model doesn’t include a natural, guaranteed split-brain resolution.

Obdurodon Those “guarantees” have been routinely violated by most other systems too. I’m not sure why you’d single out just one.
I’ll point out here that Cassandra’s handling of Hinted Handoff has only very recently reached the standard David seems to be advocating, and was pretty “funny” (to use his term) before that. The other Dynamo-derived projects have also done well in this regard, but other “filesystem alternatives” have behavior that’s too pathetic to be funny.

DavidStrauss I’m not singling out Gluster. I think elegant split-brain recovery eludes all distributed POSIX/block device systems.
Perhaps this is true of filesystems in practice, but it’s not inherent in the filesystem model. I think it has more to do with who’s working on filesystems, who’s working on databases, who’s working on distributed systems, and how people in all of those communities relate to one another. It just so happens that the convergence of database and distributed-systems work is a bit further along, but I personally intend to apply a lot of the same distributed-system techniques in a filesystem context and I see no special impediment to doing so.

DavidStrauss #Gluster has also come a long way in admin complexity, but high-latency (geo) replication still requires manual failover.

Obdurodon Yes, IMO geosync in its current form is tres lame. That’s why I still want to do *real* wide-area replication.

DavidStrauss Top-notch geo replication requires embracing split-brain as a normal operating mode and having guaranteed, predictable recovery.

Obdurodon Agreed wrt geo-replication, but that still doesn’t support your first general statement since not all systems need that.

DavidStrauss Agreed on need for geo-replication, but geo-repl. issues are just an amplified version of issues experienced in any cluster.
As I’ve pointed out before, I disagree. Even systems that do need this feature need not – and IMO should not – try to do both local/sync and remote/async replication within a single framework. They’re different beasts, most relevantly with respect to split brain being a normal operating mode. I’ve spent my share of time pointing out to Stonebraker and other NewSQL folks that partitions really do occur even within a single data center, but they’re far from being a normal case there and that does affect how one arranges the code to handle it.

Obdurodon I’m loving this conversation, but Twitter might not be the right forum. I’ll extract into a blog post.

DavidStrauss You mean complex, theoretical distributed systems issues aren’t best handled in 140 characters or less? :-)

I think that about covers it. As I said, I disagree with the original statement in its general form, but might find myself agreeing with it in a specific context. As I see it, aggregating local filesystems to provide a single storage pool with a filesystem interface and aggregating local filesystems to provide a single storage pool with another interface (such as a column-oriented database) aren’t even different enough to say that one is definitely preferable to the other. The same fundamental issues, and many of the same techniques, apply to both. Saying that filesystems are the wrong way to address scale is like saying that a magnetic #3 Phillips screwdriver is the wrong way to turn a screw. Sometimes it is exactly the right tool, and other times the “right” tool isn’t as different from the “wrong” tool as its makers would have you believe.

Oct
19
Stop The Hate

I’ve noticed a significant increase lately in the number of complaints people are making about the operating systems they use, particularly Linux and most especially the storage stack. No, I’m not thinking of a certain foul-mouthed SSD salesman, who has made such kvetching the centerpiece of his Twitter persona. I’m talking about several people I know in the NoSQL/BigData world, who I’ve come to respect as very smart and generally reasonable people, complaining about things like OS caches, virtual memory in general, CPU schedulers, I/O schedulers, and so on. Sometimes the complaints are just developers being developers, which (unfortunately) seems to mean being disrespectful of developers in other specialties. Sometimes the complaints take the form of an unexamined assumption that OS facilities just can’t be trusted, get in the way, and kill performance. The meme seems to be that the way to get better application performance is to get the OS out of the way as much as possible and reinvent half of what it does within your application. That’s wrong. No matter how the complaint is framed, it’s highly likely to reflect more negatively on the complainer rather than the thing they’re complaining about.

Look, folks, operating-system developers can’t read minds. They have to build very complex, very general systems. They set defaults that suit the most common use cases, and they provide knobs to tune for something different. Learn how to use those knobs to tune for your exotic workload, or STFU. Does your code perform well in every possible use on every possible configuration, without tuning? Not so much, huh? I’ve probably seen your developers deliver a very loud “RTFM” when users visit mailing lists or IRC channels looking for help with a “wrong” use or config. I’ve probably seen them say far worse, even. How can the same person do that, and then turn around to complain about an OS they haven’t learned properly, and not be a hypocrite? When you do find those tuning knobs, often after having been told about them because you had already condemned the things they control as broken, don’t try and pass it off as your personal victory over the lameness of operating systems and their developers. You just turned a knob, which was put there by someone else in the hopes that you’d be smart enough to use it before you complained. They did the hard work – not you.

I’m not going to say that all complaints about operating systems are invalid, of course. I still think it’s ridiculous that Linux requires swap space even when there’s plenty of memory, and behaves poorly when it can’t get any. I think the “OOM Killer” is one of the dumbest ideas ever, and the implementation is even worse than the idea. I won’t say that operating-system documentation is all that it should be, either. Still, if you haven’t even tried to find out what you can tune through /proc and /sys and fcntl/setsockopt/*advise, or gone looking in the Documentation subdirectory of your friendly neighborhood kernel tree, or accepted an offer of help from a kernel developer who came to you to help make things better, you’re just in no position to complain or criticize. It’s like complaining that your manual-transmission car stalled, when you never even learned to drive it. Not knowing something doesn’t make you a fool, but complaining instead of asking does. Maybe if you actually engaged with your peers instead of pissing on them, they could help you build better applications.

I have the best readers. One sent me email expressing a hope that I’d write about Martin Fowler’s LMAX Architecture. I’d be glad to. In fact I had already thought of doing so, but the fact that at least one reader has already expressed an interest makes it even more fun. The architecture seems to incorporate three basic ideas.

  • Event sourcing, or the idea of using a sequentially written log as the “system of record” with the written-in-place copy as a cache – an almost direct inversion of roles compared to standard journaling.
  • The “disruptor” data/control structure.
  • Fitting everything in memory.

I don’t really have all that much to say about fitting everything in memory. I’m a storage guy, which almost by definition means I don’t get interested until there’s more data than will fit in memory. Application programmers should IMO strive to use storage only as a system of record, not as an extension of memory or networking (“sending” data from one machine to another through a disk is a pet peeve). If they want to cache storage contents in memory that’s great, and if they can do that intelligently enough to keep their entire working set in memory that’s better still, but if their locality of reference doesn’t allow that then LMAX’s prescription just won’t work for them and that’s that. The main thing that’s interesting about the “fit in memory” part is that it’s a strict prerequisite for the disruptor part. LMAX’s “one writer many readers” rule makes sense because of how cache invalidation and so forth work, but disks don’t work that way so the disruptor’s advantage over queues is lost.

With regard to the disruptor structure, I’ll also keep my comments fairly brief. It seems pretty cool, not that dissimilar to structures I’ve used and seen used elsewhere; some of the interfaces to the SiCortex chip’s built-in interconnect hardware come to mind quickly. I think it’s a mistake to contrast it with Actors or SEDA, though. I see them as complementary, with Actors and SEDA as high-level paradigms and the disruptor as an implementation alternative to the queues they often use internally. The idea of running these other models on top of disruptors doesn’t seem strange at all, and the familiar nature of disruptors doesn’t even make the combination seem all that innovative to me. It’s rather disappointing to see useful ideas dismissed because of a false premise that they’re alternatives to another instead of being complementary.

The really interesting part for me, as a storage guy, is the event-sourcing part. Again, this has some pretty strong antecedents. This time it recalls Seltzer et al’s work on log-structured filesystems, which is even based on may of the same observations e.g. about locality of reference and relative costs of random vs. sequential access. That work’s twenty years old, by the way. Because event sourcing is so similar to log-structured file systems, it runs into some of the same problems. Chief among these is the potentially high cost of reads that aren’t absorbed by the cache, and the complexity involved with pruning no-longer-relevant logs. Having to scan logs to find the most recent copy of a block/record/whatever can be extremely expensive, and building indices carries its own expense in terms of both resources and complexity. It’s not a big issue if your system has very high locality of reference, which time-oriented systems such as LMAX or several others types of systems tend to, but it can be a serious problem in the general case. Similarly, the cleanup problem doesn’t matter if you can simply drop records from the tail or at least stage them to somewhere else, but it’s a big issue for files that need to stay online – with online rather than near-line performance characteristics – indefinitely.

In conclusion, then, the approach Fowler describes seems like a good one if your data characteristics are similar to LMAX’s, but probably not otherwise. Is it an innovative approach? Maybe in some ways. Two out of the three main features seem strongly reminiscent of technology that already existed, and combinations of existing ideas are so common in this business that this particular combination doesn’t seem all that special. On the other hand, there might be more innovation in the low-level details than one would even expect to find in Fowler’s high-level overview. It’s interesting, well worth reading, but I don’t think people who have dealt with high data volumes already will find much inspiration there.

I read about this a few days ago on the Green Data Center Blog, and tried to comment there, but apparently Dave Ohara either isn’t checking his moderation queue or doesn’t like skeptical comments about companies whose press releases he’s re-publishing. I’ll try to reconstruct the gist of that comment here.

Since working at SiCortex, I’ve kept a bit of an eye on other companies trying to produce high-density high-efficiency systems, like SeaMicro and Calxeda. Frankly, I find SeaMicro’s “throughput computing” spin very unconvincing. A little over a year ago I took a look at their architecture and reached the conclusion that all those processors were going to starve – not enough memory to let data rest, not enough network throughput to keep it moving. Now they have somewhat more memory and a lot more CPU cycles, but as far as I can tell the exact same network and storage capabilities, so the starvation problem is going to be even more severe. Even worse, the job posting Dave forwards for them (if he doesn’t get a fee for that he should) seems to indicate that they’s still pursuing an Ethernet-centric interconnect strategy. That will keep them well behind where SiCortex, Cray, IBM and others were years ago in terms of internal bandwidth for similar systems.

On the other hand, even if Calxeda’s more radical departure from “me too” computing seems more likely to yield something useful, it would be unfair to contrast their still-hoped-for systems to SeaMicro’s actually-shipping ones. Come on, Calxeda, ship something so we can actually make that comparison.

While it might have been overshadowed by events on my other blog, my previous post on Solid State Silliness did lead to some interesting conversations. I’ve been meaning to clarify some of the reasoning behind my position that one should use SSDs for some data instead of all data, and that reasoning applies to much more than just the SSD debate, so here goes.

The first thing I’d like to get out of the way is the recent statement by everyone’s favorite SSD salesman that “performant systems are efficient systems”. What crap. There are a great many things that people do to get more performance (specifically in terms of latency) at the expense of wasting resources. Start with every busy-wait loop in the world. Another good example is speculative execution. There, the waste is certain – you know you’re not going to execute both sides of a branch – but it’s often done anyway because it lowers latency. It’s not efficient in terms of silicon area, it’s not efficient in terms of power, it’s not efficient in terms of dollars, but it’s done anyway. (This is also, BTW, why a system full of relatively weak low-power CPUs really can do some work more efficiently than one based on Xeon power hogs, no matter how many cores you put on each hog.) Other examples of increased performance without increased efficiency include most kinds of pre-fetching, caching, or replication. Used well, these techniques actually can improve efficiency as requests need to “penetrate” fewer layers of the system to get data, but used poorly they can be pure waste.

If you’re thinking about performance in terms of throughput rather than latency, then the equation of performance with efficiency isn’t quite so laughable, but it’s still rather simplistic. Every application has a certain ideal balance of CPU/memory/network/storage performance. It might well be the case that thinner “less performant” systems with those performance ratios are more efficient – per watt, per dollar, whatever – than their fatter “more performant” cousins. Then the question becomes how well the application scales up to the higher node counts, and that’s extremely application-specific. Many applications don’t scale all that well, so the “more performant” systems really would be more efficient. (I guess we can conclude that those pushing the “performance = efficiency” meme are used to dealing with systems that scale poorly. Hmm.) On the other hand, some applications really do scale pretty well to the required node-count ranges, and then the “less performant” systems would be more efficient. It’s a subject for analysis, not dogmatic adherence to one answer.

The more important point I want to make isn’t about efficiency. It’s about locality instead. As I mentioned above, prefetch and caching/replication can be great or they can be disastrous. Locality is what makes the difference, because these techniques are all based on exploiting locality of reference. If you have good locality, fetching the same data many times in rapid succession, then these techniques can seem like magic. If you have poor locality, then all that effort will be like the effort you make to save leftovers in the refrigerator to save cooking time . . . only to throw those leftovers away before they’re used. One way to look at this is to visualize data references on a plot, using time on the X axis and location on the Y axis, using Z axis or color or dot size to represent density of accesses . . . like this.


time/location plot

It’s easy to see patterns this way. Vertical lines represent accesses to a lot of data in a short amount of time, often in a sequential scan. If the total amount of data is greater than your cache size, your cache probably isn’t helping you much (and might be hurting you) because data accessed once is likely to get evicted before it’s accessed again. This is why many systems bypass caches for recognizably sequential access patterns. Horizontal lines represent constant requests to small amounts of data. This is a case where caches are great. It’s what they’re designed for. In a multi-user and/or multi-dataset environment, you probably won’t see many thick edge-to-edge lines either way. You’ll practically never see the completely flat field that would result from completely random access either. What you’ll see the most of are partial or faint lines, or (if your locations are grouped/sorted the right way) rectangles and blobs representing concentrated access to certain data at certain times.

Exploiting these blobs is the real fun part of managing data-access performance. Like many things, they tend to follow a power-law distribution – 50% of the accesses are to 10% of the data, 25% of the accesses are to the next 10%, and so on. This means that you very rapidly reach the point of diminishing returns, and adding more fast storage – be it more memory or more flash – is no longer worth it. When you consider time, this effect becomes even more pronounced. Locality over short intervals is likely to be significantly greater than that over long intervals. If you’re doing e-commerce, certain products are likely to be more popular at certain times and you’re almost certain to have sessions open for a small subset of your customers at any time. If you can predict such a transient spike, you can migrate the data you know you’ll need to your highest-performance storage before the spike even begins. Failing that, you might still be able to detect the spike early enough to do some good. What’s important is that the spike is finite in scope. Only a fool, given such information, would treat their hottest data exactly the same as their coldest. Only a bigger fool would fail to gather that information in the first place.

Since this all started with Artur Bergman’s all-SSD systems, let’s look at how these ideas might play out at a place like Wikia. Wikia runs a whole lot of special-interest wikis. Their top-level categories are entertainment, gaming, and lifestyle, though I’m sure they host wikis on other kinds of subjects as well. One interesting property of these wikis is that each one is separate, which seems ideal for all kinds of partitioning and differential treatment of data. At the very grossest level, it seems like it should be trivial to keep some of the hottest wikis’ data on SSDs and relegate others to spinning disks. Then there’s the temporal-locality thing. The access pattern for a TV-show wiki must be extremely predictable, at least while the show’s running. Even someone as media-ignorant as me can guess that there will be a spike starting when an episode airs (or perhaps even a bit before), tailing off pretty quickly after the next day or two. Why on Earth would someone recommend the same storage for content related to a highly rated and currently running show as for a show that was canceled due to low ratings a year ago? I don’t know.

Let’s take this a bit further. Using Artur’s example of 80TB and a power-law locality pattern, let’s see what happens. What if we have a single 48GB machine, with say 40GB available for caching? Using the “50% of accesses to 10% of the data” pattern, that means 3.125% of accesses are even out of memory. No matter what the latency difference between flash and spinning disks might be, it’s only going to affect that 3.125% of accesses so it’s not going to affect your average latency that much. Even if you look at 99th-percentile latency, it’s fairly easy to see that adding SSD up to only a few times memory size will reduce the level of spinning-disk accesses to noise. Factor in temporal locality and domain-specific knowledge about locality, and the all-SSD case gets even weaker. Add more nodes – therefore more memory – and it gets weaker. Sure, you can assume a flatter access distribution, but in light of all these other considerations you’d have to take that to a pretty unrealistic level before the all-SSD prescription starts to look like anything but quackery.

Now, maybe Artur will come along to tell me about how my analysis is all wrong, how Wikia really is such a unique special flower that principles applicable to a hundred other systems I’ve seen don’t apply there. The fact is, though, that those other hundred systems are not well served by using SSDs profligately. They’ll be wasting their owners’ money. Far more often, if you want to maximize IOPS per dollar, you’d be better off using a real analysis of your system’s locality characteristics to invest in all levels of your memory/storage hierarchy appropriately.

Apparently Artur Bergman did a very popular talk about SSDs recently. It’s all over my Twitter feed, and led to a pretty interesting discussion at High Scalability. I’m going to expand a little on what I said there.

I was posting to comp.arch.storage when Artur was still a little wokling, so I’ve had ample opportunity to see how a new technology gets from “exotic” to mainstream. Along the way there will always be some people who promote it as a panacea and some who condemn it as useless. Neither position requires much thought, and progress always comes from those who actually think about how to use the Hot New Thing to complement other approaches instead of expecting one to supplant the other completely. So it is with SSDs, which are a great addition to the data-storage arsenal but cannot reasonably be used as a direct substitute either for RAM at one end of the spectrum or for spinning disks at the other. Instead of putting all data on SSDs, we should be thinking about how to put the right data on them. As it turns out, there are several levels at which this can be done.

  • For many years, operating systems have implemented all sorts of ways to do prefetching to get data into RAM when it’s likely to be accessed soon, and bypass mechanisms to keep data out of RAM when it’s not (e.g. for sequential I/O). Processor designers have been doing similar things going from RAM to cache, and HSM folks have been doing similar things going from tape to disk. These basic approaches are also applicable when the fast tier is flash and the slow tier is spinning rust.
  • At the next level up, filesystems can evolve to take better advantage of flash. For example, consider a filesystem designed to keep not just journals but actual metadata on flash, with the actual data on disks. In addition to the performance benefits, this would allow the two resources to be scaled independently of one another. Databases and other software at a similar level can make similar improvements.
  • Above that level, applications themselves can make useful distinctions between warm and cool data, keeping the former on flash and relegating the latter to disk It even seems that the kind of data being served up by Wikia is particularly well suited to this, if only they decided to think and write code instead of throwing investor money at their I/O problems.

Basically what it all comes down to is that you might not need all those IOPS for all of your data. Don’t give me that “if you don’t use your data” false-dichotomy sound bite either. Access frequency falls into many buckets, not just two, and a simplistic used/not-used distinction is fit only for a one-bit brain. If you need a lot of machines for their CPU/memory/network performance anyway, and thus don’t need half a million IOPS per machine, then spending more money to get them is just a wasteful ego trip. By putting just a little thought into using flash and disk to complement one another, just about anyone should be able to meet their IOPS goals for lower cost and use the money saved on real operational improvements.

Tom Trainer wrote what was supposed to be a thoughtful examination of what “cloud storage” should mean, but it came across as a rather nasty anti-Isilon hit piece. I tried to reply there, but apparently my comment won’t go through until I register with “UBM TechWeb” so they can sell me some crap, so I’m posting my response here. Besides being a defense of an unfairly maligned competitor – mine as well as Tom’s unnamed employer’s – it might help clarify some of the issues around what is or is not “real” cloud storage.

As the project lead for CloudFS, which addresses exactly the kinds of multi-tenancy and encryption you mention, I agree with many of your main points about what features are necessary for cloud storage. Where I disagree is with your (mis)characterization of Isilon to make those points.

* First, their architecture is far from monolithic. Yes, OneFS is proprietary, but that’s a *completely* different thing.

* Second, scaling to 144 servers is actually pretty good. When you look closely at what many vendors/projects claim, you find out that they’re actually talking about clients . . . and any idiot can put together thousands of clients. Conflating node counts with server counts was a dishonest trick when I caught iBrix doing it years ago, and it’s a dishonest trick now. Even the gigantic “Spider” system at ORNL only has 192 servers, and damn few installations need even half of that. It’s probably a support limit rather than an architectural limit. No storage vendor supports configurations bigger than they’ve tested, and testing even 144 servers can get pretty expensive – at least if you do it right. I’m pretty sure that Isilon would raise that limit if somebody asked them for a bigger system and let them use that configuration for testing.

Third, Isilon does have a “global” namespace as that term is usually used – i.e. at a logical level, to mean that the same name means the same thing across multiple servers, just like a “global variable” represents the same thing across multiple modules or processes. Do you expect global variables to be global in a physical sense too? In common usage, people use terms like “WAN” or “multi-DC” or “geo” to mean distribution across physical locations, and critiquing a vendor for common usage of a term makes your article seem like even more of a paid-for attack piece.

Disclaimer: I briefly evaluated and helped deploy some Isilon gear at my last job (SiCortex). I respect the product and I like the people, but I have no other association with either.

When Alex Popescu wrote about scalable storage solutions and I said that the omission of distributed filesystems made me cry, he suggested that I could write an introduction. OK. Here it is.

All filesystems – even local ones – have a similar data model and API model. The data model consists of files inside directories, where both have user-assigned names. In most modern filesystems directories can be nested, file contents are byte-addressable, and names are free-form character sequences. The API model, commonly referred to as POSIX after the standard of the same name, includes two broad categories of calls – those that operate on files, and those that operate on the “namespace” of files within directories. Examples of the first category include open, close, read, write, and fsync. Examples of the second category include opendir, readdir, link, unlink, and rename. People who actually develop filesystems, especially in a Linux context, often talk in terms of a different three-way distinction (file/inode/dirent operations) but that has more to do with filesystem internals than with the API users see. The other thing about filesystems is that they’re integrated into the operating system. Any program should be able to use any filesystem without using special libraries. That makes real filesystems a bit harder to implement, but it also makes them more generally useful than impostors that just have “FS” in the name to imply functionality they don’t have. There are many ways to categorize filesystems – according to how they’re accessed, how they’re implemented, what they’re good for, and so on. In the context of scalable storage solutions, though, the most important groupings are these.

  • A local filesystem is local to a single machine, in the sense that only a process on the same machine can make POSIX calls to it. That process might in fact be a server for some “higher level” kind of filesystem, and in fact local filesystems are an essential building block for most others, but for this to work the server must make a new local-filesystem call which is not quite the same as continuing the client’s call.
  • A network filesystem is one that can be shared, but where each client communicates with a single server. NFS (versions 2 and 3) and CIFS (a.k.a. SMB which is what gives Samba its name) are the best known examples of this type. Servers can of course be clustered and made highly available and so on, but this must be done transparently – almost behind the clients’ backs or under their noses. This approach fundamentally only allows vertical scaling, and the trickery necessary to scale horizontally within a single-server basic model can become quite burdensome.
  • A distributed filesystem is one in which clients actually know about and directly communicate with multiple servers (of one or more types). Lustre, PVFS2, GlusterFS, and Ceph all fit into this category despite their other differences. Unfortunately, the term “distributed filesystem” makes no distinction between filesystems distributed across a fast and lossless LAN and those distributed across a WAN with exactly opposite characteristics. I sometimes use “near-distributed” and “far-distributed” to make this distinction, but as far as I know there’s no concise and commonly accepted terms. AFS is the best known example of a far-distributed filesystem, and one of the longest-lived filesystems in any category (still in active large-scale use at several places I know of).
  • A parallel filesystem is a distributed filesystem in which a single file, or even a single I/O request, can be striped across multiple servers. This is primarily beneficial in terms of performance, but can also help to distribute capacity more evenly than if every file lives on exactly one server. I’ve often used the term to refer to near-distributed filesystems as distinct from their far-distributed cousins, because there’s a high degree of overlap, but it’s not technically correct. There are near-distributed filesystems that aren’t parallel filesystems (GlusterFS is usually configured this way) and parallel filesystems that are not near-distributed (Tahoe-LAFS and other crypto-oriented filesystems might fit this description).
  • A cluster or shared-storage filesystem is one in which clients are directly attached to shared block storage. GFS2 and OCFS2 are the best known examples of this category, which also includes MPFS. Once touted as a performance- or scale-oriented solution, these are now positioned mainly as availability solutions with a secondary emphasis on strong data consistency (compared to the weak consistency offered by many other network and distributed filesystems). Due to this focus and the general characteristics of shared block storage, the distribution in this case is always near.

This set of distinctions is certainly neither comprehensive nor ideal, as illustrated by pNFS which allows multiple “layout” types. With a file layout, pNFS would be a distributed filesystem by these definitions. With a block layout, it would be a cluster filesystem. With an object layout, a case could be made for either, and yet all three are really the same filesystem with (mostly) the same protocol and (definitely) the same warts.

One of the most important distinctions among network/distributed/cluster filesystems, from a scalability point of view, is whether it’s just data that’s being distributed or metadata as well. One of the issues I have with Lustre, for example, is that it relies on a single metadata server (MDS). The Lustre folks would surely argue that having a single metadata server is not a problem, and point out that Lustre is in fact used at some of the most I/O-intensive sites in the world without issue. I would point out that I have actually watched the MDS melt down many times when confronted with any but the most embarrassingly metadata-light workloads, and also ask why they’ve expended such enormous engineering effort – on at least two separate occasions – trying to make the MDS distributed if it’s OK for it not to be. Similarly, with pNFS you get distributed data but only some pieces of the protocol (and none of any non-proprietary implementation) to distribute metadata as well. Anybody who wants a filesystem that’s scalable in the same way that non-filesystem data stores such as Cassandra/Riak/Voldemort are scalable would and should be very skeptical of claims made by advocates of a distributed filesystem with non-distributed metadata.

A related issue here is of performance. While near-distributed parallel filesystems can often show amazing megabytes-per-second numbers on large-block large-file sequential workloads, as a group they’re notoriously poor for random or many-small-file workloads. To a certain extent this is the nature of the beast. If files live on dozens of servers, you might have to contact dozens of servers to list a large directory, or the coordination among those servers to maintain consistency (even if it’s just metadata consistency) can become overwhelming. It’s harder to do things this way than to blast bits through a simple pipe between one client and one server without any need for further coordination. Can Ma’s Pomegranate project deserves special mention here as an effort to overcome this habitual weakness of distributed filesystems, but in general it’s one of the reasons many have sought alternative solutions for this sort of data.

So, getting back to Alex’s original article and my response to it, when should one consider using a distributed filesystem instead of an oh-so-fashionable key/value or table/document store for one’s scalable data needs? First, when the data and API models fit. Filesystems are good at hierarchical naming and at manipulating data within large objects (beyond the whole-object GET and PUT of S3-like systems), but they’re not so good for small objects and don’t offer the indices or querying of databases (SQL or otherwise). Second, it’s necessary to consider the performance/cost curve of a particular workload on a distributed filesystem vs. that on some other type of system. If there’s a fit for data model and API and performance, though, I’d say a distributed filesystem should often be preferred to other options. The advantage of having something that’s accessible from every scripting language and command-line tool in the world, without needing special libraries, shouldn’t be taken lightly. Getting data in and out, or massaging it in any of half a million ways, is a real problem that isn’t well addressed by any storage system with a “unique” API (including REST-based ones) no matter how cool that storage system might be otherwise.

Yes, you read that right – Zettar, not Zetta. Robin Harris mentioned them, so I had a quick look. At first I was going to reply as a comment on Robin’s site, but the reply got long and this site has been idle for a while so I figured I’d post it here instead. I also think there are some serious issues with how Zettar are positioning themselves, and I feel a bit more free expressing my thoughts on those issues here. So, here are my reactions.

First, if I were at Zetta, I’d be livid about a company in a very similar space using a very similar name like this. That’s discourteous at best, even if you don’t think it’s unethical/illegal. It’s annoying when the older company has to spend time in every interaction dispelling any confusion between the two, and that’s if they even get the chance before someone writes them off entirely because they looked at the wrong website. The fact that Robin felt it necessary to add a disambiguating note in his post indicates that it’s a real problem.

Second, Amazon etc. might not sell you their S3 software, but there are other implementations – Project Hail’s tabled, Eucalyptus’s Walrus, ParkPlace. Then there’s OpenStack Storage (Swift) which is not quite the same API but similar enough that anyone who can use one can pretty much use the other. People can and do run all of these in their private clouds just fine.

Third, there are several packages that provide a filesystem interface on top of an S3-like store – s3fs, s3fuse, JungleDisk, even my own VoldFS (despite the name). Are any of these production-ready? Perhaps not, but several are older than Zettar and several are open source. The difference in deployment risk for Zettar vs. alternatives is extremely small.

Lastly, Zettar’s benchmarks are a joke. The comparison between local Zettar and remote S3 are obviously useless, but even the comparison with S3 from within EC2 is deceptive. Let’s just look at some of the things they did wrong:

  • They compare domU-to-domU for themselves vs. EC2 instances which are likely to be physically separate.
  • They fail to disclose what physical hardware their own tests were run on. It’s all very well to say that you have a domU with an xxx GHz VCPU and yyy GB of memory, but physical hardware matters as well.
  • Disk would have mattered if their tests weren’t using laughably small file and total sizes (less than memory).
  • 32-bit? EC2 m1.small? Come on.
  • They describe one set of results, failing to account for the fact that EC2 performance is well known to vary over time. How many runs did they really do before they picked these results to disclose?
  • To measure efficiency, they ran a separate set of tests using KVM on a notebook. WTF? Of course their monitoring showed negligible load, because their test presents negligible load.

There’s nothing useful about these results at all. Even Phoronix can do better, and that’s the very lowest rung of the storage-benchmarking ladder.

This looks like a distinctly non-enterprise product measured in distinctly non-enterprise ways. They even refer to cloud storage 2.0, I guess to set themselves apart from other cloud-storage players. They’ve set themselves apart, all right. Beneath.

Disclaimer: I work directly with the Project Hail developers, less so with the OpenStack Storage folks. My own CloudFS (work) and VoldFS (personal) projects could also be considered to be in the same space. This is all my own personal opinion, nothing to do with Red Hat, etc.

In my last post, I described several common data-loss scenarios and took people to task for what I feel is a very unbalanced view of the problem space. It would be entirely fair for someone to say that it would be even more constructive for me to explain some ways to avoid those problems, so here goes.

One of the most popular approaches to ensuring data protection is immutable and/or append-only files, using ideas that often go back to Seltzer et al’s log structured filesystem paper in 1993. One key justification for that seminal project was the observation that operating-system buffer/page caches absorb most reads, so the access pattern as it hits the filesystem is write-dominated and that’s the case for which the filesystem should be optimized. We’ll get back to that point in a moment. In such a log-oriented approach, writes are handled as simple appends to the latest in a series of logs. Usually, the size of a single log file is capped, and when one log file fills up another is started. When there are enough log files, old ones are combined or retired based on whether they contain updates that are still considered relevant – a process called compaction in several current projects, but also known by other names in other contexts. Reads are handled by searching through the accumulated logs for updates which overlap with what the user requested. Done naively, this could take linear time relative to the number of log entries present, so in practice the read path is often heavily optimized using Bloom filters and other techniques so it can actually be quite efficient. This leads me to a couple of tangential observations about how such solutions are neither as novel nor as complete as some of their more strident champions would have you believe.

  • The general outline described above is pretty much exactly what Steven LeBrun and I came up with in 2003/2004, to handle “timeline” data in Revivio’s continuous data protection system. This predates the publication of details about Dynamo in 2007, and therefore all of Dynamo’s currently-popular descendants as well.
  • Some people seem to act as though immutable files are always and everywhere superior to update-in-place solutions (including soft updates or COW), apparently unaware that they’re just making the complexity of update-in-place Somebody Else’s Problem. When you’re creating and deleting all those immutable files within a finite pool of mutable disk blocks, somebody else – i.e. the filesystem – has to handle all of the space reclamation/reuse issues for you, and they do so with update-in-place.

Despite those caveats, the log-oriented approach can be totally awesome and designers should generally consider it first especially when lookups are by a single key in a flat namespace. You could theoretically handle multiple keys by creating separate sets of Bloom filters etc. for each key, but that can quickly become unwieldy. It also makes writes less efficient, and – as noted previously – write efficiency is one of the key justifications for this approach in the first place. At some point, or for some situations, a different solution might be called for.

The other common approach to data protection is copy on write or COW (as represented by WAFL, ZFS, or btrfs) or its close cousin soft updates. In these approaches, blocks are updated in place, but with very careful attention paid to where and/or when individual block updates actually hit disk. Most commonly, all blocks are either explicitly or implicitly related as parts of a tree. Updates occur from leaves to root, copying old blocks into newly allocated space and then modifying the new copies. Ultimately all of this new space is spliced into the filesystem with an atomic update at the root – the superblock in a filesystem. It’s contention either at the root or on the way up to it that accounts for much of the complexity in such systems, and for many of the differences between them. The soft-update approach diverges from this model by doing more updates in place instead of into newly allocated space, avoiding the issue of contention at the root but requiring even more careful attention to write ordering. Here are a few more notes.

  • When writes are into newly allocated space, and the allocator generally allocates seqential blocks, the at-disk access pattern can be strongly sequential just as with the more explicitly log-oriented approach.
  • The COW approach lends itself to very efficient snapshots, because each successive version of the superblock (or equivalent) represents a whole state of the filesystem at some point in time. Garbage collection becomes quite complicated as a result, but the complexity seems well worth it.
  • There’s a very important optimization that can be made sometimes when a write is wholly contained within a single already-allocated block. In this case, that one block can simply be updated in place and you can skip a lot of the toward-the-root rigamarole. I should apply this technique to VoldFS. Unfortunately, it doesn’t apply if you have to update mtime or if you’re at a level where “torn writes” (something I forgot to mention in my “how to lose data” post) are a concern.

It’s worth noting also that, especially in a distributed environment, these approaches can be combined. For example, VoldFS itself uses a COW approach but most of the actual or candidate data stores from which it allocates its blocks are themselves more log-oriented. As always it’s horses for courses, and different systems – or even different parts of the same system – might be best served by different approaches. That’s why I thought it was worth describing multiple alternatives and the tradeoffs between them.