Canned Platypus

Saving the world one byte at a time since 2000

Archive for the ‘distributed’ Category

David Strauss tweeted an interesting comment about using filesystems (actually he said “block devices” but I think he really meant filesystems) for scale and high availability. I thought I was following him (I definitely am now) but in fact I saw the comment when it was retweeted by Jonathan Ellis. The conversation went on a while, but quickly reached a point where it became impossible to fit even a minimally useful response under 140 characters, so I volunteered to extract the conversation into blog form.

Before I start, I’d like to point out that I know both David and Jonathan. They’re both excellent engineers and excellent people. I also don’t know the context in which David originally made his statement. On the other hand, NoSQL/BigData folks pissing all over things they’re too lazy to understand has been a bit of a hot button for me lately (e.g. see Stop the Hate). So I’m perfectly willing to believe that David’s original statement was well intentioned, perhaps a bit hasty or taken out of context, but I also know that others with far less ability and integrity than he has are likely to take such comments even further out of context and use them in their ongoing “filesystems are irrelevant” marketing campaign. So here’s the conversation so far, rearranged to show the diverging threads of discussion and with some extra commentary from me.

DavidStrauss Block devices are the wrong place scale and do HA. It’s always expensive (NetApp), unreliable (SPOF), or administratively complex (Gluster).

Obdurodon Huh? GlusterFS is *less* administratively complex than e.g. Cassandra. *Far* less. Also, block dev != filesystem.

Obdurodon It might not be the right choice for any particular case, but for reasons other than administrative complexity.
What reasons, then? Wrong semantics, wrong performance profile, redundant wrt other layers of the system, etc. I think David and I probably agree that scale and HA should be implemented in the highest layer of any particular system, not duplicated across layers or pushed down into a lower layer to make it Somebody Else’s Problem (the mistake made by every project to make the HDFS NameNode highly available). However, not all systems have the same layers. If what you need is a filesystem, then the filesystem layer might very well be the right place to deal with these issues (at least as they pertain to data rather than computation). If what you need is a column-oriented database, that might be the right place. This is where I think the original very general statement fails, though it seems likely that David was making it in a context where layering two systems had been suggested.

DavidStrauss GlusterFS is good as it gets but can still get funny under split-brain given the file system approach: http://t.co/nRu1wNqI
I was rather amused by David quoting my own answer (to a question on the Gluster community site) back at me, but also a bit mystified by the apparent change of gears. Wasn’t this about administrative complexity a moment ago? Now it’s about consistency behavior?

Obdurodon I don’t think the new behavior (in my answer) is markedly weirder than alternatives, or related to being a filesystem.

DavidStrauss It’s related to it being a filesystem because the consistency model doesn’t include a natural, guaranteed split-brain resolution.

Obdurodon Those “guarantees” have been routinely violated by most other systems too. I’m not sure why you’d single out just one.
I’ll point out here that Cassandra’s handling of Hinted Handoff has only very recently reached the standard David seems to be advocating, and was pretty “funny” (to use his term) before that. The other Dynamo-derived projects have also done well in this regard, but other “filesystem alternatives” have behavior that’s too pathetic to be funny.

DavidStrauss I’m not singling out Gluster. I think elegant split-brain recovery eludes all distributed POSIX/block device systems.
Perhaps this is true of filesystems in practice, but it’s not inherent in the filesystem model. I think it has more to do with who’s working on filesystems, who’s working on databases, who’s working on distributed systems, and how people in all of those communities relate to one another. It just so happens that the convergence of database and distributed-systems work is a bit further along, but I personally intend to apply a lot of the same distributed-system techniques in a filesystem context and I see no special impediment to doing so.

DavidStrauss #Gluster has also come a long way in admin complexity, but high-latency (geo) replication still requires manual failover.

Obdurodon Yes, IMO geosync in its current form is tres lame. That’s why I still want to do *real* wide-area replication.

DavidStrauss Top-notch geo replication requires embracing split-brain as a normal operating mode and having guaranteed, predictable recovery.

Obdurodon Agreed wrt geo-replication, but that still doesn’t support your first general statement since not all systems need that.

DavidStrauss Agreed on need for geo-replication, but geo-repl. issues are just an amplified version of issues experienced in any cluster.
As I’ve pointed out before, I disagree. Even systems that do need this feature need not – and IMO should not – try to do both local/sync and remote/async replication within a single framework. They’re different beasts, most relevantly with respect to split brain being a normal operating mode. I’ve spent my share of time pointing out to Stonebraker and other NewSQL folks that partitions really do occur even within a single data center, but they’re far from being a normal case there and that does affect how one arranges the code to handle it.

Obdurodon I’m loving this conversation, but Twitter might not be the right forum. I’ll extract into a blog post.

DavidStrauss You mean complex, theoretical distributed systems issues aren’t best handled in 140 characters or less? :-)

I think that about covers it. As I said, I disagree with the original statement in its general form, but might find myself agreeing with it in a specific context. As I see it, aggregating local filesystems to provide a single storage pool with a filesystem interface and aggregating local filesystems to provide a single storage pool with another interface (such as a column-oriented database) aren’t even different enough to say that one is definitely preferable to the other. The same fundamental issues, and many of the same techniques, apply to both. Saying that filesystems are the wrong way to address scale is like saying that a magnetic #3 Phillips screwdriver is the wrong way to turn a screw. Sometimes it is exactly the right tool, and other times the “right” tool isn’t as different from the “wrong” tool as its makers would have you believe.

While we were in Ann Arbor last month, we stopped by the abolutely amazing Kaleidoscope used and rare bookstore. (I’d link, but can’t find a website.) I knew from our last visit that they have an excellent collection of old sci-fi magazines, so I decided to see if they had any from the month I was born – April 1965. Sure enough, they had a Galaxy from that month. I was surprised how many of the authors I recognized. Here are the stories mentioned on the cover:

  • “Wasted on the Young” by John Brunner
  • “War Against the Yukks” by Keith Laumer
  • “A Wobble in Wockii Futures” by Gordon R. Dickson
  • “Committee of the Whole” by Frank Herbert

That’s an all-star cast right there. However, the story that really made an impression on me was by someone I had never heard of – “The Decision Makers” by Joseph Green. It’s about an alien-contact specialist sent to decide whether a newly discovered species met relevant definitions of intelligence which would interfere with a planned terraforming operation. That’s pretty standard stuff for the SF of the time, but there’s a twist; the aliens, which are called seals, have a sort of collective intelligence which complicates the protagonist’s job. This leads to the passage that might be of interest to my usual technical audience.

Our group memory is an accumulated mass of knowledge which is impressed on the memory areas of young individuals at birth, at least three such young ones for each memory segment. We are a short-lived race, dying of natural causes after eight of your years. As each individual who carries a share of the memory feels death approaching he transfers his part to a newly born child, and thus the knowledge is transferred from generation to generation, forever.

Try to remember that this was written in 1965, long before the networked computer systems today were even imagined, and that the author wasn’t even writing about computers. He was trying to tell a completely different kind of story; the entire excerpt above could have been omitted entirely without affecting the plot. Nonetheless, he managed to describe a form of what we would now call sharding, with replication and even deliberate re-replication to preserve availability. The result should be instantly recognizable to anyone who has studied modern distributed databases such as Voldemort or Riak or Cassandra. A lot of people think of this stuff as cutting edge, but it’s also an incidental part of a barely-remembered story from 1965. Somehow I find that both humbling and hilarious.

Tom Trainer wrote what was supposed to be a thoughtful examination of what “cloud storage” should mean, but it came across as a rather nasty anti-Isilon hit piece. I tried to reply there, but apparently my comment won’t go through until I register with “UBM TechWeb” so they can sell me some crap, so I’m posting my response here. Besides being a defense of an unfairly maligned competitor – mine as well as Tom’s unnamed employer’s – it might help clarify some of the issues around what is or is not “real” cloud storage.

As the project lead for CloudFS, which addresses exactly the kinds of multi-tenancy and encryption you mention, I agree with many of your main points about what features are necessary for cloud storage. Where I disagree is with your (mis)characterization of Isilon to make those points.

* First, their architecture is far from monolithic. Yes, OneFS is proprietary, but that’s a *completely* different thing.

* Second, scaling to 144 servers is actually pretty good. When you look closely at what many vendors/projects claim, you find out that they’re actually talking about clients . . . and any idiot can put together thousands of clients. Conflating node counts with server counts was a dishonest trick when I caught iBrix doing it years ago, and it’s a dishonest trick now. Even the gigantic “Spider” system at ORNL only has 192 servers, and damn few installations need even half of that. It’s probably a support limit rather than an architectural limit. No storage vendor supports configurations bigger than they’ve tested, and testing even 144 servers can get pretty expensive – at least if you do it right. I’m pretty sure that Isilon would raise that limit if somebody asked them for a bigger system and let them use that configuration for testing.

Third, Isilon does have a “global” namespace as that term is usually used – i.e. at a logical level, to mean that the same name means the same thing across multiple servers, just like a “global variable” represents the same thing across multiple modules or processes. Do you expect global variables to be global in a physical sense too? In common usage, people use terms like “WAN” or “multi-DC” or “geo” to mean distribution across physical locations, and critiquing a vendor for common usage of a term makes your article seem like even more of a paid-for attack piece.

Disclaimer: I briefly evaluated and helped deploy some Isilon gear at my last job (SiCortex). I respect the product and I like the people, but I have no other association with either.

Amazon has posted their own explanation of the recent EBS failure. Since I had offered some theories earlier, I think it’s worthwhile to close this out by comparing my theories with Amazon’s explanation. Specifically, I had suggested two things.

  • EBS got into a state where it didn’t know what had been replicated, and fell back to re-replicating everything.
  • There was inadequate flow control on the re-replication/re-mirroring traffic, causing further network overload.

It turns out that both theories were slightly correct but mostly incorrect. Here’s the most relevant part of Amazon’s account.

When this network connectivity issue occurred, a large number of EBS nodes in a single EBS cluster lost connection to their replicas. When the incorrect traffic shift was rolled back and network connectivity was restored, these nodes rapidly began searching the EBS cluster for available server space where they could re-mirror data. Once again, in a normally functioning cluster, this occurs in milliseconds. In this case, because the issue affected such a large number of volumes concurrently, the free capacity of the EBS cluster was quickly exhausted, leaving many of the nodes “stuck” in a loop, continuously searching the cluster for free space. This quickly led to a “re-mirroring storm,” where a large number of volumes were effectively “stuck” while the nodes searched the cluster for the storage space it needed for its new replica. At this point, about 13% of the volumes in the affected Availability Zone were in this “stuck” state.

the nodes failing to find new nodes did not back off aggressively enough when they could not find space, but instead, continued to search repeatedly

The first part refers to the sort of full re-mirroring that I had mentioned, although it was re-mirroring to a new replica instead of an old one. The last part is a classic congestion-collapse pattern: transient failure, followed by too-aggressive retries that turn the transient failure into a persistent one. I had thought this would apply to the data traffic, but according to Amazon it affected the “control plane” instead. This is also what caused it to affect multiple availability zones, since the control plane – unlike the data plane – spans availability zones within a region.

The most interesting parts, to me, are the mentions of actual bugs – one in EBS and one in RDS. Here are the descriptions.

There was also a race condition in the code on the EBS nodes that, with a very low probability, caused them to fail when they were concurrently closing a large number of requests for replication. In a normally operating EBS cluster, this issue would result in very few, if any, node crashes; however, during this re-mirroring storm, the volume of connection attempts was extremely high, so it began triggering this issue more frequently. Nodes began to fail as a result of the bug, resulting in more volumes left needing to re-mirror.

Of multi-AZ database instances in the US East Region, 2.5% did not automatically failover after experiencing “stuck” I/O. The primary cause was that the rapid succession of network interruption (which partitioned the primary from the secondary) and “stuck” I/O on the primary replica triggered a previously un-encountered bug. This bug left the primary replica in an isolated state where it was not safe for our monitoring agent to automatically fail over to the secondary replica without risking data loss, and manual intervention was required.

These bugs represent an important lesson for distributed-system designers: bugs strike without regard for location. Careful sharding and replication across machines and even sites won’t protect you against a bug that exists in every instance of the code. A while back, when I was attending the UCB retreats because of OceanStore, the Recovery Oriented Computing” folks were doing some very interesting work on correlated failures. I remember some great discussions about distributing a system not just across locations but across software types and versions as well. This lesson has stuck with me ever since. For example, in iwhd the extended replication-policy syntax was developed with a specific goal of allowing replication across different back-end types (e.g. S3, OpenStack) or operating systems as well as different locations. Maybe distributing across different software versions wouldn’t have helped in Amazon’s specific case if the bugs involved have been in there long enough, but it’s very easy to imagine a related scenario in which having different versions with different mirror-retry strategies in play (same theory behind multiple hashes in Stochastic Fair Blue BTW) might at least have avoided one factor contributing to the meltdown.

Since my last article on the subject, a couple of other folks have tried to use the EBS failure to pimp their own competing solutions. Joyent went first, with Network Storage in the Cloud: Delicious but Deadly. He makes some decent points, e.g. about “read-only” mounts not actually being read-only, until he goes off the rails about here.

This whole experience — and many others like it — left me questioning the value of network storage for cloud computing. Yes, having centralized storage allowed for certain things — one could “magically” migrate a load from one compute node to another, for example — but it seemed to me that these benefits were more than negated by the concentration of load and risk in a single unit (even one that is putatively highly available).

What’s that about “concentration of load and risk in a single unit”? It’s bullshit, to put it simply. Note the conflation of “network storage” in the first sentence with “centralized storage” in the second. As Bryan himself points out in the very next paragraph, the fallback to local storage has forced them to “reinvest in technologies” for replication, migration, and backup between nodes. That’s not reinvesting, that’s reinventing – of wheels that work just fine in systems beyond those Bryan knows. Real distributed storage doesn’t involve that concentration of load and risk, because it’s more than just a single server with failover. Those of you who follow me on Twitter probably noticed my tweet about people whose vision of “distributed” doesn’t extend beyond that slight modification to an essentially single-server world view. Systems like RBD/Sheepdog, or Dynamo and its derivatives if you go a little further afield, don’t have the problems that naive iSCSI or DRBD implementations do.

Next up is Heroku, with their incident report which turned into an editorial. They actually make a point I’ve been making for years.

2) BLOCK STORAGE IS NOT A CLOUD-FRIENDLY TECHNOLOGY. EC2, S3, and other AWS services have grown much more stable, reliable, and performant over the four years we’ve been using them. EBS, unfortunately, has not improved much, and in fact has possibly gotten worse. Amazon employs some of the best infrastructure engineers in the world: if they can’t make it work, then probably no one can. Block storage has physical locality that can’t easily be transferred.

OK, that last part isn’t quite right. Block storage has no more or less physical locality than file or database storage; it all depends on the implementation. However, block storage does have another property that makes it cloud-unfriendly: there’s no reasonable way to share it. Yes, cluster filesystems that allow such sharing do exist. I even worked on one a decade ago. There are a whole bunch of reasons why they’ve never worked out as well as anyone hoped, and a few reasons why they’re a particularly ill fit for the cloud. In the cloud you often want your data to be shared, but the only way to share block storage is to turn it into something else (e.g. files, database rows/columns, graph nodes) at which point you’re sharing that something else instead of sharing the block storage itself. Just about every technology you might use to do this can handle its own sharding/replication/etc. so you might as well cut out the middle man and run them on top of local block storage. That’s the only case where local block storage makes sense, because it explicitly does not need to be shared and is destined for presentation to users in some other form. Even in the boot-image case, which might seem to involve non-shared storage, there’s actually sharing involved if your volume is a snapshot/clone of a shared template. Would you rather wait for every block in a multi-GB image to be copied to local disk before your instance can start, or start up immediately and only copy blocks from a snapshot or shared template as needed? In all of these cases, the local block storage is somehow virtualized or converted ASAP instead of being passed straight through to users. The only reason for the pass-through approach is performance, but if you’re in the cloud you should be achieving application-level performance via horizontal scaling rather than hyper-optimization of each instance anyway so that’s a weak reason to rely on it except in a few very specialized cases such as virtual appliances which are themselves providing a service to the rest of the cloud.

Apparently the AWS data center in Virginia had some problems today, which caused a bunch of sites to become unavailable. It was rather amusing to see which of the sites I visit are actually in EC2. It was considerably less amusing to see all of the people afraid that cloud computing will make their skills obsolete, taking the opportunity to drum up FUD about AWS specifically and cloud computing in general. Look, people: it was one cloud provider on one day. It says nothing about cloud computing generally, and AWS still has a pretty decent availability record (performance is another matter). Failures occur in traditional data centers too, whether outsourced or run by in-house staff. Whether you’re in the cloud or not, you should always “own your availability” and plan for failure of any resource on which you depend. Sites like Netflix that did this in AWS, by setting up their systems in multiple availability zones, were able to ride out the problems just fine. The problem was not the cloud; it was people being lazy and expecting the cloud to do their jobs for them in ways that the people providing the cloud never promised. Anybody who has never been involved in running a data center with availability at least as good as Amazon’s, but who has nevertheless used this as an excuse to tell people they should get out of the cloud, is just an ignorant jerk.

The other interesting thing about the outage is Amazon’s explanation.

8:54 AM PDT We’d like to provide additional color on what were working on right now (please note that we always know more and understand issues better after we fully recover and dive deep into the post mortem). A networking event early this morning triggered a large amount of re-mirroring of EBS volumes in US-EAST-1. This re-mirroring created a shortage of capacity in one of the US-EAST-1 Availability Zones

I find this interesting because of what it implies about how EBS does this re-mirroring. How does a network event trigger an amount of re-mirroring (apparently still in progress as I write this) so far in excess of the traffic during the event? The only explanation that occurs to me, as someone who designs similar systems, is that the software somehow got into a state where it didn’t know what parts of each volume needed to be re-mirrored and just fell back to re-mirroring the whole thing. Repeat for thousands of volumes and you get exactly the kind of load they seem to be talking about. Ouch. I’ll bet somebody at Amazon is thinking really hard about why they didn’t have enough space to keep sufficient journals or dirty bitmaps or whatever it is that they use to re-sync properly, or why they aren’t using Merkle trees or some such to make even the fallback more efficient. They might also be wondering why the re-mirroring isn’t subject to flow control precisely so that it won’t impede ongoing access so severely.

Without being able to look “under the covers” I can’t say for sure what the problem is, but it certainly seems that something in that subsystem wasn’t responding to failure the way it should. Since many of the likely-seeming failure scenarios (“split brain” anyone?) involve a potential for data loss as well as service disruption, if I were a serious AWS customer I’d be planning how to verify the integrity of all my EBS volumes as soon as the network problems allow it.

At the Linux Foundation’s recent End User Summit, I had the pleasure of meeting K.S. Bhaskar from FIS. Recently he wrote an article on his blog about Eventual State Consistency vs. Eventual Path Consistency in which he has some particularly interesting things about different kinds of consistency guarantees.

there are applications where detection of colliding updates does not suffice to ensure Consistency (where Consistency is defined as the database always being within the design parameters of the application logic – for example, the “in balance” guarantee that the sums of the assets and liabilities columns of a balance sheet are always equal).

He then gives an example showing an apparent problem with two financial transactions and their associated service charges, across two sites while a service-charge rate change is still “in flight” between them. I originally responded there, but my reply seems to have disappeared. Maybe it got lost due to a conflict with a subsequent update. ;) In any case, I might as well respond here because I think his example highlights an important issue. I don’t think Bhaskar’s example really demonstrates the problem he had described. In the last step he says that

B detects a collision, since it observes that the data that was read by the application logic on A to compute the result of transaction P has changed on B

How could B observe such a thing? Only if it knew either the data that was read on A (i.e. the service-charge rate in effect for the transaction was included as part of the replication request) or the exact replication state on A at the time P was processed there (e.g. by using vector clocks or similar). Either way, it would have enough information to replicate the transaction in a consistent fashion.

The real problem would be if B didn’t know whether or not the rate change had reached A yet when P was processed there. That would result in B needing to distinguish between two possible states that would have to be handled differently, but with no way to make that distinction. The general rule to avoid these kinds of unresolvable conflicts is: don’t pass around references to values that might be inconsistent across systems. It’s like passing a pointer from one address space to a process in another; you just shouldn’t expect it to work. Either pass around the actual values or do calculations involving those values and replicate the result. For example, consider the following replication requests.

# $var indicates immediate substitution from the original context
# %var indicates a transaction-local variable

# Wrong: sc_rate is passed by reference and interpreted at destination
replicate transaction "transfer #zzz" {
    acct_x -= $amt * (1.0 + sc_rate);
    acct_y += $amt;
}

# Right: sc_rate is interpreted at source and passed by value
replicate transaction "transfer #zzz" {
    %sc_rate = $sc_rate;
    acct_x -= $amt * (1.0 + %sc_rate);
    acct_y += $amt;
}

# Right: service charge is calculated at source
# works, but not good for auditing
amt_with_sc = amt * (1.0 + sc_rate)
replicate transaction "transfer #zzz" {
    acct_x -= $amt_with_sc;
    acct_y += $amt;
}

# Right: service charge as separate transaction
sc = amt * sc_rate;
replicate transaction "transfer #zzz" {
    acct_x -= $amt;
    acct_y += $amt;
}
replicate transaction "service charge for #zzz" {
    acct_x -= $sc;
}

In an ideal world, the interface and behavior for the replication subsystem would disallow or strongly discourage the wrong form. For example, it could require that any values meant to be interpreted or modified at the destination must be explicitly listed or tagged, and reject anything that abuses “extraneous” variables as in the first form above. (Auto-conversion of the first form into the second is likely to introduce its own kinds of unexpected behavior.) That would force people to use one of the methods that actually works.

When Alex Popescu wrote about scalable storage solutions and I said that the omission of distributed filesystems made me cry, he suggested that I could write an introduction. OK. Here it is.

All filesystems – even local ones – have a similar data model and API model. The data model consists of files inside directories, where both have user-assigned names. In most modern filesystems directories can be nested, file contents are byte-addressable, and names are free-form character sequences. The API model, commonly referred to as POSIX after the standard of the same name, includes two broad categories of calls – those that operate on files, and those that operate on the “namespace” of files within directories. Examples of the first category include open, close, read, write, and fsync. Examples of the second category include opendir, readdir, link, unlink, and rename. People who actually develop filesystems, especially in a Linux context, often talk in terms of a different three-way distinction (file/inode/dirent operations) but that has more to do with filesystem internals than with the API users see. The other thing about filesystems is that they’re integrated into the operating system. Any program should be able to use any filesystem without using special libraries. That makes real filesystems a bit harder to implement, but it also makes them more generally useful than impostors that just have “FS” in the name to imply functionality they don’t have. There are many ways to categorize filesystems – according to how they’re accessed, how they’re implemented, what they’re good for, and so on. In the context of scalable storage solutions, though, the most important groupings are these.

  • A local filesystem is local to a single machine, in the sense that only a process on the same machine can make POSIX calls to it. That process might in fact be a server for some “higher level” kind of filesystem, and in fact local filesystems are an essential building block for most others, but for this to work the server must make a new local-filesystem call which is not quite the same as continuing the client’s call.
  • A network filesystem is one that can be shared, but where each client communicates with a single server. NFS (versions 2 and 3) and CIFS (a.k.a. SMB which is what gives Samba its name) are the best known examples of this type. Servers can of course be clustered and made highly available and so on, but this must be done transparently – almost behind the clients’ backs or under their noses. This approach fundamentally only allows vertical scaling, and the trickery necessary to scale horizontally within a single-server basic model can become quite burdensome.
  • A distributed filesystem is one in which clients actually know about and directly communicate with multiple servers (of one or more types). Lustre, PVFS2, GlusterFS, and Ceph all fit into this category despite their other differences. Unfortunately, the term “distributed filesystem” makes no distinction between filesystems distributed across a fast and lossless LAN and those distributed across a WAN with exactly opposite characteristics. I sometimes use “near-distributed” and “far-distributed” to make this distinction, but as far as I know there’s no concise and commonly accepted terms. AFS is the best known example of a far-distributed filesystem, and one of the longest-lived filesystems in any category (still in active large-scale use at several places I know of).
  • A parallel filesystem is a distributed filesystem in which a single file, or even a single I/O request, can be striped across multiple servers. This is primarily beneficial in terms of performance, but can also help to distribute capacity more evenly than if every file lives on exactly one server. I’ve often used the term to refer to near-distributed filesystems as distinct from their far-distributed cousins, because there’s a high degree of overlap, but it’s not technically correct. There are near-distributed filesystems that aren’t parallel filesystems (GlusterFS is usually configured this way) and parallel filesystems that are not near-distributed (Tahoe-LAFS and other crypto-oriented filesystems might fit this description).
  • A cluster or shared-storage filesystem is one in which clients are directly attached to shared block storage. GFS2 and OCFS2 are the best known examples of this category, which also includes MPFS. Once touted as a performance- or scale-oriented solution, these are now positioned mainly as availability solutions with a secondary emphasis on strong data consistency (compared to the weak consistency offered by many other network and distributed filesystems). Due to this focus and the general characteristics of shared block storage, the distribution in this case is always near.

This set of distinctions is certainly neither comprehensive nor ideal, as illustrated by pNFS which allows multiple “layout” types. With a file layout, pNFS would be a distributed filesystem by these definitions. With a block layout, it would be a cluster filesystem. With an object layout, a case could be made for either, and yet all three are really the same filesystem with (mostly) the same protocol and (definitely) the same warts.

One of the most important distinctions among network/distributed/cluster filesystems, from a scalability point of view, is whether it’s just data that’s being distributed or metadata as well. One of the issues I have with Lustre, for example, is that it relies on a single metadata server (MDS). The Lustre folks would surely argue that having a single metadata server is not a problem, and point out that Lustre is in fact used at some of the most I/O-intensive sites in the world without issue. I would point out that I have actually watched the MDS melt down many times when confronted with any but the most embarrassingly metadata-light workloads, and also ask why they’ve expended such enormous engineering effort – on at least two separate occasions – trying to make the MDS distributed if it’s OK for it not to be. Similarly, with pNFS you get distributed data but only some pieces of the protocol (and none of any non-proprietary implementation) to distribute metadata as well. Anybody who wants a filesystem that’s scalable in the same way that non-filesystem data stores such as Cassandra/Riak/Voldemort are scalable would and should be very skeptical of claims made by advocates of a distributed filesystem with non-distributed metadata.

A related issue here is of performance. While near-distributed parallel filesystems can often show amazing megabytes-per-second numbers on large-block large-file sequential workloads, as a group they’re notoriously poor for random or many-small-file workloads. To a certain extent this is the nature of the beast. If files live on dozens of servers, you might have to contact dozens of servers to list a large directory, or the coordination among those servers to maintain consistency (even if it’s just metadata consistency) can become overwhelming. It’s harder to do things this way than to blast bits through a simple pipe between one client and one server without any need for further coordination. Can Ma’s Pomegranate project deserves special mention here as an effort to overcome this habitual weakness of distributed filesystems, but in general it’s one of the reasons many have sought alternative solutions for this sort of data.

So, getting back to Alex’s original article and my response to it, when should one consider using a distributed filesystem instead of an oh-so-fashionable key/value or table/document store for one’s scalable data needs? First, when the data and API models fit. Filesystems are good at hierarchical naming and at manipulating data within large objects (beyond the whole-object GET and PUT of S3-like systems), but they’re not so good for small objects and don’t offer the indices or querying of databases (SQL or otherwise). Second, it’s necessary to consider the performance/cost curve of a particular workload on a distributed filesystem vs. that on some other type of system. If there’s a fit for data model and API and performance, though, I’d say a distributed filesystem should often be preferred to other options. The advantage of having something that’s accessible from every scripting language and command-line tool in the world, without needing special libraries, shouldn’t be taken lightly. Getting data in and out, or massaging it in any of half a million ways, is a real problem that isn’t well addressed by any storage system with a “unique” API (including REST-based ones) no matter how cool that storage system might be otherwise.

I’m in Tempe for the Fedora Users and Developers Conference, a.k.a. FUDCon. Here are some random thoughts.

  • Enhanced pat-downs aren’t so bad.
  • The weather’s nice. I should have expected the palm trees, but I totally didn’t expect to see orange trees with ripe fruit hanging just out of arm’s reach (because the ASU students picked everything lower already).
  • The ASU campus is much more interesting and varied architecturally than any other campus I’ve been on. Sure, the color palette is a bit limited – light brown, dark brown, reddish brown – but the shapes and textures make up for it. Actually there was one nice splash of color, which was a gigantic wild rose bush clinging to the side of one building. That ugly bump just north of campus doesn’t do much for me, though.
  • I haven’t seen a single squirrel on campus. I did see two cats, though – fluffy persians who must be very uncomfortable in all this heat. I’ve seen and heard lots of unfamiliar birds, too – mostly grackles, I think
  • Meeting people in person is great. The Fedora crowd is notably casual, international, and friendly – even by technical-conference standards, in all three regards. I’d particularly like to thank Robyn Bergeron and Seth Vidal, very busy leaders in that community who have nonetheless gone out of their way to make me feel welcome and included. It was also especially nice to meet Pete Zaitcev and Major Hayden, because we’ve interacted so much online but never met until now.

Here’s a Flickr set for some of the pictures I’ve taken while here. OK, enough of the fluff. What about the real stuff? More bullet points, because that’s how I roll.

  • The whole “Bar Camp” style of pitching and voting on sessions was new to me. It did seem to work, though.
  • The first talk I attended was Marek Goldmann talking about BoxGrinder. I was pretty familiar with this work from my own involvement with Deltacloud/Aeolus, but Marek deserves kudos for presenting it well and even giving a live demo.
  • After lunch, it was Steven Dake talking about Sheepdog. Again, it’s work I’m familiar with. I think Steven and I will never quite agree on the value/importance of Sheepdog. On the one hand, the notion of distributed block storage has been very appealing to me for a long time. It’s why I went to Conley in 1998, and worked on C3D at EMC a few years later. On the other hand, block storage using a single specialized application interface which isn’t even as complex as the real system-level block device interface seems a bit unambitious to me. It just limits the applicability of the result too much IMO, and that seems a meager payoff for all that work solving the harder distributed-data problems. Of course, in this case it’s all NTT’s effort anyway. As far as the talk, a comparison to RBD would have been nice since anybody who’s interested in one should definitely check out the other as well.
  • Next up was Mike McGrath, talking about how cloud computing is going to displace non-cloud computing. Even as somebody who’s working on cloud stuff, I’m a little bit skeptical. Still, it was a good talk to get people thinking about all the implications.
  • I’ll skip the next talk, since it was mine and I’ll have more to say below.
  • The last talk of the day, for me, was Chris Lalancette talking about cloud management – especially Deltacloud and Aeolus. Having worked for a while on this project (and sitting about twenty feet from Chris most days) this was also pretty familiar territory, and Chris did a good job presenting on a complex subject. I apologize to both him and to Tobias Kunze (with whom I had an awesome chat later in the evening BTW) for putting them on the spot about the relationship between Makara and Aeolus.

So, how did my own presentation go? Somebody pointed out that I’d seemed a bit on edge the night before. Partly that was just the stress of travel and of being an introvert mingling with an unfamiliar group of people, but there’s another factor that I hadn’t even consciously realized until I was writing this post. I’ve presented about CloudFS privately and/or in fairly abstract terms so many times that I’d actually forgotten this was the first truly public presentation about a concrete thing that I’ll actually be delivering in the near future. That’s a big deal. I was a bit concerned at first because they’d put me in the largest room and at five past the hour it was still three-quarters empty. Nobody likes talking to an empty room. Shortly after I started, though, the room was pretty much full – not standing-room-only full, but I don’t remember seeing many empty seats. Not that I was trying too hard to count, of course; I was otherwise occupied. Even better, people were engaged. There were many questions, and they were good questions – questions that to me indicated genuine curiosity and constructive intent, not just the “I’m going to prove I’m smart” or “if you don’t get this one right your project will look silly” kinds of questions that one often gets. The post-presentation chatter even went on so long that Chris had to kick us away from the lectern. Good problem to have. :)

The best part of all, in my opinion, was outside of the talk itself. In at least two other presentations, and in even more hallway conversations, the possibility of using CloudFS to solve some problem or add some functionality came up. Also, at least one person had clearly given the code a pretty detailed look since my talk, asking questions and making comments about internal details that he could not have known about otherwise. That is so cool. It’s all very well to have people’s attention for an hour or so before people move on to the next new thing, but when something you’ve talked about shows up in colleagues’ own thinking about how to solve their own problems that’s an even surer measure of being on the right track. Thank you, everyone, for letting me be part of the broader progress we’re all making together.

In case people are wondering why I haven’t been posting here, it’s partly being busy at work, partly being busy with Christmas stuff, partly being sick with what has come to be known in the office as the Creeping Death, and partly because I’ve been writing a lot over at CloudFS.org. If you’re interested in the why and what of CloudFS, go check it out.