Archive for December, 2010

Initial Performance Results

My pseudo-vacation seemed like a good time to run some performance/stress tests on EC2. As I expected, this gave me a chance to fix some problems that I’d never seen on my own machines at Red Hat, or in smaller-scale testing, so it was definitely worthwhile just for that. The real goal, though, was to establish some performance baselines and measure overhead for some of the CloudFS functionality, so that I can make sure we’re not paying to dearly as that functionality becomes more complex. I set up three m1.large servers and one m1.large client, all running the official Fedora 14 x86_64 images for EC2. Actually I tried running Amazon’s own Linux distro, and even figured out how to make the GlusterFS packaging recognize it as a copy of RHEL (mostly /etc/system-release instead of redhat-release), but then the performance was dreadful. This is similar to what happened when I tried running similar tests on Rackspace a few weeks back. I could debug either case, but it doesn’t seem like it should be a high priority when I already have something that seems to work pretty well. To continue the methodological description…

  • My own build of GlusterFS+CloudFS, all with -O2
  • Simple three-way DHT setup (no replication) with write-behind and io-cache disabled
  • iozone -c -e -r 1m -s 1280m -i 0 -i 1 -l 30

I used 30 threads to get a decent distribution across the three servers, and moderate parallelism on the client. That times the file size means that the data is about 20% more than the total memory of all four systems; this doesn’t entirely eliminate cache effects, but does bring them down to real-world levels. Here are the results for a “vanilla” config, for vanilla plus only the “cloud” (tenant-isolation) translator, and for vanilla plus only the “crypt” (at-rest encryption) translator.

(MB/s) Initial Write Re-write Initial Read Re-read
Baseline 47.6 84.8 82.2 82.2
+cloud 55.2 85.9 82.1 82.1
+crypt 37.1 37.8 30.9 31.0

The baseline figure is actually pretty respectable, though the delta between first and subsequent writes was a bit surprising. There was definitely some evidence of contention/starvation, as one server or another would sit at zero blocks read/written for several seconds at a time. As it turns out, the distribution between the three servers was also surprisingly uneven, so the third server would always finish significantly before the other two. That brings the results down a bit, but it’s more realistic than if I’d forced a more even distribution.

The biggest surprise was that the cloud-translator results were actually slightly better than the baseline. Mostly I think that’s just the variability of performance that EC2 is known for, but at least it shows that the overhead for adding the translator is below the noise. That’s good news; we’ll see if the situation remains the same when I add UID/GID mapping. The general lousiness of the crypto results was not a surprise at all. The starvation behavior was even more pronounced, with servers sometimes sitting idle for most of a minute before getting more work to do (remember that this is pure client-side encryption). The single glusterfs process on the client was pegged during writes, and nearly so during reads. That’s actually good news too, since it implies that there should be a pretty easy-to-achieve performance boost from multi-threading the encryption/decryption stage either using the io-threads translator or otherwise. The best news of all, in my opinion, is that all of these tests managed to complete without failures or data errors. I did have one memory leak in the crypt translator, which I fixed, but things looked very good after that.

That should be enough performance testing for this year. ;) When things pick up after the weekend, I’ll probably have more results like these to report.


CloudFS: How

One fundamental tenet of CloudFS is that, with a modern horizontally scalable filesystem as a technology base, a single large filesystem on dedicated hardware and run by dedicated experts can be more robust and more efficient than having non-expert users configure and run their own filesystems on hardware provisioned primarily for a virtualized compute/network workload. To do this, though, we need to enable secure sharing of that filesystem resource, and that’s what CloudFS does for GlusterFS. To understand how it does that, it’s easiest to follow the steps in processing a single I/O request to see how the various GlusterFS and CloudFS components on both clients and servers work together. Before we do that, though, we have to nail down a single fundamental concept: the translator.

A GlusterFS translator is an example of a filter or proxy pattern, where a software component takes a request expressed in some format and protocol, then translates it into one or more requests in the same format and protocol for further processing by the next component. This pattern allows many different kinds of functionality to be layered on top of one another in all sorts of different orders, and even to be relocated from one side to another of the client/server divide. It’s a lot like UNIX shell pipelines (using ssh/expect or similar to execute some commands elsewhere) or STREAMS/sockets except that the data items are filesystem requests instead of text lines or network messages. Most of GlusterFS’s functionality is implemented as translators using a common core and a common API to communicate with one another. For example, the “dht” or “distribute” translator takes a request and uses consistent hashing to direct it to one of several “subvolumes” which usually (but not necessarily) live on different servers. The “afr” or “replicate” translator takes a request and directs it to multiple subvolumes, bracketed by extended-attribute requests that it generates internally to ensure that the proper state can be recovered if the servers behind any of those subvolumes fail. Communication is handled by “client” and “server” translators, which are really kind of half-translators using the translator API on one side and an RPC protocol on the other. Then there are several other performance translators to do things like read-ahead, write-behind, directory/attribute prefetching and caching. These can all be added and removed and reordered and relocated, and even duplicated – e.g. caching both on the client and server – to suit users’ needs. In accordance with the “minimize communication” and “most work at the most numerous nodes” rules of scalable systems, translators that redirect or fan out requests among multiple subvolumes (such as dht or afr) are usually implemented at the client, while translators that pass requests through to single subvolume or turn them around internally (such as read-ahead or write-behind) can go anywhere – but that’s just a rule of thumb, not a restriction imposed by the software.

So how does CloudFS fit in? If you were to look at the CloudFS git tree right now, you’d see three new translators. CloudFS also uses/enhances some existing (but infrequently used) translators. The easiest way to see how this all works is to follow a single request – let’s say it’s a simple write – through all of the translators it hits, and see what each one does.

  1. We start at the “fuse” translator on the client, which is really another half-translator converting FUSE (filesystem) calls into their translator-API equivalents. Once the conversion is made, this just gets passed “down the stack” to the next translator.
  2. The next stop in the CloudFS world is the existing “quota” translator, which keeps track of how much space the cloud tenant controlling this machine has allocated and generates an error if they try to use too much. This would ideally be done on the server side, but remember that there are many servers. It’s much more convenient to do this above some of the fan-out translators that will follow. Don’t worry, though; usage is tracked for billing purposes on the server, so nobody’s going to over-consume resources (and possibly deny them to others) without paying for them. Assuming we’re within our quota limit, we just pass the request on.
  3. Now we go through the standard client-side translator stack, consisting of several performance translators followed by DHT (to distribute data) and lastly AFR (to replicate it). At this point we’re actually dealing with multiple requests to multiple servers, but we’ll just follow one for the time being.
  4. Next up for this particular sub-request is the new “crypt” translator, which encrypts our data using a key that is never on any of the servers.
  5. Now we go through the new (in fact not written yet) “auth-helper” translator, which just passes the request straight through. Huh? The reason for this apparent no-op is that the auth-helper translator actually does most of its work at connection time, establishing that this connection is associated with a particular authenticated user (“tenant”). Because of the way translator configuration works we’re still in the request path after that, but all we do is hand requests straight through.
  6. At this point we finally cross the client/server divide, using the “client” translator on our side to contact the “server” translator for this sub-request’s target server. This gets us to a “brick” or translator stack that the server had previously exported for clients to use.
  7. This is where CloudFS really takes over from GlusterFS for a moment, through the new “cloud” translator. At this point we determine what tenant is associated with this connection, based on our earlier handshake with auth-helper. Using that identity, we map all of the tenant-specific filesystem identifiers (UIDs and GIDs) contained in the I/O request into their server-side counterparts, and redirect the whole request to a tenant-specific subvolume with its own translator. A lot of this is based on the configuration-rewriting script, which takes a single brick definition and effectively turns it into multiple bricks which get demultiplexed here.
  8. At the top of our tenant-specific translator stack is the new (in fact still to be written) “usage” translator, which updates the usage information – e.g. space used, requests and bytes in/out – for this tenant before passing through. This information will later be collected and aggregated across all servers by the provider to generate tenants’ bills.
  9. Now we go through the standard server-side translator stack, consisting of more performance translators before we finally get to the “posix” translator. This is another half-translator, converting translator-API requests into actual filesystem requests – in our case a write.
  10. Lastly, everything – especially status for the operation – starts flowing back. There’s no need to go through each step, but there are a few special cases worth noting. If this had been a read, then what comes back would include data to be decrypted by the crypt translator on the client. In other cases the returned information might contain global UID or GID information (as stored/retrieved by our local filesystem) to be mapped back into their tenant-specific forms by the cloud translator on the server. In most cases, the usage translator on the server and the quota translator on the client will also update their information as the request flies by.

Phew. That sure is a lot of work, isn’t it? Make no mistake; there is a cost associated with all of this mapping and redirecting and de/multiplexing and encrypting and authenticating and everything else. TANSTAAFL. However, both the hand-offs between translators and a lot of the “fast paths” within translators have been heavily optimized so that the overhead is no more than for other ways of achieving the same functionality (where it exists). Also, some of that overhead gets “bought back” by running native I/O on dedicated and tuned I/O nodes instead of virtual I/O on compute nodes. Lastly, and most importantly, the whole idea of horizontal scalability is that you can add more I/O nodes to get near-linear scaling up to whatever aggregate I/O capability you need. That doesn’t work infinitely, of course, but for real-world problem sizes it’s enough to make good on the claim that it’s better to use the provider’s shared filesystem infrastructure instead of setting up your own.


Fighting the Speed of Light

I just happened across Bound by the Speed of Light, about why NFS can’t be made to go fast over a WAN. The explanation’s the same one you’ll see many places, but I think the interesting thing is that this didn’t start as a research project. This was just people trying to do something that, to them, seemed pretty obvious: extend a single filesystem across their enterprise. We “experts” might know better than to attempt such a thing, but the fact is that real people with real business interests keep trying to do it and will almost certainly continue trying for the foreseeable future. This tendency will only increase as users access their data in and/or from the cloud, which is why we need filesystems that anticipate such “naive” usage.


The Feature That Wasn’t

People who’ve heard me talk about CloudFS have almost certainly heard me talk about what I (and others) consider its killer feature: multi-site replication (MSR). This is an area I’ve been working on for most of my career, it’s why I came to Red Hat to work on CloudFS, it’s probably why they hired me instead of someone else to do it, so why isn’t it represented in the Fedora feature page or in my previous writings here? The simple answer is because I want to do it right. This is a really hard problem that a lot of very smart people have been struggling with for a long time. Those other features might not be as sexy, but they are important. Getting something with those features out there for users to use and potential developers to see is important. Doing that and doing MSR all at once in the same short time-frame just isn’t feasible. There’s also a danger in putting something out there that’s too weak. There was a time when distributed filesystems over-promised and under-delivered, which led to a long period when users turned away from anything bearing that name, and I don’t want to see that history repeat itself. If the goal is to have an open and user-deployable alternative to overpriced proprietary “wide area file service” or “file virtualization” appliances, then we can’t release something half-baked.

So, what is MSR really? The first thing is that it’s primarily a replication model, rather than caching, as I have tried to distinguish the two elsewhere recently. Caching is not precluded by any means, but it’s either an extension or a separate feature. The model that I have in mind is of a distributed enterprise using replication between a small number of large sites or clouds to meet availability and disaster-recovery goals, plus caching at smaller branch offices and such to improve performance. In this model, MSR needs to have the following features.

  • It must preserve the other characteristics of CloudFS, including performance/scalability and encryption.
  • Replication must be distributed across local servers, because funneling all writes through a single replication node would create a bottleneck and violate the requirement to preserve scalability.
  • Replication must be asynchronous, because waiting for a response from a remote site many milliseconds away before acknowledging a write would violate the requirement to preserve performance.
  • Because replication is asynchronous, consistency must be eventual. As I’ve pointed out to people before, “eventual” doesn’t mean slow or unordered or uncertain. It just means consistency is not guaranteed at the precise moment that a write is allowed to complete. In practice, one would be hard pressed to observe any inconsistency during normal operation.
  • Every site must be continuously active and writable, never a passive or read-only standby requiring an explicit role switch.
  • The system must support many sites – more than two (a woefully common restriction), up to dozens or perhaps even hundreds if there can be multiple “sites” within an actual physical location or if branch-office caches are handled within the same framework.
  • The system must not rely on manual or external conflict reconciliation, as in Coda. All conflicts must be resolved internally and simply, even if that means lower consistency.

Without going into too much detail (partly because it’d be tiresome and partly because it might change), the basic approach is as follows:

  • Writes (actually any operations which modify state) are preserved in their original form, not broken up and treated as separate block writes. This is “operation transfer” in the terminology of the excellent paper by Saito and Shapiro.
  • The data store is divided into a “copy of record” containing fully committed data, and a “patch queue” containing operations which might still require some reconciliation with updates occurring elsewhere.
  • Writes are processed first by the replication layer, which stores them in the patch queue. From there, they will be propagated to other sites where they land on those sites’ patch queues.
  • Patches might need to be inserted in the middle of a patch queue, not just at the end.
  • Reads check the local patch queue and “overlay” any relevant patches (operations) atop the copy of record to construct the current data.
  • Patches are “retired” into the copy of record when no conflict (i.e. insertion of another patch before) remains possible.

There’s a lot more, of course. For example, I’ve omitted any description of ordering rules, because that’s such a complex issue that I’ll need a whole post just for that. Other issues that might be discussed in future posts include how to distribute replication traffic among (local and remote) servers, how to handle patch queues efficiently, how to determine when an update no longer requires reconciliation, and how to deal with conflicting updates to opaquely encrypted blocks.

The real point of my mentioning all of this is to illustrate what a large and complex task MSR will be. A lot of work has already been done, but a lot of work still remains to be done and will be done . . . but not right now. I’ve listed the work and I’m working the list, and this particular item just doesn’t fit anywhere but near the end.


CloudFS: What?

Yesterday, I wrote about why there’s a need for CloudFS. Today, I’m going to write about how CloudFS attempts to satisfy that need. As before, there’s a “filesystem” part and a “cloud” part, which I’ll address in that order. Let’s start with the idea that writing a whole new cloud filesystem from scratch would be a mistake. It takes a lot of time and resources to develop, then it’s another long struggle to have it accepted by users. Instead, let’s start by considering what an existing filesystem would need to be like for us to use it as a base for further cloud-specific enhancements.

  • It must support basic filesystem functionality, as outlined in the last post.
  • It must be open source.
  • It must be mature enough that users will at least consider trusting their data to it.
  • It must be shared, i.e. accessible over a network.
  • It must be horizontally scalable. Cloud economics are based on maximizing the ratio between users and resource pools, so the bigger we can make the pools the more everyone will benefit. Since this is a general-purpose and not HPC-specific environment, we need to consider metadata scalability (ops/second) as well as data (MB/second) and we should be very wary of “solutions” that centralize any kind of operation in one place ( distributing too broadly either).
  • It must provide decent data protection, even on commodity hardware with only internal storage. I’m not even going to get into the technical merits of shared storage vs. “shared nothing” because the “fact on the ground” is that the people building large clouds prefer the latter regardless of what we might think.
  • It should ideally support RDMA as well as TCP/IP. This is not strictly necessary, but is highly useful for a server-private “back end” network and is becoming more and more relevant for “front end” networks as higher-performance cloud environments gain traction.

There are really only three production-level distributed/parallel filesystems that come close to these requirements: Lustre, PVFS2, and GlusterFS. There’s also pNFS, which is a protocol rather than a ready-to-use implementation and addresses neither metadata scaling nor data protection, and Ceph, which is in many ways more advanced than any of the others but is still too young to meet the “trust it in production” standard for most people. Then there are dozens of other projects, most of which either don’t provide basic filesystem behavior, aren’t sufficiently mature, and/or don’t provide real horizontal scalability. Some day one of those might well emerge and distinguish itself, but there’s work to be done now.

As tempting as it might be to go into detail about the precise reasons why I consider various alternatives unsuitable, I’m going to refrain. The key point is that GlusterFS is the only one that can indisputably claim to meet all of the requirements above. It does so at a cost, though. It is almost certainly the slowest of those I’ve mentioned, per node on equivalent hardware, at least as such things are traditionally measured. This is not primarily the result of it being based on FUSE, which is often maligned – without a shred of supporting evidence – as the source of every problem in filesystems based on it. The difference is minimal unless you’re using replication, which is necessary to provide data protection within the filesystem itself instead of falling back to the “just use RAID” and “just install failover software (even though we can’t tell you how)” excuses used by some of the others. Every one of these options has at least one major “deficiency” making it an imperfect fit for CloudFS, so the question becomes which problems you’d rather be left with when it comes time to deploy your filesystem as shared infrastructure. When you already have horizontal scalability, low per-node performance is a startlingly easy problem to work around.

The other main advantage of GlusterFS, compared to the others, is extensibility. Precisely because it’s FUSE-based, and because it has a modular “translator” interface besides, it makes extension such as we need to do much easier than if the new code had to be strongly intertwined with old code and debugged in the kernel. Many of the libraries we’ll need to use, such as encryption, are trivially available out in user-space but would require a major LKML fight to use from the kernel. No, thanks. The advantages of higher development “velocity” are hard to understate, and are already apparent in how far GlusterFS has come relative to its competitors since about two years ago (when I tried it and it was awful). If somebody decides later on that they do want to wring every last bit of performance out of CloudFS by moving the implementation into the kernel, that’s still easier after the algorithms and protocols have been thoroughly debugged. I’ve been developing kernel code by starting in user space for a decade and a half, and don’t intend to stop now.

At this point, we’re ready to ask: what features are still missing? If we have decent scalability and decent data protection and so on, what must we still add before we can deploy the result as a shared resource in a cloud environment? It turns out that most of these features have to do with protection – with making it safe to attempt such a deployment.

  • We need strong authentication, not so much for its own sake but because the cloud user’s identity is an essential input to some of the other features we need.
  • We need encryption. Specifically, we need encryption to protect not only against other users seeing our data but also against the much more likely “insider threat” of the cloud provider doing so. This rules out the easy approaches that involve giving the provider either symmetric or private keys, no matter how securely such a thing might be done. Volume-level encryption is out, both for that reason but also because users share volumes so there’s no one key to use. The only encryption that really works here is at the filesystem level and on the client side.
  • We need namespace isolation – e.g. separate subdirectories for each cloud user.
  • We need UID/GID isolation. Since we’re using a filesystem interface, each request is going to be tagged with a user ID and one or more group IDs which serve as the basis for access control. However, my UID=50 is not the same as your UID=50, nor is it the same as the provider’s UID=50. Nonetheless, my UID=50 needs to be stored as some unique UID within CloudFS on the server side, and then converted back later.
  • We need to be able to set and enforce quota, to prevent users from consuming more space than they’re entitled to.
  • We need to track usage, precisely, including space used and data transferred in/out, on a per-user basis for billing purposes.

This is where the translator architecture really comes into its own. Every one of these features can be implemented as a separate translator, although in some cases it might make sense to combine several functions into one translator. For example, the namespace and UID/GID isolation are similar enough that they can be done together, as are quota and usage tracking. What we end up with is a system offering each user similar performance and similar security to what they’d have with a privately deployed filesystem using volume-level encryption on top of virtual block devices, but with only one actual filesystem that needs to be managed by the provider.


CloudFS: Why?

As the name implies, CloudFS is a filesystem for the cloud. What does that mean? First, it means that it’s a filesystem, with the behaviors that people – and programs – expect of filesystems, and not some completely different set of behaviors characteristic of a database or blob store or something else. Here are some examples:

  • You access data in a filesystem by mounting it and issuing a familiar set of open/close/read/write calls so that every language and library and program under the sun that can use a filesystem can use this one.
  • Files are arranged into directories which can be nested arbitrarily, without requiring the user to establish and follow some separate convention on top of a single-level hierarchy.
  • The data model is of a byte stream – not blocks, not records or rows – in which reads and writes can be done at any offset for any length.
  • Performance and consistency for small writes in large files are not reduced to near zero by doing read/modify/write on whole files.
  • Files and directories have owners, permissions, and other information associated with them besides their contents.

You might notice some things that are missing – e.g. locks or atomic cross-directory rename. That’s because most applications don’t rely on these features, often because they’re supported poorly or inconsistently by existing filesystems, and they’re things that anyone working specifically in the cloud should try to avoid. That last part, in turn, is because some features are impossible (or at least nearly so) to implement acceptably in the cloud. If nobody would be satisfied with the result anyway then trying is a waste of time – time that can be better spent implementing other features that really will be needed.

Many of the things that need a filesystem are not whole applications being developed from the ground up. using every possible filesystem feature. They’re libraries and frameworks that are used by other applications, and they just need basic filesystem functionality as I’ve outline above. If you’ve constructed your application out of a dozen such pieces, the very last thing you want to do or should be doing is to dive into each and every one of these alien bits of code to make them use some other kind of data store instead. If you can build your entire application from the ground up to use something else that’s a better fit for your needs than a traditional filesystem, then that’s great. More power to you. Meanwhile, I believe that a much larger number of programmers would be better served by having a plain old-fashioned filesystem . . . albeit one that’s based on the latest technology.

OK, so much for the filesystem part. What about the cloud part? Aren’t there already filesystems you can use in the cloud? There sure are. In fact, I do exactly that myself all the time. It’s a fine thing. However, there are a few problems with doing things this way. One is that you have to manage the servers and their configuration yourself. Not only is that just one more burden as you’re trying to do Something Else, but it doesn’t allow you to take advantage of shared-service economies. Part of the cloud value proposition is supposed to be that when you aggregate resources across many users their individually unpredictable growth or bursts balance out (James Hamilton’s “non-correlated peaks” idea). The resulting aggregate predictability (plus the ability to amortize the cost of hiring real experts) allows things like capacity provisioning and monitoring to be done more efficiently in one place than if they had to be done separately by each user. Also, when you run heavy I/O within your compute resources you’re using the wrong tool for the job. It’s much better to run physical I/O on machines provisioned and tuned for that kind of thing than to run virtual I/O on machines provisioned and tuned for something else entirely. For all of these reasons, having the provider set up a filesystem as a permanent, shared resource is preferable to having each user set up their own . . . so long as it’s done in a way that preserves the users’ and providers’ needs. On the user side, you want sharing between all of a user’s machines but you don’t want users reading each others’ data (let alone writing it). On the provider side, you need features such as quotas and accurate billing so users can actually be charged for what they use.

This brings us to the “why” of CloudFS: because sometimes people need a filesystem, because anything you put in the cloud as a shared service needs to meet certain requirements, and because none of the filesystems that might otherwise fit people’s needs meet those requirements. Without getting too deep into the details of exactly what features CloudFS provides to fill this gap – that will be my next post – I think it’s safe to say a big gap that needs filling.


First post!

Welcome to the new website for CloudFS. ¬†Over the next few weeks I’ll be posting about what CloudFS is, why it might be interesting to people, and – probably most important of all – how it works. Some of the content might be copied from my own site where I’ve written about some of this stuff before, but for the most part I intend to keep what I write here original and separate from what I write there.