Archive for June, 2011

User Space File Systems Again

My previous post on this subject seems to have attracted a lot of interest. I got links from Phoronix, Heise (German), and most of all from my friends at Gluster. I was particularly amused by the Gluster response, because usually AB is the good cop and I’m the bad cop but this time he seems to have taken a more aggressive position than me. I had been thinking of writing a follow-up, and had put it off, but then as I composed a reply to a comment on the previous article I realized that I had covered most of the points I’d intended to make anyway. Being the lazy sort that I am, I’m just going to re-post the comment and my reply here. First, here’s the challenge from P.B.Shelley.

How about admitting that with FUSE, data has to be copied to kernel, then your user space component, and then back to kernel to write to disk? If you claim that this overhead can result in better performance, please PROVE it, instead of citing how many successful user space filesystems you have out there. All of them can perform better if they live in the kernel!

Here’s my response to that challenge.

You’re missing the point on several levels, P.B. I already mentioned the issue of extra data copies, but also made the point that it doesn’t relegate all user-space file systems to toys. Let’s see how many reasons there might be for that.

(1) The copies you mention are artifacts of the Linux FUSE implementation, and are not inherent to user-space file systems in general. Other systems do this more intelligently. PVFS2 does it more intelligently *on Linux*. With RDMA, communication could be direct from the application to application, without even the overhead of going through the kernel. FUSE itself could be more efficient if resistance from the kernel grognards and their sycophants could be overcome. Even if one could make the case that filesystems based on FUSE as it exists today are all toys, Linus’s statement *as he made it* would still be untrue.

(2) The copies don’t matter in many environments, especially in distributed systems. If your system is network, memory, or I/O bound anyway – whether that’s because of provisioning or algorithms – then the copies are only consuming otherwise-idle CPU cycles. This is especially true since most systems sold today are way out of balance in favor of CPU/memory over network or disk I/O anyway.

(3) There’s an important distinction between latency and throughput. The FUSE overheads mostly affect latency. If latency is your chief concern, then you probably shouldn’t be using any kind of distributed file system regardless of whether it’s in kernel or user space. If throughput is your chief concern, which is the more common case, you need a system that allows you to aggregate the power of many servers without hitting algorithmic limits. Such systems are hard enough to scale and debug already, without the added difficulty of putting them into the kernel prematurely. I’m not against putting code in the kernel *when all of the algorithms are settled*, but projects can go well beyond “toy” status well before that.

(4) There are concerns besides performance. There are bazillions of libraries that one can use easily from user space. Many of them can not and should not ever be reimplemented in the kernel simply because that would bloat the kernel beyond belief. In some cases there would be other serious implications, such as a kernel-ported security library losing its certification in the process.

(5) Results from actual field usage trumps synthetic micro-benchmarks any day, and either trumps empty theorizing like yours. If Argonne and Pandora and dozens of others can use PVFS2 and GlusterFS and HDFS for serious work, then they’re not toys. The point is already proven. End of story.

The real point here is that user-space file systems might not be better than kernel file systems in the terms that people like Linus and P.B. care about. I don’t think anybody has claimed that they were. However, they can be better in other ways. The importance of making it easier to develop and integrate user-space file systems in already-challenging environments, and the greater ease of hiring developers to work on them, can not be lightly dismissed. The user-space file system that’s finished beats the kernel file system that remains mired in pre-release debugging. Most of the algorithms that underlie modern distributed file systems, including kernel-based ones such as pNFS or Ceph, were developed in user space first. Often, the user-space prototype turned out to be complete enough and fast enough for some real-life purpose that putting it in the kernel was no longer worth the effort.

For your root file system, you should probably go with a traditional kernel-based write-in-place file system (not a copy-on-write file system because they’re bad in almost the same latency and CPU-usage terms that user-space file systems are). For data, if latency is your primary concern and you’re not hitting other limits before CPU and you can’t be bothered fixing the interfaces, then by all means develop your fancy new file system in the kernel. If you’re more concerned about throughput or your primary constraints are network-related or you’re willing to use/implement some interface besides FUSE, then maybe there are wiser choices you could make.


User Space Filesystems

Apparently Linus has made another of his grand pronouncements, on a subject relevant to this project (thanks to Pete Zaitcev for bringing it to my attention).

People who think that userspace filesystems are realistic for anything but toys are just misguided.

I beg to differ, on the basis that many people are deploying user-space filesystems in production to good effect, and that by definition means they’re not toys. Besides the obvious example of GlusterFS, PVFS2 is almost entirely in user space and it has been used to solve some very serious problems on some seriously large systems for years. Everything Linus has worked on is a toy compared to this. There are several other examples, but that one should be sufficient.

So where does Linus’s dismissive attitude come from? Only he can say, of course, but I’ve seen the same attitude from many kernel hackers and in many cases I do know where it comes from. A lot of people who have focused their attention on the minutiae of what’s going on inside processors and memory and interrupt controllers tend to lose track of things that might happen past the edge of the motherboard. This is a constant annoyance to people who work on external networking or storage, and the problem is particularly acute with distributed systems that involve both. Sure there are inefficiencies in moving I/O out to user space, but those can be positively dwarfed by inefficiencies that occur between systems. A kernel implementation of a bad distributed algorithm is most emphatically not going to beat a user-space implementation of a better one. When you’re already dealing with the constraints of a high-performance distributed system, having to deal with the additional constraints of working in the kernel might actually slow you down. It’s not that it can’t be done; it’s just not the best way to address that class of problems.

The inefficiency of moving I/O out to user space is also somewhat self-inflicted. A lot of that inefficiency has to do with data copies, but let’s consider the possibility that there might be fewer such copies if there were better ways for user-space code to specify actions on buffers that it can’t actually access directly. We actually implemented some of these at Revivio, and they worked. Why aren’t such things part of the mainline kernel? Because the gatekeepers don’t want them to be. Linus’s hatred of microkernels and anything like them is old and well known. Many other kernel developers have similar attitudes. If they think a feature only has one significant use case, and it’s a use case they oppose for other reasons, are they going to be supportive of work to provide that feature? Of course not. They’re going to reject it as needless bloat and complexity, which shouldn’t be allowed to affect the streamlined code paths that exist to do things the way they think things should be done. There’s not actually anything wrong with that, but it does mean that when they claim that user-space filesystems will incur unnecessary overhead they’re not expressing an essential truth about user-space filesystems. They’re expressing a truth about their support of user-space filesystems in Linux, which is quite different.

A lot of user-space filesystems -perhaps even a majority – really are toys. Then again, is anybody using kernel-based exofs or omfs more seriously than Argonne is using PVFS? If you make something easier to do, more people will do it. Not all of those people will be as skilled as those who would have done it The Hard Way. FUSE has definitely made it easier to write filesystems, and a lot of tyros have made toys with it, but it’s also possible for serious people to make serious filesystems with it. Remember, a lot of people once thought Linux and the machines it ran on were toys. Many still are, even literally. I always thought that broadening the community and encouraging experimentation were supposed to be good things, without which Linux itself wouldn’t have succeeded. Apparently I’m misguided.

Note: some of the comments have been promoted into a follow-up post

CloudFS = GlusterFS + ???

One of the questions I’m asked most frequently is about the exact relationship between GlusterFS and CloudFS. Even people pretty close to the matter often seem to misunderstand this, as I was reminded recently by a post to an OpenStack mailing list that I felt didn’t represent the differences accurately. The very short version is that CloudFS simply adds options to GlusterFS. It doesn’t take away anything, it doesn’t replace anything, and most of the code you’ll be running when you run CloudFS is actually in GlusterFS. That’s the idea, anyway. In the nitty-gritty world of pushing patches and building packages, with all of the time delays those things involve, there might transiently be cases where CloudFS as it is delivered to users does in fact change or replace something that’s already in GlusterFS, but all such artifacts are intended to disappear over time. Think of it as eventual consistency for code. ;) Here are some more details about how GlusterFS and CloudFS relate in terms of specific cloud features.

  • Namespace isolation
    The way that CloudFS gives each tenant their own separate namespace is exactly the same way that a user would be able to do the same thing without CloudFS – by creating tenant-specific subdirectories under each original brick, exporting those subdirectories as bricks themselves, then creating tenant-specific volumes from those. It didn’t always work this way, but it has for a few months now, so any claim that the GlusterFS method is somehow more secure is untrue. They’re exactly the same in terms of security, because they’re exactly the same in terms of how the bricks and volumes are defined. What’s different is that CloudFS provides tools to manage this proliferation of per-tenant bricks and volumes – M bricks times N tenants can get out of hand when M is in the dozens and N is in the hundreds. It also exports multiple per-tenant bricks from a single server (glusterfsd) process, for similar reasons. At the most basic level, though, a request will go through the same set of translators on the client and then again on the server once that server has received it.
  • ID isolation
    The claim that actually inspired this post is that ID isolation has been an essential part of GlusterFS since its exception, and that its form of this is more secure than CloudFS’s. I find this strange, because GlusterFS doesn’t even have any kind of UID/GID isolation comparable to CloudFS’s uidmap translator. There is, or perhaps was once, a translator that could do a much simpler and more static kind of UID mapping, but as far as I know it has never been supported or used. Even if I’m wrong on that last point, there’s nothing about it to support a claim that it’s more secure than uidmap. In fact, I can point to places where it distinctly fails to do some of the mappings that it should, and I know to look for them because they came up while Kaleb was developing uidmap. For all practical purposes, if you need this feature you need CloudFS.
  • Network encryption
    This is the place where the “only add, never change or remove” relationship between CloudFS might temporarily break down. I’ve submitted patches to add OpenSSL support to the GlusterFS socket transport, intending them eventually to become part of GlusterFS. This naturally provides authentication as well as encryption, and also includes performance changes made necessary by the fact that each GlusterFS process has a single-threaded event loop and doing SSL means work which would previously have been done in that loop needs to be done in separate per-connection threads to achieve adequate parallelism. Gluster has tentatively said they’re interested in accepting these patches, but both the code and the contributor agreement are still under review. It’s not at all clear whether this process will complete by the time CloudFS needs to be frozen for Fedora 16, so there’s a distinct possibility that the modified transport will have to be packaged as part of CloudFS until things work themselves out. Because replacing GlusterFS’s socket module with CloudFS’s would never pass package review (in turn because it would lead to support nightmares for all involved), this might also require a small patch in the Fedora version of GlusterFS so that the configuration-parsing code can handle options specifying the SSL-enabled transport instead of the non-SSL-enabled one. That’s all temporary, though. As soon as the necessary patches make their way into the Gluster code base, these CloudFS-specific changes to both packages could be backed out in favor of (what would then be) vanilla GlusterFS.
  • Disk encryption
    This is still, and will probably remain, an undisputed additional feature for CloudFS. Gluster has never claimed this as a feature (unless you count the rot13 translator) or expressed an interest in using the code developed as part of CloudFS, nor has it ever been offered to them.
  • Quota
    GlusterFS has a quota translator, dormant for a long time but recently the subject of fairly intense activity. In GlusterFS this is deployed on the client side, which I believe is insecure and also fails to deal with quota shared among many client machines. In CloudFS this same translator is likely to be deployed on the server side, or in some cases bypassed in favor of similar features in some local file systems (e.g. XFS). Now we have a quota-across-servers problem instead of a quota-across-clients problem. As part of CloudFS 2.0, therefore, it’s highly likely that there will be code to balance/adjust quota across servers either as part of the existing management daemons or as a separate daemon itself.
  • Management interface
    CloudFS does include its own management web interface and CLI, as described here previously. This does not replace the existing gluster/glusterd management structure, and in fact depends on that infrastructure for some things. However, it does understand how to configure bricks and volumes to take advantage of all the features mentioned above, and if it is used instead of the Gluster management tools then the result will be different in some ways (e.g. the assignment of bricks to server processes). These two interfaces might converge at some point, or they might not; it largely depends on whether Gluster makes it possible to extend their tools/interfaces in the ways that CloudFS needs. In the absence of a rich enough extension interface, users will – rather unfortunately, I admit – be saddled with two management interfaces instead of one.

As you can probably see, CloudFS looks a lot like GlusterFS with a few extra bits tacked on. I expect that to continue for some time. Even if there are some management bits and some minor-feature translators that CloudFS users don’t run, the two projects will continue to share a large and important common core which is developed by Gluster and not by me. Thank you, guys. Even if CloudFS eventually includes its own translators to perform major functions such as distribution and replication, so that the GlusterFS translators to do those things are never used in CloudFS, I expect that the two projects will remain complementary (and complimentary) working together for a long time to come.


Quick Note on Encryption

I noticed a few hits here from Slashdot. I wouldn’t normally go there, but I figured I should at least see what was being said about the project. I ended up writing a reply about the strength of the at-rest encryption, and it seems worth repeating here. If you find anything in the first paragraph scary – and you should, really – then please don’t just run away without reading the second. Long story short, the current encryption isn’t that great but the stuff we release with will be.

To be quite clear on this, the at-rest encryption that’s currently in CloudFS is not as secure as we’d like it to be, or as secure as it will be when it’s released. To put it another way, it’s more secure than Dropbox or Jungledisk have proven to be, it’s probably more secure than a couple of dozen other similar cloud-storage options (it’s hard to tell since so many are not open source), but it does have flaws. To be more specific, it’s secure against inspection by someone who only has the ciphertext – such as your cloud provider. However, it is not secure against transparent modification (flipping a bit in the ciphertext flips the corresponding bit in the plaintext). Also, since it’s currently CTR-mode encryption, if someone has both ciphertext and plaintext for the same part of a file then that part of the file becomes readable from just ciphertext thereafter. These flaws are not acceptable; the current code is only a stopgap. This is exactly why I made the point on Twitter recently that even the strongest ciphers with long keys can still result in weak protection if used improperly. I’m sick of seeing cloud-storage providers crow about how strong their transport encryption is but say nothing about on-disk encryption, or mention using “military grade AES-256″ on disk but say nothing about how. Worst of all are the ones -who require that you give them keys – which for all you know will be stored unprotected right next to the data.

The good news is that I’ve been consulting with some real crypto experts – I admit I’m not one myself – on this. We’ve worked out a block-based scheme that all involved believe will address the above flaws, while also handling concurrent writes correctly (something most “personal backup” alternatives fail to do). The performance cost is more than I’d like, but I think it’s no more than necessary and the parallelism inherent in the underlying system should still yield more-than-adequate performance. I’ve already begun implementation, and will fully disclose all the details once I get a bit further along.


CloudFS Article in ;login: Magazine

I met Rik Farrow at FUDcon in Tempe, back at the end of January. After my talk about CloudFS, he introduced himself as the editor of USENIX ;login: magazine (a name that I’m sure has driven years of library-catalog maintainers crazy) and asked if I’d be interested in writing an article to appear there. Of course I would! We discussed the idea further as we shared a nice breakfast with Ric Wheeler at FAST’11 in San Jose, and the process picked up a bit from there. I have to say it was a very easy process – working both with Rik and with Jane-Ellen Long was a pleasure – leading to the article some of you might see in this month’s issue. Subscription is required for now, but I believe the content will become freely available at some point and I still have the drafts. ;)

The content should be very familiar to those who’ve seen my Red Hat Summit or FUDcon presentations, or indeed read what I’ve written here. Mostly I make the same points I’ve been making for a year or more, about requirements for a genuine cloud filesystem[1] and how CloudFS in particular tries to meet them. It’s just in a more coherent and continuous form, instead of being broken up into many slides or blog posts. If you read it, I’d appreciate hearing what you think.