When Alex Popescu wrote about scalable storage solutions and I said that the omission of distributed filesystems made me cry, he suggested that I could write an introduction. OK. Here it is.
All filesystems – even local ones – have a similar data model and API model. The data model consists of files inside directories, where both have user-assigned names. In most modern filesystems directories can be nested, file contents are byte-addressable, and names are free-form character sequences. The API model, commonly referred to as POSIX after the standard of the same name, includes two broad categories of calls – those that operate on files, and those that operate on the “namespace” of files within directories. Examples of the first category include open, close, read, write, and fsync. Examples of the second category include opendir, readdir, link, unlink, and rename. People who actually develop filesystems, especially in a Linux context, often talk in terms of a different three-way distinction (file/inode/dirent operations) but that has more to do with filesystem internals than with the API users see. The other thing about filesystems is that they’re integrated into the operating system. Any program should be able to use any filesystem without using special libraries. That makes real filesystems a bit harder to implement, but it also makes them more generally useful than impostors that just have “FS” in the name to imply functionality they don’t have. There are many ways to categorize filesystems – according to how they’re accessed, how they’re implemented, what they’re good for, and so on. In the context of scalable storage solutions, though, the most important groupings are these.
- A local filesystem is local to a single machine, in the sense that only a process on the same machine can make POSIX calls to it. That process might in fact be a server for some “higher level” kind of filesystem, and in fact local filesystems are an essential building block for most others, but for this to work the server must make a new local-filesystem call which is not quite the same as continuing the client’s call.
- A network filesystem is one that can be shared, but where each client communicates with a single server. NFS (versions 2 and 3) and CIFS (a.k.a. SMB which is what gives Samba its name) are the best known examples of this type. Servers can of course be clustered and made highly available and so on, but this must be done transparently – almost behind the clients’ backs or under their noses. This approach fundamentally only allows vertical scaling, and the trickery necessary to scale horizontally within a single-server basic model can become quite burdensome.
- A distributed filesystem is one in which clients actually know about and directly communicate with multiple servers (of one or more types). Lustre, PVFS2, GlusterFS, and Ceph all fit into this category despite their other differences. Unfortunately, the term “distributed filesystem” makes no distinction between filesystems distributed across a fast and lossless LAN and those distributed across a WAN with exactly opposite characteristics. I sometimes use “near-distributed” and “far-distributed” to make this distinction, but as far as I know there’s no concise and commonly accepted terms. AFS is the best known example of a far-distributed filesystem, and one of the longest-lived filesystems in any category (still in active large-scale use at several places I know of).
- A parallel filesystem is a distributed filesystem in which a single file, or even a single I/O request, can be striped across multiple servers. This is primarily beneficial in terms of performance, but can also help to distribute capacity more evenly than if every file lives on exactly one server. I’ve often used the term to refer to near-distributed filesystems as distinct from their far-distributed cousins, because there’s a high degree of overlap, but it’s not technically correct. There are near-distributed filesystems that aren’t parallel filesystems (GlusterFS is usually configured this way) and parallel filesystems that are not near-distributed (Tahoe-LAFS and other crypto-oriented filesystems might fit this description).
- A cluster or shared-storage filesystem is one in which clients are directly attached to shared block storage. GFS2 and OCFS2 are the best known examples of this category, which also includes MPFS. Once touted as a performance- or scale-oriented solution, these are now positioned mainly as availability solutions with a secondary emphasis on strong data consistency (compared to the weak consistency offered by many other network and distributed filesystems). Due to this focus and the general characteristics of shared block storage, the distribution in this case is always near.
This set of distinctions is certainly neither comprehensive nor ideal, as illustrated by pNFS which allows multiple “layout” types. With a file layout, pNFS would be a distributed filesystem by these definitions. With a block layout, it would be a cluster filesystem. With an object layout, a case could be made for either, and yet all three are really the same filesystem with (mostly) the same protocol and (definitely) the same warts.
One of the most important distinctions among network/distributed/cluster filesystems, from a scalability point of view, is whether it’s just data that’s being distributed or metadata as well. One of the issues I have with Lustre, for example, is that it relies on a single metadata server (MDS). The Lustre folks would surely argue that having a single metadata server is not a problem, and point out that Lustre is in fact used at some of the most I/O-intensive sites in the world without issue. I would point out that I have actually watched the MDS melt down many times when confronted with any but the most embarrassingly metadata-light workloads, and also ask why they’ve expended such enormous engineering effort – on at least two separate occasions – trying to make the MDS distributed if it’s OK for it not to be. Similarly, with pNFS you get distributed data but only some pieces of the protocol (and none of any non-proprietary implementation) to distribute metadata as well. Anybody who wants a filesystem that’s scalable in the same way that non-filesystem data stores such as Cassandra/Riak/Voldemort are scalable would and should be very skeptical of claims made by advocates of a distributed filesystem with non-distributed metadata.
A related issue here is of performance. While near-distributed parallel filesystems can often show amazing megabytes-per-second numbers on large-block large-file sequential workloads, as a group they’re notoriously poor for random or many-small-file workloads. To a certain extent this is the nature of the beast. If files live on dozens of servers, you might have to contact dozens of servers to list a large directory, or the coordination among those servers to maintain consistency (even if it’s just metadata consistency) can become overwhelming. It’s harder to do things this way than to blast bits through a simple pipe between one client and one server without any need for further coordination. Can Ma’s Pomegranate project deserves special mention here as an effort to overcome this habitual weakness of distributed filesystems, but in general it’s one of the reasons many have sought alternative solutions for this sort of data.
So, getting back to Alex’s original article and my response to it, when should one consider using a distributed filesystem instead of an oh-so-fashionable key/value or table/document store for one’s scalable data needs? First, when the data and API models fit. Filesystems are good at hierarchical naming and at manipulating data within large objects (beyond the whole-object GET and PUT of S3-like systems), but they’re not so good for small objects and don’t offer the indices or querying of databases (SQL or otherwise). Second, it’s necessary to consider the performance/cost curve of a particular workload on a distributed filesystem vs. that on some other type of system. If there’s a fit for data model and API and performance, though, I’d say a distributed filesystem should often be preferred to other options. The advantage of having something that’s accessible from every scripting language and command-line tool in the world, without needing special libraries, shouldn’t be taken lightly. Getting data in and out, or massaging it in any of half a million ways, is a real problem that isn’t well addressed by any storage system with a “unique” API (including REST-based ones) no matter how cool that storage system might be otherwise.