Distributed File System SBFAQ

There's a lot of hype around distributed file systems and their relatives, such as object stores. Every week, it seems, there's a new project claiming to be the the fastest, most scalable, most robust, most space-efficient distributed file system ever, sweeping all precursors before it. Nine times out of ten, those claims are simply ridiculous. A distributed file system is a complex thing. Design choices and tradeoffs must be made. Anybody who claims to be the best in all of these categories is simply lying, and most new projects break new ground in only one direction - all too often, in zero. Instead of trying to identify the ways in which each new braggart is lying to you, I've compiled this list of questions so you can figure it out for yourself.

  • Is it, in fact, a file system at all? A real file system must be mountable and usable in the same way as a local file sytem, e.g. to store your source code and run your applications. That means it must have indefinitely hierarchical directories, byte addressability within files, UNIX-style permissions, and so on. It should also meet all or at least most POSIX requirements around issues such as consistency and durability. For example, it is not OK for a file sytem to support only whole-file or appending writes, or to ignore fsync. Not everybody needs a real file system - many people are quite happy with object stores and good for them - but if your project doesn't meet the basic definition of a file system then don't call it one.

  • How is metadata distributed? This is probably the biggest distinction among distributed file systems. Serious practitioners have known for a decade or more that single-metadata-server designs are bad for both reliability and scalability. Old-school active/passive failover doesn't even fully address reliability - the entire file system goes down during the failure-detection interval - and fails to address scalability at all. Having to provision two special ultra-powerful machines as your active and standby metadata servers should not be acceptable any more, but to this day new projects based on this approach continue to appear and still claim to be the best. Systems that are designed to have as many metadata servers as you want are common enough that you shouldn't have to settle for anything less.

  • How is data distributed and/or replicated? Does each file live on only one node, or can files be split/striped across multiple nodes? Is replication (or erasure coding) built in, or will you have to rely on some other piece of software to protect against disk or node failures? Can data be replicated across data centers? If so, with what kinds of consistency/ordering guarantees?

  • What access methods are supported? Even when accessing a distributed file system as a distributed file system, there are differences among native protocols (implemented either in the kernel or user space), NFS (multiple versions), and SMB (ditto). It might also be useful to have an S3/Swift object-store interface, or a block-device interface.

  • What network configurations are supported? Does the system support InfiniBand or other kinds of RDMA? Does it use one network for both client-to-server and server-to-server communication, or can/must those be segregated? Does it support IPv6?

  • How easy is it to set up and use? This is another major differentiator. Distributed file systems have traditionally been one of the most difficult kinds of software to work with. With only a couple of exception, setting one up will require an inordinate amount of editing and copying files around manually, on each node and usually for each layer. Does adding or removing a node require more manual reconfiguration? Can it even be done online, or does it require a cold restart? How about a software upgrade?

  • What security features does it have? Does it support ACLs? Which flavor? Does it support SELinux? Can data be encrypted on the network? On disk? Where are the keys, in either case? How are identities managed for authorization? Can it use Kerberos, or LDAP/AD?

  • How efficiently is data stored? Replication can (obviously) require N times as much disk space, but often offers the best performance. Erasure coding is more storage-efficient, but typically slower. Are compression and/or deduplication supported? Are there block-size or other concerns that might also lead to wasted space?

  • Are snapshots supported? Multiple snapshots? Snapshots of snapshots? Writable snapshots (clones)? Snapshots/clones of clones? How space-efficient are they? What dependencies (e.g. LVM or ZFS) do they introduce?

  • How does the system detect failures, and subsequently repair files for which one or more copies/fragments were lost to those failures? Is the process fully automatic, or does it require some sort of manual intervention? How does it affect ongoing performance? How is "split brain" handled? Is some sort of quorum enforced to prevent it? How is it reported? Can it be repaired automatically? Can it be repaired manually?

  • How does the system migrate data when nodes are added or removed? Again, is this automatic or does it require manual intervention? How does it affect ongoing performance? How well do the rebalancing algorithms work to ensure that a minimal amount of data is moved? How flexible are those algorithms, or are there multiple algorithms that the user can choose?

That's a lot, isn't it? And we haven't even gotten to performance yet. Everybody wants to skip ahead to performance. There's a saying in the military, that amateurs talk about strategy (without actually understanding even that) while professionals worry about logistics. In storage, amateurs talk about performance (without actually understanding even that) while professionals worry about things like robustness and security and operational simplicity - all of the stuff above. Still, performance does matter, so there's a whole separate set of questions to ask about any performance claims people make.

  • Is anything about the configuration, workload, or testing methodology described? Claims like "24x better response time" or "7x better IOPS" are utterly worthless without any of that information. If I try hard enough, I can find situations where GlusterFS is 7x faster than HDFS, others where HDFS is 7x faster than Ceph, and still others where Ceph is 7x faster than GlusterFS. If you see nothing but one or two headline numbers, you might as well ignore those.

  • How many servers were involved? How many clients? What kind of network between them? I'm not against micro-benchmarks involving small numbers of clients and servers - I've run a few and posted about the results myself - but it's important to understand that they can only measure one aspect of the system. Even when it's a very important aspect, and very likely to be reflected in a broader kind of test, it's still a bit like looking at a few cells under a microscope vs. looking at the whole animal.

  • How many disks were involved? What kind? What kinds of RAID controllers? What partitioning scheme or local file systems were involved? Many distributed file systems are highly sensitive to the speed of these underlying components, either generally or in specific roles (e.g. as log/journal or metadata targets). It's very easy to configure a system in a way that's optimal for one competitor and awful for another. Look up "short stroking" and "head thrashing" to get some idea of the tricks that commercial storage vendors use both among themselves and as a bulwark against true software-defined storage.

  • What did the workload look like? Reads or writes? Large requests or small? Sequential or random? If random, was it all blocks exactly once but in random order, or true random, or weighted somehow? How many concurrent I/O streams per process, per client, or overall? What queue depth or fsync interval was each thread using? The ideal here is to show the actual iozone or fio command line and/or job files, to remove any ambiguity, but any information at all is better than none.

  • How large were the datasets, and how long were the runs? Did requests mostly go only to client caches/buffers, to the same on the server, or to actual disk? What parts of the system were really exercised? Note that the answers might be different based on different file systems' caching and other strategies. Exploiting these differences to generate misleading numbers is another favorite big-dollar-storage vendor trick.

  • How did the performance scale along various axes? Number of clients, servers, disks? Speed of network or disks? Number of worker threads? Replication level? This is where you would normally expect to see some back-and-forth betwen alternatives, as their respective tradeoff spaces are explored.

  • How did the performance vary over time? Was it steady, or glitchy? Did some clients/threads race along while others starved? Don't just look at averages over an entire run. Look at IOPS and 99th-percentile latencies for each client and for each interval within a run. Your application's performance might be bound by the worst sample in that entire set, so make sure you know what that worst case is and how often it's approached.

Phew. There's a lot more, of course, but if you can get the answers to these questions you should have a pretty good handle on whether the thing in front of you is a serious distributed file system or just someone's research project that they expect you to underwrite. If you can compare the answers for two or more distributed file systems, you should have a good idea which one will really suit your needs. I'm sure I forgot something. Please let me know if you find out what it was.

Comments for this blog entry