There's a lot of hype around distributed file systems and their relatives, such
as object stores. Every week, it seems, there's a new project claiming to be
the the fastest, most scalable, most robust, most space-efficient distributed
file system ever, sweeping all precursors before it. Nine times out of ten,
those claims are simply ridiculous. A distributed file system is a complex
thing. Design choices and tradeoffs must be made. Anybody who claims to be
the best in all of these categories is simply lying, and most new projects
break new ground in only one direction - all too often, in zero. Instead of
trying to identify the ways in which each new braggart is lying to you, I've
compiled this list of questions so you can figure it out for yourself.
Is it, in fact, a file system at all? A real file system must be mountable
and usable in the same way as a local file sytem, e.g. to store your source
code and run your applications. That means it must have indefinitely
hierarchical directories, byte addressability within files, UNIX-style
permissions, and so on. It should also meet all or at least most POSIX
requirements around issues such as consistency and durability. For example,
it is not OK for a file sytem to support only whole-file or appending
writes, or to ignore fsync. Not everybody needs a real file system - many
people are quite happy with object stores and good for them - but if your
project doesn't meet the basic definition of a file system then don't call
How is metadata distributed? This is probably the biggest distinction among
distributed file systems. Serious practitioners have known for a decade or
more that single-metadata-server designs are bad for both reliability and
scalability. Old-school active/passive failover doesn't even fully address
reliability - the entire file system goes down during the
failure-detection interval - and fails to address scalability at all.
Having to provision two special ultra-powerful machines as your active and
standby metadata servers should not be acceptable any more, but to this day
new projects based on this approach continue to appear and still claim to be
the best. Systems that are designed to have as many metadata servers as
you want are common enough that you shouldn't have to settle for anything
How is data distributed and/or replicated? Does each file live on only
one node, or can files be split/striped across multiple nodes? Is
replication (or erasure coding) built in, or will you have to rely on some
other piece of software to protect against disk or node failures? Can data
be replicated across data centers? If so, with what kinds of
What access methods are supported? Even when accessing a distributed file
system as a distributed file system, there are differences among native
protocols (implemented either in the kernel or user space), NFS (multiple
versions), and SMB (ditto). It might also be useful to have an S3/Swift
object-store interface, or a block-device interface.
What network configurations are supported? Does the system support
InfiniBand or other kinds of RDMA? Does it use one network for both
client-to-server and server-to-server communication, or can/must those be
segregated? Does it support IPv6?
How easy is it to set up and use? This is another major differentiator.
Distributed file systems have traditionally been one of the most difficult
kinds of software to work with. With only a couple of exception, setting
one up will require an inordinate amount of editing and copying files around
manually, on each node and usually for each layer. Does adding or removing
a node require more manual reconfiguration? Can it even be done online, or
does it require a cold restart? How about a software upgrade?
What security features does it have? Does it support ACLs? Which flavor?
Does it support SELinux? Can data be encrypted on the network? On disk?
Where are the keys, in either case? How are identities managed for
authorization? Can it use Kerberos, or LDAP/AD?
How efficiently is data stored? Replication can (obviously) require N times
as much disk space, but often offers the best performance. Erasure coding
is more storage-efficient, but typically slower. Are compression and/or
deduplication supported? Are there block-size or other concerns that might
also lead to wasted space?
Are snapshots supported? Multiple snapshots? Snapshots of snapshots?
Writable snapshots (clones)? Snapshots/clones of clones? How
space-efficient are they? What dependencies (e.g. LVM or ZFS) do they
How does the system detect failures, and subsequently repair files for which
one or more copies/fragments were lost to those failures? Is the process
fully automatic, or does it require some sort of manual intervention? How
does it affect ongoing performance? How is "split brain" handled? Is some
sort of quorum enforced to prevent it? How is it reported? Can it be
repaired automatically? Can it be repaired manually?
How does the system migrate data when nodes are added or removed? Again, is
this automatic or does it require manual intervention? How does it affect
ongoing performance? How well do the rebalancing algorithms work to ensure
that a minimal amount of data is moved? How flexible are those algorithms,
or are there multiple algorithms that the user can choose?
That's a lot, isn't it? And we haven't even gotten to performance yet.
Everybody wants to skip ahead to performance. There's a saying in the
military, that amateurs talk about strategy (without actually understanding
even that) while professionals worry about logistics. In storage, amateurs
talk about performance (without actually understanding even that) while
professionals worry about things like robustness and security and operational
simplicity - all of the stuff above. Still, performance does matter, so
there's a whole separate set of questions to ask about any performance claims
Is anything about the configuration, workload, or testing methodology
described? Claims like "24x better response time" or "7x better IOPS" are
utterly worthless without any of that information. If I try hard
enough, I can find situations where GlusterFS is 7x faster than HDFS, others
where HDFS is 7x faster than Ceph, and still others where Ceph is 7x faster
than GlusterFS. If you see nothing but one or two headline numbers, you
might as well ignore those.
How many servers were involved? How many clients? What kind of network
between them? I'm not against micro-benchmarks involving small numbers of
clients and servers - I've run a few and posted about the results myself -
but it's important to understand that they can only measure one aspect of
the system. Even when it's a very important aspect, and very likely to be
reflected in a broader kind of test, it's still a bit like looking at a few
cells under a microscope vs. looking at the whole animal.
How many disks were involved? What kind? What kinds of RAID controllers?
What partitioning scheme or local file systems were involved? Many
distributed file systems are highly sensitive to the speed of these
underlying components, either generally or in specific roles (e.g. as
log/journal or metadata targets). It's very easy to configure a system in a
way that's optimal for one competitor and awful for another. Look up "short
stroking" and "head thrashing" to get some idea of the tricks that
commercial storage vendors use both among themselves and as a bulwark
against true software-defined storage.
What did the workload look like? Reads or writes? Large requests or small?
Sequential or random? If random, was it all blocks exactly once but in
random order, or true random, or weighted somehow? How many concurrent I/O
streams per process, per client, or overall? What queue depth or fsync
interval was each thread using? The ideal here is to show the actual
iozone or fio command line and/or job files, to remove any ambiguity,
but any information at all is better than none.
How large were the datasets, and how long were the runs? Did requests
mostly go only to client caches/buffers, to the same on the server, or to
actual disk? What parts of the system were really exercised? Note that the
answers might be different based on different file systems' caching and
other strategies. Exploiting these differences to generate misleading
numbers is another favorite big-dollar-storage vendor trick.
How did the performance scale along various axes? Number of clients,
servers, disks? Speed of network or disks? Number of worker threads?
Replication level? This is where you would normally expect to see some
back-and-forth betwen alternatives, as their respective tradeoff spaces are
How did the performance vary over time? Was it steady, or glitchy? Did
some clients/threads race along while others starved? Don't just look at
averages over an entire run. Look at IOPS and 99th-percentile latencies for
each client and for each interval within a run. Your application's
performance might be bound by the worst sample in that entire set, so make
sure you know what that worst case is and how often it's approached.
Phew. There's a lot more, of course, but if you can get the answers to these
questions you should have a pretty good handle on whether the thing in front of
you is a serious distributed file system or just someone's research project
that they expect you to underwrite. If you can compare the answers for two or
more distributed file systems, you should have a good idea which one will
really suit your needs. I'm sure I forgot something. Please let me know if
you find out what it was.