A lot of people don’t understand storage. A lot of people don’t understand distributed systems. It should be no surprise, then, that very few really seem to understand distributed storage. Even experts in one of the two relevant specialties often seem lost when attempting to deal with their intersection, or make amazingly bad decisions when they try to implement something in that space. Wherever there is confusion of general lack of knowledge, there is also an opportunity for vendors to sling FUD. I was reminded of this by a MaxiScale white paper called Distributed Metadata: The Key to Achieving Web-Scale File Systems. My intent in addressing it is not to beat up MaxiScale – though I will be doing some of that at the end – but to use the white paper’s shortcomings as an opportunity to clear up some of the ignorance and confusion that they’re trying to use to their own advantage. Let’s examine some of their worst claims.

The simplest approach to building an infrastructure with a large number of storage systems is known as a “shared nothing” cluster (Figure 1). The shared nothing method consists of a cluster of file server nodes built on low-cost, commodity hardware, connected to clients via IP with a shared back-end. The entire cluster is then connected to a large storage facility, usually via a SAN.

No, no, no! The very name “shared nothing” indicates that the storage is not shared. In fact, many architecturally shared-nothing systems do actually use shared storage to enable transparent failover, to ease provisioning, or to take advantage of other features such as snapshots or wide-area replication, but that’s just a deployment method. A shared-nothing architecture also allows storage to be entirely internal to each server, and such filesystems are also often deployed in that way.

A big disadvantage is that the file system namespace is fragmented across nodes; clients must understand on which nodes individual files are located. It is also inefficient because there is no way to effectively load balance between the nodes.

This is not a problem with shared-nothing; it is a problem with static partitioning – which I’ll discuss more in a little while. In fact, the partitioning in a shared-nothing system can be quite dynamic, even including migration of live data. It’s not easy to write and debug the code for that, especially when trying to deal with the performance issues that are involved, but there’s nothing in the architecture to preclude it.

While it is possible to create a single root hierarchy that appears to have all files in a single directory, the reality is that requests must still target the node associated with the file to be accessed. A distributed lock manager manages file access by multiple clients. This is difficult, time consuming, and risky to implement correctly. Distributed lock managers fail to scale beyond a small number of nodes because internodal communication overhead quickly increases with node count.

DLMs are more characteristic of what are variously called cluster or SAN filesystems (which we’ll get to in a minute) than of shared-nothing filesystems. Notice how all of the criticisms offered so far of shared-nothing filesystems are either unrelated to the architecture or more applicable to alternatives. Maybe that’s because MaxiScale’s own architecture is shared-nothing, but they want to differentiate themselves from others who have explored that territory before them. Let’s take a look at what they have to say about other kinds of architectures.

Any storage node can serve any file in a clustered file system. Typical clustered file systems rely on synchronized metadata across all nodes to ensure consistency in handling file requests. Synchronization processes do not scale well as more nodes are added and each node must communicate metadata changes to all other nodes.

Here, MaxiScale takes aim at a target that doesn’t exist. There are no filesystems relying on such broadcast-style handling of metadata, at least not that anyone takes seriously.

Clusters usually require a dedicated control network such as InfiniBand or Fibre Channel to exchange metadata and coordinate their activity.

Bearing in mind that they’re still talking about a non-existent type of system, this criticism is no more true of any other system than of MaxiScale’s own. The simple fact is that if you want to replicate data or metadata N>1 times – and you’d better, for reliability reasons – then you need to move it N times. You can do that by shipping it N times from the client on a front-end network, or you can do it by shipping it once on the front-end network and N-1 times on a private back-end network. Most such systems – e.g. Lustre, PVFS2, GlusterFS – do not require any separate back-end network. Others – e.g. Panasas in the SAN-filesystem space, Isilon in the clustered-NAS space – do require such networks to ensure sufficient bandwidth for such replication, but it’s really more of a product-configuration choice (related to testing and support matrices) than an architectural one. If MaxiScale themselves have replicated data or metadata, then they also face the same problems with the same solutions.

SAN-based systems allow multiple clients to directly access raw block-storage

Clients communicate directly with SAN storage resources once necessary control information is obtained from the metadata server. Each client runs a significant amount of file system software because it must be knowledgeable about the lower-level workings of the file system itself.

Untrue. The client has to understand a data structure that describes where the data are, but those data structures might have nothing to do with the internal structures used by the metadata servers and parsing the information does not require knowing anything about anyone’s “lower-level workings” at all. The client does have to run a significant amount of filesystem software just as they do already with NFS or any other standard protocol. Yet again, MaxiScale’s criticism applies to themselves as much as to anyone.

The metadata server quickly becomes a bottleneck to overall system performance when multiple clients perform simultaneous requests.

Another target that doesn’t exist. None of the systems I’ve mentioned have a single metadata server. A few did once, but they learned their lesson before MaxiScale came along.

OK, enough about MaxiScale’s analysis of problems that don’t exist. How about problems that do exist? They’re right that metadata management is a central problem. If you have multiple servers, you have three choices for how to deal with the metadata.

  • At one extreme, you mirror metadata everywhere. This makes queries fast and easy to implement, but imposes a hideous penalty on updates, so nobody does it this way.
  • At the other extreme, you put each piece of metadata only one place but spread the pieces around. This makes updates fast, and with most of the common hash-based approaches queries can be fast too in the normal case, but this approach can add significant complexity to handle other cases such as when servers are added or removed.
  • In between, you try to strike a balance between updates and queries by replicating to all>N>1 places. The implementation is usually very similar to that for the one-place approach.

It seems that MaxiScale, like everyone else, has chosen the last approach. Have they added anything new? It’s hard to tell. They might have. On the other hand, their technology might actually be no better – might indeed be no different – than what had already been invented elsewhere. Having obviously failed to survey the state of the art, they might even have succeeded in reinventing technology that others left behind a generation ago. What’s certain is that they’re still hand-waving about some important issues that others have already addressed.

  • Their vaunted “Peer Sets” reek of static partitioning, which has long been recognized as a poor way to allocate either capacity or load. I don’t want my files sorted into big buckets just because that makes the metadata-management coder’s job easier (and BTW sounds suspiciously similar to what IBRIX was already doing). I want a system where placement decisions are made at file granularity, and re-made to keep load and capacity balanced. I very much doubt that MaxiScale would fare well in any serious comparison of their placement strategy vs. those employed by competitors such as Isilon. Peer sets are a liability, not a strength.
  • N-way replication is hardly state of the art, and doesn’t avoid the problem of having to move data N times. Since they try to discredit both separate back-end networks and client knowledge of actual data locations, the only remaining possibility is that they’re transparently forwarding data between servers on the front-end networks. That’s a “worst of both worlds” approach. Now the load is not only still there but contending with the clients’ own traffic.
  • There doesn’t seem to be anything on their site about actions taken to ensure data integrity, which is especially important since they do mention using inexpensive SATA drives.
  • There’s no mention of striping, either, though that’s another critical technique for large-scale systems. Reading from any member of a peer set is fine for reads, but writing to all is lousy.
  • There’s also nothing about other kinds of functionality such as thin provisioning, replication, multi-tenancy, snapshots, etc. These might not be part of a high-scale storage system’s core functionality, but they’re often available in such systems and often require (or at least can benefit from) integration with the techniques used in that core technology.

Maybe MaxiScale has solved a whole bunch of difficult problems, and solved them well. Or maybe not. You can rarely trust what vendors say about their own products, and you can never trust what they say about competitors (or competing approaches). Believing MaxiScale’s claims about their own capabilities would require faith, and I for one don’t tend to put much faith in people who have already demonstrated a propensity for misleading or outright false claims. Maybe this white paper was written by the people who are really behind MaxiScale’s technology. Maybe it was just written by some marketing intern who really didn’t understand the subject matter. It certainly reads like more of the latter, but in any case this kind of material does its authors more harm than good.