Canned Platypus

Making the world better, one byte at a time.

Jan
16

Fighting the MaxiScale FUD

A lot of people don’t understand storage. A lot of people don’t understand distributed systems. It should be no surprise, then, that very few really seem to understand distributed storage. Even experts in one of the two relevant specialties often seem lost when attempting to deal with their intersection, or make amazingly bad decisions when they try to implement something in that space. Wherever there is confusion of general lack of knowledge, there is also an opportunity for vendors to sling FUD. I was reminded of this by a MaxiScale white paper called Distributed Metadata: The Key to Achieving Web-Scale File Systems. My intent in addressing it is not to beat up MaxiScale – though I will be doing some of that at the end – but to use the white paper’s shortcomings as an opportunity to clear up some of the ignorance and confusion that they’re trying to use to their own advantage. Let’s examine some of their worst claims.

The simplest approach to building an infrastructure with a large number of storage systems is known as a “shared nothing” cluster (Figure 1). The shared nothing method consists of a cluster of file server nodes built on low-cost, commodity hardware, connected to clients via IP with a shared back-end. The entire cluster is then connected to a large storage facility, usually via a SAN.

No, no, no! The very name “shared nothing” indicates that the storage is not shared. In fact, many architecturally shared-nothing systems do actually use shared storage to enable transparent failover, to ease provisioning, or to take advantage of other features such as snapshots or wide-area replication, but that’s just a deployment method. A shared-nothing architecture also allows storage to be entirely internal to each server, and such filesystems are also often deployed in that way.

A big disadvantage is that the file system namespace is fragmented across nodes; clients must understand on which nodes individual files are located. It is also inefficient because there is no way to effectively load balance between the nodes.

This is not a problem with shared-nothing; it is a problem with static partitioning – which I’ll discuss more in a little while. In fact, the partitioning in a shared-nothing system can be quite dynamic, even including migration of live data. It’s not easy to write and debug the code for that, especially when trying to deal with the performance issues that are involved, but there’s nothing in the architecture to preclude it.

While it is possible to create a single root hierarchy that appears to have all files in a single directory, the reality is that requests must still target the node associated with the file to be accessed. A distributed lock manager manages file access by multiple clients. This is difficult, time consuming, and risky to implement correctly. Distributed lock managers fail to scale beyond a small number of nodes because internodal communication overhead quickly increases with node count.

DLMs are more characteristic of what are variously called cluster or SAN filesystems (which we’ll get to in a minute) than of shared-nothing filesystems. Notice how all of the criticisms offered so far of shared-nothing filesystems are either unrelated to the architecture or more applicable to alternatives. Maybe that’s because MaxiScale’s own architecture is shared-nothing, but they want to differentiate themselves from others who have explored that territory before them. Let’s take a look at what they have to say about other kinds of architectures.

Any storage node can serve any file in a clustered file system. Typical clustered file systems rely on synchronized metadata across all nodes to ensure consistency in handling file requests. Synchronization processes do not scale well as more nodes are added and each node must communicate metadata changes to all other nodes.

Here, MaxiScale takes aim at a target that doesn’t exist. There are no filesystems relying on such broadcast-style handling of metadata, at least not that anyone takes seriously.

Clusters usually require a dedicated control network such as InfiniBand or Fibre Channel to exchange metadata and coordinate their activity.

Bearing in mind that they’re still talking about a non-existent type of system, this criticism is no more true of any other system than of MaxiScale’s own. The simple fact is that if you want to replicate data or metadata N>1 times – and you’d better, for reliability reasons – then you need to move it N times. You can do that by shipping it N times from the client on a front-end network, or you can do it by shipping it once on the front-end network and N-1 times on a private back-end network. Most such systems – e.g. Lustre, PVFS2, GlusterFS – do not require any separate back-end network. Others – e.g. Panasas in the SAN-filesystem space, Isilon in the clustered-NAS space – do require such networks to ensure sufficient bandwidth for such replication, but it’s really more of a product-configuration choice (related to testing and support matrices) than an architectural one. If MaxiScale themselves have replicated data or metadata, then they also face the same problems with the same solutions.

SAN-based systems allow multiple clients to directly access raw block-storage

Clients communicate directly with SAN storage resources once necessary control information is obtained from the metadata server. Each client runs a significant amount of file system software because it must be knowledgeable about the lower-level workings of the file system itself.

Untrue. The client has to understand a data structure that describes where the data are, but those data structures might have nothing to do with the internal structures used by the metadata servers and parsing the information does not require knowing anything about anyone’s “lower-level workings” at all. The client does have to run a significant amount of filesystem software just as they do already with NFS or any other standard protocol. Yet again, MaxiScale’s criticism applies to themselves as much as to anyone.

The metadata server quickly becomes a bottleneck to overall system performance when multiple clients perform simultaneous requests.

Another target that doesn’t exist. None of the systems I’ve mentioned have a single metadata server. A few did once, but they learned their lesson before MaxiScale came along.

OK, enough about MaxiScale’s analysis of problems that don’t exist. How about problems that do exist? They’re right that metadata management is a central problem. If you have multiple servers, you have three choices for how to deal with the metadata.

  • At one extreme, you mirror metadata everywhere. This makes queries fast and easy to implement, but imposes a hideous penalty on updates, so nobody does it this way.
  • At the other extreme, you put each piece of metadata only one place but spread the pieces around. This makes updates fast, and with most of the common hash-based approaches queries can be fast too in the normal case, but this approach can add significant complexity to handle other cases such as when servers are added or removed.
  • In between, you try to strike a balance between updates and queries by replicating to all>N>1 places. The implementation is usually very similar to that for the one-place approach.

It seems that MaxiScale, like everyone else, has chosen the last approach. Have they added anything new? It’s hard to tell. They might have. On the other hand, their technology might actually be no better – might indeed be no different – than what had already been invented elsewhere. Having obviously failed to survey the state of the art, they might even have succeeded in reinventing technology that others left behind a generation ago. What’s certain is that they’re still hand-waving about some important issues that others have already addressed.

  • Their vaunted “Peer Sets” reek of static partitioning, which has long been recognized as a poor way to allocate either capacity or load. I don’t want my files sorted into big buckets just because that makes the metadata-management coder’s job easier (and BTW sounds suspiciously similar to what IBRIX was already doing). I want a system where placement decisions are made at file granularity, and re-made to keep load and capacity balanced. I very much doubt that MaxiScale would fare well in any serious comparison of their placement strategy vs. those employed by competitors such as Isilon. Peer sets are a liability, not a strength.
  • N-way replication is hardly state of the art, and doesn’t avoid the problem of having to move data N times. Since they try to discredit both separate back-end networks and client knowledge of actual data locations, the only remaining possibility is that they’re transparently forwarding data between servers on the front-end networks. That’s a “worst of both worlds” approach. Now the load is not only still there but contending with the clients’ own traffic.
  • There doesn’t seem to be anything on their site about actions taken to ensure data integrity, which is especially important since they do mention using inexpensive SATA drives.
  • There’s no mention of striping, either, though that’s another critical technique for large-scale systems. Reading from any member of a peer set is fine for reads, but writing to all is lousy.
  • There’s also nothing about other kinds of functionality such as thin provisioning, replication, multi-tenancy, snapshots, etc. These might not be part of a high-scale storage system’s core functionality, but they’re often available in such systems and often require (or at least can benefit from) integration with the techniques used in that core technology.

Maybe MaxiScale has solved a whole bunch of difficult problems, and solved them well. Or maybe not. You can rarely trust what vendors say about their own products, and you can never trust what they say about competitors (or competing approaches). Believing MaxiScale’s claims about their own capabilities would require faith, and I for one don’t tend to put much faith in people who have already demonstrated a propensity for misleading or outright false claims. Maybe this white paper was written by the people who are really behind MaxiScale’s technology. Maybe it was just written by some marketing intern who really didn’t understand the subject matter. It certainly reads like more of the latter, but in any case this kind of material does its authors more harm than good.

Comments

  1. Excellent analysis Jeff

  2. Good blog post, currently looking into building our own shared nothing cluster.

    T

  3. check out either Polyserve or Ibrix.

    What they are describing is exactly how they both work.

    -J

  4. I don’t believe that’s true in either case. There’s a difference between a DLM-based system where metadata is primarily stored on disk and one where metadata is communicated directly to every other node. Polyserve is an example of the former, but MaxiScale’s white paper is explicitly (in the excerpt I used) referring to the latter. As for Ibrix, all of the “area code” blather in their “segmented” architecture clearly points toward a more fully distributed system much like MaxiScale’s own – which is why I mentioned them a second time in reference to MaxiScale’s “peer sets” placement method. Lastly, one or both of these HP products might be considered to fail the “taken seriously” test.

  5. Sorry Jeff – I missed the peer sets part, yes this was what IBRIX did initially.

    Both products require dedicated backend networks, polyserve for the SCLD and DLM portion, and IBRIX also for the data/metadata portion. Polyserve was always less scalable on our clusters than IBRIX, but was (as you mentioned) a different architecture for a different problem. Matrix clusters were never intended to be internet scale, and as you noted they dont scale well past 12-16 node clusters based on my running of them in RAC environments. (Huge DLM traffic)

    IBRIX on the other hand while they had default file/metadata placement, in the last generation of the software before they were purchased by HP allowed the administrator to dynamically rebalance these portions of the system according to our needs. We used this pretty extensively to balance out 3PB of disk with hundreds of millions of files across 3 data centers. They were one of the first we found to allow rebalancing/mirroring/failover across multiple geographic datacenters, although it was clear we were pushing the product at that point.

    Based on your current research where are the interesting companies/spaces now in this area?

    -J

  6. IBRIX/X9000 from HP is very relevant in a cloud scale-out environment and we have customers deploying this technology where they need scalability in the multi-PB range, billions of addressable objects in the same namespace and geographic content distribution/replication. IBRIX from an architectural standpoint does not require peer sets nor static partitioning. Data placement occurs at the inode granularity and can be intelligently placed on nodes or different tiers of storage varying by performance/cost characteristics or an individual file can be migrated or replicated post-ingestion based on user-defined metadata policies or frequency of access. Files, entire segments of the namespace, network interfaces and other resources can migrate or be replicated from any node to any other node in the cluster. Also, rebalancing of files based on node capacity utilization is a common feature used by our customers and this file movement is seamless to the end-user.

    As you stated earlier in your article, many topology choices are made for deployment reasons. IBRIX allows you to deploy in a classic share-all storage back-end, pairs of nodes sharing DAS storage or nodes with completely captive DAS storage. They are all valid topologies with different characteristics in terms of performance, HA, capacity utilization and price. If you or any of your readers want to learn more about the architecture, please contact me any time.

    BTW, great blog. File system discussions are awesome!

  7. Jason, I’m glad to say that this is a pretty lively field after a long period of near dormancy. Here are some suggestions of companies/projects worth looking at.

    Parallel filesystems: Lustre, PVFS2, GlusterFS, Ceph, XtreemFS
    Clustered NAS: Isilon, Exanet, NetApp GX, EMC Celerra
    Non-POSIX cloud “filesystems”: MogileFS, Kosmos/CloudStore, HDFS, GoogleFS
    Block level (but modern techniques): Parascale, Mezeo, Zetta, Nasuni
    Services: Mozy, Iron Mountain, Nirvanix, Dropbox, ZumoDrive
    Other: Tahoe-LAFS, EMC Atmos

    FWIW, I’m working at Red Hat on a next generation of cloud filesystem. The parallel filesystems mentioned above offer some useful functionality, but there are still some holes that need to be filled: multi-tenancy, WAN distribution, and non-disruptive addition/removal of servers are highest on that list. Enhancing an existing codebase in these ways should allow me to create maximum functionality with minimum effort. I’d like to say more, but there are still some things that have to happen before I’d be comfortable doing that. Soon…

  8. oleg kiselev Says: January 20th, 2010 at 9:24 pm

    one correction: parascale is file-level, not block-level. it’s multi-tentant, access is via nfs, http, ftp, webdav, CIFS (next release), and REST (next++).

  9. Jeff, nice post!

    Interesting that Redhat is going to build a cloud filesystem. We run on top of Redhat 5 and offer a free download of our enterprise ready product. We do policy-based placement and replication of unstructured data along with all the management features allowing for non-disruptive addition/deletion of storage nodes/etc.

    As Oleg mentions, we’re focused on file storage and export our underlying object storage architecture through standard protocols (natively and without a gateway) and because of this and other features can’t be easily categorized into these buckets defined above.

    Keep these great posts coming.

  10. Jeff:

    Why is Tahoe-LAFS in a different category (and the dreaded “other” at that!) from:

    Non-POSIX cloud “filesystems”: MogileFS, Kosmos/CloudStore, HDFS, GoogleFS

    This is an honest question — my motivation is more to learn about your thought process than to get you to change it.

    Regards,

    Zooko

  11. I put Tahoe-LAFS in a different category mainly because its primary goals are different – secure/private storage across a broadly distributed environment, rather than performance/scalability in a local and trusting environment. This is not to say that Tahoe-LAFS can’t serve well in those environments also, but that’s not what seems to set it apart. No insult was intended; being in the “other” category means a three-word categorization wouldn’t suffice, and I consider that a good thing.

Leave a Comment