I had a nice chat with some folks from MaxiScale this evening. Yes, I know, some people will think I’m forever tarnished by having actually talked to somebody who makes money, but I figure if you write something about someone then it’s only fair that you give them a chance to respond. All in all, I think it’s a good thing that they’re willing to go mano a mano with critics, and they were far more gracious toward me than I had previously been toward them, so they deserve some credit. Before I get into the technical content, though, I’d like to clear up a few things.

  • Some people have characterized my previous article about MaxiScale as a review of their product. It was not. It was, mostly, a review of a particular white paper they had published. At least one person also accused me of fighting FUD with more FUD. Guilty as charged. I’m sure my readers know that when my ire is aroused I can be pretty acerbic. The white paper – which BTW is apparently undergoing some revision – annoyed me. I gave it a harsh review, and some of that harshness extended to playing “turn the tables” a bit. If I wanted to review the product, I’d want to see it in operation first.
  • My conversation with MaxiScale was predicated on an explicit agreement that nothing confidential would be discussed. Due to the nature of my own work I cannot be privy to their secrets, nor am I authorized to share my employer’s. Everything I have to say here should be considered public information.

Much of the conversation was about “peer sets” and placement strategies. It turns out that MaxiScale’s approach is based on some of the same techniques I’ve talked about here. Each file is hashed to identify a peer set which will handle its metadata, but then the members of that peer set might determine that the data should be placed elsewhere. The term “consistent hashing” wasn’t actually used, but I’d have to guess that what they have is either that or a moral equivalent. Similarly, I’m sure there’s some “special sauce” in how they determine which peer set should receive the data, and I’m content to leave it that way. What’s important is the general approach, and their hash-based method is IMO very consistent with what I wrote yesterday about good design for distributed systems.

On another issue, I’m only half convinced. Apparently they have their own protocol which does replication via multicast. This was a possibility I hadn’t considered, even though I’ve seen other parallel filesystems that do it. I’m not really a big fan of multicast. It might or might not actually involve less data on the wire than client-driven replication, depending on implementation and network topology. It could also be argued that if handling storage failures and retries at the node level is a good idea (instead of relying on RAID) then the exact same principle should be applied to network failures and retries as well (instead of relying on multicast). Using multicast isn’t entirely a bad choice, but it’s not a clear winner vs. server-driven replication on a back-end network either.

This leads straight into another thorny issue: platform compatibility. Having clients run “a significant amount of file system software” on clients is not bad only “because it must be knowledgeable about the lower-level workings of the file system” as in the white paper’s criticism of SAN filesystems. Using a proprietary protocol means having to implement that protocol yourself on every platform you intend to support. It also means being dependent on each platform’s support for not-quite-universal features like multicast. When I asked about platform support, the answer was that “major manufacturers” were supported. There was, notably, no mention of Linux in general or of any particular Linux distribution. Since I was representing myself, not my employer, I didn’t press further. According to a follow-up email, there is a Linux client which is known to run on RHEL, SLES, Ubuntu/Debian, and Gentoo.

The last significant technical issue we discussed was striping. They don’t do it. The reason given was that they’re focused on small-file workloads – mention was made of retrieving files under 1MB with a single disk operation – and that striping could be a waste or even a negative in such cases. That’s absolutely true. I’ve worked with several parallel filesystems. They tend to be good at delivering lots of MB/s, but they’re often poor for IOPS and downright lousy for metadata ops/second. This is not a strictly necessary consequence of striping, but it often relates to the complexity of needing files to be created multiple places but then have different states (as opposed to replication where the states are identical). Just think for a while about how stat(2) should return a correct value for st_size when a file is striped across several servers, and you’ll see what I mean. For the systems I design striping is pretty much essential, but they’re hitting a different design point and it’s fair to say that for them it might be a mistake.

Overall, I was pretty impressed. They didn’t do everything the way I would have, and they didn’t give all the answers I would have liked, but it seems like they made reasonable choices and – just as importantly – are willing to explain those choices even to folks like me. On the particular issue of data distribution, their hashed peer-set approach seems to be on the right track. It’s a hard problem, at the core of scalable storage-system design, and their design seems to avoid many of the SPOFs and bottlenecks I’ve seen plague other designs in this space. It’ll be interesting to see where they’re able to go with it, and I wish them luck.