I had a nice chat with some folks from MaxiScale this evening. Yes, I know, some people will think I’m forever tarnished by having actually talked to somebody who makes money, but I figure if you write something about someone then it’s only fair that you give them a chance to respond. All in all, I think it’s a good thing that they’re willing to go mano a mano with critics, and they were far more gracious toward me than I had previously been toward them, so they deserve some credit. Before I get into the technical content, though, I’d like to clear up a few things.
- Some people have characterized my previous article about MaxiScale as a review of their product. It was not. It was, mostly, a review of a particular white paper they had published. At least one person also accused me of fighting FUD with more FUD. Guilty as charged. I’m sure my readers know that when my ire is aroused I can be pretty acerbic. The white paper – which BTW is apparently undergoing some revision – annoyed me. I gave it a harsh review, and some of that harshness extended to playing “turn the tables” a bit. If I wanted to review the product, I’d want to see it in operation first.
- My conversation with MaxiScale was predicated on an explicit agreement that nothing confidential would be discussed. Due to the nature of my own work I cannot be privy to their secrets, nor am I authorized to share my employer’s. Everything I have to say here should be considered public information.
Much of the conversation was about “peer sets” and placement strategies. It turns out that MaxiScale’s approach is based on some of the same techniques I’ve talked about here. Each file is hashed to identify a peer set which will handle its metadata, but then the members of that peer set might determine that the data should be placed elsewhere. The term “consistent hashing” wasn’t actually used, but I’d have to guess that what they have is either that or a moral equivalent. Similarly, I’m sure there’s some “special sauce” in how they determine which peer set should receive the data, and I’m content to leave it that way. What’s important is the general approach, and their hash-based method is IMO very consistent with what I wrote yesterday about good design for distributed systems.
On another issue, I’m only half convinced. Apparently they have their own protocol which does replication via multicast. This was a possibility I hadn’t considered, even though I’ve seen other parallel filesystems that do it. I’m not really a big fan of multicast. It might or might not actually involve less data on the wire than client-driven replication, depending on implementation and network topology. It could also be argued that if handling storage failures and retries at the node level is a good idea (instead of relying on RAID) then the exact same principle should be applied to network failures and retries as well (instead of relying on multicast). Using multicast isn’t entirely a bad choice, but it’s not a clear winner vs. server-driven replication on a back-end network either.
This leads straight into another thorny issue: platform compatibility. Having clients run “a significant amount of file system software” on clients is not bad only “because it must be knowledgeable about the lower-level workings of the file system” as in the white paper’s criticism of SAN filesystems. Using a proprietary protocol means having to implement that protocol yourself on every platform you intend to support. It also means being dependent on each platform’s support for not-quite-universal features like multicast. When I asked about platform support, the answer was that “major manufacturers” were supported. There was, notably, no mention of Linux in general or of any particular Linux distribution. Since I was representing myself, not my employer, I didn’t press further. According to a follow-up email, there is a Linux client which is known to run on RHEL, SLES, Ubuntu/Debian, and Gentoo.
The last significant technical issue we discussed was striping. They don’t do it. The reason given was that they’re focused on small-file workloads – mention was made of retrieving files under 1MB with a single disk operation – and that striping could be a waste or even a negative in such cases. That’s absolutely true. I’ve worked with several parallel filesystems. They tend to be good at delivering lots of MB/s, but they’re often poor for IOPS and downright lousy for metadata ops/second. This is not a strictly necessary consequence of striping, but it often relates to the complexity of needing files to be created multiple places but then have different states (as opposed to replication where the states are identical). Just think for a while about how stat(2) should return a correct value for st_size when a file is striped across several servers, and you’ll see what I mean. For the systems I design striping is pretty much essential, but they’re hitting a different design point and it’s fair to say that for them it might be a mistake.
Overall, I was pretty impressed. They didn’t do everything the way I would have, and they didn’t give all the answers I would have liked, but it seems like they made reasonable choices and – just as importantly – are willing to explain those choices even to folks like me. On the particular issue of data distribution, their hashed peer-set approach seems to be on the right track. It’s a hard problem, at the core of scalable storage-system design, and their design seems to avoid many of the SPOFs and bottlenecks I’ve seen plague other designs in this space. It’ll be interesting to see where they’re able to go with it, and I wish them luck.
Jeff, thanks for the chance to discuss further. I’ll add a few things to clarify what MaxiScale is doing. (Disclaimer to all: I am MaxiScale’s director of product management.)
Client OS support – MaxiScale’s client is lightweight (see next point) and supports a broad range of platforms. We’re running on many flavors of Linux including RHEL, SLES, Ubuntu, Gentoo, Debian, CentOS. We’re also running on Windows flavors: 2K8, 2K3, XP. We can test with others – it’s really based on customer demand how we prioritize our efforts.
Client module – you raise a fair question and point back to our whitepaper. MaxiScale’s client is just a few KB in size because it doesn’t do any data processing or buffering/caching. It’s just there to take POSIX calls and hash the file pathname to determine where to direct metadata requests (open, close, etc.). We refer to this as client-side routing of file operations. File management logic resides in our cluster which cuts down on client side processing. The application or OS doesn’t have to deal with low-level filesystem management functions – just POSIX calls to open, read, write, close, etc.
Consistent hashing – yes, though we prefer to say deterministic hashing. At any point in time, the hash has to be consistent across the cluster for obvious reasons. We update the hash as the cluster changes in size to take advantage of more available resources. This also allows us to add capacity on-demand while client application file operations continue.
It’s very interesting to see the recent advances in scalable filesystem development across the industry. There are many approaches, each of which targets a slightly different mix of workloads. In large Web environments, there are good papers and blog posts out there explaining what Amazon, Facebook, Google, Yahoo! and others have done to create their own storage infrastructure on commodity hardware. To avoid repeating, here’s a blog post I did to learn more about these implementations: http://blog.maxiscale.com/2009/09/watching-the-big-players-amazon-facebook-google-yahoo.html
I think you mean open, close, read, write, fsync, stat, truncate, readdir, readlink (if you support symbolic links), getxattr (if you support xattrs), lock (if you support POSIX locks), etc. My un-favorites are link and rename, which introduce new first-class paths that might hash differently than the ones you’re using. It’s good to keep the client uninvolved in caching, but there’s still some mess involved both in handling ~40 different request types and in handling the various failures that can come back. Give those engineers a pat on the back for me, OK?
BTW, what is MaxiScale’s consistency model? Once a block or directory entry gets into the client’s cache, what (if anything) can cause it to be invalidated?
Re: file consistency across clients. At a very high level, when a reader opens a file that has been modified (prior to the open) it will receive the new contents of the file. The reader is protected against seeing in-progress changes from another writer that may be internally inconsistent.
That’s about as far as I can go on consistency before getting into secret sauce.
FWIW – many of the workloads we see in Web/Cloud are read-heavy, where files change less often relative to more traditional enterprise applications.
Fair enough, Mark, and thanks for answering. There are many options in this space, from lock-based full consistency through eventual consistency or MVCC all the way to just letting the chips fall where they may. For many workloads, just about any of them will work so long as the rules are clear and consistently followed, so it comes down to performance and ease of implementation.