My recent post about real parallel filesystems generated an unexpected traffic spike, largely because Wes Felter posted it on Hacker News. Thanks, Wes. In the comments there, I had occasion to make some comments about Lustre. Those who read this site or who have worked with me mostly know that I’m not really a big fan of Lustre despite having worked on it a lot for two-plus years at SiCortex, and that I’d rather work with any other parallel filesystem. Just so nobody accuses me of whispering behind anyone’s back, I’m going to lay out the reasons from my perspective as a developer; users and administrators and business folks might have others, but these are mine.

  • Single metadata server. Some might argue that Lustre doesn’t need distributed metadata because the single MDS performs so well, but then why has even Sun repeatedly attempted – and as of today still not succeeded – to make the metadata role more distributed? Most of the other problems I’ll mention wouldn’t have mattered nearly as much if they’d gotten this one right. It’s just an outdated architecture.
  • Thread-pool execution model. I was telling people how wrong this is before Lustre even existed, and I was hardly alone. It might have worked OK on some developer’s desktop test system, but in the real world and especially on architectures with high context-switch costs it led to that one MDS thrashing itself to death under even moderate load.
  • Poor binding of messages to threads. Naive thread-pool implementations are bad enough, but what’s worse is letting a bunch of blocked requests eat up all the threads so that the message which might unblock them all can’t find a thread to run on.
  • No admission control. There was some flow control at the LND level, but that was just to deal with link-level resource issues. There was nothing at a global level to prevent a thousand clients from sending one request apiece to an MDS that only had a hundred threads for processing.
  • Relying too much on timeouts. I already wrote about this in Evil Timeouts, after one too many times when the previous two items conspired to create a deadlock that was “resolved” by a timeout.
  • Poor fault isolation. When a request did time out, the response would affect more than just that one request – often blowing away a whole connection if not rendering the entire system inoperable.
  • Lousy logging. I’ve lost too much time during my career dealing with “unique” logging systems that could only provide two kinds of information – too little or too much. This was just one more example of getting nothing useful at all until you opened the spigot so that logging overhead perturbed the whole operating milieu and the one piece of useful information was buried in a thousand other messages that one developer found useful one time during unit testing and that remained in the code ever after. About the only use I ever had for Lustre log output was to search for some of the strings in their bug database.

It’s easy to see that these are all related. If there had been distributed metadata or global admission control, the thread-exhaustion deadlock wouldn’t have been a problem and the timeouts wouldn’t have fired. As it was, though, the Lustre developers managed to create a perfect storm of implementation artifacts that made the result highly unstable. To avoid context thrashing you’d want to configure relatively few threads. To avoid deadlocks you’d want to configure relatively many. The problem is that the two safe ranges didn’t overlap. Any system would be vulnerable to one problem or the other, most often both, then a timeout which should have caused a request retry instead blew up the whole system, and the broken logging made it impossible to figure out anything useful about what happened. The patterns were always the same, but never actionable short of rewriting half of the codebase. At that point, what sane developer wouldn’t be investigating alternatives?

I’ve often told people that the Lustre architecture is fine. Distributed metadata has always been part of that architecture, and most of the other things I mention above are implementation-level or at most design-level phenomena. Lustre does perform better than GlusterFS or most of its other competitors on a per-node basis, though shortly before I left SiCortex I was able to get better numbers with PVFS2 and in any case per-node performance is the wrong figure of merit for anything parallel. Lustre does have a good feature set, which includes flexible striping (even better with OST pools) and HSM integration. There are some talented people working on it, who might yet succeed in making the positives outweigh the negatives. Having developed code based on it and supported that code in the field throughout the 1.6 series and into 1.8, though, I’m not about to recommend it to anyone without some serious proof that things have improved.