History of MPFS



In case it's not obvious, this is my history of MPFS. It reflects my memories, my feelings, etc. In places I'm sure my memory or perspective are imperfect, but the only people who can really say so are the others who were there. I've seen a lot of comments about MPFS, its place within EMC, its place within computing, from people who weren't there. Their opinions carry no weight whatsoever.

1. What Is It?
Before I really get started, I should explain what MPFS is. It's a parallel filesystem, which means that - unlike with traditional NFS or CIFS - a single client communicates with multiple servers instead of just one to read and write data. Having multiple servers not only allows a single client to get better performance, but allows the entire system consisting of many servers and clients to scale up to performance levels beyond what any single server would allow. There are many approaches to parallel filesystems, but one main distinction is between those where storage is segregated and hidden behind servers vs. those where storage is combined into one pool and directly accessible to clients. Most of the parallel filesystems you'll see nowadays - e.g. Lustre, PVFS - fall into the first category. MPFS was in the second category, which made a lot of sense at a time when 2Gb/s Fibre Channel connections to storage were more mature than 1Gb/s Ethernet connections between clients and servers. That tradeoff has changed since, but I'll wait until the end to talk about that.

1. Historical Context
While I'm speaking of alternatives, I want to point out that I've never claimed that we (the MPFS team at EMC) were the first to think along these lines. We were certainly doing this before Lustre or Polyserve even existed, but I for one was well aware of xFS (not SGI's later XFS) and Frangipani and others having done it even earlier. During my time working on MPFS I actually attended talks on Sistina's (later Red Hat's) GFS and SGI's CXFS while both were under development. Commercially, there were a couple of projects I knew about - QFS was out there, and Fusion-something at Mercury. What we did, though, was release a commercial multi-platform (UNIX and Windows) multi-protocol (NFS and CIFS) shared-storage filesystem when nobody else had such a thing.

3. Names
That brings me to naming. The first name was Parallel NFS, but it wasn't just NFS so we chose MPFS - Multi Path primarily, but Multi Platform and Multi Protocol were in our minds as well. It eventually got released as HighRoad and now it's apparently MPFS again while PNFS is being used for another relative. None of the names were ever associated with people's names, contrary to certain logically-impossible rumors started by people who weren't there.

4. Opening
In 1998, shortly after I had joined Conley, they were bought by EMC. The Virtual Storage Server I'd been working on seemed to be of little interest to our new masters - though every aspect of it was eventually revived - so I found myself looking for a new project. Ric Calvillo was still in charge then, and either he or Ron Searls introduced me to Percy Tzelnic and Uresh Vahalia in the Network Storage Group. They had been doing clustered NFS servers since long before the current crop of vendors in that space came along, but their architecture had begun hitting a performance wall because of the communication between the "data movers" which comprised a Celerra. They'd obviously been thinking for some time about some of the technical directions discussed above as a way to get past that wall. Since they hadn't worked on any operating system not under their direct control since before they left DEC, though, there were certain technical skills they lacked. They needed us to develop clients a lot more than we needed them to develop servers. What they did have was the organizational clout to get the project approved and funded, and so we formed our partnership.

5. Midgame
We didn't really go in a nice orderly line from project initiation to specs to development to product, but it's easier to talk about things that way. The way that MPFS works is that, instead of clients reading and writing data through servers, they communicate with the servers about data locations and then read/write the data themselves using their own connection to storage. If the clients had to ask for locations on every I/O, this would be even worse than classic NFS, so they actually ask for far more locations than they immediately need and they cache the "maps" to those locations. In practice, clients can do quite a lot of I/O in between requests to the server. However, whenever you're caching you have to deal with issues such as coherency and recovery, and the key piece of making this work is to define a robust communications protocol. In our case this was called the File Mapping Protocol or FMP.

Now I'm going to say something that I know other participants will dispute: I wrote the first serious FMP spec. By that, I don't mean that every idea in it was mine. In fact there was a great deal of discussion, often contentious, on almost every point. We all won some battles, and lost others. However, I do think that I contributed more to the final product than any other single person. In one important way this is incontrovertible: I was the one who wrote it down, compiling these often-contradictory ideas into a reasonably coherent spec. I still have early versions of the "rainbow document" - so named because I used different colors for different people's comments/questions - to prove it, should anyone else try to claim credit.

Besides shepherding the protocol spec along, I was also responsible for implementing the first client. The very first prototype client was a library (LD_PRELOAD) hack on Solaris, running against a very primitive server not integrated with the rest of the Celerra software. In fact, in the very early stages I had to run against my own FMP server because no other existed yet. In successive versions, all the way to full in-kernel production code, the Solaris/NFS client was generally ready first, closely followed by the server. The Windows/CIFS client always followed more slowly and was usually marked by annoying incompatibilities with everything else, because the person working on it was an annoyingly incompatible kind of guy. I'd be amazed if his EMC non-compete agreement, along with those of others at iBrix who had previously worked on MPFS, wasn't used as a bargaining chip in negotiations between those two companies.

Besides the code evolving, there were many other changes along the way. Mark Kaufman came to EMC Cambridge, and took over after Ric Calvillo left. Nick Vasilatos became the manager of the MPFS group in Cambridge, which grew to include several people - I think Jason Glasgow was the first hire besides myself and Boris, but it might have been Peter Lombardo. There was an endless revolving door over at NSG, but Xiaoye Jiang was a constant and welcome presence - a notable fact when tensions overall between the Cambridge and Hopkinton parts of the project often ran high. The other constant is that they always had far more people there than we did in Cambridge. This always tended to bother me, as Uresh's tendency to make every project-leader decision in favor of his more numerous crew even if it made life much more difficult for us on the client side was often an impediment to progress. At some point Uresh also went patent-crazy because patents are a key to moving up the technical hierarchy within EMC. I got named as fifth inventor on something that was wholly my idea and that he had bitterly opposed, but hey, he did the tedious work of writing the darn thing just as I had with the FMP spec, so I guess he earned first-name rights.

6. Endgame
Eventually we shipped MPFS under the name HighRoad, which pretty much nobody who had actually been involved in its development seemed to like. Oh well, the important thing is that we shipped. Then we waited for the market reaction. We got some good press, Storage Product of the Year kind of stuff, but actual customers stayed away in droves. I guess the sales folks didn't understand what was special about MPFS, and without understanding that they didn't know how to sell it. Or maybe the commision structure didn't motivate them; it wouldn't be the first or last time EMC internal politics were reflected in commissions inversely proportional to a product's inherent quality or value. In any case, I was ready to move on technically and also very tired of all the political games. You probably figured out that second part from some of my comments above, but if you really want to see just how tired I was try reading my old Project From Hell page. I spent another year and a half or so working on another project somewhat related to MPFS, and then I left EMC altogether.

Meanwhile, pNFS has come to refer to something else - not unrelated, but not the same either. Nowadays it's commonly thought of as a set of extensions to NFSv4 to do the same things that MPFS did in a more standards-y kind of way. The pNFS block layout proposal is, as its authors admit, based on FMP. The overall pNFS spec owes a greater and less acknowledged debt to FMP. I remember the debates at EMC around many of the issues it addresses, and the invention of some of its terminology, but you'd never guess that from the predominance of Panasas names on the document. Nonetheless, there's more good than bad about the fact that the technology lives on. If somebody considered those ideas good enough to be worth "borrowing" then there must have been some value to them.

7. Looking Back
I learned a lot from the MPFS project. I learned a lot about organizational behavior, that's for sure, about how to recognize and deal with various personal or institutional behaviors in the workplace, but I'm mostly going to focus on the technical stuff. Remember what I said about 2Gb/s Fibre Channel vs. 1Gb/s Ethernet? When I talk about clients going directly to storage now, people look at me funny because nowadays if you're building a cluster you already have 20Gb/s Infiniband (or something similar) which is more mature than 8Gb/s Fibre Channel. They're right, of course, but they're right now. Back then, I still think the MPFS choice was a sensible one. Infiniband was barely even on the horizon then, and there was no particular reason to believe that Fibre Channel wouldn't keep up. If you wanted scalable I/O, why wouldn't you use an interconnect that's both faster and more tuned to that task?

That was then, this is now, and I've come to believe in more of a server-oriented approach like those used by PVFS or Lustre. For one thing, the interconnect tradeoffs have changed. The fabric within a SiCortex machine, for example, is two to six times as fast as 8Gb/s Fibre Channel (depending on how you count) and far more scalable. Even more importantly, most of the nodes cannot have their own connection to storage. They physically lack any way to make such a connection except by using another node through the fabric as some kind of server. Of course, that's still considered an exotic architecture, but a similar "only one interconnect" principle applies to other clusters as well. People building large clusters already have Infiniband or some other fast interconnect; adding a whole separate set of cards and cables and switches just for storage is likely to be unwelcome for economic and logistical and performance and reliability reasons. Once you're in that world, what's the difference between using your one interconnect to talk to a disk array and using it to talk to another node? Not much, except that the other node is running software you (the parallel filesystem designer) control. It's much easier to have it implement the access-control model you want, or initiate communication - with its peers, with metadata servers, with clients - when you need it to, etc. These were all issues we struggled with in MPFS, where we were talking to dumb storage. Yes, the Symmetrix or Clariion firmware ("microcode" even though that term hasn't really ever applied in either case) could have been much more than dumb storage, but the people responsible for them showed absolutely zero interest in making that happen - another of those non-technical lessons I mentioned is not to rely on groups with no economic interest in your project. The server-oriented approach has always been easier to implement, and in the modern world the "feeds and speeds" favor that approach too even where both approaches are possible.

If I had to do it again, I wouldn't connect clients to storage, at least not at an architectural level. I'd leave the door open to having the same physical node be both client and server, or metadata and data server, but they'd still be separate pieces of software. Of course, I'd make them capable of coexistence, unlike a certain parallel filesystem that will deadlock itself in that case, and only an idiot would ever design such a system with only a single metadata server even as a temporary measure (which turned out not to be so temporary). But I'm getting ahead of myself. I have a whole 'nother article to write about my ideas in that area.