AGPL Again

Blah. There’s another round of lies being spread about AGPL, and the culprit is the same as last time. Let’s deal with them one by one, starting with the claim that using AGPL’ed code will force someone to open-source their own code. This is clearly and explicitly contradicted by the text of the license itself. The first part of this contradiction is the definition of “convey” as follow (emphasis mine).

To “convey” a work means any kind of propagation that enables other
parties to make or receive copies. Mere interaction with a user through
a computer network, with no transfer of a copy, is not conveying.

That last sentence renders all provisions of section 5, on which most claims of “virality” are based, completely irrelevant. If you’re using modified AGPL code to provide services over a network, the actually relevant section is not 5 but 13.

if you modify the
Program, your modified version must prominently offer all users
interacting with it remotely through a computer network (if your version
supports such interaction) an opportunity to receive the Corresponding
Source of your version

That’s really pretty darn clear. Anybody who tries to interpret it as meaning that AGPL requires them to open-source their own code is just desperate for a way to rationalize using AGPLed code without honoring its license. OK, that takes care of the Big Lie; on to some of the smaller ones.

First, releasing open-source code is not “outsourcing labor”. That would imply some expectation or requirement that labor be provided, but neither AGPL nor any other open-source license has that expectation or requirement. People are completely free to use the software without returning anything at all to the authors. They’re completely free to distribute it to others, likewise. That’s nothing like outsourcing, nor are most people who release open-source software bragging about how open and free they are, so both elements of the supposed hypocrisy Joe talks about are completely made up.

Second, dual licensing is not “bait and switch”. That refers to offering one thing to attract interest, then withdrawing that offer and replacing it with something less appealing. You want real bait and switch? That would be offering a web service that’s initially free but whose provider’s business plan clearly calls for making it non-free (including freemium) at some point. Offer X, then provide Y. Bait, then switch. The term does not apply to something that remains available in its exact original form and under its exact original terms indefinitely. It’s just another term abused for emotive effect as part of a campaign to promote captruism. Gee, it sure would be nice to be the only person allowed to act selfishly while everyone else is handicapped by a requirement to act altruistically, but only someone afraid of competing symmetrically would desire such status. To use your own words, Joe: grow a pair. Capitalism is supposed to be about voluntary exchange. If you want a different license, go negotiate for one. Don’t whine like a baby about how the license on code you got for free isn’t the one you wanted.

I’m not a license zealot, BTW. I don’t think everything should be under any form of GPL. If people want to use BSD or Apache licenses instead, or even keep their code proprietary, that’s fine with me. As authors, they have that right. They also have the right, when they offer their work as a gift, to ensure that the “service loophole” is not used to implement data lock-in or other kinds of bait-and-switch contrary to the original spirit of the gift. Users have the right to use some other software instead, or write their own, but not to force their own moral compromises on any author of code they want to use. If I don’t want my code used to screw people, or to kill them, or to invade their privacy, then the license is my only way to have any influence on that. Licenses and contracts are a core part of a free-market system, and by their nature can contain whatever arbitrary provisions their authors wish (so long as they’re legal and legally enforceable). Denying others the right to define those provisions is tantamount to promoting a fiat system in which exchange can only occur under certain mandated conditions. I very much doubt whether the authors of the code Joe uses, even the Apache/BSD/whatever code, would ever support such a system.

P.S. Would someone who lacks the courage to allow comments on their own site be a hypocrite for coming here and leaving comments on mine?

One Born Every Minute

Do you use Twitter? Do you make sure all your “tweets” are backed up? No? What’s wrong with you? After all, knowing where you went for lunch last Tuesday might be invaluable to future biographers as they document your rise to glory. Not to worry. Now you can use Backupiphish to make sure all of your important online data gets backed up. Just give us all of your social-networking site passwords – you know, the same ones you probably use on your banking sites as well, even though you’ve been told not to do that – and we’ll make sure all of your stuff is nice and safe in our very own Swiss bank accounts. Um, I mean, your data will be on our secure servers in a secure facility protected by the very best guards we can hire for minimum wage, plus our patented 1025-bit encryption. And the best part? It’s completely free! You don’t have to pay us a cent for our useless service. Just your passwords. Don’t forget to send your passwords. You won’t mind if the very first thing we do with your Twitter account info is impersonate you to advertise our service, right? That shows you can trust us. Oh, and using four-letter words makes us seem all edgy and stuff, so make sure you follow #itsour****now on Twitter.

This post has nothing whatsoever to do with Backupify. No sir, not at all. Pure coincidence.

In Memory Data Grids

In my professional life as well as here (is there really any difference?) one of the things I do a lot is “evangelize” the idea that applications need many different kinds of storage. You shouldn’t shoe-horn everything you have into an RDBMS. You shouldn’t shoe-horn everything you have into a filesystem. Ditto for key/value stores, column or document stores, graph DBs, etc. As I’m talking about different kinds of storage that make different CAP/performance/durability tradeoffs, somebody often mentions In Memory Data Grids (henceforth IMDGs). Occasionally references are made to tuple spaces, or to Stonebraker’s or Gray’s writings about data fitting in memory, but the message always seems to be the same: everything can run at RAM speed and so storage doesn’t need to be part of the operational equation. To that I say: bunk. I’ve set up a 6TB shared data store across a thousand nodes, bigger than many IMDG advocates have ever seen or will for at least a few more years, but 6TB is still nothing in storage terms. It was used purely as scratch space, as a way to move intermediate results between stages of a geophysical workflow. It was a form of IPC, not storage; the actual datasets were orders of magnitude larger, and lived on a whole different storage system.

But wait, the IMDG advocates say, we can spill to disk so capacity’s not a limitation. Once you have an IMDG that spills to disk, using memory as cache, you have effectively the same thing as a parallel filesystem only without the durability characteristics. Without a credible backup story, or ILM story, or anything else that has grown up around filesystems. How the heck is that a win? There are sites that generate 25TB of log info per day. Loading it into memory, even with spill-to-disk, is barely feasible and certainly not cost-effective. There are a very few applications that need random access to that much data; the people running those applications are the ones who keep hyper-expensive big SMP (like SGI’s UltraViolet) alive, and a high percentage of them work at a certain government agency. For the rest of us, the typical processing model for big data is sequential, not random. RAM is not so much a random-access cache as a buffer that constantly fills at one end and empties at the other. That’s why the big-data folks are so enchanted with Hadoop, which is really just a larger-scale version of what your video player does. VLC doesn’t load your entire video into memory. It probably can’t, unless it’s a very small video or very large memory, and you don’t need random access anyway. What it does instead is buffer into memory, with one thread keeping the buffer full while the other empties it for playback. The point is that memory is used for processing, not storage. The storage for that data, be it 4.7GB of video or 25TB of logs, is still likely to be disks and filesystems.

I’m not saying that IMDGs aren’t valuable. They can be a very valuable part of an application’s computation or communication model. When it comes to that same application’s storage model, though, IMDGs are irrelevant and shouldn’t be presented as alternatives to various kinds of storage. (Aside to the IMDG weenie who derided cloud databases and key/value stores as “hacks”: let’s talk about implementing persistence by monkey-patching the JVM you’re running in before we start talking about what’s a hack, OK?) Maybe when we make the next quantum leap in memory technology, so that individual machines can have 1TB each of non-volatile memory without breaking the bank, then IMDGs will be able to displace real storage in some cases. Or maybe not, since data needs will surely have grown too by then and there’s still no IMDG backup/ILM story worth telling. Maybe it’s better to continue treating memory as memory and storage as storage – two different things, each necessary and each involving its own unique challenges.

Tweet Dump

I’ve settled into a pattern of using Twitter for short technical stuff, Facebook for short non-technical stuff, and this site for longer (almost always technical) stuff. It works, but it means that people who don’t follow me in all of those might miss some stuff. For example, family members who only follow me here have probably noticed a dearth of pictures lately because they’re all on Facebook. Similarly, technical folks might notice that my articles have gotten longer, because the short blips are going on Twitter. For them, here are some of my tweets that I consider worth repeating/saving.

You can build a cloud without virtualization, but why? Users’ standard unit of provisioning is smaller than a modern physical machine.

There ought to be a special place in hell for people who call malloc several times in the main code path for a single I/O.

Should “cloud computing” be c12g or c13g? Does the space count?

“It works fine on my machine” will soon be displaced by “it works fine in my cloud” as a developer’s favorite excuse. #quoteme

What’s Wrong With Lustre

My recent post about real parallel filesystems generated an unexpected traffic spike, largely because Wes Felter posted it on Hacker News. Thanks, Wes. In the comments there, I had occasion to make some comments about Lustre. Those who read this site or who have worked with me mostly know that I’m not really a big fan of Lustre despite having worked on it a lot for two-plus years at SiCortex, and that I’d rather work with any other parallel filesystem. Just so nobody accuses me of whispering behind anyone’s back, I’m going to lay out the reasons from my perspective as a developer; users and administrators and business folks might have others, but these are mine.

  • Single metadata server. Some might argue that Lustre doesn’t need distributed metadata because the single MDS performs so well, but then why has even Sun repeatedly attempted – and as of today still not succeeded – to make the metadata role more distributed? Most of the other problems I’ll mention wouldn’t have mattered nearly as much if they’d gotten this one right. It’s just an outdated architecture.
  • Thread-pool execution model. I was telling people how wrong this is before Lustre even existed, and I was hardly alone. It might have worked OK on some developer’s desktop test system, but in the real world and especially on architectures with high context-switch costs it led to that one MDS thrashing itself to death under even moderate load.
  • Poor binding of messages to threads. Naive thread-pool implementations are bad enough, but what’s worse is letting a bunch of blocked requests eat up all the threads so that the message which might unblock them all can’t find a thread to run on.
  • No admission control. There was some flow control at the LND level, but that was just to deal with link-level resource issues. There was nothing at a global level to prevent a thousand clients from sending one request apiece to an MDS that only had a hundred threads for processing.
  • Relying too much on timeouts. I already wrote about this in Evil Timeouts, after one too many times when the previous two items conspired to create a deadlock that was “resolved” by a timeout.
  • Poor fault isolation. When a request did time out, the response would affect more than just that one request – often blowing away a whole connection if not rendering the entire system inoperable.
  • Lousy logging. I’ve lost too much time during my career dealing with “unique” logging systems that could only provide two kinds of information – too little or too much. This was just one more example of getting nothing useful at all until you opened the spigot so that logging overhead perturbed the whole operating milieu and the one piece of useful information was buried in a thousand other messages that one developer found useful one time during unit testing and that remained in the code ever after. About the only use I ever had for Lustre log output was to search for some of the strings in their bug database.

It’s easy to see that these are all related. If there had been distributed metadata or global admission control, the thread-exhaustion deadlock wouldn’t have been a problem and the timeouts wouldn’t have fired. As it was, though, the Lustre developers managed to create a perfect storm of implementation artifacts that made the result highly unstable. To avoid context thrashing you’d want to configure relatively few threads. To avoid deadlocks you’d want to configure relatively many. The problem is that the two safe ranges didn’t overlap. Any system would be vulnerable to one problem or the other, most often both, then a timeout which should have caused a request retry instead blew up the whole system, and the broken logging made it impossible to figure out anything useful about what happened. The patterns were always the same, but never actionable short of rewriting half of the codebase. At that point, what sane developer wouldn’t be investigating alternatives?

I’ve often told people that the Lustre architecture is fine. Distributed metadata has always been part of that architecture, and most of the other things I mention above are implementation-level or at most design-level phenomena. Lustre does perform better than GlusterFS or most of its other competitors on a per-node basis, though shortly before I left SiCortex I was able to get better numbers with PVFS2 and in any case per-node performance is the wrong figure of merit for anything parallel. Lustre does have a good feature set, which includes flexible striping (even better with OST pools) and HSM integration. There are some talented people working on it, who might yet succeed in making the positives outweigh the negatives. Having developed code based on it and supported that code in the field throughout the 1.6 series and into 1.8, though, I’m not about to recommend it to anyone without some serious proof that things have improved.

Cloud Appropriate

In my cloud-storage slides for the cloud forum last week – available at Red Hat but you’ll probably have to register until I get clearance to post a copy here – one of the points I made is that you have many options for storage in the cloud but whatever options you choose should have certain “cloud appropriate” characteristics. Here, I’ll dive more into what I think that term means.

First, though, I have to talk a little bit about cloud-service deployment models. Just about any cloud service can be deployed privately by the users themselves, taking advantage of the elasticity and isolation already provided at the instance and network level. For example, I run GlusterFS this way on Amazon or Rackspace pretty regularly and it works fine. On the other hand, what if the provider offered GlusterFS as a permanent shared resource, like S3 but at a filesystem level or like CloudFiles but with a POSIX interface? The servers could be doing native instead of virtualized I/O, on specially provisioned and optimized hardware. This would much better capture those economies of scale and expertise that James mentions, and also take advantage of his “non-correlated peaks” to bring the cloud advantage of more efficient provisioning to storage as well as computation. That’s the deployment model I have in mind for this discussion.

Availability and Quorum

In a comment to my availability and partition tolerance post, Sergio Bossa asked a very interesting question that I thought was worth answering in a full post.

is it correct to say that a CA system can simply be turned into a CP one by just forcing “partitioned nodes” down, so that:
1) Partitioned nodes are completely non-functional.
2) Live nodes are still functional because locks held by the partitioned ones have been broken.

I’ve been thinking about this all afternoon, and I still haven’t decided on a single answer. What I’d say is that in a mathematical sense the answer is no, but in practice the answer is likely to be yes. In many distributed systems where consistency has been considered before availability and partition tolerance, locking or leases are used to ensure that consistency. When a partition occurs, therefore, the “natural” outcome is that requests can get stuck waiting for locks or leases held on the other side of the partition. According to the Brewer/Lynch terminology, this would be CA behavior. In those particular kinds of systems, enforcing quorum can change the behavior to CP – requests no longer get blocked during the partition but non-quorum nodes do.

What makes my answer less than straightforward is this: what if it’s not just locks or leases that are stranded on the non-quorum side of a partition, but actual data? In particular, in an MVCC-based system, it’s quite possible that the current version of some datum might only exist on the non-quorum side. Forcing that node down would not help if requests still had to wait for that datum, and discarding it (i.e. discarding the transaction of which it was a part) would violate consistency. For that type of system, then, it would not be true that enforcing quorum would turn a CA system into a CP one. Maybe that’s why I can’t immediately think of an MVCC system that applies quorum across a WAN; most only use MVCC inside a local cluster where partitions are considered unlikely and then switch to a different model for replication.

New Tools

Ever since my rant about the sad fate of KScope, I’ve been using vim and cscope and I’m happy to report that the combination is working very well for me. The tutorial is adequate, but I seem to recall a fair amount of web-searching and plain old experimentation before I really felt comfortable with it. The nicest thing is that it works just as well when I’m logged in from home as it does when I’m sitting at the machine where the files reside.

In a semi-related note, I finally got tired of the constant “not quite transparent” updates to Firefox. At least once a week, Firefox updates itself, and can supposedly do it without a restart, but what really happens is that it starts getting more and more flaky until I get so sick of it that I restart anyway. For example, one update caused all dialog boxes to show up in teeny-tiny little windows which were not much more than a truncated title bar and about ten vertical pixels’ worth of content. Other times things just start to hang. I’ve been burned by this at least a dozen times, and finally felt motivated to do something about it. After experimenting with Chrome, I’ve switched to Opera. Built-in bookmark synchronization is one thing that’s nice about it. Also, Opera does lots of things besides browsing, so for example I’m also using it for IRC now so I have one less window lying around. I’d consider using it for email too, thus getting rid of Thunderbird as well as Firefox, but the email client seems to lack a lot of features including LDAP support so I probably won’t. There are also some things I’m used to in Firefox that I might have to find Opera-friendly equivalents for. Chief among these:

FoxyProxy
I use this one a lot, so I might have to start running TinyProxy or similar (preferably something with an easy interface for switching between one rule set and another).
AdBlock Plus
I haven’t noticed any problems with ads or popups, so maybe I won’t need it, but if I do I’ll probably look into auto-converting EasyList into Privoxy scripts or something.
Tree Style Tab
This has become one of my favorites. At least I can put my Opera tab bar on the left, but I had gotten used to TST’s hierarchical nature. Opening a folder full of bookmarks as a tree which can be collapsed or all closed with a single click is very intuitive and useful. Ditto for clicking on all of the interesting links as I’m reading an article, or for opening up a bunch of tabs while surfing eBay or CafePress. I don’t know if there’s much I can do about this one, except decide whether I miss the functionality enough to put up with the Firefox update SNAFU some more.

Necessity is the mother of change, and change is good sometimes. If nothing else, using different tools gives me a bit of that “learning cool new stuff” experience I remember from my early days with computers, and which I don’t get enough of nowadays. It almost doesn’t matter whether I continue using the new tools or switch back to the old ones.

Real Parallel Filesystems

One of the dangers of making something easier to do is that a lot of less skilled people will start doing it. One familiar example of this is writing multi-threaded code. All of a sudden everyone’s doing it, the vast majority without any understanding of the principles behind writing good multi-threaded code, so an awful lot of them make a complete hash of it. The same is beginning to be true of distributed code. The example that has been on my mind lately, though, is filesystems. FUSE has made it a lot easier to write filesystems, so a lot more people are doing it. I generally consider that a good thing, and (unlike many of my kernel-filesystem-developer colleagues) I’m not going to look down my nose at FUSE filesystems just because they’re FUSE. After all, I just finished writing CassFS in my spare time. On the one hand, it illustrates just how easy it can be to slap a basic filesystem interface on top of something else. It took me about twenty hours’ worth of spare time, and I’m not Zed Shaw so I’ll give credit to FUSE instead of pretending it’s proof of my own awesomeness. On the other hand, CassFS is also an example of how badly a FUSE filesystem can suck. I won’t go into details here, since I already did, but my point is that CassFS is no worse than a bunch of other FUSE filesystems out there and some of those projects’ authors still act like their little brain-fart is equal to the more mature efforts out there. That does bug me. It’s great that technologies like FUSE allow people to do something that would previously have been out of reach for them. It’s not so great that the people who’ve been working on the truly hard problems in this area for ten years or more, and who might expect credit or even profit for those efforts, have to “share the stage” with people who just got basic read/write of a single file by a single process working.

That brings me to the real topic of this article. There are a lot of parallel/distributed filesystems and other data stores out there nowadays. Some of their authors are making pretty grandiose claims because their pet does exactly one thing well and when they tested that one thing vs. better-known alternatives it didn’t do too badly. Well, sorry, but that doesn’t cut it. It’s like “racing” the guy in the car next to you who doesn’t even know you’re there because he’s busy doing what he should be doing which is paying attention to conditions up ahead. If you want your p/d filesystem to be taken seriously, you have to meet at least the following criteria.

  1. Support practically all of the standard filesystem entry points with reasonable behavior – not just read/write but link/symlink operations, chown/chmod, rename, stat returning reasonable info, etc.
  2. Have distributed metadata, not a single metadata-server SPOF/bottleneck.
  3. Provide intra-file striping for high performance access to a single file from one or many nodes/processes (the latter precluding whole-file locks) and for even data distribution across servers.
  4. Support RDMA-style as well as socket-style interconnects, also for high performance.

I’m aware of only three open-source alternatives that meet this standard, and dozens that don’t. Lustre failed criterion 2 when I worked on it, but claims to have gotten past that and I’ll give them the benefit of the doubt. PVFS2 also passes; some might quibble about whether their explicit rejection of certain obscure POSIX requirements allows them to meet criterion 1, but I think they’re close enough. GlusterFS also passes, though there’s some room for improvement on criterion 4. Of the rest, I suspect NFS4/pNFS advocates are the most likely to show up and object, but I don’t think NFS4/pNFS are even in the right space. They’re protocols, not implementations, and the existing open-source implementations don’t even address how to use the protocol features that were put in for this sort of thing. As far as I know, most if not all multi-server NFS4/pNFS implementations have used some other parallel filesystem on the back end to handle that, and it’s those other parallel filesystems (PVFS2 in one case but more often proprietary) that I’d consider.

If what you want is a real, mature parallel filesystem to deploy today, these are the ones you should look at. In another year or two, maybe some other very exciting and promising projects will join the list. Ceph is my favorite candidate, along with POHMELFS and HAMMER. Such things are great to play with, but I don’t think I’ll be putting my home directory on one. Come to think of it, I never got around to putting my home directory on any of the Big Three either. Maybe once I’m done with my current subproject I’ll take a big bite of my own dogfood.

Checking In

I know I’ve been kind of absent lately. Part of it was traveling to Michigan to see my mother, brother, and cousin. Good times. We flew this time, and I was worried that it would be awful. Last time the three of us flew through DTW, Northworst took six hours there and four hours back for what should be a two-hour flight. That’s a lot of time in a plane on the tarmac trying to keep a three-year-old entertained with the few things you can carry on. The last time I went through DTW myself, I found that they’d scheduled a dozen flights at exactly 6am on a Sunday morning, leading to a huge security-theatre backup and to me missing my flight. I ended up getting routed through a very busy O’Hare – which I’d just left – before finally getting back to Boston. Considering all that, and that there was an “incident” there not too long ago, I thought it would be crazy, but in fact it all went smoothly.

The other reason I’ve been quiet here is that I’ve been busy doing actual work. I’ve been writing lots of actual code for my way-cool GlusterFS translator, for one. I’ve reached the point where I can run actual tests and see how well it works, which I’m pleased to say is very well. Now I just have to slog through all of the entry points I haven’t bothered with yet, figuring out the GlusterFS object-lifecycle rules so I can make sure there are no memory leaks, making sure I return consistent error codes, and then running some real functional tests like fsx, etc. More about that later, I’m sure.

The other thing I’ve been busy with is techno-evangelism. I’ve already mentioned the podcast, plus I gave a half-hour presentation about cloud storage at Red Hat’s Cloud Computing Forum yesterday. I’ll post a link to the archive when I get a public one myself (all I have is a private one that I’m not sure is usable by others); meanwhile you’ll have to read the The Register had to say about my talk and others.

OK, now back to that code.