They Killed KScope!

For quite a while, my favorite source editor/browser has been KScope. It combines one of the better source editors I’ve seen (Kate) with a very robust cross-referencing tool (Cscope). Unfortunately, when KDE went from version 3 to version 4, they made changes that broke KScope. What’s a little odd is that Kate still works. You’d think that if they’d fixed Kate and maintained its interface, KScope would be unaffected. Apparently, though, whoever made the KDE4 changes to Kate decided to extend the breakage instead of containing it. Nice call, whoever you are. Thanks.

This wouldn’t be worth a rant if I could get KScope running again in reasonable time, but that’s not the case. The KScope developers’ response to the breakage was to push out one last 1.9 release, containing only the features that still worked and removing everything that was useful about the project to begin with, and then stop development. That doesn’t seem like the best response to me, but perhaps they no longer had the time/inclination to deal with this level of breakage. I haven’t yet figured out how to put together enough of a KDE3 environment in Fedora 12 to build KScope 1.6 myself, though this is supposedly possible and I might yet find the magic formula. The lack-of-resources argument doesn’t work quite as well for Fedora as for KScope, but they might also lack volunteers in this particular area. I’m not going to blame them for the way KDE stabbed everyone in the back.

I’ve historically been a fan of KDE. I’ve been using KDE versions of natively-GNOME distributions for quite a while now, because I think it’s better both technologically and at a user-experience level. They totally botched the KDE4 transition, though. As far as I can tell, furthermore, there were plenty of people predicting this disaster. They could have listened, but they did what they wanted to anyway out of sheer arrogance and insensitivity to either users’ or external developers’ needs. I’m simply not a KDE fan any more. If I want to get away from GNOME, I can use XFCE. I might still use some specific KDE programs, but I don’t trust them to provide a complete environment any more.

So, what are my options? I used Source Navigator quite happily for many years, but it now seems crufty and unpleasant. I’ve been using CodeLite a bit because it has tolerable Cscope integration so I can navigate large bodies of code that I’m not intimately familiar with. I’m not a big fan of the Studio/Eclipse workspace/project/virtual-folder model where the IDE tries to write your own (warthog-butt-ugly) makefiles, though, and the editor’s not the best either. Using CodeLite from home is also barely tolerable, whether it’s the program or the files that are remote. Somebody seems unaware that redraws and/or filesystem operations might be expensive enough to expose lazy coding. Anjuta seems like a roughly-equivalent offering, which might improve in some areas. I don’t want to use Eclipse CDT because it’s bloated or KDevelop because it’s KDE, but I might be forced to use one or the other if Anjuta doesn’t pan out. Sigh. I suppose I could fix KScope if I didn’t mind giving up all of my remaining free time for the next long while, but I wish such choices weren’t made necessary by others’ bad behavior.

I’m a Star!

OK, maybe just a brown dwarf. In any case, on Friday I joined an episode of the Infosmack Podcast to talk about cloud storage, NOSQL, and a bunch of other stuff. If anybody wants to know what Jeff-with-a-cold sounds like, here’s your chance.

Site Trends

Just for fun, I decided to spend part of my lunch break generating a graph of my top twenty posts. The first graph I did was total hits vs. date posted. I know this site has become more popular lately, and I wasn’t too surprised to see that the increase in popularity outweighs the effect of older posts having had longer to rack up the numbers. What I thought might be more interesting was a graph of hits per day instead of total hits. Since #20 was a too-recent anomaly, I only graphed the top nineteen. Here’s the result.

hits per day vs. date posted

Not bad. Of course, for this graph the general effect of age is the opposite of what it would be for total hits. I’d like to graph “hits in first month” but that would require a lot more log processing than I can do on my lunch break. The thing I find most interesting is what this tells me about my evolving readership. Eight out of the ten most recent posts to make the list are technical, as are six out of the top ten by total hits and seven out of the top ten by hits per day. While I’ve gone through periods of less technical blogging, and some of the results do make the top twenty, the technical stuff is clearly what people come here for. Most of my family and some of my friends have learned to look for other stuff (including family pictures) on Facebook, and I’ve deliberately cut back on the political stuff in general (my recent blip about the election notwithstanding). As I predicted back in my “Unemployed!” post – #15 total and #8 per day – this blog has become and will probably continue to be more technical than it was during the mid-oughties.

Yesterday’s Election

A short break from the purely technical stuff…

The voters of Massachusetts sent a message yesterday. That message was, “Waaaahhhh!” The people who created the United States Senate as a place where statesmen could soberly reflect on issues of lasting importance, somewhat insulated from the fleeting passions whipped up by demagogues, are all turning over in their graves.

MaxiScale, Round Two

I had a nice chat with some folks from MaxiScale this evening. Yes, I know, some people will think I’m forever tarnished by having actually talked to somebody who makes money, but I figure if you write something about someone then it’s only fair that you give them a chance to respond. All in all, I think it’s a good thing that they’re willing to go mano a mano with critics, and they were far more gracious toward me than I had previously been toward them, so they deserve some credit. Before I get into the technical content, though, I’d like to clear up a few things.

  • Some people have characterized my previous article about MaxiScale as a review of their product. It was not. It was, mostly, a review of a particular white paper they had published. At least one person also accused me of fighting FUD with more FUD. Guilty as charged. I’m sure my readers know that when my ire is aroused I can be pretty acerbic. The white paper – which BTW is apparently undergoing some revision – annoyed me. I gave it a harsh review, and some of that harshness extended to playing “turn the tables” a bit. If I wanted to review the product, I’d want to see it in operation first.
  • My conversation with MaxiScale was predicated on an explicit agreement that nothing confidential would be discussed. Due to the nature of my own work I cannot be privy to their secrets, nor am I authorized to share my employer’s. Everything I have to say here should be considered public information.

Much of the conversation was about “peer sets” and placement strategies. It turns out that MaxiScale’s approach is based on some of the same techniques I’ve talked about here. Each file is hashed to identify a peer set which will handle its metadata, but then the members of that peer set might determine that the data should be placed elsewhere. The term “consistent hashing” wasn’t actually used, but I’d have to guess that what they have is either that or a moral equivalent. Similarly, I’m sure there’s some “special sauce” in how they determine which peer set should receive the data, and I’m content to leave it that way. What’s important is the general approach, and their hash-based method is IMO very consistent with what I wrote yesterday about good design for distributed systems.

On another issue, I’m only half convinced. Apparently they have their own protocol which does replication via multicast. This was a possibility I hadn’t considered, even though I’ve seen other parallel filesystems that do it. I’m not really a big fan of multicast. It might or might not actually involve less data on the wire than client-driven replication, depending on implementation and network topology. It could also be argued that if handling storage failures and retries at the node level is a good idea (instead of relying on RAID) then the exact same principle should be applied to network failures and retries as well (instead of relying on multicast). Using multicast isn’t entirely a bad choice, but it’s not a clear winner vs. server-driven replication on a back-end network either.

This leads straight into another thorny issue: platform compatibility. Having clients run “a significant amount of file system software” on clients is not bad only “because it must be knowledgeable about the lower-level workings of the file system” as in the white paper’s criticism of SAN filesystems. Using a proprietary protocol means having to implement that protocol yourself on every platform you intend to support. It also means being dependent on each platform’s support for not-quite-universal features like multicast. When I asked about platform support, the answer was that “major manufacturers” were supported. There was, notably, no mention of Linux in general or of any particular Linux distribution. Since I was representing myself, not my employer, I didn’t press further. According to a follow-up email, there is a Linux client which is known to run on RHEL, SLES, Ubuntu/Debian, and Gentoo.

The last significant technical issue we discussed was striping. They don’t do it. The reason given was that they’re focused on small-file workloads – mention was made of retrieving files under 1MB with a single disk operation – and that striping could be a waste or even a negative in such cases. That’s absolutely true. I’ve worked with several parallel filesystems. They tend to be good at delivering lots of MB/s, but they’re often poor for IOPS and downright lousy for metadata ops/second. This is not a strictly necessary consequence of striping, but it often relates to the complexity of needing files to be created multiple places but then have different states (as opposed to replication where the states are identical). Just think for a while about how stat(2) should return a correct value for st_size when a file is striped across several servers, and you’ll see what I mean. For the systems I design striping is pretty much essential, but they’re hitting a different design point and it’s fair to say that for them it might be a mistake.

Overall, I was pretty impressed. They didn’t do everything the way I would have, and they didn’t give all the answers I would have liked, but it seems like they made reasonable choices and – just as importantly – are willing to explain those choices even to folks like me. On the particular issue of data distribution, their hashed peer-set approach seems to be on the right track. It’s a hard problem, at the core of scalable storage-system design, and their design seems to avoid many of the SPOFs and bottlenecks I’ve seen plague other designs in this space. It’ll be interesting to see where they’re able to go with it, and I wish them luck.

Thinking at Cloud Scale

Scalable systems are not just bigger versions of their smaller brethren. If you take a ten-node system and just multiply it by ten, you’ll probably get a system that performs poorly. If you multiply it by a hundred, you’ll probably get a system that doesn’t work at all. Scalable systems are fundamentally and pervasively different, because they have to be. I’d been meaning to write about some aspects of this for a while, but my recent post about MaxiScale brought a couple of particular points to mind. Here are two Things People Don’t Get about building scalable systems.

  • Everything has to be distributed.
  • Change has to be handled online.

To the first point, almost everyone knows by now that any “master server” can become a limit on system scalability. What’s less obvious is that universal replication is just as bad if not worse. For an extremely read-dominated workload, spreading reads across many nodes and not worrying about a few writes here and there might work. For most systems, though, the overhead of replicating those more-than-a-few writes will kill you. What you have to do instead is spread data around, which means that those who want it have to be able to find each piece separately instead of just assuming they all exist in one convenient place.

The second point is a bit less obvious. As node count increases, so does the likelihood that users will represent conflicting schedules and priorities that preclude bringing down the whole system for a complete sanity check and tune-up. This is the down side of James Hamilton’s non-correlated peaks. Despite the “no good time for everyone” feature of large systems, though, change will continue to occur. Nodes will be added and removed, and possibly upgraded. Lists will get long, space will fragment, and cruft will generally accumulate. Latent errors will appear, and they’ll remain latent until they’re fixed online or until they cause a catastrophic failure. Individual nodes will reboot, and some will argue that they should be rebooted even if they seem fine, but “planned downtime” for the system as a whole will be no more than a fond memory.

As it turns out, these two rules combine in a particularly nasty way. If everything has to be distributed and then found, and changes to the system are inevitable, then you would certainly hope that your method of finding things can handle those inevitable changes. Unfortunately, this is not always the case. For example, consider one of the many systems based on Dynamo-style consistent hashing with N replicas. Now add N nodes, such that they’re all adjacent in the space between file hash X and server hash Y. Many systems support a update-or-insert operation, but if such an operation is attempted at this point it will create a new datum on the N new nodes, separate from and inconsistent with regard to the existing datum at Y and its successors. This is just bad in a bunch of ways – inconsistent data on the new nodes, stale data on the old ones, perhaps even an unexpected and hard-to-resolve conflict between the two if one of the new nodes then fails. This might seem to be an unlikely scenario, but the third key lesson of scalable systems is this:

  • Given enough nodes and enough time, even rare scenarios become inevitable.

In other words, you can never sweep the icky bits under the rug. You have to anticipate them, deal with them, and test the way that you deal with them. I can guarantee that certain well regarded data stores, implemented by generally competent people, mishandle the case I’ve just outlined. I’ve watched their developers hotly deny that such cases can even occur, proving only that they hadn’t thought about how to handle them. It’s not that they’re bad people, or stupid people, but they clearly weren’t thinking at cloud scale.

Trying to reason about systems that have no single authoritative frame of reference is hard. It’s like a store where every customer tries to use a different currency, with the exchange rates changing every minute. Building systems that can never go down for more than a moment, or perhaps never at all, is hard too. It’s no wonder people have trouble with the combination. Nonetheless, that’s what people who make “cloud scale” or “web scale” or “internet scale” products have to get used to. A ten-node cloud is just a puff of warm moist air, and anyone can produce one of those.

Fighting the MaxiScale FUD

A lot of people don’t understand storage. A lot of people don’t understand distributed systems. It should be no surprise, then, that very few really seem to understand distributed storage. Even experts in one of the two relevant specialties often seem lost when attempting to deal with their intersection, or make amazingly bad decisions when they try to implement something in that space. Wherever there is confusion of general lack of knowledge, there is also an opportunity for vendors to sling FUD. I was reminded of this by a MaxiScale white paper called Distributed Metadata: The Key to Achieving Web-Scale File Systems. My intent in addressing it is not to beat up MaxiScale – though I will be doing some of that at the end – but to use the white paper’s shortcomings as an opportunity to clear up some of the ignorance and confusion that they’re trying to use to their own advantage. Let’s examine some of their worst claims.

CassFS 0.01 Released

You can see it here. As I mentioned in my update last night, it works well enough to extract and build – but not run – iozone. That was the milestone I had previously established for making the code available, but I’d like to make sure one thing is very clear.

This is not code to use. This is code to study and hack on. I wouldn’t even use it myself for anything but coding practice. It’s incomplete, unreliable, and slow.

Here are some more specifics on what’s wrong with it.

  • Incomplete: no delete/rename, no maintenance of owner or permissions, no truncate, file-size limits, …
  • Unreliable: unchecked return values, probably memory leaks, and so on. You will definitely hit race conditions in the Thrift code if you forget to use the “-s” (single-threaded) flag for the FUSE daemon. Who would have thought that code would be even lamer than my own? Well, it is, and not just because of that.
  • Slow: the code rewrites inodes and superblocks way too often, does way too much copying (thanks again to Thrift.

All that said – and when I’m so critical of everything else out there how could I be any less critical of my own work? – I still think it’s kind of cool. It’s neat being able to store and manipulate data as Plain Old Files even though the backing store is about as much unlike a Plain Old Disk as you’re ever going to see, with a minimum of configuration fuss as well. I’m going to set it aside for a while now, mostly because I’m having plenty of fun solving some not-entirely-unrelated problems during the day now, but the next time I feel like doing some hobby programming there are all sorts of interesting things left to do with this code. Maybe someone else will even look at it and get some ideas.

Software Licensing as an Environmental Issue

After I suggested that I might use AGPL for my Cassandra filesystem, somebody asked why I would choose that particular license. I answered briefly in a comment, but I think the subject deserves a fuller explanation.

First, I’ll try to explain what AGPL means from my perspective. The basic intent of GPL generally is to ensure a healthy “commons” of software anyone can use free of charge. It does this by requiring that people who use distribute software based on GPL code make that code, including their own modifications, available to others. Unfortunately, the way GPLv2 was worded, this did not cover the case of the GPL software being used to provide a service. The code itself never gets onto users’ machines (this makes me wonder in the case of Java bytecodes and such) and thus the requirement to publish never gets triggered. Many companies have taken advantage of this loophole, but I feel it violates the spirit of the license under which they received the code. Even though I have historically considered GPL-style licenses to be anti-profit and have preferred BSD-style alternatives, I appreciate the efforts Affero made to preserve the intent of GPL more than GPL (v2) itself was doing.

My preference for BSD-style license was also predicated on a belief that most people would try to do the right thing even if they weren’t required to. Unfortunately, that belief has been weakened over time. The final nail in the coffin was a recent case when someone I follow on Twitter claimed that a particular piece of software had the “worst possible licensing” which precluded using it. He said that AGPL would force him to publish all of his own code – a lie explicitly contradicted by the AGPL text itself – and that dual licensing was a “bait and switch” tactic. I wonder if he yells at the free-sample folks in the grocery store too. What’s sadder still was the limited reaction he got. One author of the software in question correctly referred to his attitude as “captruism” – capitalism for what I want to sell, altruism for what I want to use. Kudos to him. Others who should know better – people who have historically taken a more aggressive stance than I have on software freedom, and who have pressured me to release my code before it’s even usable – not only remained silent but continue to aid this individual in his effort to promote himself and profit from the work of others. You know who you are. Shame on you. Saddest of all, this is not an isolated case. This combination of “aggressive freeloading” by some, aided and abetted by others’ apathy or bought silence, is quite common.

Here’s the thing: I actively want to prevent such people from using my work, even if that means I forego some chance to profit along with them. Consider it an example of the Ultimatum Game; even ignoring other considerations, I might use AGPL as a deliberate goad to those who have expressed antipathy toward it. To answer the facile but inevitable counterattack, this is not because I despise wealth creation but because I value it. Monetizing others’ code is not stealing – I’m not that kind of extremist – but it’s not wealth creation either. It’s conversion of raw resources into refined ones, and an abundant pool of freely available software representing all kinds of innovation is an essential resource for software-related wealth creation. If those who profit from it fail to replenish the resource pool, that’s in the same category as overfishing or strip mining and I don’t condone that sort of thing. There is nothing in AGPL to preclude somebody else taking my code, combining or aggregating it with their own innovative code licensed however they want, and profiting from the result. More power to them, but if they modify my code then yes, I darn well want those changes contributed to the commons even if they’re hidden behind an online service. Don’t strip-mine other people’s property.

Cassandra Filesystem

Over the holidays I had planned to work on a FUSE interface to Cassandra. Yeah, it’s a silly idea. I’m not doing it because it’s useful. Mostly I’m just doing it because I can. I like to play with code even when I’m not working, so even though this involves two work-related technologies I consider it a form of leisure. As it turns out, I didn’t get much of a chance to work on it. I always thought vacations were supposed to be voluntary and either restful or enjoyable, but when the timing is dictated by my employer and most of the the time is spent enabling someone else’s rest or enjoyment then I think a different term is necessary. I was able to squeeze three or four evenings’ worth of coding around my second job, though, and I don’t know when I’ll be able to get back to it, so this is a status update of sorts. Here’s what I have so far.

  • Data structures and key-naming conventions – roughly equivalent to the on-disk format of a disk-based filesystem.
  • Code to manipulate those structures in several important ways, including inode and block allocation.
  • Code to create/mount a filesystem and create/list arbitrarily nested subdirectories.
  • Code to create and read/write string-valued “files” within those subsubdirectories (including rewrite).

That’s really not much, but what’s probably more important than the current functionality is the structure that holds it all together. If I’d set out to implement FTP-like get/put on whole files I would have done that within a much simpler structure, but I just don’t consider that functionality interesting. I very consciously took the slower route of implementing things the way they’ll need to be for FUSE, and that integration is the obvious next step. I should be able to knock out mount, lookup, mkdir, and create/open pretty quickly at this point. I consider incremental read/write of only the affected portions within an arbitrarily large file (as opposed to reading or writing the whole thing) to be the most important feature of this whole project, and I’ve structured things so that full read/write support shouldn’t be difficult – though it’s necessarily a bit tedious. After implementing a few more calls (e.g. stat/fstat, opendir/readdir, maybe even symlink operations) the result might even be useful to someone besides myself. Then there are a lot of other things I could do…

  • I’d like to port the same functionality to other stores such as Voldemort, Tokyo/LightCloud, and/or Hail. Nothing so far particularly precludes that.
  • I probably won’t optimize around Cassandra’s multi-column data model, because that’s largely at odds with porting to other stores. Yes, I could implement yet another layer of abstraction between FUSE and the “block” level so that Cassandra could do certain things using columns and simpler stores could do them using similarly named keys, but it just wouldn’t be any fun. If anybody wants to pay me then my attitude might change, but as long as it’s a leisure-time project this seems unlikely.
  • I do intend to fix certain inefficiencies in how my own code works right now. For example, inode and block allocation hit the “superblock” key way too often. I have a very specific plan for how to do that better, but haven’t bothered to implement it. Similarly, file and directory creation both involve rewriting the entire parent directory and that’s nasty. Incremental directory updates are similar to incremental data updates, so once I have those done I’ll adapt the code.
  • I don’t intend to fix inefficiencies in how Cassandra works right now. The Thrift interface is ludicrously string-centric, forcing all kinds of copies and transformations that really shouldn’t be necessary, but fixing that would require a whole new bunch of work that I wouldn’t enjoy. See previous comment about for-pay work vs. leisure.
  • I do intend to fix some of my own general sloppiness – unchecked return values, probably memory leaks, general lack of modularity in some places.
  • I do not intend to implement any kind of multi-machine or multi-user support. That’s the kind of stuff I do for my day job; unless you want to offer me a new day job (for a lot of money) it’s both too much work and too much conflict of interest. That’s absolutely positively off the table as long as this is a hobby project.

I don’t quite feel ready to post the code yet, though I might be persuaded. If you think it’s something you might actually be interested in working on, then by all means let me know and I’d be glad to let you have it privately. I just don’t see any point in posting it for every wannabe to pick at when at least of half of it will probably change soon anyway. When I get to the point where I can mount via FUSE and unpack/build/run iozone within the resulting mountpoint, even if it’s slow and ugly, than I’ll probably put it on SourceForge under AGPL unless someone suggests another site/license.