Snowflakes Melt

The first rule of self-help scam artists is to tell people what they want to hear, without being too obvious about it. Tell them that there’s an easy way for them to get whatever they want, or that everything bad is somebody else’s fault, but wrap it up in convincing language that makes it seem somehow logical. One example of this that has really been driving me nuts lately is Seth Godin’s “linchpin” idea. Apparently the idea is that everyone should aspire to be a post-industrial “artist” instead of just a cog in a machine, everyone has the potential to do “emotional work” at a “high level” and thus become indispensable, etc. It’s a very positive message, but I’m just not buying it.

First, while I believe it’s not possible to predict who will rise to the challenge of becoming a linchpin, and thus that we should give everyone the chance, that doesn’t mean absolutely everyone can. Many people really are cogs in the corporate machine, are not particularly capable of being anything else, and are even happy that way – saving their creative energy for other pursuits such as family, hobbies, sports, and so on. Ninety percent of the people who proudly portray themselves as linchpins quite notably do not meet the criteria, and I’d even say that they’re less likely than average to be true linchpins because the significant time they spend on self-affirmation and self-promotion is time not spent actually doing anything that would make them real linchpins. My second objection to Godin’s idea is that just because some people can be linchpins doesn’t mean all can. It’s like every child being above average. Show me an organization where every single contributor is indispensable, and I’ll show you an organization that is guaranteed to fail as the normal course of events makes any one of them unavailable. Like it or not, only a few people in any group can be linchpins and if you want to be one then you’ll be in competition with others to find that niche.

We really need to get over the idea that every worker should be a unique and pretty snowflake to deserve a place in a high-functioning organization. Somebody has to do the things that anybody can do, and as long as that’s the case there’s a need for discipline as well as self-expression. If everybody thinks they’re leading, nobody really is. Most people are unique and special in some way, but usually not in a way that any employer/client can or should care about. Ordinary people, or people who are special in non-work-related ways, can still play a valuable and even essential role in even the most creative and innovative environments. Uniqueness is not a requirement. A beautiful snowstorm would still be beautiful even if every single snowflake looked exactly alike. Real self-help would mean teaching people how to find and function in and enjoy being in a creative environment, not just telling them that they can be among its leaders. For many of them, it’s simply not true.

KumofsFS

I had a little time and energy to hack on VoldFS today, but not enough of either to get into anything really complicated. My highest priority is to get the contended-write cases working properly, and that’s likely to be a bit of a slog. I decided to do something really easy and fun, so I did. Specifically, I installed/built Kumofs to see if my previously implemented memcached-protocol support would work with that http://kumofs.sourceforge.net/I as well as with Memcached Classic.

Well, it does. All I had to do was set VOLDFS_DB=mc and voila! The built-in unit tests worked fine, so I did my manual test: mount, copy in a directory, unmount, remount, get file sums in the original and copied directories, verify that they match. Everything was fine, so I can now say that VoldFS is a true multi-platform FUSE interface to at least two fully distributed data stores. Now I guess I’ll have to work on concurrency, performance, and features.

Apples and Oranges and Llamas

Everyone has their own unique set of interests. Professionally, mine is distributed storage systems. I’m mostly not very interested in systems that are limited to a single machine, which is not to say that I think nobody should be interested but just that it’s not my own personal focus. I believe somewhat more strongly, and I’m sure more controversially, that “in-memory storage” is an oxymoron. Memory is part of a computational system, not a storage system, even though (obviously) even storage systems have computational elements as well. “Storage” systems that are limited to a single system’s memory are therefore doubly uninteresting to me. Any junior in a reputable computer science program should be able to whip up some sort of network service that can provide access to a single system’s memory, and any senior should be able to make it perform fairly well. Yawn. It was in this context that I read the following tweet today (links added).

what does membase do that redis can’t ? #redis #membase

Yeah, I guess you could also ask “what does membase do that iPhoto can’t” and it would make almost as much sense. They’re just fundamentally not the same thing. One is distributed and the other isn’t. I don’t mean that membase is better just because it’s distributed, by the way. It’s not clear whether it’s a real data store or just another “runs from memory with snapshots to/from disk” system targeting durability but not capacity. In fact many such systems don’t even provide real durability if they’re based on mmap/msync and thus can’t guarantee that writes occur in an order which facilitates later recovery, and by failing to make proper use of either rotating or solid-state storage they definitely fail to provide a cost-effective capacity solution. In addition to that, membase looks to me like a fairly incoherent collection of agents to paper over the gaping holes in the memcache “architecture” (e.g. rebalancing). No, I’m no particular fan of membase, but the fact that it’s distributed makes it pretty non-comparable to Redis. It might make more sense to compare it to Cassamort or Mongiak. It would make more sense still to compare it to LightCloud or kumofs, which already solved essentially the same set of problems via distribution add-ons to existing projects using the same protocol as membase. Comparing to Redis just doesn’t make sense.

But wait, I’m sure someone’s itching to say, there are sharding projects for Redis. Indeed there are, but there are two problems with saying that they make Redis into a distributed system Firstly, adding a sharding layer to something else doesn’t make that something else distributed; it only makes the combination distributed. Gizzard can add partitioning and replication to all kinds of non-distributed data stores, but that doesn’t make them anything but non-distributed data stores. Secondly, the distribution provided by many sharding layers – and particularly those I’ve seen for Redis – is often of a fairly degenerate kind. If you don’t solve the consistency or data aggregation/dependency problems or node addition/removal problems that come with making data live on multiple machines, it’s a pretty weak distributed system. I’m not saying you have to provide full SQL left-outer-join functionality with foreign-key constraints and full ACID guarantees and partition-tolerant replication across long distances, but you can’t just slap some basic consistent hashing on top of several single-machine data stores and claim to be in the same league as some of the real distributed data stores I’ve mentioned. You need to have a reasonable level of partitioning and replication and membership-change handling integrated into the base project to be taken seriously in this realm.

Lest anyone think I’m setting the bar too high, consider this list of projects. That’s a year and a half old, and I count seven projects that meet the standard I’ve described. There are a few more that Richard missed, and more have appeared since then. There are already close to two dozen more-or-less mature projects in this space, not even counting things like distributed filesystems and clustered databases that still meet these criteria even if they don’t offer partition tolerance. It’s already too crowded to justify throwing every manner of non-distributed or naively-sharded system into the same category, even if they have other features in common. Redis or Terrastore, for example, are fine projects that are based on great technology and offer great value to their users, but my phone pretty much fits that description too and I don’t put it in the same category either. Let’s at least compare apples to other fruit.

Memcached support in VoldFS

A few days ago, I pushed VoldFS to GitHub. I was rather pleased to see that it then spent two days at or near the top of the “trending repos” on their front page, whatever that means. It’s hard to believe I’m even in the top thousand for views per hour/day, or that the views were still increasing at the end of that period, so I’m not sure that standing was really deserved but I still appreciate the exposure while it lasted. Last night, I pushed my first update, adding support for the memcached protocol using python-memcached. If you want to play with it using an instance of memcached running locally on the default port, you’ll need the most recent version of memcache.py which supports the “cas” operation (which is terribly misnamed because it’s not a compare and swap of contents at all but rather a conditional put based on version numbers). Anyway, if you have that then all you need is:

$ export VOLDFS_DB=mc
$ ./mkfs.py
$ ./voldfs.py …

The point of adding this support is actually nothing to do with memcached as we all know and love it – in that “look at the cute little kid trying to act all grown up” kind of way. It’s a common protocol for other things as well, including the Gear6 and Northscale commercial memcache appliances as well as projects like kumofs (which is the alternative that led me to explore this). Supporting the memcached protocol means that VoldFS can provide a filesystem interface across any of several underlying technologies, expanding the potential user base greatly. I might well add support for other data stores as well at some point.

VoldFS

Those of you who follow me on Twitter have probably seen me mention VoldFS. It’s my latest spare-time coding project, in which I’m using FUSE (and Python) to implement a filesystem interface on top of Voldemort. To satisfy the curious, here’s a first cut at a FAQ.

Using GlusterFS for Filesystem Migration

Someone on Slashdot asked how to do filesystem snapshots on Linux. Many respondents pointed out somewhat reasonably that the current setup with ext3 on raw disks didn’t support that, and that the poster should migrate onto another filesystem or LVM to get that functionality, but nobody seemed to have much to say about how to do that initial migration non-disruptively. I’ve had to do filesystem migrations involving many terabytes and many millions of files, and it’s a non-trivial exercise. There are a lot of ways to get it wrong and ruin your data. Based on things I’ve done since then, here’s the approach I’d investigate first if I had to do it now.

One of the less obvious tricks I’ve learned to do with GlusterFS is to run it all on one machine. The design very deliberately lets you stack “translators” on top of one another in all sorts of arbitrary ways, and the network protocol modules use the same translator interface as well. I often run the “distribute” translator, or those I’ve written, directly on top of the “storage/posix” local-filesystem translator. It works fine, and it’s much more convenient for development than having to run across machines. GlusterFS also has a replication translator, and one of the functions it necessarily provides is “self-heal” to rebuild directories and files on a failed (and possibly replaced) sub-volume. Do you see where I’m going with this? You can set up an old local filesystem and a new (empty) local filesystem as a replica pair under GlusterFS, and then poke it to “self-heal” all the files from old to new while the filesystem is live and in active use. GlusterFS doesn’t care that the two filesystems might be of different types (e.g. ext3 vs. btrfs) and/or using different kinds of storage (e.g. raw devices vs. LVM) so long as they both support POSIX locks and extended attributes. All the while it keeps track of operations in progress to the composite filesystem so this activity is effectively transparent to users who just see essentially the same filesystem they always saw. When you’re done, you just take GlusterFS back out of the picture and mount the new fully-populated filesystem directly. Here’s a configuration file to do just what I’ve described. It takes an existing directory at /root/m_old and combines that with an empty directory at /root/m_new to create a replica pair. Here are the commands to mount it and force self-heal.

mount -f /usr/etc/glusterfs/migrate.vol /mnt/migrate
ls -alR /mnt/migrate

I should warn people that I’ve only done a very basic sanity test on this. It seems to work as expected for a non-trivial but still small directory tree, but you’d certainly want to test it more thoroughly before using it to migrate production data (and of course you should absolutely make sure you have a backup that works before you attempt any such thing). There are a couple of non-obvious things about the configuration that I should also point out.

  • Both filesystems should, as mentioned previously, support POSIX locks and extended attributes. You need to load the features/locks translator to use the former.
  • For some reason this doesn’t seem to work without the server and client modules involved. This hasn’t been the case in my experience with other composite translators, and it shouldn’t be necessary here either, but at least the networking is all local so it’s not too terrible.

I’m sure there are other live-migration approaches that should work just as well if not better. I suspect there’s at least on approach using union mounts, for example. There are also a lot of approaches I can think of that would fall prey to subtle issues involving links (or other things that aren’t plain files), sparse files, and so on. It’s a lot easier to suggest an answer than to implement one that actually works. I’ve even thought (since SiCortex) of writing a FUSE filesystem specifically to do this kind of migration, but it would require a significant effort. This seems like an easy and fairly robust way to do it using tools that already exist.

SeaMicro Comes Out of Stealth Mode

Here’s the product page. It should certainly look very familiar to my ex-SiCortex readers, and some of the commentary from James Hamilton even more so.

Four of these modules will . . . deliver more work done joule, work done per dollar, and more work done per rack than the more standard approaches currently on the market.

Short version: up to 512 servers in 10U, each a 1.6GHz Atom processor with 2GB (non-ECC) memory, connected in a three-dimensional torus with 1.2Tb/s aggregate bandwidth, and virtualized I/O. Computationally it’s a bit denser than the SiCortex systems at 3.3e12 instructions/sec in a rack vs. 4.1e12 in an SC5832 which was two racks wide and deeper as well. In terms of power the density is about the same – 2kW/unit or 8kW/rack vs. 20kW for the SC5832. 1.2Tb/s works out to only 20Gb/s per eight-processor board (vs. 48Gb/s per processor for the SC5832) and with that topology the average hop count for the 128-node box is likely to be higher than for the entire SC5832 (with no mention of a way to connect that fabric across all four boxes in a rack) so for communication-intensive workloads it’s likely to run into a few problems. On the other hand, the SeaMicro I/O virtualization seems a bit more robust than what we had. Sure, we had scethernet, but that was emulation rather than virtualization (we explicitly avoided the “virtual NIC” approach) and we had nothing similar for storage. And of course it’s x86, so all the n00bs who can’t comprehend there even being another architecture with different performance tradeoffs can just plop their code on it and go. That plus the I/O virtualization might mean that even things like operating systems can run unaware that this is something physically different than what they’re used to.

Overall, there might be a few areas (especially interconnect) where the SM10K might seem less sexy than the SC5832 was, but there are also some (I/O virtualization) where it takes things a bit further and it’s clearly better positioned for adoption by a much larger target market. It’s a very interesting and welcome development. It will be particularly interesting to see how things play out between them and Smooth Stone with their ARM-based architecture.

If anybody from either of those companies is reading this, and looking for a parallel/distributed filesystem (or other data store) to run on such systems, let me know. I might be able to help. ;)