Canned Platypus

Making the world better, one byte at a time.

I just thought of a new metric: followers per tweet. I’m at about 0.43, which is pretty middle of the road. I see some people who are flirting with the 0.1 mark. At the other end of the scale I see some some who are at 3.0 or better. Not too surprisingly, the first group are disproportionately likely to end up on my “whale-jumpers” list which I check less often, and a similarly disproportionate number of my favorite tweeple seem to be in the second group. I could therefore expect to improve the “quality” of my own personal Twitter stream by checking this ratio for people I’m thinking of following . . . and you could do the same for your own personal stream as well.

BTW, if you want to help me pump up my own ratio, I’m @Obdurodon. ;)

Amazon has announced Cluster Compute Instances for EC2. This is a very welcome development. Having come from SiCortex, where we provided a somewhat cloud-like ability for users to allocate large numbers of very well-connected nodes on demand, I’ve been talking to people about the idea of provisioning cloud resources on special machines like this for at least the past year. In that light, I find a couple of things about the announcement a bit surprising. Let’s go down the specs first.

  • There’s a single type of HPC instance – dual quad-core “Nehalem” processors with 23GB. The Amazon page points out that this small additional amount of transparency about the exact CPU type allows people to do processor-specific optimization that they generally can’t do elsewhere in EC2.
  • Each instance comes with 1.7TB of instance storage. Performance is not mentioned, but at modern disk drive sizes that might well be just two drives.
  • Connectivity is via 10GbE (NIC and switch vendors not specified). Yuk. 10GbE still lags behind InfiniBand in terms of both bandwidth and latency, both absolute and per dollar. Much has been made lately of the significant and increasing dominance of IB in the HPC world, especially in the Top 500, and the customers Amazon is trying to attract are likely to consider 10GbE a strange choice at best.
  • There is a default limit of eight Cluster Compute Instances without filling out a request form. Eight machines is not enough for serious work of this type, even when the machines are this powerful, so that’s going to affect – and annoy – practically every user.
  • The instances are $1.60 per hour, which is $38.40 per day or a thousand a month. There are others far better qualified to comment on the economics, so I’ll leave it at that.
  • Cluster Compute Instances are only available in one availability zone.

My first thought is that the new offering as currently specified is nowhere near as interesting as it could be – and might be, as the service continues to evolve. Faster interconnects are one obvious way to make it more interesting. Removing the eight-machine default limit – which I strongly suspect is related to the capacity of the switches they’re using – is another. Then it gets even more interesting. When I’ve talked to people about heterogeneous clouds, which is what we’re heading towards here, I’ve generally meant far more kinds of specialization than this. How about instances which are optimized for communications instead of computation, such as with the same 10GbE (or better) but less powerful processors? How about instances which are optimized for disk I/O with multiple spindles and/or SSDs? How about special GPU-equipped instances? Once you can deal with the kind of heterogeneity that today’s CCIs represent, it’s but a short step to handling these other variations as well, so today’s announcement might merely foreshadow even bigger things to come.

The other thought I have about this is that it’s not just about the individual instances. The ability to specify that several instances should be provisioned close to another – probably on the same switch for reasons I mentioned above – is interesting with respect to both the user experience and the infrastructure needed to support it. Location transparency might be a defining feature of cloud computing, but that’s only in the sense of absolute location. Relative location is still a very valid parameter for allocation of cloud-computing resources. When you define a “cluster placement group” in EC2 you’re effectively saying that these instances should all be close to one another, regardless of where they all are relative to anything else. In other situations, such as disaster recovery, you might want to say certain instances should definitely be far from each other instead. We’ve been thinking through a lot of these issues on Deltacloud, but this isn’t a work blog so it would be both unwise and distasteful to say much more about that right now. Suffice it to say that facilitating this kind of placement requires a much more sophisticated cloud infrastructure than “grab whatever’s free wherever it is” which is pretty much the current standard. When you consider relationships not only between instances but also between instances and the data or connectivity they need, it can become quite a science project. The possibility that Amazon might be doing some of that science is, to me, one of the most exciting things about this announcement.

I’ve been on vacation for the last few days, and while I was (mostly) gone a few interesting things seem to have happened here on the blog. The first is that, after a totally unremarkable first week, my article It’s Faster Because It’s C suddenly had a huge surge in popularity. In a single day it has become my most popular post ever, more than 2x its nearest competitor, and it seems to have spawned a couple of interesting threads on Hacker News and Reddit as well. I’m rather amused that the “see, you can use Java for high-performance code” and the “see, you can’t…” camps seem about evenly matched. Some people seem to have missed the point in even more epic fashion, such as by posting totally useless results from trivial “tests” where process startup dominates the result and the C version predictably fares better, but overall the conversations have been interesting and enlightening. One particularly significant point several have made is that a program doesn’t have to be CPU-bound to benefit from being written in C, and that many memory-bound programs have that characteristic as well. I don’t think it changes my main point, because memory-bound programs were only one category where I claimed a switch to C wouldn’t be likely to help. Also, programs that store or cache enough data to be memory-bound will continue to store and cache lots of data in any language. They might hit the memory wall a bit later, but not enough to change the fundamental dynamics of balancing implementation vs. design or human cycles vs. machine cycles. Still, it’s a good point and if I were to write a second version of the article I’d probably change things a bit to reflect this observation.

(Side point about premature optimization: even though this article has been getting more traffic than most bloggers will ever see, my plain-vanilla WordPress installation on budget-oriented GlowHost seems to have handled it just fine. Clearly, any time spent hyper-optimizing the site would have been wasted.)

As gratifying as that traffic burst was, though, I was even more pleased to see that Dan Weinreb also posted his article about the CAP Theorem. This one was much less of a surprise, not only because he cites my own article on the same topic but also because we’d had a pretty lengthy email exchange about it. In fact, one part of that conversation – the observation that the C in ACID and the C in CAP are not the same – had already been repeated a few times and taken on a bit of a life of its own. I highly recommend that people go read Dan’s post, and encourage him to write more. The implications of CAP for system designers are subtle, impossible to grasp from reading only second-hand explanations – most emphatically including mine! – and every contribution to our collective understanding of it is valuable.

That brings us to what ties these two articles together – besides the obvious opportunity for me to brag about all the traffic and linkage I’m getting. (Hey, I admit that I’m proud of that.) The underlying theme is dialog. Had I kept my thoughts on these subjects to myself or discussed them only with my immediate circle of friends/colleagues, or had Dan done so, or had any of the re-posters and commenters anywhere, we all would have missed an opportunity to learn together. It’s the open-source approach to learning – noisy and messy and sometimes seriously counter-productive, to be sure, but ultimately leading to something better than the “old way” of limited communication in smaller circles. Everyone get out there and write about what interests you. You never know what the result might be, and that’s the best part.

(Dedication: to my mother, who did much to teach me about writing and even more about the importance of seeing oneself as a writer.)

After writing recently about some of the less-savory tactics that are often used in positioning technical products/projects, another ugly example of FUD has raised its head in the NoSQL world. No, I’m not referring to the “Facebook is abandoning Cassandra” silliness, which I think Eric Evans dealt with quite adequately. I’m talking about the MongoDB will throw away your data silliness, where somebody involved in a competing project expresses “sincere concern” about data durability. Mikeal actually raises some excellent points about data durability. I have worked for many years in environments where such things are taken very seriously, and he’s correct on many of the technical points such as the inadequacy of mmap-based approaches for ensuring recoverability. I do like MongoDB’s feature set, which has led me to use it for one project, but I admit that my faith in their ability and intention to fix some of the durability issues is just that – faith. It’s faith based on knowing something about the people involved, and knowing – contra Mikeal – that the problems do not run so deep as to be unsolvable without a major overhaul, but it’s faith rather than current reality. Mikeal’s points would be better made without some of the exaggerations and misrepresentations that others have taken pains to correct in the comments, and especially without remarks like this:

Using an append-only file is the preferred, sane and most assured way to handle data loss or corruption.
[after being corrected and reminded of the soft-update alternative]
while soft updates are safe their inventor and implementor in UFS have admitted are very hard to get right.

“Hard to get right” and “insane” are not the same. Some things are intrinsically hard to get right but are still worth doing. That includes all of the hard work kernel people do on placement/alignment and ordering/flushing to make sure those append-only files work so well even for systems-programming tyros. It also includes some of the compaction/vacuuming problems that append-only files introduce (closely related to “segment cleaning” in log-structured filesystems). Two of the best-respected filesystems out there – ZFS and btrfs – are primarily based on COW rather than append-only logs/journals, though ZFS does have an intent log as well. VoldFS also uses COW, and I’ll gladly debate Mikeal on the merits of my “insane” choice for that environment or use case.

The claim that the approach used by CouchDB is the only sane and assured one doesn’t help anyone. It’s merely partisan, not constructive. As Kristina Chodorow points out in the comments, it’s irritating when somebody just says “MongoDB ate my data” or “CouchDB is slow” without providing any specifics that can be addressed. (Since Mikeal seems to feel differently, I’m sure he won’t mind me mentioning that “CouchDB is a hog” is exactly the response I got when I suggested using it to replace MongoDB in that afore-mentioned project because I prefer its replication model.) As Dwight Merriman is also quoted in the comments as saying, the days of one-size-fits-all storage are over. MongoDB’s choices and roadmap might not suit everybody. Neither do CouchDB’s. There’s nothing wrong with proponents of one project engaging in constructive dialog regarding issues in others, but it does help if such criticism is in fact constructive and if people don’t consistently offer far more or sharper criticism than they are willing to accept themselves.

I was recently drawn into another discussion about a claim that project Foo was faster than project Bar because Foo is written in C (or maybe C++) and Bar is written in Java. In my experience, as a long-time kernel programmer and as someone who often codes in C even when there are almost certainly better choices, such claims are practically always false. The speed at which a particular piece of code executes only has a significant effect if your program can find something else to do after that piece is done – in other words, if your program is CPU-bound and/or well parallelized. Most programs are neither. The great majority of programs fit into one or more of the following categories.

  • I/O-bound. Completing a unit of work earlier just means waiting longer for the next block/message.
  • Memory-bound. Completing a unit of work earlier just means more time spent thrashing the virtual-memory system.
  • Synchronization-bound (i.e. non-parallel). Completing a unit of work earlier just means waiting longer for another thread to release a lock or signal an event – and for the subsequent context switch.
  • Algorithm-bound. There’s plenty of other work to do, and the program can get to it immediately, but it’s wasted work because a better algorithm would have avoided it altogether. We did all learn in school why better algorithms matter more than micro-optimization, didn’t we?

If you look at this excellent list of performance problems based on real-world observation, you’ll see that most of the problems mentioned (except #5) fit this characterization and wouldn’t be solved by using a different language. It’s possible to run many synchronization-bound programs on one piece of hardware, with or without virtualization, but the fewer resources these programs share the more likely it becomes that you’ll just become memory-bound instead. On the flip side, if a program is purely disk-bound or memory-bound then you can obtain more of those resources by distributing work across many machines, but if you don’t know how to implement distributed systems well you’ll probably just become network-bound or synchronization-bound. In fact, the class of programs that exhibit high sensitivity to network latency – a combination of I/O-boundedness and synchronization-boundedness – is large and growing.

So, you have a program that uses efficient algorithms with a well-parallelized implementation, and it’s neither I/O-bound nor memory-bound. Will it be faster in C? Yes, it very well might. It might also be faster in Fortran, which is why many continue to use it for scientific computation but that hardly makes it a good choice for more general use. Everyone thinks they’re writing the most performance-critical code in the world, but in reality maybe one in twenty programmers are writing code where anything short of the most egregious bloat and carelessness will affect the performance of the system overall. (Unfortunately, egregious bloat and carelessness are quite common.) There are good reasons for many of those one in twenty to be writing their code in C, but even then most of the reasons might not be straight-line performance. JIT code can be quite competitive with statically compiled code, and even better in many cases, once it has warmed up, but performance-critical code often has to be not only fast but predictable. GC pauses, JIT delays, and unpredictable context-switch behavior all make such languages unsuitable for truly performance-critical tasks, and many of those effects remain in the runtime libraries or frameworks/idioms even when the code is compiled. Similarly, performance-critical code often needs to interact closely with other code that’s already written in C, and avoiding “impedance mismatches” is important. Most importantly, almost all programmers need to be concerned with making their code run well on multiple processors. I’d even argue that the main reason kernel code tends to be efficient is not because it’s written in C but because it’s written with parallelism and reentrancy in mind, by people who understand those issues. A lot of code is faster not because it’s written in C but for the same reasons that it’s written in C. It’s common cause, not cause and effect. The most common cause of all is that C code tends to be written by people who have actually lived outside the Java reality-distortion bubble and been forced to learn how to write efficient code (which they could then do in Java but no longer care to).

For those other nineteen out of twenty programmers who are not implementing kernels or embedded systems or those few pieces of user-level infrastructure such as web servers (web applications don’t count) where these concerns matter, the focus should be on programmer productivity, not machine cycles. “Horizontal scalability” might seem like a euphemism for “throw more hardware at it” and I’ve been conditioned to abhor that as much as anyone, but hyper-optimization is only a reasonable alternative when you have a long time to do it. Especially at startups, VC-funded or otherwise, you probably won’t. Focus on stability and features first, scalability and manageability second, per-unit performance last of all, because if you don’t take care of the first two nobody will care about the third. If you’re bogged down chasing memory leaks or implementing data/control structures that already exist in other languages instead of on better algorithms or new features, you’re spending your time on the wrong things. Writing code in C(++) won’t magically make it faster where it counts, across a whole multi-processor (and possibly multi-node) system, and even if it did that might be missing the point. Compare results, not approaches.

Everyone knows that the market for products can be very competitive, and that the competition can often take on a distinctly shady character. What seems to be less appreciated is that the “marketplace of ideas” can be just as competitive, and often just as shady. The battles for “mind share” between one project and another, or even between one approach and another, can be fierce. At the more obviously commercial end of this spectrum are debates like NAS vs. SAN, 10GbE vs. IB, iSCSI vs. FCOE, end-to-end vs. embedded functionality. Often the commercial interests and biases of the participants are obvious, but other times less so. When Joe Random Blogger weighs in on one of these debates, finding out that Joe spent months and thousands of dollars to become a CCIE might shed some light on their vested interest. Other times, there’s no such obvious marker but the vested interest is just as real. This is why Stephen Foskett and others have called for explicit disclosure of such interests by bloggers. Full disclosure: I’ve met Stephen, I like Stephen, we’ve done an Infosmack podcast together and he invited me to drop by at a Tech Field Day where I got to enjoy some free food at EMC’s expense. See how easy that was?

Where things get a little murkier is where there’s not an obvious product involved, or where some of the projects are non-commercial. Some of the participants in the fierce SQL vs. NoSQL debates have a commercial stake, some have a personal stake, and some really have no stake at all beyond a desire to see open discussion advance the state of the art. Quick: which category do I fit into? I’m not even sure myself. I like to think I fit into the last category, but this stuff is not entirely unrelated to the job that puts food on my family’s table so arguments that I belong in one of the other categories have plenty of merit. It’s funny that in my day job I rarely get exposed to such conflicts of interest but as a blogger I do. I’ve been asked to write “objectively” about certain products or projects, or sometimes to refrain from writing about them. I’ve tried to ignore those requests as much as possible, and just write about what interests me . . . but I digress. What are we to make of sniping e.g. between Cassandra and HBase advocates, for example, when both are open source? At an even more abstract level, what about centralized metadata vs. “floating master” vs. peer-to-peer distribution, or using the same vs. separate algorithms for wide-area and local replication? What about public cloud vs. private cloud and the people who claim one or another is beneath contempt? The resolutions of such debates have clear implications for certain projects, and the participants are hardly unaware of that, but the debates are not directly about the projects.

All of this creates an environment ripe for manipulation by the less ethical. Here on this blog I’ve often posted about apparent instances of FUD and astroturf, which are two forms of such manipulation. Bloggers or Twitterers with undisclosed interests in one side of the debate are everywhere. I’ve seen one fellow with a vested interest in certain NoSQL projects repeatedly bash other projects quite savagely, more than once, for faults that his own pet projects still have or had until only a week before, without disclosing his direct involvement in the alternatives that remain after the bashing is done. One of the dark sides of open source is the practice of mining competitors’ code for flaws not so they can be fixed but so that they can be used as ammunition in the war of ideas. Perhaps the most effective technique I’m aware of in this area, though, is the wooing of converts. As much as we all like to pride ourselves on being completely rational and empirical, technology is still a social enterprise and nothing can help one side of a debate more than winning over a prominent member of the other team. “I used to believe in SAN/embedded/centralized but I’ve seen the error of my ways” can be very powerful. It strongly implies that the evolution from novice to expert and that from one position to the other are somehow linked. Novices are fooled into believing X, but experts have figured out Y. Sometimes I’m sure the change in heart is legitimate and sincere, driven by increased knowledge just as it seems or perhaps by the ever-changing tradeoffs we all have to make. (I’ve done this myself, with regard to storage vs. processing networks and the balance of traffic between the two in a distributed filesystem.) Other times, I’m just as sure that someone’s change of heart is the result of deliberate persuasion. Contact might have been deliberately made, perhaps through a mutual friend. The strongest features, situations, and (perhaps non-obvious) future plans/directions for one alternative might have been shared, all deliberately planned but presented as innocent exchange of ideas between colleagues. Sometimes a convert can be won this way, and since nobody is ever as zealous as a new convert the result can often be advocacy of the new preference even in situations where the old preference remains objectively better. Yes, you can buy publicity like that. What you can also do is create “moles” who gain some level of notoriety within a community – it’s really not that hard with emerging technologies – and then very noisily “defect” to an opposing camp.

I don’t know for sure how prevalent this sort of manipulation is. I’ve certainly seen plenty of astroturf and FUD, I’ve seen some of the attempts to persuade “thought leaders” one way or another, but I don’t know for sure if I’ve ever actually seen a mole. What I do know is that I don’t see everything, and I’d be a fool to believe these things don’t happen. The means, motive, and opportunity are all there. There are “social media experts” who are paid – and paid quite well – to do almost exactly what I’ve described; I’m sure not all of them have 100% clean hands. I’ve yet to meet a VP of marketing at a startup who would have any qualms, who would hesitate one second, over paying someone to do these things if they thought that person had the capability. I’m not saying we should all start jumping at shadows, but if you see a prominent advocate of low-cost open-source scale-out solutions suddenly start singing the praises of a vendor who is notoriously opposed to all three features, maybe you should at least consider the possibility that their change of position is something other than a total accident.

The first rule of self-help scam artists is to tell people what they want to hear, without being too obvious about it. Tell them that there’s an easy way for them to get whatever they want, or that everything bad is somebody else’s fault, but wrap it up in convincing language that makes it seem somehow logical. One example of this that has really been driving me nuts lately is Seth Godin’s “linchpin” idea. Apparently the idea is that everyone should aspire to be a post-industrial “artist” instead of just a cog in a machine, everyone has the potential to do “emotional work” at a “high level” and thus become indispensable, etc. It’s a very positive message, but I’m just not buying it.

First, while I believe it’s not possible to predict who will rise to the challenge of becoming a linchpin, and thus that we should give everyone the chance, that doesn’t mean absolutely everyone can. Many people really are cogs in the corporate machine, are not particularly capable of being anything else, and are even happy that way – saving their creative energy for other pursuits such as family, hobbies, sports, and so on. Ninety percent of the people who proudly portray themselves as linchpins quite notably do not meet the criteria, and I’d even say that they’re less likely than average to be true linchpins because the significant time they spend on self-affirmation and self-promotion is time not spent actually doing anything that would make them real linchpins. My second objection to Godin’s idea is that just because some people can be linchpins doesn’t mean all can. It’s like every child being above average. Show me an organization where every single contributor is indispensable, and I’ll show you an organization that is guaranteed to fail as the normal course of events makes any one of them unavailable. Like it or not, only a few people in any group can be linchpins and if you want to be one then you’ll be in competition with others to find that niche.

We really need to get over the idea that every worker should be a unique and pretty snowflake to deserve a place in a high-functioning organization. Somebody has to do the things that anybody can do, and as long as that’s the case there’s a need for discipline as well as self-expression. If everybody thinks they’re leading, nobody really is. Most people are unique and special in some way, but usually not in a way that any employer/client can or should care about. Ordinary people, or people who are special in non-work-related ways, can still play a valuable and even essential role in even the most creative and innovative environments. Uniqueness is not a requirement. A beautiful snowstorm would still be beautiful even if every single snowflake looked exactly alike. Real self-help would mean teaching people how to find and function in and enjoy being in a creative environment, not just telling them that they can be among its leaders. For many of them, it’s simply not true.

Jun
28
KumofsFS

I had a little time and energy to hack on VoldFS today, but not enough of either to get into anything really complicated. My highest priority is to get the contended-write cases working properly, and that’s likely to be a bit of a slog. I decided to do something really easy and fun, so I did. Specifically, I installed/built Kumofs to see if my previously implemented memcached-protocol support would work with that http://kumofs.sourceforge.net/I as well as with Memcached Classic.

Well, it does. All I had to do was set VOLDFS_DB=mc and voila! The built-in unit tests worked fine, so I did my manual test: mount, copy in a directory, unmount, remount, get file sums in the original and copied directories, verify that they match. Everything was fine, so I can now say that VoldFS is a true multi-platform FUSE interface to at least two fully distributed data stores. Now I guess I’ll have to work on concurrency, performance, and features.

Everyone has their own unique set of interests. Professionally, mine is distributed storage systems. I’m mostly not very interested in systems that are limited to a single machine, which is not to say that I think nobody should be interested but just that it’s not my own personal focus. I believe somewhat more strongly, and I’m sure more controversially, that “in-memory storage” is an oxymoron. Memory is part of a computational system, not a storage system, even though (obviously) even storage systems have computational elements as well. “Storage” systems that are limited to a single system’s memory are therefore doubly uninteresting to me. Any junior in a reputable computer science program should be able to whip up some sort of network service that can provide access to a single system’s memory, and any senior should be able to make it perform fairly well. Yawn. It was in this context that I read the following tweet today (links added).

what does membase do that redis can’t ? #redis #membase

Yeah, I guess you could also ask “what does membase do that iPhoto can’t” and it would make almost as much sense. They’re just fundamentally not the same thing. One is distributed and the other isn’t. I don’t mean that membase is better just because it’s distributed, by the way. It’s not clear whether it’s a real data store or just another “runs from memory with snapshots to/from disk” system targeting durability but not capacity. In fact many such systems don’t even provide real durability if they’re based on mmap/msync and thus can’t guarantee that writes occur in an order which facilitates later recovery, and by failing to make proper use of either rotating or solid-state storage they definitely fail to provide a cost-effective capacity solution. In addition to that, membase looks to me like a fairly incoherent collection of agents to paper over the gaping holes in the memcache “architecture” (e.g. rebalancing). No, I’m no particular fan of membase, but the fact that it’s distributed makes it pretty non-comparable to Redis. It might make more sense to compare it to Cassamort or Mongiak. It would make more sense still to compare it to LightCloud or kumofs, which already solved essentially the same set of problems via distribution add-ons to existing projects using the same protocol as membase. Comparing to Redis just doesn’t make sense.

But wait, I’m sure someone’s itching to say, there are sharding projects for Redis. Indeed there are, but there are two problems with saying that they make Redis into a distributed system Firstly, adding a sharding layer to something else doesn’t make that something else distributed; it only makes the combination distributed. Gizzard can add partitioning and replication to all kinds of non-distributed data stores, but that doesn’t make them anything but non-distributed data stores. Secondly, the distribution provided by many sharding layers – and particularly those I’ve seen for Redis – is often of a fairly degenerate kind. If you don’t solve the consistency or data aggregation/dependency problems or node addition/removal problems that come with making data live on multiple machines, it’s a pretty weak distributed system. I’m not saying you have to provide full SQL left-outer-join functionality with foreign-key constraints and full ACID guarantees and partition-tolerant replication across long distances, but you can’t just slap some basic consistent hashing on top of several single-machine data stores and claim to be in the same league as some of the real distributed data stores I’ve mentioned. You need to have a reasonable level of partitioning and replication and membership-change handling integrated into the base project to be taken seriously in this realm.

Lest anyone think I’m setting the bar too high, consider this list of projects. That’s a year and a half old, and I count seven projects that meet the standard I’ve described. There are a few more that Richard missed, and more have appeared since then. There are already close to two dozen more-or-less mature projects in this space, not even counting things like distributed filesystems and clustered databases that still meet these criteria even if they don’t offer partition tolerance. It’s already too crowded to justify throwing every manner of non-distributed or naively-sharded system into the same category, even if they have other features in common. Redis or Terrastore, for example, are fine projects that are based on great technology and offer great value to their users, but my phone pretty much fits that description too and I don’t put it in the same category either. Let’s at least compare apples to other fruit.

A few days ago, I pushed VoldFS to GitHub. I was rather pleased to see that it then spent two days at or near the top of the “trending repos” on their front page, whatever that means. It’s hard to believe I’m even in the top thousand for views per hour/day, or that the views were still increasing at the end of that period, so I’m not sure that standing was really deserved but I still appreciate the exposure while it lasted. Last night, I pushed my first update, adding support for the memcached protocol using python-memcached. If you want to play with it using an instance of memcached running locally on the default port, you’ll need the most recent version of memcache.py which supports the “cas” operation (which is terribly misnamed because it’s not a compare and swap of contents at all but rather a conditional put based on version numbers). Anyway, if you have that then all you need is:

$ export VOLDFS_DB=mc
$ ./mkfs.py
$ ./voldfs.py …

The point of adding this support is actually nothing to do with memcached as we all know and love it – in that “look at the cute little kid trying to act all grown up” kind of way. It’s a common protocol for other things as well, including the Gear6 and Northscale commercial memcache appliances as well as projects like kumofs (which is the alternative that led me to explore this). Supporting the memcached protocol means that VoldFS can provide a filesystem interface across any of several underlying technologies, expanding the potential user base greatly. I might well add support for other data stores as well at some point.