Canned Platypus

Making the world better, one byte at a time.

Archive for the ‘systems’ Category

In my last post, I described several common data-loss scenarios and took people to task for what I feel is a very unbalanced view of the problem space. It would be entirely fair for someone to say that it would be even more constructive for me to explain some ways to avoid those problems, so here goes.

One of the most popular approaches to ensuring data protection is immutable and/or append-only files, using ideas that often go back to Seltzer et al’s log structured filesystem paper in 1993. One key justification for that seminal project was the observation that operating-system buffer/page caches absorb most reads, so the access pattern as it hits the filesystem is write-dominated and that’s the case for which the filesystem should be optimized. We’ll get back to that point in a moment. In such a log-oriented approach, writes are handled as simple appends to the latest in a series of logs. Usually, the size of a single log file is capped, and when one log file fills up another is started. When there are enough log files, old ones are combined or retired based on whether they contain updates that are still considered relevant – a process called compaction in several current projects, but also known by other names in other contexts. Reads are handled by searching through the accumulated logs for updates which overlap with what the user requested. Done naively, this could take linear time relative to the number of log entries present, so in practice the read path is often heavily optimized using Bloom filters and other techniques so it can actually be quite efficient. This leads me to a couple of tangential observations about how such solutions are neither as novel nor as complete as some of their more strident champions would have you believe.

  • The general outline described above is pretty much exactly what Steven LeBrun and I came up with in 2003/2004, to handle “timeline” data in Revivio’s continuous data protection system. This predates the publication of details about Dynamo in 2007, and therefore all of Dynamo’s currently-popular descendants as well.
  • Some people seem to act as though immutable files are always and everywhere superior to update-in-place solutions (including soft updates or COW), apparently unaware that they’re just making the complexity of update-in-place Somebody Else’s Problem. When you’re creating and deleting all those immutable files within a finite pool of mutable disk blocks, somebody else – i.e. the filesystem – has to handle all of the space reclamation/reuse issues for you, and they do so with update-in-place.

Despite those caveats, the log-oriented approach can be totally awesome and designers should generally consider it first especially when lookups are by a single key in a flat namespace. You could theoretically handle multiple keys by creating separate sets of Bloom filters etc. for each key, but that can quickly become unwieldy. It also makes writes less efficient, and – as noted previously – write efficiency is one of the key justifications for this approach in the first place. At some point, or for some situations, a different solution might be called for.

The other common approach to data protection is copy on write or COW (as represented by WAFL, ZFS, or btrfs) or its close cousin soft updates. In these approaches, blocks are updated in place, but with very careful attention paid to where and/or when individual block updates actually hit disk. Most commonly, all blocks are either explicitly or implicitly related as parts of a tree. Updates occur from leaves to root, copying old blocks into newly allocated space and then modifying the new copies. Ultimately all of this new space is spliced into the filesystem with an atomic update at the root – the superblock in a filesystem. It’s contention either at the root or on the way up to it that accounts for much of the complexity in such systems, and for many of the differences between them. The soft-update approach diverges from this model by doing more updates in place instead of into newly allocated space, avoiding the issue of contention at the root but requiring even more careful attention to write ordering. Here are a few more notes.

  • When writes are into newly allocated space, and the allocator generally allocates seqential blocks, the at-disk access pattern can be strongly sequential just as with the more explicitly log-oriented approach.
  • The COW approach lends itself to very efficient snapshots, because each successive version of the superblock (or equivalent) represents a whole state of the filesystem at some point in time. Garbage collection becomes quite complicated as a result, but the complexity seems well worth it.
  • There’s a very important optimization that can be made sometimes when a write is wholly contained within a single already-allocated block. In this case, that one block can simply be updated in place and you can skip a lot of the toward-the-root rigamarole. I should apply this technique to VoldFS. Unfortunately, it doesn’t apply if you have to update mtime or if you’re at a level where “torn writes” (something I forgot to mention in my “how to lose data” post) are a concern.

It’s worth noting also that, especially in a distributed environment, these approaches can be combined. For example, VoldFS itself uses a COW approach but most of the actual or candidate data stores from which it allocates its blocks are themselves more log-oriented. As always it’s horses for courses, and different systems – or even different parts of the same system – might be best served by different approaches. That’s why I thought it was worth describing multiple alternatives and the tradeoffs between them.

As I mentioned in my last post, I’ve been getting increasingly annoyed at a lot of the flak that has been directed toward MongoDB over data-protection issues. I’m certainly no big fan of systems that treat memory as primary storage (with or without periodic flushes to disk) instead of a cache or buffer for the real thing. I’ve written enough here to back that up, but I’ve also written plenty about something that bugs me even more: FUD. Merely raising an issue isn’t FUD, but the volume and tone and repetition of the criticism are all totally out of proportion when there are so many other data-protection issues we should also worry about. Here are just a few ways to lose data.

  • Don’t provide full redundancy at all levels of your system. It’s amazing how many “distributed” systems out there aren’t really distributed at all, leaving users entirely vulnerable to loss or extended unreachability of a single node, without one peep of protest from the people who are so quick to point the finger at systems which can at least survive that most-common failure mode.
  • Be careless about non-battery-backed disk caches. If data gets stranded in a disk cache when the power goes out, it’s no different than if it was stranded in memory, and yet many projects do absolutely nothing to detect let alone correct for obvious problems in this area.
  • Be careless about data ordering in the kernel. My colleagues who work on local filesystems and pieces of the block-device subsystem in Linux (and others working on other OSes) have done a great deal of too-little-appreciated work to provide the very highest levels of data safety that they can without sacrificing any more performance than necessary. Then folks who preach the virtues of append-only files without knowing anything at all about how they work turn around and subvert all that effort by giving mount-command and fstab-line examples that explicitly put filesystems into async mode, turn off barriers, etc.
  • A special case of the previous point is when people actually do seem to know the options that assure data protection, but forego those options for the sake of getting better benchmark numbers. That’s simply dishonest. You can’t claim great performance and great data protection if users can only really get one or the other depending on which options they choose. Pick one, and shut up about the other.
  • Be careless about your own data ordering. A single I/O operation can require several block-level updates. Many overlapping operations can create a huge bucket of such updates, conflicting in complex ways and requiring very careful attention to the order in which the updates actually occur. If you screw it up just once, and it takes a special brand of arrogance to believe that could never happen to you, then you corrupt data. If you corrupt metadata, you might well lose the user data it points to. If you corrupt user data that can be even worse than losing it, because there are security implications as well. It’s not nice when some of your confidential data becomes part of somebody else’s file/document/whatever. At least with mmap-based approaches, it’s fairly straightforward to do things with msync and fork and hypervisor/filesystem/LVM snapshots to at least guarantee that the state on disk remains consistent even if it’s not absolutely current.
  • Don’t provide any reasonable way to take a backup, which would protect against the nightmare scenario where data is lost not because of a hardware failure but because of a bug or user error that makes your internal redundancy irrelevant.

Of course, some of these issues won’t apply to Your Favorite Data Store, e.g. if it doesn’t have a hierarchical data model or a concept of multiple users. Then again, the list is also incomplete because the real point I’m making is that there are plenty of data-protection pitfalls and plenty of people falling into them. Some of the loudest complainers already had to suspend their FUD campaign to deal with their own data-corruption fiasco. Others are vulnerable to having the same thing happen – I can tell by looking at their designs or code – but those particular chickens haven’t come home to roost yet.

Look, I laughed at the “redundant coffee mug” joke too. It was funny at the time, but that was a while ago. Since then it’s been looking more and more like junior-high-school cliquishness, poking fun at a common target as a way to fit in with the herd. It’s not helping users, it’s not advancing the state of the art, and it’s actively harming the community. As one of the worst offenders once had the gall to tell me, be part of the solution. Find and fix new data-protection issues in whichever projects have them, instead of going on and on about the one everybody already recognizes.

Pomegranate is a new distributed filesystem, apparently oriented toward serving many small files efficiently (thanks to @al3xandru for the link). Here are some fairly disconnected thoughts/impressions.

  • The HS article says that “Pomegranate should be the first file system that is built over tabular storage” but that’s not really accurate. For one thing, Pomegranate is only partially based on tabular storage for metadata, and relies on another distributed filesystem – Lustre is mentioned several times – for bulk data access. I’d say Ceph is more truly based on tabular storage (RADOS) and it’s far more mature than Pomegranate. I also feel a need to mention my own CassFS and VoldFS, and Artur Bergman’s RiakFuse, as filesystems that are completely based on tabular storage. They’re not fully mature production-ready systems, but they are counterexamples to the original claim.
  • One way of looking at Pomegranate is that they’ve essentially replaced the metadata layer from Lustre/PVFS/Ceph/pNFS with their own while continuing to rely on the underlying DFS for data. Perhaps this makes Pomegranate more of a meta-filesystem or filesystem sharding/caching layer than a full filesystem in and of itself, but there’s nothing wrong with that just as there’s nothing wrong with similar sharding/caching layers for databases. Compared to Lustre, this is a significant step forward since Pomegranate’s metadata is fully distributed. Compared to Ceph, though, it’s not so clearly innovative. Ceph already has a distributed metadata layer, based on advanced distribution algorithms to distribute load etc. Pomegranate’s use of ring-based consistent hashing suits my own preference a little better than Ceph’s tree-based approach (CRUSH), but there are many kinds of ring-based hashing and it looks like Pomegranate won’t really catch up to Ceph in this regard until their scheme is tweaked a few times.
  • I’m really not wild about the whole “in-memory architecture” thing. If your update didn’t make it to disk because it was at the end of the in-memory queue and hadn’t been flushed yet, that’s no better for reliability than if you just left it in memory for ever (though it does improve capacity) and if you acknowledged the write as complete then you lied to the user. Prompted by some of the hyper-critical and hypocritical comments I’ve seen lately bashing one project for lack of durability, I have another blog post I’m working on about how the critics’ own toys can lose or corrupt data, and how claiming superior durability while using “unsafe” settings for benchmarks is dishonest, so I’ll defer most of that conversation for now. Suffice it to say that if I were to deploy Pomegranate in production one of the first things I’d do would be to force the cache to be properly write-through instead of write-back.
  • I can see how the Pomegranate scheme efficiently supports looking up a single file among billions, even in one directory (though the actual efficacy of the approach seems unproven). What’s less clear is how well it handles listing all those files, which is kind of a separate problem similar to range queries in a distributed K/V store. This is something I spent a lot of time pondering for VoldFS, and I’m rather proud of the solution I came up with. I think that solution might be applicable to Pomegranate as well, but need to investigate further. Can Ma, if you read this, I’d love to brainstorm further on this.
  • Another thing I wonder about is the scalability of Pomegranate’s approach to complex operations like rename. There’s some mention of a “reliable multisite update service” but without details it’s hard to reason further. This is a very important issue because this is exactly where several efforts to distribute metadata in other projects – notably Lustre – have foundered. It’s a very very hard problem, so if one’s goal is to create something “worthy for [the] file system community” then this would be a great area to explore further.

Some of those points might seem like criticism, but they’re not intended that way – or at least they’re intended as constructive criticism. They’re things I’m curious about, because I know they’re both difficult and under-appreciated by those outside the filesystem community, and they’re questions I couldn’t answer from a cursory examination of the available material. I hope to examine and discuss these issues further, because Pomegranate really does look like an interesting and welcome addition to this space.

I’ve been on vacation for the last few days, and while I was (mostly) gone a few interesting things seem to have happened here on the blog. The first is that, after a totally unremarkable first week, my article It’s Faster Because It’s C suddenly had a huge surge in popularity. In a single day it has become my most popular post ever, more than 2x its nearest competitor, and it seems to have spawned a couple of interesting threads on Hacker News and Reddit as well. I’m rather amused that the “see, you can use Java for high-performance code” and the “see, you can’t…” camps seem about evenly matched. Some people seem to have missed the point in even more epic fashion, such as by posting totally useless results from trivial “tests” where process startup dominates the result and the C version predictably fares better, but overall the conversations have been interesting and enlightening. One particularly significant point several have made is that a program doesn’t have to be CPU-bound to benefit from being written in C, and that many memory-bound programs have that characteristic as well. I don’t think it changes my main point, because memory-bound programs were only one category where I claimed a switch to C wouldn’t be likely to help. Also, programs that store or cache enough data to be memory-bound will continue to store and cache lots of data in any language. They might hit the memory wall a bit later, but not enough to change the fundamental dynamics of balancing implementation vs. design or human cycles vs. machine cycles. Still, it’s a good point and if I were to write a second version of the article I’d probably change things a bit to reflect this observation.

(Side point about premature optimization: even though this article has been getting more traffic than most bloggers will ever see, my plain-vanilla WordPress installation on budget-oriented GlowHost seems to have handled it just fine. Clearly, any time spent hyper-optimizing the site would have been wasted.)

As gratifying as that traffic burst was, though, I was even more pleased to see that Dan Weinreb also posted his article about the CAP Theorem. This one was much less of a surprise, not only because he cites my own article on the same topic but also because we’d had a pretty lengthy email exchange about it. In fact, one part of that conversation – the observation that the C in ACID and the C in CAP are not the same – had already been repeated a few times and taken on a bit of a life of its own. I highly recommend that people go read Dan’s post, and encourage him to write more. The implications of CAP for system designers are subtle, impossible to grasp from reading only second-hand explanations – most emphatically including mine! – and every contribution to our collective understanding of it is valuable.

That brings us to what ties these two articles together – besides the obvious opportunity for me to brag about all the traffic and linkage I’m getting. (Hey, I admit that I’m proud of that.) The underlying theme is dialog. Had I kept my thoughts on these subjects to myself or discussed them only with my immediate circle of friends/colleagues, or had Dan done so, or had any of the re-posters and commenters anywhere, we all would have missed an opportunity to learn together. It’s the open-source approach to learning – noisy and messy and sometimes seriously counter-productive, to be sure, but ultimately leading to something better than the “old way” of limited communication in smaller circles. Everyone get out there and write about what interests you. You never know what the result might be, and that’s the best part.

(Dedication: to my mother, who did much to teach me about writing and even more about the importance of seeing oneself as a writer.)

Jun
28
KumofsFS

I had a little time and energy to hack on VoldFS today, but not enough of either to get into anything really complicated. My highest priority is to get the contended-write cases working properly, and that’s likely to be a bit of a slog. I decided to do something really easy and fun, so I did. Specifically, I installed/built Kumofs to see if my previously implemented memcached-protocol support would work with that http://kumofs.sourceforge.net/I as well as with Memcached Classic.

Well, it does. All I had to do was set VOLDFS_DB=mc and voila! The built-in unit tests worked fine, so I did my manual test: mount, copy in a directory, unmount, remount, get file sums in the original and copied directories, verify that they match. Everything was fine, so I can now say that VoldFS is a true multi-platform FUSE interface to at least two fully distributed data stores. Now I guess I’ll have to work on concurrency, performance, and features.

Everyone has their own unique set of interests. Professionally, mine is distributed storage systems. I’m mostly not very interested in systems that are limited to a single machine, which is not to say that I think nobody should be interested but just that it’s not my own personal focus. I believe somewhat more strongly, and I’m sure more controversially, that “in-memory storage” is an oxymoron. Memory is part of a computational system, not a storage system, even though (obviously) even storage systems have computational elements as well. “Storage” systems that are limited to a single system’s memory are therefore doubly uninteresting to me. Any junior in a reputable computer science program should be able to whip up some sort of network service that can provide access to a single system’s memory, and any senior should be able to make it perform fairly well. Yawn. It was in this context that I read the following tweet today (links added).

what does membase do that redis can’t ? #redis #membase

Yeah, I guess you could also ask “what does membase do that iPhoto can’t” and it would make almost as much sense. They’re just fundamentally not the same thing. One is distributed and the other isn’t. I don’t mean that membase is better just because it’s distributed, by the way. It’s not clear whether it’s a real data store or just another “runs from memory with snapshots to/from disk” system targeting durability but not capacity. In fact many such systems don’t even provide real durability if they’re based on mmap/msync and thus can’t guarantee that writes occur in an order which facilitates later recovery, and by failing to make proper use of either rotating or solid-state storage they definitely fail to provide a cost-effective capacity solution. In addition to that, membase looks to me like a fairly incoherent collection of agents to paper over the gaping holes in the memcache “architecture” (e.g. rebalancing). No, I’m no particular fan of membase, but the fact that it’s distributed makes it pretty non-comparable to Redis. It might make more sense to compare it to Cassamort or Mongiak. It would make more sense still to compare it to LightCloud or kumofs, which already solved essentially the same set of problems via distribution add-ons to existing projects using the same protocol as membase. Comparing to Redis just doesn’t make sense.

But wait, I’m sure someone’s itching to say, there are sharding projects for Redis. Indeed there are, but there are two problems with saying that they make Redis into a distributed system Firstly, adding a sharding layer to something else doesn’t make that something else distributed; it only makes the combination distributed. Gizzard can add partitioning and replication to all kinds of non-distributed data stores, but that doesn’t make them anything but non-distributed data stores. Secondly, the distribution provided by many sharding layers – and particularly those I’ve seen for Redis – is often of a fairly degenerate kind. If you don’t solve the consistency or data aggregation/dependency problems or node addition/removal problems that come with making data live on multiple machines, it’s a pretty weak distributed system. I’m not saying you have to provide full SQL left-outer-join functionality with foreign-key constraints and full ACID guarantees and partition-tolerant replication across long distances, but you can’t just slap some basic consistent hashing on top of several single-machine data stores and claim to be in the same league as some of the real distributed data stores I’ve mentioned. You need to have a reasonable level of partitioning and replication and membership-change handling integrated into the base project to be taken seriously in this realm.

Lest anyone think I’m setting the bar too high, consider this list of projects. That’s a year and a half old, and I count seven projects that meet the standard I’ve described. There are a few more that Richard missed, and more have appeared since then. There are already close to two dozen more-or-less mature projects in this space, not even counting things like distributed filesystems and clustered databases that still meet these criteria even if they don’t offer partition tolerance. It’s already too crowded to justify throwing every manner of non-distributed or naively-sharded system into the same category, even if they have other features in common. Redis or Terrastore, for example, are fine projects that are based on great technology and offer great value to their users, but my phone pretty much fits that description too and I don’t put it in the same category either. Let’s at least compare apples to other fruit.

A few days ago, I pushed VoldFS to GitHub. I was rather pleased to see that it then spent two days at or near the top of the “trending repos” on their front page, whatever that means. It’s hard to believe I’m even in the top thousand for views per hour/day, or that the views were still increasing at the end of that period, so I’m not sure that standing was really deserved but I still appreciate the exposure while it lasted. Last night, I pushed my first update, adding support for the memcached protocol using python-memcached. If you want to play with it using an instance of memcached running locally on the default port, you’ll need the most recent version of memcache.py which supports the “cas” operation (which is terribly misnamed because it’s not a compare and swap of contents at all but rather a conditional put based on version numbers). Anyway, if you have that then all you need is:

$ export VOLDFS_DB=mc
$ ./mkfs.py
$ ./voldfs.py …

The point of adding this support is actually nothing to do with memcached as we all know and love it – in that “look at the cute little kid trying to act all grown up” kind of way. It’s a common protocol for other things as well, including the Gear6 and Northscale commercial memcache appliances as well as projects like kumofs (which is the alternative that led me to explore this). Supporting the memcached protocol means that VoldFS can provide a filesystem interface across any of several underlying technologies, expanding the potential user base greatly. I might well add support for other data stores as well at some point.

Jun
16
VoldFS

Those of you who follow me on Twitter have probably seen me mention VoldFS. It’s my latest spare-time coding project, in which I’m using FUSE (and Python) to implement a filesystem interface on top of Voldemort. To satisfy the curious, here’s a first cut at a FAQ.
Read the rest of this entry »

Someone on Slashdot asked how to do filesystem snapshots on Linux. Many respondents pointed out somewhat reasonably that the current setup with ext3 on raw disks didn’t support that, and that the poster should migrate onto another filesystem or LVM to get that functionality, but nobody seemed to have much to say about how to do that initial migration non-disruptively. I’ve had to do filesystem migrations involving many terabytes and many millions of files, and it’s a non-trivial exercise. There are a lot of ways to get it wrong and ruin your data. Based on things I’ve done since then, here’s the approach I’d investigate first if I had to do it now.

One of the less obvious tricks I’ve learned to do with GlusterFS is to run it all on one machine. The design very deliberately lets you stack “translators” on top of one another in all sorts of arbitrary ways, and the network protocol modules use the same translator interface as well. I often run the “distribute” translator, or those I’ve written, directly on top of the “storage/posix” local-filesystem translator. It works fine, and it’s much more convenient for development than having to run across machines. GlusterFS also has a replication translator, and one of the functions it necessarily provides is “self-heal” to rebuild directories and files on a failed (and possibly replaced) sub-volume. Do you see where I’m going with this? You can set up an old local filesystem and a new (empty) local filesystem as a replica pair under GlusterFS, and then poke it to “self-heal” all the files from old to new while the filesystem is live and in active use. GlusterFS doesn’t care that the two filesystems might be of different types (e.g. ext3 vs. btrfs) and/or using different kinds of storage (e.g. raw devices vs. LVM) so long as they both support POSIX locks and extended attributes. All the while it keeps track of operations in progress to the composite filesystem so this activity is effectively transparent to users who just see essentially the same filesystem they always saw. When you’re done, you just take GlusterFS back out of the picture and mount the new fully-populated filesystem directly. Here’s a configuration file to do just what I’ve described. It takes an existing directory at /root/m_old and combines that with an empty directory at /root/m_new to create a replica pair. Here are the commands to mount it and force self-heal.

mount -f /usr/etc/glusterfs/migrate.vol /mnt/migrate
ls -aR /mnt/migrate

I should warn people that I’ve only done a very basic sanity test on this. It seems to work as expected for a non-trivial but still small directory tree, but you’d certainly want to test it more thoroughly before using it to migrate production data (and of course you should absolutely make sure you have a backup that works before you attempt any such thing). There are a couple of non-obvious things about the configuration that I should also point out.

  • Both filesystems should, as mentioned previously, support POSIX locks and extended attributes. You need to load the features/locks translator to use the former.
  • For some reason this doesn’t seem to work without the server and client modules involved. This hasn’t been the case in my experience with other composite translators, and it shouldn’t be necessary here either, but at least the networking is all local so it’s not too terrible.

I’m sure there are other live-migration approaches that should work just as well if not better. I suspect there’s at least on approach using union mounts, for example. There are also a lot of approaches I can think of that would fall prey to subtle issues involving links (or other things that aren’t plain files), sparse files, and so on. It’s a lot easier to suggest an answer than to implement one that actually works. I’ve even thought (since SiCortex) of writing a FUSE filesystem specifically to do this kind of migration, but it would require a significant effort. This seems like an easy and fairly robust way to do it using tools that already exist.

Here’s the product page. It should certainly look very familiar to my ex-SiCortex readers, and some of the commentary from James Hamilton even more so.

Four of these modules will . . . deliver more work done joule, work done per dollar, and more work done per rack than the more standard approaches currently on the market.

Short version: up to 512 servers in 10U, each a 1.6GHz Atom processor with 2GB (non-ECC) memory, connected in a three-dimensional torus with 1.2Tb/s aggregate bandwidth, and virtualized I/O. Computationally it’s a bit denser than the SiCortex systems at 3.3e12 instructions/sec in a rack vs. 4.1e12 in an SC5832 which was two racks wide and deeper as well. In terms of power the density is about the same – 2kW/unit or 8kW/rack vs. 20kW for the SC5832. 1.2Tb/s works out to only 20Gb/s per eight-processor board (vs. 48Gb/s per processor for the SC5832) and with that topology the average hop count for the 128-node box is likely to be higher than for the entire SC5832 (with no mention of a way to connect that fabric across all four boxes in a rack) so for communication-intensive workloads it’s likely to run into a few problems. On the other hand, the SeaMicro I/O virtualization seems a bit more robust than what we had. Sure, we had scethernet, but that was emulation rather than virtualization (we explicitly avoided the “virtual NIC” approach) and we had nothing similar for storage. And of course it’s x86, so all the n00bs who can’t comprehend there even being another architecture with different performance tradeoffs can just plop their code on it and go. That plus the I/O virtualization might mean that even things like operating systems can run unaware that this is something physically different than what they’re used to.

Overall, there might be a few areas (especially interconnect) where the SM10K might seem less sexy than the SC5832 was, but there are also some (I/O virtualization) where it takes things a bit further and it’s clearly better positioned for adoption by a much larger target market. It’s a very interesting and welcome development. It will be particularly interesting to see how things play out between them and Smooth Stone with their ARM-based architecture.

If anybody from either of those companies is reading this, and looking for a parallel/distributed filesystem (or other data store) to run on such systems, let me know. I might be able to help. ;)