This Is Competition?

As I’m sure you’ve all noticed by now, I’ve become a bit sensitive about people bashing GlusterFS performance. It’s really hard to make even common workloads run well when everything has to go over a network. It’s impossible to make all workloads run well in that environment, and when people blame me for the speed of light I get a bit grouchy. There are a couple of alternatives that I’ve gotten particularly tired of hearing about, not because I fear losing in a fair fight but because I feel that their reality doesn’t match their hype. Either they’re doing things that I think a storage system shouldn’t do, or they don’t actually perform all that well, or both. When I found out that I could get my hands on some systems with distributed block storage based on one of these alternatives, it didn’t take me long to give it a try.

The first thing I did was check out the basic performance of the systems, without even touching the new distributed block storage. I was rather pleased to see that my standard torture test (random synchronous 4KB writes) would ramp very smoothly and consistently up to 25K IOPS. That’s more than respectable. That’s darn good – better IOPS/$ than I saw from any of the alternatives I mentioned last time. So I spun up some of the distributed stuff and ran my tests with high hopes.

synchronous IOPS

Ouch. That’s not totally awful, but it’s not particularly good and it’s not particularly consistent. Certainly not something I’d position as high-performance storage. At the higher thread counts it gets into a range that I wouldn’t be too terribly ashamed of for a distributed filesystem, but remember that this is block storage. There’s a local filesystem at each end, but the communication over the wire is all about blocks. It’s also directly integrated into the virtualization code, which should minimize context switches and copies. Thinking that the infrastructure just might not be handling the hard cases well, I tried throwing an easier test at it – sequential buffered 64KB writes.

buffered IOPS

WTF? That’s even worse that the synchronous result! You can’t see it at this scale, but some of those lower numbers are single digit IOPS. I did the test three times, because I couldn’t believe my eyes, then went back and did the same for the synchronous test. I’m not sure if the consistent parts (such as the nose-dive from 16 to 18 threads all three times) or the inconsistent parts bother me more. That’s beyond disappointing, it’s beyond bad, it’s into shameful territory for everyone involved. Remember 25K IOPS for this hardware using local disks? Now even the one decent sample can’t reach a tenth of that, and that one sample stands out quite a bit from all the rest. Who would pay one penny more for so much less?

Yes, I feel better now. The next time someone mentions this particular alternative and says we should be more like them, I’ll show them how the lab darling fared in the real world. That’s a load off my mind.

Rackspace Block Storage

A while ago, Rackspace announced their own block storage. I hesitate to say it’s equivalent to Amazon’s EBS, them being competitors and all, but that’s the quickest way to explain what it is/does. I thought the feature itself was long overdue, and the performance looked pretty good, so I said so on Twitter. I also resolved to give it a try, which I was finally able to do last night. Here are some observations.

  • Block storage is only available through their “next generation” (OpenStack based) cloud, and it’s clearly a young product. Attaching block devices to a server often took a disturbingly long time, during which the web interface would often show stale state. Detaching was even worse, and in one case took a support ticket and several hours before a developer could get it unstuck. If I didn’t already have experience with Rackspace’s excellent support folks, this might have been enough to make me wander off.
  • Still before I actually got to the block storage, I was pretty impressed with the I/O performance of the next-gen servers themselves. In my standard random-sync-write test, I was seeing over 8000 4KB IOPS. That’s a kind of weird number, clearly well beyond the typical handful of local disks but pretty low for SSD. In any case, it’s not bad for instance storage.
  • After seeing how well the instance storage did, I was pretty disappointed by the block storage I’d come to see. With that, I was barely able to get beyond 5000 IOPS, and it didn’t seem to make any difference at all if I was using SATA- or SSD-backed block storage. Those are still respectable numbers at $15/month for a minimum 100GB volume. Just for comparison, at Amazon’s prices that would get you a 25-IOPS EBS volume of the same size. Twenty-five, no typo. With the Rackspace version you also get a volume that you can reattach to a different server, while in the Amazon model the only way to get this kind of performance is with storage that’s permanently part of one instance (ditto for Storm on Demand).
  • Just for fun, I ran GlusterFS on these systems too. I used a replicated setup for comparison to previous results, getting up to 2400 IOPS vs. over 4000 for Amazon and over 5000 for Storm on Demand. To be honest, I think these numbers mostly reflect the providers’ networks rather than their storage. Three years ago when I was testing NoSQL systems, I noticed that Amazon’s network seemed much better than their competitors’ and that more than made up for a relative deficit in disk I/O. It seems like little has changed.

The bottom line is that Rackspace’s block storage is interesting, but perhaps not enough to displace others in this segment. Let’s take a look at IOPS per dollar for a two-node replicated GlusterFS configuration.

  • Amazon EBS: 1000 IOPS (provisioned) for $225/month or 4.4 IOPS/$ (server not included)
  • Amazon SSD: 4300 IOPS for $4464/month or 1.0 IOPS/$ (that’s pathetic)
  • Storm on Demand SSD: 5500 IOPS for $590/month or 9.3 IOPS/$
  • Rackspace instance storage: 3400 IOPS for $692/month (8GB instances) or 4.9 IOPS/$
  • Rackspace with 4x block storage per server: 9600 IOPS for $811/month or 11.8 IOPS/$ (hypothetical, assuming CPU or network don’t become bottlenecks)

Some time I’ll have to go back and actually test that last configuration, because I seriously doubt that the results would really be anywhere near that good and I suspect Storm would still remain on top. Maybe if the SSD volumes were really faster than the SATA volumes, which just didn’t seem to be the case when I tried them, things would be different. I should also test some other less-known providers such as CloudSigma or CleverKite, which also offer SSD instances at what seem to be competitive prices (though after Storm I’m wary of providers who do monthly billing with “credits” for unused time instead of true hourly billing).

Pulling Up Weeds

Some software projects are obviously hard. Nobody thinks writing a compiler or an operating system, especially one comparable to existing production-grade examples, will be easy. Other projects seem to be easy until you get into them. Unfortunately, distributed storage is one of those. It’s really not that hard to put together some basic server/connection management and some consistent hashing, add a simple REST interface, and have a pretty useful type of distributed object store. The problem is that some people do that and then claim to have reinvented Swift, or to have invented something that’s even better than GlusterFS because it’s simpler. Um, no. A real distributed storage system has to go well beyond those simple requirements, and I’m not even talking about all of the complexity imposed by POSIX. I’ve beaten the NoPOSIX drum a few times, calling for simplification of semantics for distributed filesystems. I have nothing against object stores like Swift, which simplify even further. However, NoPOSIX and NoFS and NoSQL don’t mean giving up all requirements and expectations altogether. A minimum standard still includes things like basic security, handling disk-full or node-failure errors gracefully, automating the process of adding/removing/rearranging servers (preferably without downtime), and so on. That complexity is there for a reason. Comparing something that has these features to something that doesn’t isn’t just incorrect. It’s dangerous. Users are even less capable than fellow developers of evaluating claims about such systems. Overstating your capabilities increases risk that users will choose systems which aren’t really ready for serious use, and might even cost them their valuable data.

What brought all of this to mind was some recent Quora spamming by the author of Weed-FS, claiming that it would perform better than existing systems because of its great simplicity. In some ways it’s unfair to pick on Weed-FS specifically, but it represents a general category of “existing data stores are too complex” and “we invented a better data store” BS that I’ve been seeing entirely too much lately. Also, I kind of promised/threatened to run some performance tests myself if the author was too lazy or scared, so here we go.

STRIKE 1: no real packaging I fired up a Rackspace cloud server, and went to see if I could install Weed-FS on it. No such luck. The only build packages are for Windows/Darwin/Linux amd64, but that’s a relative rarity on cloud services, so I cloned the source tree and tried to build from that. Too bad there are no makefiles. Apparently the author builds using Eclipse, and didn’t bother including all of the information from that in the source tree. Nonetheless, it only took me a few minutes to figure out the correct build order and build the single executable using gccgo.

STRIKE 2: barely usable interface Unlike most object-store interfaces, Weed-FS has no buckets/containers/directories and insists on assigning its own keys to objects. Therefore you can’t use keys that are meaningful to you; you have to use theirs and store the mapping in some other kind of database. There also seems to be no enumeration function (I guess we don’t need to bother measuring that kind of performance) so if you ever lose the mapping between your own key and theirs then you’ll never find your data again. Similarly, there are no functions to get/set metadata on objects, so there’s pretty much no way to use Weed-FS except by pairing it with a database and wrapping a library around the whole thing. Oh, and there’s no delete either. Too bad if you ever want to reclaim any space.

STRIKE 3: poor performance Despite all the above, I set about writing some scripts to test performance. Before I could read a million files I had to create a million files. Just to be on the safe side, I decided to try creating 100K files first to make sure it wasn’t going to take forever – both making this exercise very tedious and costing me money in extra instance hours. It didn’t take forever, but it did take over 14 minutes. That’s over 8ms per object create, or over two hours just to set up for a real test. It’s particularly egregious since there doesn’t seem to be any evidence of using O_SYNC/fsync, so it’s not even clear that the index file is sure to have been updated. I tried speeding it up by running five client threads in parallel, but one by one they hung waiting for a response from one of the volume servers – probably related to the “unexpected EOF” errors that the volume server would spit out periodically. I guess concurrency isn’t a strong suit, and neither of error reporting since I had already noticed that the servers would return an HTTP 200 response even when requests failed. Just for comparison, GlusterFS completed the same setup in about seven minutes. That’s with full data durability, plus a result that has real user-friendly file names (plus extended attributes) and directory listings and actual security. At this point I decided it wasn’t even worth moving on to the test I’d meant to do. I’d seen enough.

Some people might think I’m being too harsh here, but I disagree. As I said in the first paragraph, systems like Weed-FS can be quite useful. As I also said, representing them as more than they are is not only incorrect but dangerous. The author as developer has done some interesting stuff and deserves encouragement. This feedback might not be what he wants to hear, but it’s a kind of feedback that developers need to hear, plus I’ve already done some testing and scripting work that might be useful. On the other hand, the author as social-media marketer deserves nothing but contempt. This is not a system that one can trust with real data, or that is in any significant way comparable to systems that already existed, and yet it was blithely presented as something actually superior to alternatives. That’s not acceptable. Building real storage is hard and often tedious work. The people who do it – including my competitors – don’t deserve to have their efforts trivialized by comparison to half-baked spare-time projects. They deserve better, and users deserve better, and anybody who doesn’t respect that deserves a few harsh words.

Storage Developer Conference 2012

Teaser: erasure codes and SMB, shingled disks and CDMI, all with a side of snark.

Last week I went to SNIA’s Storage Developer Conference in Santa Clara. It was a bit of a change of pace for me, in a bunch of ways. For one thing, most of the conferences I go to tend to be heavily open-source oriented, whereas SNIA tends to be dominated by the closed-source Big Storage folks. I’m not judging anyone, but this does imply a different vocabulary and emphasis to get one’s point across. This was also a bit different because I was explicitly not representing my employer. I was originally invited to participate as a member of Stephen Foskett’s Tech Field Day roundtable broadcast live from the event, for which I am very grateful to SNIA and to Stephen. Once I was already set to attend, I ended up doing a BoF and even a talk on the filesystem track, not that I mind but it did make my role there a little hard to explain. Anyway, enough about all that. What was going on technologically?

The most notable thing at the conference was the prevalence of SMB over NFS. It’s hard to interpret this as a function of the audience, since the big NFS players are also to a large degree the big SNIA players and were there in force. NFS hardly showed up in any of the talks, though. By way of contrast, SMB was either the topic or at least mentioned in a great many talks, and there was a big SMB plugfest going on downstairs the whole time. I kept hearing that Codenomicon was causing some havoc, but people couldn’t say any more because I wasn’t actually part of the plugfest and it’s all under NDA to avoid spooking the peasantry. It’s important to note that finding bugs at a plugfest is a Very Good Thing, because whatever’s found and fixed there won’t be found and exploited in the wild. In any case, I really came away with a strong feeling that the fragmentation and territoriality among the NFSv4+ server vendors has finally taken its toll. Now that the Microsoft folks and the Samba folks and others are playing together more nicely, they seem well on the way to becoming the technology base of choice for that kind of thing. And no, I don’t feel threatened by that, because I see both of them as more front-end technologies that in fact work very well with a fully distributed back-end technology such as GlusterFS.

Another technology that seemed to get mentioned many times was object storage, and particularly CDMI (Cloud Data Management Initiative). With things like S3 and Swift already out there, I always thought it would be difficult for something like CDMI – a 224-page spec with 24 of them boilerplate before the overview – to gain any traction at all, but CDMI interfaces were mentioned prominently in several talks and by all accounts the interop demo (part of the cloud plugfest which was also running concurrent with SDC itself) was pretty impressive. The Cleversafe guys were there in force, as always, as were the Scality folks making some very ambitious claims. One of the big surprises for me is that GoDaddy is apparently planning to offer an object-storage service, following in the footsteps of Amazon and Rackspace and DreamHost. Speaking of DreamHost and their Ceph-powered DreamObjects, the RADOS layer of Ceph seems to have grown a new “class” abstraction to do the same sort of server-side filtering/transformation as we’ve considered doing in GlusterFS with translators. In any case, as we’ve seen with our own “UFO” object-storage interface, object storage just seems to be hot hot hot.

Another entity in the object-storage space is Microsoft, whose Azure Storage offers that paradigm along with several others (which is cool already). What’s even cooler is what they’ve done with erasure coding in that storage, and if I had to recommend one set of slides to peruse it would be that one. Basically the problem with erasure codes is that reading and especially repairing data can require talking to many more servers than would be the case with simple replication – to which erasure codes are preferable mostly because of their better storage efficiency. What the Azure guys did was to develop a new kind of erasure code that allows for more efficient repair, and even for a very flexible tradeoff of storage efficiency vs. repair cost (mirroring the tradeoffs that are already possible for storage efficiency vs. protection against N simultaneous disk failures). Very cool stuff, and they did a great job explaining it too.

The last thing I’ll mention is Garth Gibson’s talk about shingled disks. The idea there is that disk vendors are under immense pressure to deliver ever higher densities at very low cost. The rise of solid-state storage at the $/IOPS design point forces them even further toward the $/GB design point, and they’re reaching the point where – even with technologies like heat-assisted recording and patterned media – they just can’t write tracks that are any thinner – but they can read thinner tracks, and that’s where the shingled approach comes in. The idea is that writing each track (within a “band”) actually overwrites 2/3 or so of the previous track as well. Thus, that previous track is still readable but not rewritable without also rewriting the second track that overwrote it, and the third track that overwrote the second, and so on to the end of the band. Besides being terrible for performance, this creates a huge window where a failure in the middle of the process could lose data. A more logical approach would be to treat a shingled disk somewhat like a WORM drive, using a log-structured filesystem. Unfortunately, those have fallen into disrepair and disrepute as everyone has gone gaga over btree/COW filesystems, so some resurrection would be necessary. Also, it would be nice to have at least some easily rewritable space on a mostly shingled disk, even if it’s just to store log-cleaning state. Most people seemed to think you should just pair the shingled device with a smaller non-shingled device, but that doesn’t really solve the “have to guess the ratio” problem any more than one unshingled track at the end of each band (not per disk because of seek-time issues) and I know users would hate the logistical issues of paired devices. I really hope that we don’t end up with the highest-capacity kinds of disks being almost impossible to use effectively because you have to provision two kinds of storage and then deal with both sets of performance anomalies for a single volume.

There was certainly a lot of other stuff at the conference as well, including the inevitable bubble-inflating dose of “big data” hadoop-la, but those didn’t really interest me as much as the things above. So thanks to everyone for inviting me, or educating me, or just hanging out with me, and see you all next year.

Comments on Parallels “Cloud Storage”

As someone who was once hired to work on a “cloud file system” I was quite intrigued by this tweet from Kir Kolyshkin.

@kpisman This is our very own distributed fs, somewhat similar to Gluster or CEPH but (of course) better. http://t.co/DlysXvve

Trying to ignore the fact that what the link describes is explicitly not a real filesystem,I immediately responded that the file/block discussion seemed misguided, and – more importantly – the code seemed to be MIA. The link is not to an implementation, in either source or binary form. It’s not even to an architecture or design. It’s just a discussion of high-level requirements, similar to what I did for HekaFS before I even considered writing the first line of code. Naturally, Kir challenged me to elaborate, so I will. Let’s start with what he has to say about scalability.

It’s interesting to note that a 64-node rack cluster with a fast Ethernet switch supporting fabricswitching technology can, using nothing more than 1Gb network cards and fairly run-of-the-mill SATA devices, deliver an aggregate storage bandwidth of around 50GB/s

I’ve actually seen a storage system deliver 50GB/s. I doubt that Kir has, because it’s not that common and if he had I’m pretty sure it would be mentioned somewhere in the document. Even if we assume dual Gb/s full-duplex NICs per node, that’s only 250MB/s/node or 16GB/s total. At 64 nodes per rack I don’t think you’re going to be cramming in more NICs, plus switches, so basically he’s just off by 3x. I work on the same kind of distributed “scale-out” storage he’s talking about, so I’m well aware of how claims like that should and do set off alarm bells for anybody who’s serious about this kind of thing. Let’s move on to the point I originally addressed.

each VE root contains a large number of small files,
and aggregating them in a file environment causes the file server to see a massively growing number of
objects. As a result, metadata operations will run into bottlenecks. To explain this problem further: if
each root has N objects and there are M roots, tracking the combined objects will require an N times M
scaling of effort.

How does this require “N times M” effort any more for M>1 servers than for M=1? The only explanation I can think of is that Kir is thinking of each client needing to have a full map of all objects, but that’s simply not the case. Clients can cache the locations of objects they care about and look up any locations not already in cache. With techniques such as consistent hashing, even those rare lookups won’t be terribly expensive. Servers only care about their own objects, so “N times M” isn’t true for any entity in the system. This is not entirely a solved problem, but both GlusterFS and Ceph (among many others) have been doing things this way for years so anybody claiming to have innovated in this space should exhibit awareness of the possibility. Let’s move on.

use of sparse objects typically is not of interest to hosting providers because they
already generally have more storage than they need.

O RLY? My customers – who I’d guess are probably more “enterprise-y” than Kir’s – certainly don’t seem to be wallowing in idle storage. On the contrary, they seem to be buying lots of new storage all the time and are very sensitive to its cost. That’s why one of the most frequently requested features for GlusterFS is “network RAID” or erasure coding instead of full-out replication, and deduplication/compression are close behind. They’re all geared toward wringing the most out of the storage people already have so that they don’t need to buy more. That hardly sounds like “more than they need” does it?

Because of these misunderstandings, I don’t think Parallels “cloud storage” is really comparable to GlusterFS, so I’m not sure why he mentioned it or why I’d care. It seems a lot more like RBD or Sheepdog, leaving open the question of why Parallels didn’t use one of those. Maybe they specifically wanted something that was closed source (or open source but you’re not supposed to know you’re paying them for something free). What’s really striking is what Kir never even mentions. For example, there’s no mention at all of security, privacy, or multi-tenancy. Surely, if this is supposed to be cloud storage, some mention should be made of accounts and authentication etc. There’s also no mention of management. If this is supposed to be all cloudy, shouldn’t there be something about how easy it is to add capacity or provision user storage from that pooled capacity? Without so much as an architectural overview it’s impossible to tell how well the result meets either the requirements Kir mentions or those he omits, and with such a start it’s hard to be optimistic.

Scaling Filesystems vs. Other Things

David Strauss tweeted an interesting comment about using filesystems (actually he said “block devices” but I think he really meant filesystems) for scale and high availability. I thought I was following him (I definitely am now) but in fact I saw the comment when it was retweeted by Jonathan Ellis. The conversation went on a while, but quickly reached a point where it became impossible to fit even a minimally useful response under 140 characters, so I volunteered to extract the conversation into blog form.

Before I start, I’d like to point out that I know both David and Jonathan. They’re both excellent engineers and excellent people. I also don’t know the context in which David originally made his statement. On the other hand, NoSQL/BigData folks pissing all over things they’re too lazy to understand has been a bit of a hot button for me lately (e.g. see Stop the Hate). So I’m perfectly willing to believe that David’s original statement was well intentioned, perhaps a bit hasty or taken out of context, but I also know that others with far less ability and integrity than he has are likely to take such comments even further out of context and use them in their ongoing “filesystems are irrelevant” marketing campaign. So here’s the conversation so far, rearranged to show the diverging threads of discussion and with some extra commentary from me.

DavidStrauss Block devices are the wrong place scale and do HA. It’s always expensive (NetApp), unreliable (SPOF), or administratively complex (Gluster).

Obdurodon Huh? GlusterFS is *less* administratively complex than e.g. Cassandra. *Far* less. Also, block dev != filesystem.

Obdurodon It might not be the right choice for any particular case, but for reasons other than administrative complexity.
What reasons, then? Wrong semantics, wrong performance profile, redundant wrt other layers of the system, etc. I think David and I probably agree that scale and HA should be implemented in the highest layer of any particular system, not duplicated across layers or pushed down into a lower layer to make it Somebody Else’s Problem (the mistake made by every project to make the HDFS NameNode highly available). However, not all systems have the same layers. If what you need is a filesystem, then the filesystem layer might very well be the right place to deal with these issues (at least as they pertain to data rather than computation). If what you need is a column-oriented database, that might be the right place. This is where I think the original very general statement fails, though it seems likely that David was making it in a context where layering two systems had been suggested.

DavidStrauss GlusterFS is good as it gets but can still get funny under split-brain given the file system approach: http://t.co/nRu1wNqI
I was rather amused by David quoting my own answer (to a question on the Gluster community site) back at me, but also a bit mystified by the apparent change of gears. Wasn’t this about administrative complexity a moment ago? Now it’s about consistency behavior?

Obdurodon I don’t think the new behavior (in my answer) is markedly weirder than alternatives, or related to being a filesystem.

DavidStrauss It’s related to it being a filesystem because the consistency model doesn’t include a natural, guaranteed split-brain resolution.

Obdurodon Those “guarantees” have been routinely violated by most other systems too. I’m not sure why you’d single out just one.
I’ll point out here that Cassandra’s handling of Hinted Handoff has only very recently reached the standard David seems to be advocating, and was pretty “funny” (to use his term) before that. The other Dynamo-derived projects have also done well in this regard, but other “filesystem alternatives” have behavior that’s too pathetic to be funny.

DavidStrauss I’m not singling out Gluster. I think elegant split-brain recovery eludes all distributed POSIX/block device systems.
Perhaps this is true of filesystems in practice, but it’s not inherent in the filesystem model. I think it has more to do with who’s working on filesystems, who’s working on databases, who’s working on distributed systems, and how people in all of those communities relate to one another. It just so happens that the convergence of database and distributed-systems work is a bit further along, but I personally intend to apply a lot of the same distributed-system techniques in a filesystem context and I see no special impediment to doing so.

DavidStrauss #Gluster has also come a long way in admin complexity, but high-latency (geo) replication still requires manual failover.

Obdurodon Yes, IMO geosync in its current form is tres lame. That’s why I still want to do *real* wide-area replication.

DavidStrauss Top-notch geo replication requires embracing split-brain as a normal operating mode and having guaranteed, predictable recovery.

Obdurodon Agreed wrt geo-replication, but that still doesn’t support your first general statement since not all systems need that.

DavidStrauss Agreed on need for geo-replication, but geo-repl. issues are just an amplified version of issues experienced in any cluster.
As I’ve pointed out before, I disagree. Even systems that do need this feature need not – and IMO should not – try to do both local/sync and remote/async replication within a single framework. They’re different beasts, most relevantly with respect to split brain being a normal operating mode. I’ve spent my share of time pointing out to Stonebraker and other NewSQL folks that partitions really do occur even within a single data center, but they’re far from being a normal case there and that does affect how one arranges the code to handle it.

Obdurodon I’m loving this conversation, but Twitter might not be the right forum. I’ll extract into a blog post.

DavidStrauss You mean complex, theoretical distributed systems issues aren’t best handled in 140 characters or less? :-)

I think that about covers it. As I said, I disagree with the original statement in its general form, but might find myself agreeing with it in a specific context. As I see it, aggregating local filesystems to provide a single storage pool with a filesystem interface and aggregating local filesystems to provide a single storage pool with another interface (such as a column-oriented database) aren’t even different enough to say that one is definitely preferable to the other. The same fundamental issues, and many of the same techniques, apply to both. Saying that filesystems are the wrong way to address scale is like saying that a magnetic #3 Phillips screwdriver is the wrong way to turn a screw. Sometimes it is exactly the right tool, and other times the “right” tool isn’t as different from the “wrong” tool as its makers would have you believe.

Stop The Hate

I’ve noticed a significant increase lately in the number of complaints people are making about the operating systems they use, particularly Linux and most especially the storage stack. No, I’m not thinking of a certain foul-mouthed SSD salesman, who has made such kvetching the centerpiece of his Twitter persona. I’m talking about several people I know in the NoSQL/BigData world, who I’ve come to respect as very smart and generally reasonable people, complaining about things like OS caches, virtual memory in general, CPU schedulers, I/O schedulers, and so on. Sometimes the complaints are just developers being developers, which (unfortunately) seems to mean being disrespectful of developers in other specialties. Sometimes the complaints take the form of an unexamined assumption that OS facilities just can’t be trusted, get in the way, and kill performance. The meme seems to be that the way to get better application performance is to get the OS out of the way as much as possible and reinvent half of what it does within your application. That’s wrong. No matter how the complaint is framed, it’s highly likely to reflect more negatively on the complainer rather than the thing they’re complaining about.

Look, folks, operating-system developers can’t read minds. They have to build very complex, very general systems. They set defaults that suit the most common use cases, and they provide knobs to tune for something different. Learn how to use those knobs to tune for your exotic workload, or STFU. Does your code perform well in every possible use on every possible configuration, without tuning? Not so much, huh? I’ve probably seen your developers deliver a very loud “RTFM” when users visit mailing lists or IRC channels looking for help with a “wrong” use or config. I’ve probably seen them say far worse, even. How can the same person do that, and then turn around to complain about an OS they haven’t learned properly, and not be a hypocrite? When you do find those tuning knobs, often after having been told about them because you had already condemned the things they control as broken, don’t try and pass it off as your personal victory over the lameness of operating systems and their developers. You just turned a knob, which was put there by someone else in the hopes that you’d be smart enough to use it before you complained. They did the hard work – not you.

I’m not going to say that all complaints about operating systems are invalid, of course. I still think it’s ridiculous that Linux requires swap space even when there’s plenty of memory, and behaves poorly when it can’t get any. I think the “OOM Killer” is one of the dumbest ideas ever, and the implementation is even worse than the idea. I won’t say that operating-system documentation is all that it should be, either. Still, if you haven’t even tried to find out what you can tune through /proc and /sys and fcntl/setsockopt/*advise, or gone looking in the Documentation subdirectory of your friendly neighborhood kernel tree, or accepted an offer of help from a kernel developer who came to you to help make things better, you’re just in no position to complain or criticize. It’s like complaining that your manual-transmission car stalled, when you never even learned to drive it. Not knowing something doesn’t make you a fool, but complaining instead of asking does. Maybe if you actually engaged with your peers instead of pissing on them, they could help you build better applications.

Thoughts on Fowler’s LMAX Architecture

I have the best readers. One sent me email expressing a hope that I’d write about Martin Fowler’s LMAX Architecture. I’d be glad to. In fact I had already thought of doing so, but the fact that at least one reader has already expressed an interest makes it even more fun. The architecture seems to incorporate three basic ideas.

  • Event sourcing, or the idea of using a sequentially written log as the “system of record” with the written-in-place copy as a cache – an almost direct inversion of roles compared to standard journaling.
  • The “disruptor” data/control structure.
  • Fitting everything in memory.

I don’t really have all that much to say about fitting everything in memory. I’m a storage guy, which almost by definition means I don’t get interested until there’s more data than will fit in memory. Application programmers should IMO strive to use storage only as a system of record, not as an extension of memory or networking (“sending” data from one machine to another through a disk is a pet peeve). If they want to cache storage contents in memory that’s great, and if they can do that intelligently enough to keep their entire working set in memory that’s better still, but if their locality of reference doesn’t allow that then LMAX’s prescription just won’t work for them and that’s that. The main thing that’s interesting about the “fit in memory” part is that it’s a strict prerequisite for the disruptor part. LMAX’s “one writer many readers” rule makes sense because of how cache invalidation and so forth work, but disks don’t work that way so the disruptor’s advantage over queues is lost.

With regard to the disruptor structure, I’ll also keep my comments fairly brief. It seems pretty cool, not that dissimilar to structures I’ve used and seen used elsewhere; some of the interfaces to the SiCortex chip’s built-in interconnect hardware come to mind quickly. I think it’s a mistake to contrast it with Actors or SEDA, though. I see them as complementary, with Actors and SEDA as high-level paradigms and the disruptor as an implementation alternative to the queues they often use internally. The idea of running these other models on top of disruptors doesn’t seem strange at all, and the familiar nature of disruptors doesn’t even make the combination seem all that innovative to me. It’s rather disappointing to see useful ideas dismissed because of a false premise that they’re alternatives to another instead of being complementary.

The really interesting part for me, as a storage guy, is the event-sourcing part. Again, this has some pretty strong antecedents. This time it recalls Seltzer et al’s work on log-structured filesystems, which is even based on may of the same observations e.g. about locality of reference and relative costs of random vs. sequential access. That work’s twenty years old, by the way. Because event sourcing is so similar to log-structured file systems, it runs into some of the same problems. Chief among these is the potentially high cost of reads that aren’t absorbed by the cache, and the complexity involved with pruning no-longer-relevant logs. Having to scan logs to find the most recent copy of a block/record/whatever can be extremely expensive, and building indices carries its own expense in terms of both resources and complexity. It’s not a big issue if your system has very high locality of reference, which time-oriented systems such as LMAX or several others types of systems tend to, but it can be a serious problem in the general case. Similarly, the cleanup problem doesn’t matter if you can simply drop records from the tail or at least stage them to somewhere else, but it’s a big issue for files that need to stay online – with online rather than near-line performance characteristics – indefinitely.

In conclusion, then, the approach Fowler describes seems like a good one if your data characteristics are similar to LMAX’s, but probably not otherwise. Is it an innovative approach? Maybe in some ways. Two out of the three main features seem strongly reminiscent of technology that already existed, and combinations of existing ideas are so common in this business that this particular combination doesn’t seem all that special. On the other hand, there might be more innovation in the low-level details than one would even expect to find in Fowler’s high-level overview. It’s interesting, well worth reading, but I don’t think people who have dealt with high data volumes already will find much inspiration there.

SeaMicro’s New Machines

I read about this a few days ago on the Green Data Center Blog, and tried to comment there, but apparently Dave Ohara either isn’t checking his moderation queue or doesn’t like skeptical comments about companies whose press releases he’s re-publishing. I’ll try to reconstruct the gist of that comment here.

Since working at SiCortex, I’ve kept a bit of an eye on other companies trying to produce high-density high-efficiency systems, like SeaMicro and Calxeda. Frankly, I find SeaMicro’s “throughput computing” spin very unconvincing. A little over a year ago I took a look at their architecture and reached the conclusion that all those processors were going to starve – not enough memory to let data rest, not enough network throughput to keep it moving. Now they have somewhat more memory and a lot more CPU cycles, but as far as I can tell the exact same network and storage capabilities, so the starvation problem is going to be even more severe. Even worse, the job posting Dave forwards for them (if he doesn’t get a fee for that he should) seems to indicate that they’s still pursuing an Ethernet-centric interconnect strategy. That will keep them well behind where SiCortex, Cray, IBM and others were years ago in terms of internal bandwidth for similar systems.

On the other hand, even if Calxeda’s more radical departure from “me too” computing seems more likely to yield something useful, it would be unfair to contrast their still-hoped-for systems to SeaMicro’s actually-shipping ones. Come on, Calxeda, ship something so we can actually make that comparison.

Efficiency, Performance, and Locality

While it might have been overshadowed by events on my other blog, my previous post on Solid State Silliness did lead to some interesting conversations. I’ve been meaning to clarify some of the reasoning behind my position that one should use SSDs for some data instead of all data, and that reasoning applies to much more than just the SSD debate, so here goes.

The first thing I’d like to get out of the way is the recent statement by everyone’s favorite SSD salesman that “performant systems are efficient systems”. What crap. There are a great many things that people do to get more performance (specifically in terms of latency) at the expense of wasting resources. Start with every busy-wait loop in the world. Another good example is speculative execution. There, the waste is certain – you know you’re not going to execute both sides of a branch – but it’s often done anyway because it lowers latency. It’s not efficient in terms of silicon area, it’s not efficient in terms of power, it’s not efficient in terms of dollars, but it’s done anyway. (This is also, BTW, why a system full of relatively weak low-power CPUs really can do some work more efficiently than one based on Xeon power hogs, no matter how many cores you put on each hog.) Other examples of increased performance without increased efficiency include most kinds of pre-fetching, caching, or replication. Used well, these techniques actually can improve efficiency as requests need to “penetrate” fewer layers of the system to get data, but used poorly they can be pure waste.

If you’re thinking about performance in terms of throughput rather than latency, then the equation of performance with efficiency isn’t quite so laughable, but it’s still rather simplistic. Every application has a certain ideal balance of CPU/memory/network/storage performance. It might well be the case that thinner “less performant” systems with those performance ratios are more efficient – per watt, per dollar, whatever – than their fatter “more performant” cousins. Then the question becomes how well the application scales up to the higher node counts, and that’s extremely application-specific. Many applications don’t scale all that well, so the “more performant” systems really would be more efficient. (I guess we can conclude that those pushing the “performance = efficiency” meme are used to dealing with systems that scale poorly. Hmm.) On the other hand, some applications really do scale pretty well to the required node-count ranges, and then the “less performant” systems would be more efficient. It’s a subject for analysis, not dogmatic adherence to one answer.

The more important point I want to make isn’t about efficiency. It’s about locality instead. As I mentioned above, prefetch and caching/replication can be great or they can be disastrous. Locality is what makes the difference, because these techniques are all based on exploiting locality of reference. If you have good locality, fetching the same data many times in rapid succession, then these techniques can seem like magic. If you have poor locality, then all that effort will be like the effort you make to save leftovers in the refrigerator to save cooking time . . . only to throw those leftovers away before they’re used. One way to look at this is to visualize data references on a plot, using time on the X axis and location on the Y axis, using Z axis or color or dot size to represent density of accesses . . . like this.


time/location plot

It’s easy to see patterns this way. Vertical lines represent accesses to a lot of data in a short amount of time, often in a sequential scan. If the total amount of data is greater than your cache size, your cache probably isn’t helping you much (and might be hurting you) because data accessed once is likely to get evicted before it’s accessed again. This is why many systems bypass caches for recognizably sequential access patterns. Horizontal lines represent constant requests to small amounts of data. This is a case where caches are great. It’s what they’re designed for. In a multi-user and/or multi-dataset environment, you probably won’t see many thick edge-to-edge lines either way. You’ll practically never see the completely flat field that would result from completely random access either. What you’ll see the most of are partial or faint lines, or (if your locations are grouped/sorted the right way) rectangles and blobs representing concentrated access to certain data at certain times.

Exploiting these blobs is the real fun part of managing data-access performance. Like many things, they tend to follow a power-law distribution – 50% of the accesses are to 10% of the data, 25% of the accesses are to the next 10%, and so on. This means that you very rapidly reach the point of diminishing returns, and adding more fast storage – be it more memory or more flash – is no longer worth it. When you consider time, this effect becomes even more pronounced. Locality over short intervals is likely to be significantly greater than that over long intervals. If you’re doing e-commerce, certain products are likely to be more popular at certain times and you’re almost certain to have sessions open for a small subset of your customers at any time. If you can predict such a transient spike, you can migrate the data you know you’ll need to your highest-performance storage before the spike even begins. Failing that, you might still be able to detect the spike early enough to do some good. What’s important is that the spike is finite in scope. Only a fool, given such information, would treat their hottest data exactly the same as their coldest. Only a bigger fool would fail to gather that information in the first place.

Since this all started with Artur Bergman’s all-SSD systems, let’s look at how these ideas might play out at a place like Wikia. Wikia runs a whole lot of special-interest wikis. Their top-level categories are entertainment, gaming, and lifestyle, though I’m sure they host wikis on other kinds of subjects as well. One interesting property of these wikis is that each one is separate, which seems ideal for all kinds of partitioning and differential treatment of data. At the very grossest level, it seems like it should be trivial to keep some of the hottest wikis’ data on SSDs and relegate others to spinning disks. Then there’s the temporal-locality thing. The access pattern for a TV-show wiki must be extremely predictable, at least while the show’s running. Even someone as media-ignorant as me can guess that there will be a spike starting when an episode airs (or perhaps even a bit before), tailing off pretty quickly after the next day or two. Why on Earth would someone recommend the same storage for content related to a highly rated and currently running show as for a show that was canceled due to low ratings a year ago? I don’t know.

Let’s take this a bit further. Using Artur’s example of 80TB and a power-law locality pattern, let’s see what happens. What if we have a single 48GB machine, with say 40GB available for caching? Using the “50% of accesses to 10% of the data” pattern, that means 3.125% of accesses are even out of memory. No matter what the latency difference between flash and spinning disks might be, it’s only going to affect that 3.125% of accesses so it’s not going to affect your average latency that much. Even if you look at 99th-percentile latency, it’s fairly easy to see that adding SSD up to only a few times memory size will reduce the level of spinning-disk accesses to noise. Factor in temporal locality and domain-specific knowledge about locality, and the all-SSD case gets even weaker. Add more nodes – therefore more memory – and it gets weaker. Sure, you can assume a flatter access distribution, but in light of all these other considerations you’d have to take that to a pretty unrealistic level before the all-SSD prescription starts to look like anything but quackery.

Now, maybe Artur will come along to tell me about how my analysis is all wrong, how Wikia really is such a unique special flower that principles applicable to a hundred other systems I’ve seen don’t apply there. The fact is, though, that those other hundred systems are not well served by using SSDs profligately. They’ll be wasting their owners’ money. Far more often, if you want to maximize IOPS per dollar, you’d be better off using a real analysis of your system’s locality characteristics to invest in all levels of your memory/storage hierarchy appropriately.