Playing with Scratch

Amy has been interested in creating her own video game for a while now, so I started looking for easy ways to get an eight-year-old started with programming. It quickly became clear that Scratch was the best choice, not only because of the environment itself but also because of the community that has built up around it, so I downloaded the Mac version and we started playing some of the demos. There’s one called Fruit RPG that seemed pretty close to the kind of game we wanted to create. You get up, walk around collecting fruit, then go into the “fruit place” where someone rewards you with a fruit platter – simple stuff, mostly there as an example of lists in Scratch. After Amy went to bed, I started digging in a bit so I could be several steps ahead of her and ready to answer questions when she asked instead of having to figure them out on the spot. My idea was to learn by adding a couple of new features – a poisonous mushroom that would kill you when you picked it up, and a dinosaur that would follow you around trying to steal your fruit (shades of Adventure there). I was able to implement both fairly quickly, but also discovered a couple of bugs.

  • The fruit were infinitely regenerating. Every time you went into a building and then came back out, all of the fruit were there even if you had already collected some.
  • If you walked to the spot on the screen where the fruit-place guy would be if you were in the store, the ending sequence would start even if you were outside.

I’m sure the people who wrote this little example didn’t care about the first, and probably the second either, since this was just an example of using lists anyway. Nonetheless, it seemed likely that I’d learn by fixing them. It turned out to be a bit more of an exercise than I’d thought, because the “scripts” to make things appear and disappear as you change location are intricately tied to the language’s message-passing paradigm. You leave your house by moving over an invisible door object, which as part of its operation broadcasts a message. The background object responds to this message by drawing the outdoor scenery, the fruit objects respond by making themselves visible, and so on.

This got me thinking about some aspects of the message-passing system. For example, does a “broadcast and wait” action mean that the message gets enqueued everywhere before it’s processed anywhere, or would various overlaps be possible? This was important because it would affect whether a second message sent from one of the response scripts was guaranteed to come after the original message, and the viability of some possible solutions hinged on the answer. I wondered how many young programmers had been frustrated by the weird semantics of the “forever when X” action (not “while X” but “if X” within an infinite loop), by having the wrong expectation of whether the “green flag” message is processed synchronously or asynchronously, by not realizing that multiple instances of the same script could run concurrently, and so on.

Needless to say, I fixed the bugs and added my features pretty easily. Along the way, I came up with a general set of “best practices” (essentially actor model) that avoid the pitfalls mentioned above. That’s all nice, but it’s not the real point here. The real point is that careful specification of a system’s behavior isn’t just something that you do for the sake of advanced programmers. Even a simple system like Scratch can involve significant complexity and ambiguity. Even the most inexperienced programmers trying to do the simplest things might get tripped up by details that you didn’t bother to explain. Scratch is a nice little system, but it might be nicer if it didn’t leave my daughter and other children like her trying to debug race conditions. Think about that next time you’re tempted to say that an implementation choice was obvious because users would be able to figure it out.

P.S. I wrote most of this a couple of days ago and wasn’t even sure I’d ever get around to posting it (I have dozens of drafts like that), but it seems particularly timely in light of this book review that was posted today so I might as well let it fly.

The Future of Storage for Big Data

I probably shouldn’t post this. It’s unduly inflammatory in general, and unnecessarily insulting to the HDFS developers in particular. It will probably alienate some people, deepening divisions that are already too deep. On the other hand, there’s a “manifest destiny” attitude in the Hadoop community that really needs to be addressed. Throwing a bucket of ice water on someone is an obnoxious thing to do, but sometimes that’s what it takes to get their attention.

The other day I got into a discussion on Quora with Cloudera’s Jeff Hammerbacher, about how Impala uses storage. Now Jeff is someone I like, and respect immensely, but he said something in that discussion that I find deeply disturbing (emphasis mine).

MapReduce and Impala are just the first two of what are likely to be many frameworks for parallel processing that will run well over data stored in HDFS, HBase, and future data storage solutions built above HDFS.

I agree that there will always be multiple data stores within an organization. I suspect that as the storage and computational capabilities of Hadoop expand, the amount of data stored within Hadoop will come to dominate the amount of data stored outside Hadoop, and most analyses will be performed within the cluster.

That’s a pretty bold prediction, and a pretty HDFS-centric strategy for getting there. If HDFS is central to the Hadoop storage strategy, and most data is stored in Hadoop, then what he’s really saying is that most data will be stored in HDFS, and I really hope that’s not the plan because any Hadoop company that has that as a plan will lose. I’m a firm believer in polyglot storage. Some storage is optimized for reads and some for writes, some for large requests and some for small requests, some for semantically complex operations and some for simple operations, some for strong consistency and some for high performance. I don’t think the storage I work on is good for every purpose, and that’s even less the case for HDFS. It’s not even a complete general-purpose filesystem, let alone a good one. It was designed to support exactly one workload, and it shows. There’s no track record to show that the people interested in seeing it used more widely could pull off the fundamental change that would be required to make it a decent general-purpose filesystem. Thus, if the other parts of Hadoop are only tuned for HDFS, only tested with HDFS, stubbornly resistant to the idea of integrating with any other kind of storage, then users will be left with a painful choice.

  • Put data into HDFS even though it’s a poor fit for their other needs, because it’s the only thing that works well with the rest of Hadoop.
  • Put data into purpose-appropriate storage, with explicit import/export between that and HDFS.

Painful choices tend to spur development or adoption of more choices, and there are plenty of competitors ready to pick up that dropped ball. That leaves the HDFS die-hards with their own two choices.

  • Hire a bunch of real filesystem developers to play whack-a-mole with the myriad requirements of real general-purpose storage.
  • Abandon the “one HDFS to rule them all” data-lock-in strategy, and start playing nice with other kinds of storage.

Obviously, I think it’s better to let people use the storage they want, as and where it already exists, to serve their requirements instead of Hadoop developers’. That means a storage API which is stable, complete, well documented, and consistently used by the rest of Hadoop – i.e. practically the opposite of where Impala and libhdfs are today. It means actually testing with other kinds of storage, and actually collaborating with other people who are working on storage out in the real world. There’s no reason those future parallel processing frameworks should be limited to working well over HDFS, or why those future storage systems should be based on HDFS. There’s also little chance that they will be, in the presence of alternatives that are both more open and more attuned to users’ needs.

Perhaps some day most data will be connected to Hadoop, but it will never be within. The people who own that data won’t stand for it.

Thoughts on Multitasking

A lot of people, especially in the geek community, have historically taken pride in their ability to multi-task. More recently, a lot of research has shown that multi-tasking is less effective than people think, leading many to a conclusion that multi-tasking really doesn’t exist. I think both sides are full of bunk.

On a CPU, there is a cost associated with switching from one task to another. Whether the switch is worth the cost depends on which task needs those cycles more. If the old task is likely to be blocked anyway and the new one is ready to go, then the switch is likely to be worth it. Conversely, if the old task still has work to do and the new one isn’t ready yet, then a switch would be a complete waste of time. As it turns out, on a typical computer most tasks are blocked most of the time, waiting for disks or networks or people. It’s relatively easy to detect and distinguish between these conditions, so multi-tasking works really well.

The problem is that we’re not like computers. For one thing, while a lot of things can happen at once in a human brain, we only have one “core” devoted to higher-level activities like coding, writing, or carrying on a conversation. For another, our brains are actually quite slow, so we don’t typically have a lot of idle cycles on either side of the “should we switch” equation. That slowness also means that our task switches are many orders of magnitude more expensive than those on computers – possibly seconds, depending on the complexity of the task we’re setting aside and the task we’re taking up, instead of microseconds. For human multi-tasking to work, we must make much more intelligent decisions about when to switch and when not to, based on much more subtle features of the old and new tasks. Even the decision to accept or reject an interruption takes significant time, which is why interruptions harm productivity so much. People who say they multi-task well usually mean that they can make accept/reject decisions quickly, but that doesn’t mean they make those decisions well – and there’s still the effect of the switch itself to consider. Besides being very slow, for us a task switch often SQUIRREL! Quick, can you remember what I was just saying three sentences ago? I doubt it, and that’s the point: unlike computers, when we switch our recall of where we were can be highly imperfect. We can get through the accept/reject part and the switch part quickly and still lose because in the process we’ve forgotten more context than switching was worth. We would still have been better off single-tasking.

The upshot is that you can train yourself to be multi-task more efficiently, but it’s an ability you should be reluctant to exercise. Unless you’re in one of those situations where you really should stop thinking about something because you’re overanalyzing or going around in circles (the infamous “finally figured out that bug while driving home” scenario), you should probably stick to what you’re doing until you’re done. Learn to schedule exactly the amount of work that you’re really able to do well, and do it in an organized way, instead of trying to be a hero by multi-tasking and doing them all poorly.

Why I Was Late To Work

I’ve been working almost entirely from home since March, mostly so that I can be home in case Amy needs something and Cindy is unavailable due to her sleep issues. On most days I also get to save an hour otherwise spent commuting, but that’s usually soaked up by extra time “at work” or in other ways. Once such other way happened this morning. We had some heavy rain and thunderstorms, which seemed to have subsided when we heard a *crack* from outside. We’re all too familiar with that sound from last October’s storm, so we immediately went to the windows to see which tree or limb had fallen where. It turned out to be a pretty big one, effectively blocking our street in both directions. Our street happens to be a popular commuter shortcut, and during rush hour often has a line of cars half a mile long. It’s also heavily used by emergency vehicles getting across town or over to the next town, so a blockage is serious business.

First thing to do was call 911 and let them know of the issue. Next was to put on some shoes and see if I could clear at least one lane. This limb was big enough that I seriously doubted I could move it very far myself, but I figured if people saw me trying then someone else might come out to lend a hand. That would have been in their own best interest, after all. Sure enough, two guys immediately got out of their cars to help. Between the three of us we were able to get the limb swung around enough to clear one lane. Then a miracle happened. A guy with a big truck pulled out of the queue, got out, and pulled out a tow strap. He immediately set about wrapping it around the tree while I directed traffic. Naturally, I let the two guys who had helped get on their way first. When Truck Guy was ready we pulled the limb further around, and then the two of us rolled it enough so that both lanes were free. Then he went on his way as well.

The police showed up just after the action concluded. That’s not a knock against them; they clearly had plenty of other things to deal with. I went out and talked to the officer who was there, who said they’d get DPW to take a look at it. When I got out of my shower, I was totally unsurprised to see guys with chainsaws and a wood chipper across the street, and everything’s cleared away as I write this – a little over an hour after the branch fell. What I think stands out about this is that four total strangers spontaneously got together to deal with a problem that wasn’t really theirs (the tree might be in front of my house but technically it’s on town land right next to the street), in a way that was completely complementary to the contributions of the local government, then the whole setup just dissolved when it was no longer needed. It’s like a flash mob with a purpose.

All in all, this wasn’t quite as dramatic as saving the drunk guy in the water on Long Beach Island, but it was certainly a good jolt of adrenaline to get my day started. I think my boss will understand why I’m answering email a bit later than usual this morning.

Patch Based Filesystem

Even when I’m happy with my current project, I often like to think about alternative solutions to the same basic set of distributed-data problems that are my area of expertise. So it is with GlusterFS. I think that’s on a good track, with a robust set of features already and a clear path to being the best available solution for an even broader area, but . . . there’s always a “but” isn’t there? As I’ve worked on GlusterFS for the last few years, there have been certain patterns of problems and feature additions that have proven unexpectedly difficult. A lot of these come down to the issue of write granularity and conflicting simultaneous writes. For example, what if two HekaFS clients using block-based encryption try to write different parts of the same cipher block at the same time? They’re not conflicting from the user’s point of view, so one shouldn’t be allowed to undo the other, but they are conflicting in the sense that every byte written affects every byte of that cipher block. It’s impossible to merge the two writes without keys, and of course the server can’t have keys in our model. Erasure coding runs into much the same problem. In both cases, the solution is to serialize the read-modify-write sequences used for partial writes. We can have all sorts of fun with various optimistic locking schemes and so on, but the net result is still a bit sad. When these issues are considered in the context of asynchronously replicating those writes across geographic locations, things become even uglier . . . and that’s the train of thought leading to this post. Is there a fundamentally different approach, absolutely infeasible to implement within GlusterFS, that would handle this combination (async replication plus erasure coding plus encryption) more gracefully? If I were to design such a system from the ground up, here’s what it might look like. While I’m at it, I’ll throw in some other ideas that aren’t really related to this fundamental problem, just for fun.

  • Would I even want the result to be a filesystem? I’ve often said that one of the reasons I got involved with GlusterFS was because I really didn’t want to develop an entire filesystem, and I still feel that way. There’s tremendous value in the result, but if I were to do something on my own maybe I’d go for something less complicated. Object (or “blob”) stores like S3 or Swift are about twenty times easier to implement, and many people seem willing to adopt them instead of filesystems, so maybe I’d do that. With things like HTTP PATCH coming along I’d still have to deal with overlapping byte-aligned writes within a file/object (I’m just going to say “file” from here on for simplicity), but that’s OK.
  • Distribution would still be based on consistent hashing. I have many of my own favorite tweaks in this area, most of which are equally applicable to GlusterFS and PBFS.
  • The first major change I’d make is to get out of the business of dealing with cluster membership and consensus with respect to configuration myself. There’s other software available to do just that, so there’s no need to implement an inferior home-grown version or put up with its deficiencies.
  • The other major change would be a move from a “write in place” model to more of a log-structured or “event sourcing” model. When a client writes data, the entire write is encrypted and erasure coded according to its own internal alignment, regardless of how it’s aligned with respect to the file. The pieces are then distributed among the servers responsible for that file. Voila, no more server-side merging or read-modify-write sequences.
  • As pieces of the write’s contents are being written to several servers, information about the write as a whole is fully replicated to a smaller set of servers. Any one of these “catalog” servers can then, when asked about that write, identify which servers got erasure fragments and also provide any authentication/integrity metadata necessary to validate the combined result.
  • Since the entire write is maintained as a single unit, that unit can be transmitted verbatim to another site. It can even be erasure-coded differently on that end, to satisfy different requirements e.g. at a primary vs. DR site. This simplifies matters greatly compared to approaches that require writes to be merged somehow before being propagated.
  • As with other patch-based systems as they’re well known and understood in the source-control world, the merging happens on the client side during reads. To read a byte range within a file, you ask one of its catalog servers for all current writes overlapping that range. What you get back is potentially a list of writes, with extent and ordering information. It’s up to you to decrypt and merge to get the bytes you need. In some cases, it would make sense to re-encrypt and write back the results so that the next reader only gets a single chunk, but that’s optional. In that case, the write-back might cause some or all of the previous chunks to be marked as no longer current and (eventually) garbage-collected.

As I’m sure you can see, this is a very different kind of animal than GlusterFS – or, for that matter, any other distributed filesystem. Both the read and write paths are fundamentally different. Writes can be significantly more efficient, but reads are more complicated and rely more on high scale to deliver good aggregate performance. It would certainly be an interesting exercise, if only I had time…

Confessions of a MOO Programmer

Most people nowadays seem to learn about object-oriented programming via Python or Ruby. Before that it was C++ or Java, and before that Smalltalk or (some flavor of) LISP. My first exposure to object-oriented programming was through none of these, but instead through LambdaMOO. Best known as a persistent multi-user environment, LambdaMOO also had its own fairly unique programming language. For one thing, it was prototype-based whereas most other OOP langages tend to be class-based. This means that you never have to create a class just so you could create one instance. You just create objects, which can be used directly or (sort of) as a class, or even both. There is no distinction between virtual and non-virtual methods, nor between static and instance methods, no abstract classes or singleton-pattern nonsense. You could build almost all of these things yourself, but you don’t have to conform to just one model.

However, the most interesting thing about MOOcode is its approach to permissions. Because it was designed to work in a multi-user environment, where users were often neither trusting nor trustworthy but still called each others’ code constantly, MOO needed a pretty robust permissions system. This was no lame private/protected/public model, without even the concept of an owner and thus without the ability to infer rights based on ownership. Each object, each “verb” and each property has an owner. The language includes primitives to determine the object for the previous call (caller), the operative permissions in that call (caller_perms), or even the whole call stack (callers). This lets you implement basically whatever permissions scheme you wanted. For example, “caller==this” is kind of like “protected” in C++, and “caller==#12345″ is kind of like declaring #12345 as a “friend” likewise. At the other extreme, one could go looking further up the stack to see if the current verb is being called in some particular context even if there are multiple “unexpected” calls in between.

The most unusual thing about the MOO permission system is that every verb runs with the permissions of its author – not the user who caused it to be invoked. This is kind of like every program in UNIX being “set UID” by default, which seems crazy but actually works quite well. It makes most kinds of “trojan horse” attacks impossible, for one thing. The person who has to worry about improper access to data is also the person – the verb author/owner – who can add code to prevent it. The exact workings of MOO property ownership and inheritance were a bit strange sometimes, but most MOO programmers learned the basics and were able to secure their code pretty quickly.

Because of all these features, programming in MOOcode was a very fluid and enjoyable experience. Python comes closest among the languages I know well, though I’ve dabbled with Lua and it seems even closer. If I ever decide to spend time on inventing my own language instead of using one to get something else done, it would probably be a prototype-based OOP language with extra features for concurrency/distribution, and in that context MOO’s permission model is about as good a starting point as I’ve seen. It’s too bad that most people who could benefit from studying it are probably put off by its origins in a game-like environment.

Gluster Acquisition

For some pretty obvious reasons, everybody’s asking me about this already and will probably continue to do so, so I might as well get some thoughts written down in semi-coherent form. First, though, let’s take care of some administrivia.

I do not – ever – represent Red Hat online. I neither can nor want to speak for them. Also, I was not directly involved in the acquisition. I’m sure my well known opinions about Gluster helped put the idea in people’s heads, and I’m sure I’ll be quite busy helping figure out exactly where to go from here, but it would have been neither appropriate nor useful for me to have been involved in between. Everybody who was involved knew and respected that, as did I, so I was not at all surprised to read about it in public sources first. I’m posting this on my personal site instead of the HekaFS site to underscore the fact that this is my personal, unofficial opinion as someone who is affected by but not responsible for this decision.

OK, enough of that. Personally, then, I am delighted by this. Let’s enumerate some of the things I’ve been feeling and saying about Gluster and GlusterFS since I joined Red Hat and started the CloudFS/HekaFS project.

  • This is an area where open source has needed to make a stronger play vs. proprietary solutions.
  • GlusterFS has a strong overall architecture – e.g. leveraging local filesystems, adding modular functionality – for dealing with emerging needs regarding unstructured data, cloud deployment, etc. Sure, there are some parts of the implementation that I think could improve, but I’d rather build on a strong base than have to rip out and throw away stuff built on a weak one.
  • The Gluster folks are as committed to open source as Red Hat is. Not only their code but their process is open, and so are their minds. Just think for a moment how open-minded someone must be to listen when I get all opinionated about their work. Despite my abrasive style, they have always listened and responded constructively.
  • Community matters, and the very strong Gluster/FS community has been one of the best parts of my job for the last couple of years.

All of this adds up to a move that somebody in open source had to make, and these were the best two companies to make it. Proprietary Big Storage has defined the field too long. This will make it a lot easier to implement not only my vision for HekaFS, but other visions as well. Scale-out shared-nothing storage that’s easy to configure, easy to tune, easy to monitor, is a powerful tool. It can be used to serve up files – or objects – directly to users, as part of either a traditional or cloud environment. It can be used to serve up the virtual-machine (and other disk) images that are an essential part of cloud computing. It can be used for many other things besides, either as-is or via extension modules. Compression or deduplication, snapshots or versioning, custom access controls, inline format conversions . . . the sky’s the limit. Layering separate functionality on top of dumb blocks/files/objects, each oblivious to the other, is so yesterday, but it’s all that competitors will ever make possible. When people have access to a strong and stable core, plus the ability to tinker with it, much more ambitious visions both within and outside Red Hat become possible. What would you do, if you could build storage that was exactly what you need?

And Now For Something Totally Random

I was talking about an XKCD comic with my wife, and realized I could do a little experiment myself. Because I’m lazy, I didn’t install gnuplot on my Mac but instead searched for an online graph generator. The first one I came up with was this one which I’m linking here mainly because I might want to point Amy at it some day. Anyway, here’s the graph I came up with.

'I got fired' vs. 'I quit my job' by day of week

I found the difference in Monday:Friday ratios amusing. At first I was also very intrigued by the fact that more people seemed to be quitting on Saturday than Wednesday, but you won’t see that on the graph because I realized that people were quitting other things besides jobs on Saturday so I changed my search from “I quit” to “I quit my job” and the anomaly went away. That’s an interesting illustration of how subtle changes in wording can affect experimental results. Anyway, it kept me amused for a few minutes. What’s your favorite XKCD-style Google experiment?

Efficiency, Performance, and Locality

While it might have been overshadowed by events on my other blog, my previous post on Solid State Silliness did lead to some interesting conversations. I’ve been meaning to clarify some of the reasoning behind my position that one should use SSDs for some data instead of all data, and that reasoning applies to much more than just the SSD debate, so here goes.

The first thing I’d like to get out of the way is the recent statement by everyone’s favorite SSD salesman that “performant systems are efficient systems”. What crap. There are a great many things that people do to get more performance (specifically in terms of latency) at the expense of wasting resources. Start with every busy-wait loop in the world. Another good example is speculative execution. There, the waste is certain – you know you’re not going to execute both sides of a branch – but it’s often done anyway because it lowers latency. It’s not efficient in terms of silicon area, it’s not efficient in terms of power, it’s not efficient in terms of dollars, but it’s done anyway. (This is also, BTW, why a system full of relatively weak low-power CPUs really can do some work more efficiently than one based on Xeon power hogs, no matter how many cores you put on each hog.) Other examples of increased performance without increased efficiency include most kinds of pre-fetching, caching, or replication. Used well, these techniques actually can improve efficiency as requests need to “penetrate” fewer layers of the system to get data, but used poorly they can be pure waste.

If you’re thinking about performance in terms of throughput rather than latency, then the equation of performance with efficiency isn’t quite so laughable, but it’s still rather simplistic. Every application has a certain ideal balance of CPU/memory/network/storage performance. It might well be the case that thinner “less performant” systems with those performance ratios are more efficient – per watt, per dollar, whatever – than their fatter “more performant” cousins. Then the question becomes how well the application scales up to the higher node counts, and that’s extremely application-specific. Many applications don’t scale all that well, so the “more performant” systems really would be more efficient. (I guess we can conclude that those pushing the “performance = efficiency” meme are used to dealing with systems that scale poorly. Hmm.) On the other hand, some applications really do scale pretty well to the required node-count ranges, and then the “less performant” systems would be more efficient. It’s a subject for analysis, not dogmatic adherence to one answer.

The more important point I want to make isn’t about efficiency. It’s about locality instead. As I mentioned above, prefetch and caching/replication can be great or they can be disastrous. Locality is what makes the difference, because these techniques are all based on exploiting locality of reference. If you have good locality, fetching the same data many times in rapid succession, then these techniques can seem like magic. If you have poor locality, then all that effort will be like the effort you make to save leftovers in the refrigerator to save cooking time . . . only to throw those leftovers away before they’re used. One way to look at this is to visualize data references on a plot, using time on the X axis and location on the Y axis, using Z axis or color or dot size to represent density of accesses . . . like this.


time/location plot

It’s easy to see patterns this way. Vertical lines represent accesses to a lot of data in a short amount of time, often in a sequential scan. If the total amount of data is greater than your cache size, your cache probably isn’t helping you much (and might be hurting you) because data accessed once is likely to get evicted before it’s accessed again. This is why many systems bypass caches for recognizably sequential access patterns. Horizontal lines represent constant requests to small amounts of data. This is a case where caches are great. It’s what they’re designed for. In a multi-user and/or multi-dataset environment, you probably won’t see many thick edge-to-edge lines either way. You’ll practically never see the completely flat field that would result from completely random access either. What you’ll see the most of are partial or faint lines, or (if your locations are grouped/sorted the right way) rectangles and blobs representing concentrated access to certain data at certain times.

Exploiting these blobs is the real fun part of managing data-access performance. Like many things, they tend to follow a power-law distribution – 50% of the accesses are to 10% of the data, 25% of the accesses are to the next 10%, and so on. This means that you very rapidly reach the point of diminishing returns, and adding more fast storage – be it more memory or more flash – is no longer worth it. When you consider time, this effect becomes even more pronounced. Locality over short intervals is likely to be significantly greater than that over long intervals. If you’re doing e-commerce, certain products are likely to be more popular at certain times and you’re almost certain to have sessions open for a small subset of your customers at any time. If you can predict such a transient spike, you can migrate the data you know you’ll need to your highest-performance storage before the spike even begins. Failing that, you might still be able to detect the spike early enough to do some good. What’s important is that the spike is finite in scope. Only a fool, given such information, would treat their hottest data exactly the same as their coldest. Only a bigger fool would fail to gather that information in the first place.

Since this all started with Artur Bergman’s all-SSD systems, let’s look at how these ideas might play out at a place like Wikia. Wikia runs a whole lot of special-interest wikis. Their top-level categories are entertainment, gaming, and lifestyle, though I’m sure they host wikis on other kinds of subjects as well. One interesting property of these wikis is that each one is separate, which seems ideal for all kinds of partitioning and differential treatment of data. At the very grossest level, it seems like it should be trivial to keep some of the hottest wikis’ data on SSDs and relegate others to spinning disks. Then there’s the temporal-locality thing. The access pattern for a TV-show wiki must be extremely predictable, at least while the show’s running. Even someone as media-ignorant as me can guess that there will be a spike starting when an episode airs (or perhaps even a bit before), tailing off pretty quickly after the next day or two. Why on Earth would someone recommend the same storage for content related to a highly rated and currently running show as for a show that was canceled due to low ratings a year ago? I don’t know.

Let’s take this a bit further. Using Artur’s example of 80TB and a power-law locality pattern, let’s see what happens. What if we have a single 48GB machine, with say 40GB available for caching? Using the “50% of accesses to 10% of the data” pattern, that means 3.125% of accesses are even out of memory. No matter what the latency difference between flash and spinning disks might be, it’s only going to affect that 3.125% of accesses so it’s not going to affect your average latency that much. Even if you look at 99th-percentile latency, it’s fairly easy to see that adding SSD up to only a few times memory size will reduce the level of spinning-disk accesses to noise. Factor in temporal locality and domain-specific knowledge about locality, and the all-SSD case gets even weaker. Add more nodes – therefore more memory – and it gets weaker. Sure, you can assume a flatter access distribution, but in light of all these other considerations you’d have to take that to a pretty unrealistic level before the all-SSD prescription starts to look like anything but quackery.

Now, maybe Artur will come along to tell me about how my analysis is all wrong, how Wikia really is such a unique special flower that principles applicable to a hundred other systems I’ve seen don’t apply there. The fact is, though, that those other hundred systems are not well served by using SSDs profligately. They’ll be wasting their owners’ money. Far more often, if you want to maximize IOPS per dollar, you’d be better off using a real analysis of your system’s locality characteristics to invest in all levels of your memory/storage hierarchy appropriately.

Solid State Silliness

Apparently Artur Bergman did a very popular talk about SSDs recently. It’s all over my Twitter feed, and led to a pretty interesting discussion at High Scalability. I’m going to expand a little on what I said there.

I was posting to comp.arch.storage when Artur was still a little wokling, so I’ve had ample opportunity to see how a new technology gets from “exotic” to mainstream. Along the way there will always be some people who promote it as a panacea and some who condemn it as useless. Neither position requires much thought, and progress always comes from those who actually think about how to use the Hot New Thing to complement other approaches instead of expecting one to supplant the other completely. So it is with SSDs, which are a great addition to the data-storage arsenal but cannot reasonably be used as a direct substitute either for RAM at one end of the spectrum or for spinning disks at the other. Instead of putting all data on SSDs, we should be thinking about how to put the right data on them. As it turns out, there are several levels at which this can be done.

  • For many years, operating systems have implemented all sorts of ways to do prefetching to get data into RAM when it’s likely to be accessed soon, and bypass mechanisms to keep data out of RAM when it’s not (e.g. for sequential I/O). Processor designers have been doing similar things going from RAM to cache, and HSM folks have been doing similar things going from tape to disk. These basic approaches are also applicable when the fast tier is flash and the slow tier is spinning rust.
  • At the next level up, filesystems can evolve to take better advantage of flash. For example, consider a filesystem designed to keep not just journals but actual metadata on flash, with the actual data on disks. In addition to the performance benefits, this would allow the two resources to be scaled independently of one another. Databases and other software at a similar level can make similar improvements.
  • Above that level, applications themselves can make useful distinctions between warm and cool data, keeping the former on flash and relegating the latter to disk It even seems that the kind of data being served up by Wikia is particularly well suited to this, if only they decided to think and write code instead of throwing investor money at their I/O problems.

Basically what it all comes down to is that you might not need all those IOPS for all of your data. Don’t give me that “if you don’t use your data” false-dichotomy sound bite either. Access frequency falls into many buckets, not just two, and a simplistic used/not-used distinction is fit only for a one-bit brain. If you need a lot of machines for their CPU/memory/network performance anyway, and thus don’t need half a million IOPS per machine, then spending more money to get them is just a wasteful ego trip. By putting just a little thought into using flash and disk to complement one another, just about anyone should be able to meet their IOPS goals for lower cost and use the money saved on real operational improvements.