Canned Platypus

Saving the world one byte at a time since 2000

Archive for the ‘uncategorized’ Category

Most people nowadays seem to learn about object-oriented programming via Python or Ruby. Before that it was C++ or Java, and before that Smalltalk or (some flavor of) LISP. My first exposure to object-oriented programming was through none of these, but instead through LambdaMOO. Best known as a persistent multi-user environment, LambdaMOO also had its own fairly unique programming language. For one thing, it was prototype-based whereas most other OOP langages tend to be class-based. This means that you never have to create a class just so you could create one instance. You just create objects, which can be used directly or (sort of) as a class, or even both. There is no distinction between virtual and non-virtual methods, nor between static and instance methods, no abstract classes or singleton-pattern nonsense. You could build almost all of these things yourself, but you don’t have to conform to just one model.

However, the most interesting thing about MOOcode is its approach to permissions. Because it was designed to work in a multi-user environment, where users were often neither trusting nor trustworthy but still called each others’ code constantly, MOO needed a pretty robust permissions system. This was no lame private/protected/public model, without even the concept of an owner and thus without the ability to infer rights based on ownership. Each object, each “verb” and each property has an owner. The language includes primitives to determine the object for the previous call (caller), the operative permissions in that call (caller_perms), or even the whole call stack (callers). This lets you implement basically whatever permissions scheme you wanted. For example, “caller==this” is kind of like “protected” in C++, and “caller==#12345″ is kind of like declaring #12345 as a “friend” likewise. At the other extreme, one could go looking further up the stack to see if the current verb is being called in some particular context even if there are multiple “unexpected” calls in between.

The most unusual thing about the MOO permission system is that every verb runs with the permissions of its author – not the user who caused it to be invoked. This is kind of like every program in UNIX being “set UID” by default, which seems crazy but actually works quite well. It makes most kinds of “trojan horse” attacks impossible, for one thing. The person who has to worry about improper access to data is also the person – the verb author/owner – who can add code to prevent it. The exact workings of MOO property ownership and inheritance were a bit strange sometimes, but most MOO programmers learned the basics and were able to secure their code pretty quickly.

Because of all these features, programming in MOOcode was a very fluid and enjoyable experience. Python comes closest among the languages I know well, though I’ve dabbled with Lua and it seems even closer. If I ever decide to spend time on inventing my own language instead of using one to get something else done, it would probably be a prototype-based OOP language with extra features for concurrency/distribution, and in that context MOO’s permission model is about as good a starting point as I’ve seen. It’s too bad that most people who could benefit from studying it are probably put off by its origins in a game-like environment.

For some pretty obvious reasons, everybody’s asking me about this already and will probably continue to do so, so I might as well get some thoughts written down in semi-coherent form. First, though, let’s take care of some administrivia.

I do not – ever – represent Red Hat online. I neither can nor want to speak for them. Also, I was not directly involved in the acquisition. I’m sure my well known opinions about Gluster helped put the idea in people’s heads, and I’m sure I’ll be quite busy helping figure out exactly where to go from here, but it would have been neither appropriate nor useful for me to have been involved in between. Everybody who was involved knew and respected that, as did I, so I was not at all surprised to read about it in public sources first. I’m posting this on my personal site instead of the HekaFS site to underscore the fact that this is my personal, unofficial opinion as someone who is affected by but not responsible for this decision.

OK, enough of that. Personally, then, I am delighted by this. Let’s enumerate some of the things I’ve been feeling and saying about Gluster and GlusterFS since I joined Red Hat and started the CloudFS/HekaFS project.

  • This is an area where open source has needed to make a stronger play vs. proprietary solutions.
  • GlusterFS has a strong overall architecture – e.g. leveraging local filesystems, adding modular functionality – for dealing with emerging needs regarding unstructured data, cloud deployment, etc. Sure, there are some parts of the implementation that I think could improve, but I’d rather build on a strong base than have to rip out and throw away stuff built on a weak one.
  • The Gluster folks are as committed to open source as Red Hat is. Not only their code but their process is open, and so are their minds. Just think for a moment how open-minded someone must be to listen when I get all opinionated about their work. Despite my abrasive style, they have always listened and responded constructively.
  • Community matters, and the very strong Gluster/FS community has been one of the best parts of my job for the last couple of years.

All of this adds up to a move that somebody in open source had to make, and these were the best two companies to make it. Proprietary Big Storage has defined the field too long. This will make it a lot easier to implement not only my vision for HekaFS, but other visions as well. Scale-out shared-nothing storage that’s easy to configure, easy to tune, easy to monitor, is a powerful tool. It can be used to serve up files – or objects – directly to users, as part of either a traditional or cloud environment. It can be used to serve up the virtual-machine (and other disk) images that are an essential part of cloud computing. It can be used for many other things besides, either as-is or via extension modules. Compression or deduplication, snapshots or versioning, custom access controls, inline format conversions . . . the sky’s the limit. Layering separate functionality on top of dumb blocks/files/objects, each oblivious to the other, is so yesterday, but it’s all that competitors will ever make possible. When people have access to a strong and stable core, plus the ability to tinker with it, much more ambitious visions both within and outside Red Hat become possible. What would you do, if you could build storage that was exactly what you need?

I was talking about an XKCD comic with my wife, and realized I could do a little experiment myself. Because I’m lazy, I didn’t install gnuplot on my Mac but instead searched for an online graph generator. The first one I came up with was this one which I’m linking here mainly because I might want to point Amy at it some day. Anyway, here’s the graph I came up with.

'I got fired' vs. 'I quit my job' by day of week

I found the difference in Monday:Friday ratios amusing. At first I was also very intrigued by the fact that more people seemed to be quitting on Saturday than Wednesday, but you won’t see that on the graph because I realized that people were quitting other things besides jobs on Saturday so I changed my search from “I quit” to “I quit my job” and the anomaly went away. That’s an interesting illustration of how subtle changes in wording can affect experimental results. Anyway, it kept me amused for a few minutes. What’s your favorite XKCD-style Google experiment?

While it might have been overshadowed by events on my other blog, my previous post on Solid State Silliness did lead to some interesting conversations. I’ve been meaning to clarify some of the reasoning behind my position that one should use SSDs for some data instead of all data, and that reasoning applies to much more than just the SSD debate, so here goes.

The first thing I’d like to get out of the way is the recent statement by everyone’s favorite SSD salesman that “performant systems are efficient systems”. What crap. There are a great many things that people do to get more performance (specifically in terms of latency) at the expense of wasting resources. Start with every busy-wait loop in the world. Another good example is speculative execution. There, the waste is certain – you know you’re not going to execute both sides of a branch – but it’s often done anyway because it lowers latency. It’s not efficient in terms of silicon area, it’s not efficient in terms of power, it’s not efficient in terms of dollars, but it’s done anyway. (This is also, BTW, why a system full of relatively weak low-power CPUs really can do some work more efficiently than one based on Xeon power hogs, no matter how many cores you put on each hog.) Other examples of increased performance without increased efficiency include most kinds of pre-fetching, caching, or replication. Used well, these techniques actually can improve efficiency as requests need to “penetrate” fewer layers of the system to get data, but used poorly they can be pure waste.

If you’re thinking about performance in terms of throughput rather than latency, then the equation of performance with efficiency isn’t quite so laughable, but it’s still rather simplistic. Every application has a certain ideal balance of CPU/memory/network/storage performance. It might well be the case that thinner “less performant” systems with those performance ratios are more efficient – per watt, per dollar, whatever – than their fatter “more performant” cousins. Then the question becomes how well the application scales up to the higher node counts, and that’s extremely application-specific. Many applications don’t scale all that well, so the “more performant” systems really would be more efficient. (I guess we can conclude that those pushing the “performance = efficiency” meme are used to dealing with systems that scale poorly. Hmm.) On the other hand, some applications really do scale pretty well to the required node-count ranges, and then the “less performant” systems would be more efficient. It’s a subject for analysis, not dogmatic adherence to one answer.

The more important point I want to make isn’t about efficiency. It’s about locality instead. As I mentioned above, prefetch and caching/replication can be great or they can be disastrous. Locality is what makes the difference, because these techniques are all based on exploiting locality of reference. If you have good locality, fetching the same data many times in rapid succession, then these techniques can seem like magic. If you have poor locality, then all that effort will be like the effort you make to save leftovers in the refrigerator to save cooking time . . . only to throw those leftovers away before they’re used. One way to look at this is to visualize data references on a plot, using time on the X axis and location on the Y axis, using Z axis or color or dot size to represent density of accesses . . . like this.


time/location plot

It’s easy to see patterns this way. Vertical lines represent accesses to a lot of data in a short amount of time, often in a sequential scan. If the total amount of data is greater than your cache size, your cache probably isn’t helping you much (and might be hurting you) because data accessed once is likely to get evicted before it’s accessed again. This is why many systems bypass caches for recognizably sequential access patterns. Horizontal lines represent constant requests to small amounts of data. This is a case where caches are great. It’s what they’re designed for. In a multi-user and/or multi-dataset environment, you probably won’t see many thick edge-to-edge lines either way. You’ll practically never see the completely flat field that would result from completely random access either. What you’ll see the most of are partial or faint lines, or (if your locations are grouped/sorted the right way) rectangles and blobs representing concentrated access to certain data at certain times.

Exploiting these blobs is the real fun part of managing data-access performance. Like many things, they tend to follow a power-law distribution – 50% of the accesses are to 10% of the data, 25% of the accesses are to the next 10%, and so on. This means that you very rapidly reach the point of diminishing returns, and adding more fast storage – be it more memory or more flash – is no longer worth it. When you consider time, this effect becomes even more pronounced. Locality over short intervals is likely to be significantly greater than that over long intervals. If you’re doing e-commerce, certain products are likely to be more popular at certain times and you’re almost certain to have sessions open for a small subset of your customers at any time. If you can predict such a transient spike, you can migrate the data you know you’ll need to your highest-performance storage before the spike even begins. Failing that, you might still be able to detect the spike early enough to do some good. What’s important is that the spike is finite in scope. Only a fool, given such information, would treat their hottest data exactly the same as their coldest. Only a bigger fool would fail to gather that information in the first place.

Since this all started with Artur Bergman’s all-SSD systems, let’s look at how these ideas might play out at a place like Wikia. Wikia runs a whole lot of special-interest wikis. Their top-level categories are entertainment, gaming, and lifestyle, though I’m sure they host wikis on other kinds of subjects as well. One interesting property of these wikis is that each one is separate, which seems ideal for all kinds of partitioning and differential treatment of data. At the very grossest level, it seems like it should be trivial to keep some of the hottest wikis’ data on SSDs and relegate others to spinning disks. Then there’s the temporal-locality thing. The access pattern for a TV-show wiki must be extremely predictable, at least while the show’s running. Even someone as media-ignorant as me can guess that there will be a spike starting when an episode airs (or perhaps even a bit before), tailing off pretty quickly after the next day or two. Why on Earth would someone recommend the same storage for content related to a highly rated and currently running show as for a show that was canceled due to low ratings a year ago? I don’t know.

Let’s take this a bit further. Using Artur’s example of 80TB and a power-law locality pattern, let’s see what happens. What if we have a single 48GB machine, with say 40GB available for caching? Using the “50% of accesses to 10% of the data” pattern, that means 3.125% of accesses are even out of memory. No matter what the latency difference between flash and spinning disks might be, it’s only going to affect that 3.125% of accesses so it’s not going to affect your average latency that much. Even if you look at 99th-percentile latency, it’s fairly easy to see that adding SSD up to only a few times memory size will reduce the level of spinning-disk accesses to noise. Factor in temporal locality and domain-specific knowledge about locality, and the all-SSD case gets even weaker. Add more nodes – therefore more memory – and it gets weaker. Sure, you can assume a flatter access distribution, but in light of all these other considerations you’d have to take that to a pretty unrealistic level before the all-SSD prescription starts to look like anything but quackery.

Now, maybe Artur will come along to tell me about how my analysis is all wrong, how Wikia really is such a unique special flower that principles applicable to a hundred other systems I’ve seen don’t apply there. The fact is, though, that those other hundred systems are not well served by using SSDs profligately. They’ll be wasting their owners’ money. Far more often, if you want to maximize IOPS per dollar, you’d be better off using a real analysis of your system’s locality characteristics to invest in all levels of your memory/storage hierarchy appropriately.

Apparently Artur Bergman did a very popular talk about SSDs recently. It’s all over my Twitter feed, and led to a pretty interesting discussion at High Scalability. I’m going to expand a little on what I said there.

I was posting to comp.arch.storage when Artur was still a little wokling, so I’ve had ample opportunity to see how a new technology gets from “exotic” to mainstream. Along the way there will always be some people who promote it as a panacea and some who condemn it as useless. Neither position requires much thought, and progress always comes from those who actually think about how to use the Hot New Thing to complement other approaches instead of expecting one to supplant the other completely. So it is with SSDs, which are a great addition to the data-storage arsenal but cannot reasonably be used as a direct substitute either for RAM at one end of the spectrum or for spinning disks at the other. Instead of putting all data on SSDs, we should be thinking about how to put the right data on them. As it turns out, there are several levels at which this can be done.

  • For many years, operating systems have implemented all sorts of ways to do prefetching to get data into RAM when it’s likely to be accessed soon, and bypass mechanisms to keep data out of RAM when it’s not (e.g. for sequential I/O). Processor designers have been doing similar things going from RAM to cache, and HSM folks have been doing similar things going from tape to disk. These basic approaches are also applicable when the fast tier is flash and the slow tier is spinning rust.
  • At the next level up, filesystems can evolve to take better advantage of flash. For example, consider a filesystem designed to keep not just journals but actual metadata on flash, with the actual data on disks. In addition to the performance benefits, this would allow the two resources to be scaled independently of one another. Databases and other software at a similar level can make similar improvements.
  • Above that level, applications themselves can make useful distinctions between warm and cool data, keeping the former on flash and relegating the latter to disk It even seems that the kind of data being served up by Wikia is particularly well suited to this, if only they decided to think and write code instead of throwing investor money at their I/O problems.

Basically what it all comes down to is that you might not need all those IOPS for all of your data. Don’t give me that “if you don’t use your data” false-dichotomy sound bite either. Access frequency falls into many buckets, not just two, and a simplistic used/not-used distinction is fit only for a one-bit brain. If you need a lot of machines for their CPU/memory/network performance anyway, and thus don’t need half a million IOPS per machine, then spending more money to get them is just a wasteful ego trip. By putting just a little thought into using flash and disk to complement one another, just about anyone should be able to meet their IOPS goals for lower cost and use the money saved on real operational improvements.

Today’s XKCD reminded me of an idea I thought of the other day. I’m sure most of my readers have encountered triangles before. X, Y, Z: pick two. Probably the best known is good/fast/cheap, but there’s the CAP Theorem, Zooko’s Triangle, even a few I’ve devised. So here’s my meta-triangle, of properties a triangle itself (especially in the customary equilateral presentation) might have.

  • Easy to understand.
  • Useful or insightful.
  • Accurately reflects reality.

Pick two. In particular, pick which two properties that triangle itself has, and which one it lacks. Ouch, now my brain hurts.

I’m in Tempe for the Fedora Users and Developers Conference, a.k.a. FUDCon. Here are some random thoughts.

  • Enhanced pat-downs aren’t so bad.
  • The weather’s nice. I should have expected the palm trees, but I totally didn’t expect to see orange trees with ripe fruit hanging just out of arm’s reach (because the ASU students picked everything lower already).
  • The ASU campus is much more interesting and varied architecturally than any other campus I’ve been on. Sure, the color palette is a bit limited – light brown, dark brown, reddish brown – but the shapes and textures make up for it. Actually there was one nice splash of color, which was a gigantic wild rose bush clinging to the side of one building. That ugly bump just north of campus doesn’t do much for me, though.
  • I haven’t seen a single squirrel on campus. I did see two cats, though – fluffy persians who must be very uncomfortable in all this heat. I’ve seen and heard lots of unfamiliar birds, too – mostly grackles, I think
  • Meeting people in person is great. The Fedora crowd is notably casual, international, and friendly – even by technical-conference standards, in all three regards. I’d particularly like to thank Robyn Bergeron and Seth Vidal, very busy leaders in that community who have nonetheless gone out of their way to make me feel welcome and included. It was also especially nice to meet Pete Zaitcev and Major Hayden, because we’ve interacted so much online but never met until now.

Here’s a Flickr set for some of the pictures I’ve taken while here. OK, enough of the fluff. What about the real stuff? More bullet points, because that’s how I roll.

  • The whole “Bar Camp” style of pitching and voting on sessions was new to me. It did seem to work, though.
  • The first talk I attended was Marek Goldmann talking about BoxGrinder. I was pretty familiar with this work from my own involvement with Deltacloud/Aeolus, but Marek deserves kudos for presenting it well and even giving a live demo.
  • After lunch, it was Steven Dake talking about Sheepdog. Again, it’s work I’m familiar with. I think Steven and I will never quite agree on the value/importance of Sheepdog. On the one hand, the notion of distributed block storage has been very appealing to me for a long time. It’s why I went to Conley in 1998, and worked on C3D at EMC a few years later. On the other hand, block storage using a single specialized application interface which isn’t even as complex as the real system-level block device interface seems a bit unambitious to me. It just limits the applicability of the result too much IMO, and that seems a meager payoff for all that work solving the harder distributed-data problems. Of course, in this case it’s all NTT’s effort anyway. As far as the talk, a comparison to RBD would have been nice since anybody who’s interested in one should definitely check out the other as well.
  • Next up was Mike McGrath, talking about how cloud computing is going to displace non-cloud computing. Even as somebody who’s working on cloud stuff, I’m a little bit skeptical. Still, it was a good talk to get people thinking about all the implications.
  • I’ll skip the next talk, since it was mine and I’ll have more to say below.
  • The last talk of the day, for me, was Chris Lalancette talking about cloud management – especially Deltacloud and Aeolus. Having worked for a while on this project (and sitting about twenty feet from Chris most days) this was also pretty familiar territory, and Chris did a good job presenting on a complex subject. I apologize to both him and to Tobias Kunze (with whom I had an awesome chat later in the evening BTW) for putting them on the spot about the relationship between Makara and Aeolus.

So, how did my own presentation go? Somebody pointed out that I’d seemed a bit on edge the night before. Partly that was just the stress of travel and of being an introvert mingling with an unfamiliar group of people, but there’s another factor that I hadn’t even consciously realized until I was writing this post. I’ve presented about CloudFS privately and/or in fairly abstract terms so many times that I’d actually forgotten this was the first truly public presentation about a concrete thing that I’ll actually be delivering in the near future. That’s a big deal. I was a bit concerned at first because they’d put me in the largest room and at five past the hour it was still three-quarters empty. Nobody likes talking to an empty room. Shortly after I started, though, the room was pretty much full – not standing-room-only full, but I don’t remember seeing many empty seats. Not that I was trying too hard to count, of course; I was otherwise occupied. Even better, people were engaged. There were many questions, and they were good questions – questions that to me indicated genuine curiosity and constructive intent, not just the “I’m going to prove I’m smart” or “if you don’t get this one right your project will look silly” kinds of questions that one often gets. The post-presentation chatter even went on so long that Chris had to kick us away from the lectern. Good problem to have. :)

The best part of all, in my opinion, was outside of the talk itself. In at least two other presentations, and in even more hallway conversations, the possibility of using CloudFS to solve some problem or add some functionality came up. Also, at least one person had clearly given the code a pretty detailed look since my talk, asking questions and making comments about internal details that he could not have known about otherwise. That is so cool. It’s all very well to have people’s attention for an hour or so before people move on to the next new thing, but when something you’ve talked about shows up in colleagues’ own thinking about how to solve their own problems that’s an even surer measure of being on the right track. Thank you, everyone, for letting me be part of the broader progress we’re all making together.

Every year at this time, my scary pumpkin post starts getting a lot of hits. This year, I have a couple of ideas for pumpkin-carving ideas I’d actually consider implementing myself (though the subject hasn’t come up at home and I won’t be the first to mention it).

Big Mac pumpkin
(from Neatorama)
Death Star pumpkin
(from Fantasy Pumpkins)

About eight years ago, I wrote a series of posts about server design, which I then combined into one post. That was also a time when debates were raging about multi-threaded vs. event-based programming models, about the benefits and drawbacks of TCP, etc. For a long time, my posts on those subjects constituted my main claim to fame in the tech-blogging community, until more recent posts on startup failures and CAP theorem and language wars started reaching an even broader audience, and that server-design article was the centerpiece of that set. Now some of those old debates have been revived, and Matt Welsh has written a SEDA retrospective, so maybe it’s a good time for me to follow suit to see what I and the rest of the community have learned since then.

Before I start talking about the Four Horsemen of Poor Performance, it’s worth establishing a bit of context. Processors have actually not gotten a lot faster in terms of raw clock speed since 2002 – Intel was introducing a 2.8GHz Pentium 4 then – but they’ve all gone multi-core with bigger caches and faster buses and such. Memory and disk sizes have gotten much bigger; speeds have increased less, but still significantly. Gigabit Ethernet was at the same stage back then that 10GbE is at today. Java has gone from being the cool new kid on the block to being the grumpy old man the new cool kids make fun of, with nary a moment spent in between. Virtualization and cloud have become commonplace. Technologies like map/reduce and NoSQL have offered new solutions to data problems, and created new needs as well. All of the tradeoffs have changed, and of course we’ve learned a bit as well. Has any of that changed how the Four Horsemen ride?
Read the rest of this entry »

And now for something completely different…

I got this camera a few months ago after I’d lost its predecessor, and I think it has become my favorite out of the half-dozen or so digital cameras I’ve had over the years. You can find specs etc. anywhere, but here are some of the highlights from my perspective.

  • It’s small, easy to use, and takes video as well as stills, so it’s very convenient to bring everywhere.
  • Battery life seems excellent.
  • Start-up and between-picture times are better than average. This was one of my main selection criteria.
  • It doesn’t have any of the focus, exposure, or color-balance problems that I’ve seen in other cameras (especially its immediate predecessor).
  • When I zoom in, the pictures seem remarkably noise-free, so they resize well (using iPhoto – Facebook’s auto-resize is amazingly bad so don’t judge by that) and I can often skip a clean-up stage.

That’s really it. I’m no expert, just a casual family photographer, but that’s a description that fits many people so maybe someone will find my positive experience useful.