Canned Platypus

Making the world better, one byte at a time.

Archive for the ‘design’ Category

About eight years ago, I wrote a series of posts about server design, which I then combined into one post. That was also a time when debates were raging about multi-threaded vs. event-based programming models, about the benefits and drawbacks of TCP, etc. For a long time, my posts on those subjects constituted my main claim to fame in the tech-blogging community, until more recent posts on startup failures and CAP theorem and language wars started reaching an even broader audience, and that server-design article was the centerpiece of that set. Now some of those old debates have been revived, and Matt Welsh has written a SEDA retrospective, so maybe it’s a good time for me to follow suit to see what I and the rest of the community have learned since then.

Before I start talking about the Four Horsemen of Poor Performance, it’s worth establishing a bit of context. Processors have actually not gotten a lot faster in terms of raw clock speed since 2002 – Intel was introducing a 2.8GHz Pentium 4 then – but they’ve all gone multi-core with bigger caches and faster buses and such. Memory and disk sizes have gotten much bigger; speeds have increased less, but still significantly. Gigabit Ethernet was at the same stage back then that 10GbE is at today. Java has gone from being the cool new kid on the block to being the grumpy old man the new cool kids make fun of, with nary a moment spent in between. Virtualization and cloud have become commonplace. Technologies like map/reduce and NoSQL have offered new solutions to data problems, and created new needs as well. All of the tradeoffs have changed, and of course we’ve learned a bit as well. Has any of that changed how the Four Horsemen ride?
Read the rest of this entry »

I was recently drawn into another discussion about a claim that project Foo was faster than project Bar because Foo is written in C (or maybe C++) and Bar is written in Java. In my experience, as a long-time kernel programmer and as someone who often codes in C even when there are almost certainly better choices, such claims are practically always false. The speed at which a particular piece of code executes only has a significant effect if your program can find something else to do after that piece is done – in other words, if your program is CPU-bound and/or well parallelized. Most programs are neither. The great majority of programs fit into one or more of the following categories.

  • I/O-bound. Completing a unit of work earlier just means waiting longer for the next block/message.
  • Memory-bound. Completing a unit of work earlier just means more time spent thrashing the virtual-memory system.
  • Synchronization-bound (i.e. non-parallel). Completing a unit of work earlier just means waiting longer for another thread to release a lock or signal an event – and for the subsequent context switch.
  • Algorithm-bound. There’s plenty of other work to do, and the program can get to it immediately, but it’s wasted work because a better algorithm would have avoided it altogether. We did all learn in school why better algorithms matter more than micro-optimization, didn’t we?

If you look at this excellent list of performance problems based on real-world observation, you’ll see that most of the problems mentioned (except #5) fit this characterization and wouldn’t be solved by using a different language. It’s possible to run many synchronization-bound programs on one piece of hardware, with or without virtualization, but the fewer resources these programs share the more likely it becomes that you’ll just become memory-bound instead. On the flip side, if a program is purely disk-bound or memory-bound then you can obtain more of those resources by distributing work across many machines, but if you don’t know how to implement distributed systems well you’ll probably just become network-bound or synchronization-bound. In fact, the class of programs that exhibit high sensitivity to network latency – a combination of I/O-boundedness and synchronization-boundedness – is large and growing.

So, you have a program that uses efficient algorithms with a well-parallelized implementation, and it’s neither I/O-bound nor memory-bound. Will it be faster in C? Yes, it very well might. It might also be faster in Fortran, which is why many continue to use it for scientific computation but that hardly makes it a good choice for more general use. Everyone thinks they’re writing the most performance-critical code in the world, but in reality maybe one in twenty programmers are writing code where anything short of the most egregious bloat and carelessness will affect the performance of the system overall. (Unfortunately, egregious bloat and carelessness are quite common.) There are good reasons for many of those one in twenty to be writing their code in C, but even then most of the reasons might not be straight-line performance. JIT code can be quite competitive with statically compiled code, and even better in many cases, once it has warmed up, but performance-critical code often has to be not only fast but predictable. GC pauses, JIT delays, and unpredictable context-switch behavior all make such languages unsuitable for truly performance-critical tasks, and many of those effects remain in the runtime libraries or frameworks/idioms even when the code is compiled. Similarly, performance-critical code often needs to interact closely with other code that’s already written in C, and avoiding “impedance mismatches” is important. Most importantly, almost all programmers need to be concerned with making their code run well on multiple processors. I’d even argue that the main reason kernel code tends to be efficient is not because it’s written in C but because it’s written with parallelism and reentrancy in mind, by people who understand those issues. A lot of code is faster not because it’s written in C but for the same reasons that it’s written in C. It’s common cause, not cause and effect. The most common cause of all is that C code tends to be written by people who have actually lived outside the Java reality-distortion bubble and been forced to learn how to write efficient code (which they could then do in Java but no longer care to).

For those other nineteen out of twenty programmers who are not implementing kernels or embedded systems or those few pieces of user-level infrastructure such as web servers (web applications don’t count) where these concerns matter, the focus should be on programmer productivity, not machine cycles. “Horizontal scalability” might seem like a euphemism for “throw more hardware at it” and I’ve been conditioned to abhor that as much as anyone, but hyper-optimization is only a reasonable alternative when you have a long time to do it. Especially at startups, VC-funded or otherwise, you probably won’t. Focus on stability and features first, scalability and manageability second, per-unit performance last of all, because if you don’t take care of the first two nobody will care about the third. If you’re bogged down chasing memory leaks or implementing data/control structures that already exist in other languages instead of on better algorithms or new features, you’re spending your time on the wrong things. Writing code in C(++) won’t magically make it faster where it counts, across a whole multi-processor (and possibly multi-node) system, and even if it did that might be missing the point. Compare results, not approaches.

Caching is a neat trick. It’s a really really neat trick, using one level of the storage hierarchy to hide or work around slowness in the next, and we wouldn’t be able to do much of what we do without caches. They’re still a trick, though, because the illusion that they create always fades eventually. Inconsistencies start to creep in, or the cost of maintaining consistency starts to outweigh the advantage of having caches in the first place. Even if you manage to avoid both of those problems, request rates or problem sizes will practically always increase faster than cache sizes until those caches are too small and you’d better have another solution ready. HPC folks have known about this for a long time. They don’t think much of CPU caches, because the kind of stuff they do blows through those in no time. Increasingly, their attitude toward in-memory caching of disk data is the same, because they tend to blow through those caches too. After a few rounds of this, it’s natural to stop thinking of caches as performance magic and start thinking of them as just another tool in the box.

Some folks seem to have trouble learning that lesson. Even if they’ve learned it about computational tasks, they still cling to a belief that caches can last forever when it comes to I/O. Most recently, I’ve seen this in the form of people claiming that modern applications don’t need fast I/O because they can just keep everything in (distributed) memory. The H-Store papers behind some of this advocacy make some good points regarding OLTP specifically, but OLTP is only one kind of computing. While OLTP workloads are still extremely common and important, they account for an ever-decreasing share of storage capacity and bandwidth. Let’s run some numbers to see how inapplicable those conclusions are to most people. A modern server such as the IBM x3650 M2 can accommodate 128GB in 2U. That’s barely 1TB per rack after you account for the fact that some of the space has to be used for a top-of-rack switch to provide bandwidth to that data, and that you have to replicate the in-memory data for some semblance of resistance to faults (though many would say even that is far from sufficient for serious work). The SiCortex systems were built for high density, and even they only got to about 8TB in something less than 3x the physical space. Those are piddling numbers for storage, when installations with petabytes are no longer uncommon. It’s also a very costly way to get that much capacity, paying for RAM and the processors that go with it and the power to run both. It better be worth it. Is it? Only if some small and easily identified part of your data is red-hot, and the rest is all ice-cold, and your operations on that data distribute nice and evenly so you don’t have communication hot spots that would make memory vs. disk irrelevant. That’s a pretty limited set of conditions.

Still not convinced? OK, look at it this way. A lot of people have workloads more like Facebook than TPC-C, and it just so happens that there’s some information out there about how well caching has worked for Facebook. According to James Hamilton, as of last April Facebook was serving 475K images per second out of 6.5B total. Those numbers are certainly higher today – I’ve heard 600K and 15B – but they’re the basis for some other interesting numbers so let’s stick with them. Using 1TB of memory cache across 40 servers they get a 92% hit rate. That’s still multiple gigabytes per second that have to be served from disk – just for pictures, a year ago. For everything, today, they surely need dozens of servers to achieve the necessary bandwidth, and dozens of disks per server to achieve the necessary capacity. Facebook is far from typical, but also far from the “in-memory everything” crowd’s model and moving further away every day. Where Facebook is leading, many others will follow.

Sooner or later, your caches (including memory used as cache rather than as a functionally different operational resource) won’t be big enough to provide adequate performance if the next level in your storage hierarchy is too slow. Caches can still be useful as a way to optimize use of that next level, such as by avoiding redundant requests or turning random small-block I/O into large-block sequential, but they can’t hide its weaknesses. Parallelism and horizontal distribution are the real long-term keys to scalability, for I/O as well as for computation. If you really want scalability, you have to put parallelism at the center of your design with caches in a supporting role.

In the context of responding to a post about C++, I realized that part of what I was addressing was the fairly common attitude that brevity in a programming language was indicative of the language’s power or expressiveness. This is common in many communities, especially among perl programmers, but probably its best known expression is from Paul Graham.

making programs short is what high level languages are for. It may not be 100% accurate to say the power of a programming language is in inverse proportion to the length of programs written in it, but it’s damned close. Imagine how preposterous it would sound if someone said “The program is 10 lines of code in your language and 50 in my language, but my language is more powerful.” You’d be thinking: what does he mean by power, then?

Like many high-profile programmers, Paul tends to assume that if he can’t think of an answer then nobody can. He almost certainly meant the above as a rhetorical question with no good answer, but in fact it’s not hard to answer at all. A diesel-powered truck is likely to be more powerful than a Prius. It might take more to start it up, it might be the wrong vehicle for a daily commute or a trip to the grocery store, but once it gets going it can do things that a Prius never could. In other words, power has to do with ultimate capability, not initial cost. What if modifying the 10-line example in Paul’s example to run across many processors required increasing its size by an order of magnitude, but modifying the 50-line example required not one extra line because half of what the original fifty lines did was set up the infrastructure for that? Which language is more powerful now? This same argument has recently played out at the server-framework level in discussions of Twisted vs. Tornado. Twisted is more complex, it’s harder to write simple apps in it, but few would last long arguing that it’s not also more powerful. (I’m not actually a big Twisted fan, BTW, but it does illustrate this particular point well.) Writing a shorter “hello world” program is not interesting. Writing a shorter complete application that does something real, in a real world where performance and scalability and robust handling of unusual conditions all matter, is much closer to the true measure of a language’s (or framework’s) power.

I say “much closer” because brevity does not truly equal power even in the context of a serious program. Part of my initial point about C++ is that so much of its brevity is bad brevity. If you have deep class hierarchies with complex constructors, and you use lots of operator overloading, then a single line of your C++ code might translate into many thousands of instructions. That same line of C++ code might, under other circumstances, result in only a few instructions. The problem is largely that the person writing that line might not know – might not even be able to find out without trying – what the results will be with respect to instructions and cache misses and page faults and stack depth and all the other things that it might be important to know. I would modify Graham’s claim by saying that recognizable and predictable brevity is an indicator of programming-language power. Any programmer in a given language will immediately recognize that certain constructs might cause lots of code to be emitted by a compiler or executed by an interpreter, most often by explicitly redirecting the flow of execution – loops, function calls, throwing exceptions, etc. Decent programmers in just about any language know that operating on large data structures, even to assign them, might incur a large cost. They don’t need to know how someone else wrote their part of the program to know which operations are likely to be expensive; they just need to know the language itself. Contrast this with C++, where a simple declaration or assignment of a simple object might or might not – through the “magic” of constructors and destructors, templates, smart pointers, and so on – take whole milliseconds to execute.

If a language allows you to express a complicated operation simply, with no ambiguity and with the execution complexity clearly recognizable by any competent programmer in that language, then that might be a legitimate marker of the language’s power and expressiveness. Paul Graham’s Arc language might in fact be considered powerful by that standard. On the other hand, if understanding a single simple line of code’s likely execution complexity requires careful study of code elsewhere, then that language might not be more powerful or expressive. It might just be more obfuscated, or even broken. C++ completely fails this test, though it’s worth noting that close cousins on either side – C and Java – do much better. Even perl does better, despite being terrible in other ways. The real point of brevity is not fewer lines or characters, but more efficient communication of intent and effects to another programmer. If your three lines communicate those things clearly, then congratulations. If they just leave the reader confused or needing to look elsewhere, then you have abused whatever language you’re writing in. If that language makes it inevitable for abuse to crowd out good use, then it is a bad language and programmers interested in writing good code should avoid it.

At the end of my original actor-model post, I suggested that I might post some example code to show how it works. More importantly, I want to show how it avoids some common problems that arise in the lock-based model, and makes others at least a little bit more tractable. Before we dive in, then, I need to explain some of what those problems are and how they happen.
Read the rest of this entry »

All of a sudden, everybody is talking about concurrent programming. The driving factor, of course, is the fact that multi-core processors have moved beyond the server room, into the machines people use for web surfing and games. Whether that’s a good kind of evolution is a question best left for another time. Right now, much of the debate is over how to make it possible for programmers to take full advantage of all that extra processing power. Because of what’s driving it, most of this conversation is focused on systems where all processors share equal access to a single memory space. In fact, neither the access times nor the memory layout are always completely uniform, but that’s not really important here. In the past, I’ve written quite a bit about programming in this kind of environment. It’s probably why many of you are here, but today I’ll be going in a slightly different direction. I say “slightly” because I won’t be going to the other extreme – internet scale with dozens-of-millisecond latencies and high node/link failure rates – either. For now, I’ll be somewhere in between – clusters of machines with separate memory, connected with some kind of modern interconnect such as InfiniBand or 10GbE.
Read the rest of this entry »

Jul
7
Scalability

One of the luxuries that I’ve gotten used to at some of the places I’ve worked is being surrounded by people who understand scalability. In my recent job hunt and other between-job activities, I’ve been reminded that a great many people in this industry lack such understanding. So, here’s the thing. If you draw a graph of a system’s component (e.g. server) count on the x axis, and aggregate performance on the y axis, then

Scalability is about the slope, not the height.

The line for a scalable system will have a positive slope throughout the range of interest. By contrast, the line for a non-scalable system will level off or even turn downward as contention effects outweigh parallelism. Note that a non-scalable system might still outperform a scalable system throughout that range of interest, if the scalable system has poor per-unit performance. Scalability is not the same as high performance; it enables higher performance, but does not guarantee it.

Similarly, building a scalable system might also increase single-request latency or decrease single-stream bandwidth due to higher levels of complexity or indirection. It would be wrong, though, to dismiss a scalable design based on such concerns. If low single-request latency or high single-stream bandwidth are hard requirements with specific numbers attached, then the more scalable system might not suit that particular purpose, but in the majority of cases it’s the aggregate requests or bytes per second that matter most so it’s a good tradeoff. The key to scalability is enabling the addition of more servers, more pipes, more disks, more widgets, not in making any one server or pipe etc. faster. Can you make better use of a network by using RDMA instead of messaging? Sure, that’s nice, it might even be all that’s needed to reach some people’s goals, but it’s a complete no-op where scalability is concerned. Ditto for “parallel” filesystems that only make data access parallel but do nothing to address the metadata bottleneck.

Scaling – more precisely what the true cognoscenti would recognize as horizontal scaling – across all parts of a system is an important key to performance in scientific and enterprise computing, in the grid and the cloud. That’s why all the biggest systems in the world, from Google and Amazon to Roadrunner and Jaguar, work that way. Anybody who doesn’t grasp that, in fact anyone who hasn’t internalized it until it’s instinct, is not qualified to be designing or writing about modern large computer systems. (Ditto, by the way, for anybody who thinks a hand-wave about their favorite protocol possibly allowing such scalability is the same as that protocol and implementation explicitly supporting it. Such people are frauds, to put it nicely.)

I’ll probably be writing more about scaling issues for a while, for reasons that will soon become apparent. Watch this space.

It was kind of fun getting involved in the fsync debate, so now I’ll get involved in another ongoing controversy: software transactional memory. If you haven’t already heard of STM, Wikipedia has a good intro – albeit one obviously written by an STM proponent. Such proponents often try to position STM as the solution to all of our concurrent-programming problems. However, as often happens when the advocacy for something becomes too strident, there seems to be a growing backlash from people pointing out the supposed panacea’s limits and problems.

One example of this is Patrick Logan’s rant about STM, and the ensuing discussion at Lambda the Ultimate. In the discussion, there seems to be a general consensus that locking and shared state don’t scale very well, but then there’s a split between those who propose STM as a solution and those who propose more Erlang-like “shared nothing” approach based on messaging. Every once in a while, but too rarely in my opinion, somebody seems to notice that putting operations under a lock and putting them inside a transaction are actually very similar. Transactions offer some desirable properties such as isolation and composability, but the two approaches have much in common and require similar diligence from the programmer. Races, deadlock/livelock, starvation, and so on are still entirely possible with STM if the programmer is careless – and we all know they are.

On LtU, Tim Sweeney weighs in on the pro-STM side.

You could implement this using message passing, but you’d be writing and debugging your own ad-hoc transactional protocol for each of the tens of thousands of updates and state transitions present in the program.

I understand his concern. Data-consistency problems are often unrecognized as such by programmers who lack the proper training or experience. People given messaging often end up implementing their own kind of distributed state sharing, whether they realize they’re doing it or not, and almost invariably implement it poorly so that they have stale or inconsistent data everywhere. Why not give them STM-based state sharing that avoids at least some of these problems? Because each transaction having its own view creates the same problems as each process having its own view, and becomes more likely as people take advantage of all the composability that is supposedly such a great feature of STM.

Also, and perhaps more importantly to me than to some, we have to consider the distributed case. Distributed STM (DSTM) is essentially the same problem as software distributed shared memory (SDSM). I’ve actually worked on SDSM systems, so I’m pretty familiar with its problems and limits. Over the years I’ve watched one group after another make grandiose claims that they’ve solved the problems and overcome the limits, only to have the actual results disappoint. That’s why I view claims such as ScaleMP’s with some skepticism, and claims backed by even less proof than theirs with something like derision. Show me. Tell me the details, and show me the results – on real applications, not some rigged comparison where an optimized STM implementation is compared to an unoptimized lock-based one to achieve the desired result. Until then, I’ll continue to believe that performance of DSTM is likely to be so bad that programmers will avoid it in most (if not all) cases despite its supposed advantages, so at most DSTM will be relegated to niche use.

Experience trumps theory, of course, so it’s interesting that the most empirical observation I saw in these discussions was from Mental on Patrick’s site.

Having written several implementations of both STM and various message-passing-based concurrency models for Ruby lately, I’m a lot less sunny on STM than I used to be even a few weeks ago.

I was having fun implementing STM until I realized that I was able to implement the Actor model correctly in a few hours, versus several weeks for getting the fiddly aspects of STM down.

The biggest immediate problem for STM is starvation — a large transaction can just keep retrying, and I’m not sure there is a way to address that without breaking composability. …and composability is the whole raison d’etre of STM in the first place.

That’s a pretty good example of STM failing to solve a problem – in this case process starvation – that its proponents commonly portray as characteristic of lock-based systems. How, indeed, does one guarantee fairness and forward progress, or even avoid livelock, in an optimistic-concurrency model where you just retry everything when contention is detected? I’m sure somebody has come up with solutions. I’m equally sure those solutions are rarely implemented in actual STM code, so real programmers will encounter these real and hard-to-debug problems when they try to follow the STM advocates’ advice.

Probably the most damning point about the limitations of STM was made by Sriram Srinivasan on LtU.

Second, there are many areas that don’t come under the purview of STM (file activity, screen updates, real db transactions with different tx semantics). They don’t work under an arbitrary retry model, unlike in a locking case where you know you own the critical section.

I love the idea of STM, particularly MVCC STM, and have spent a lot of enjoyable time musing about how I would implement it or use it in programs. Nonetheless, after reading all this, I can’t help but conclude that STM might be too limited to be of much interest to me in the things I do. It might be great for computation within a single process (Tim Sweeney’s example is a good one) but it seems to fall apart in the distributed case or where side effects like I/O are important. Maybe there are some constrained distributed cases in which the benefits would outweigh the costs. After all, many people do useful work within the messaging constraints imposed by MPI. A lot of big simulations are done by partitioning objects or physical regions between processes, having each process calculate one time-step’s worth of change for its objects/regions, then all getting together to share the results before launching the next time step. Maybe some restricted form of DSTM would allow all of this to happen more efficiently using transactions instead of barriers, and displace MPI in some cases. For the more general case, though, I remain rather skeptical.

I’ve been tempted a few times to sketch out what I think a modern parallel filesystem should look like, building on both my own experience and that of others who have worked in this space. While far from a complete description, here are some hints at the directions I’d take.
Read the rest of this entry »

This morning in the shower and on the way to work, I was musing about the difference between component performance and system performance in computing, and particularly about the role that efficiency plays in the latter. It often seems that one can make a small piece of a system perform better by using resources profligately, but making the whole system perform better often requires the exact opposite. Let me give an example. In many situations, two processes or nodes that are communicating with one another can achieve maximum communications performance when the receiver spins waiting for an indication that a message has arrived. However, that completely wastes the processor on which the receiver is spinning, and often the spender spins too waiting for an acknowledgement. Some people think the answer is to have communicating processes synchronize so that they’re sending and receiving at the same time and don’t need to wait for long before they go back to computation. Blech. Synchronized waste is still waste. Besides that, such an approach is hard to implement even when everything from applications down to hardware is designed with that in mind, and it can hardly be considered general when a great many applications simply do not lend themselves to such synchronization at all. That approach is fundamentally flawed. In contrast to this micro-optimizers’ favorite fantasy, the systems-level answer is to make sending and receiving messages as resource-efficient as possible so that applications can get some useful work done without spinning at all. Asynchrony plays a big role here, because by now everyone knows that parallelism is essential to high performance and a parallel program that relies on frequent or tight synchronization isn’t really parallel while it’s synchronizing.

If you want an even more concrete example of this correspondence between resource efficiency and performance, look no further than the Green 500, which lists the most efficient of Top 500 fastest computers in the world according to efficiency instead of absolute performance. If you look at the latest list from June 2008, you can hardly help but notice that only two system types are represented in the top ten – the top three are all based on the PowerXCell 8i and QS22, while the next seven are all based on BlueGene/P (both IBM architectures). One of those top three is Roadrunner, which also happens to be the fastest computer on the planet. Performance and efficiency are not only compatible but complementary. In a particularly amusing example of synchronicity, after thinking about this on the drive in I found a reference to IBM’s Kittyhawk project in my mailbox. Clearly, IBM is continuing to apply the lesson that you can’t make it big unless you make it efficient.

Since many readers here know I work at SiCortex, which is also all about a combination of efficiency and performance, I’ll point out here that they had nothing to do with this article. In fact, the “consume resources to boost performance” idea I lambaste above is often the subject of debates that I have with my colleagues and some of them will undoubtedly take issue with what I say. What I will point out is that our latest systems would probably be less than 1% short of the top spot on the Green 500, putting them in fourth place; the reason they don’t appear on that list currently is that being in the Top 500 is a prerequisite (something colleague Matt Reilly has written about more eloquently than I could) and we haven’t built any systems that large. Our next-gen systems might change that and eclipse the current MFLOPS/W numbers as well, but of course the rest of the market – especially IBM who seem to be the only others who really Get It – won’t be standing still either so nobody knows what the list will look like in 2010 except that the numbers will be higher.