Server Design in Serbo-Croatian

Ten and a half years ago, I wrote an article on server design. Considering that I probably worked harder on that than on anything I’ve posted since, I’m pleased that it has continued to be one of the most popular articles on the site despite its increasing age. It’s even more gratifying to know that some people are including it in their academic course materials – maybe half a dozen instances that I know of. Now, thanks to Anja Skrba, it has been translated into Serbo-Croatian. I’m totally unqualified to comment on the accuracy of the translation, but it’s a commendable contribution to the community nonetheless. Feedback on either the content or the translation would surely be appreciated by both of us. Thank you, Anja!

The “Gather, Prepare, Cook, Eat” Design Pattern

This post is actually about an anti-pattern, which I’ll call the “grazing” pattern. Code wanders around, consuming new bits of information here and there, occasionally excreting new bits likewise. This works well “in the small” because all of the connections between inputs and outputs are easy to keep in your head. In code that has to make complex and important choices, such as what response to a failure will preserve users’ data instead of destroying it, such a casual approach quickly turns into a disaster. You repeatedly find yourself in some later stage of the code, responsible for initiating some kind of action, and you realize that you might or might not have some piece of information you need based on what code path you took to get there. So you recalculate it, or worse you re-fetch it from its origin. Or it’s not quite the information you need so you add a new field that refers to the old ones, but that gets unwieldy so you make a near copy of it instead (plus code to maintain the two slightly different versions). Sound familiar? That’s how (one kind of) technical debt accumulates.

Yes, of course I have a particular example in mind – GlusterFS’s AFR (Advanced File Replication) translator. There, we have dozens of multi-stage operations, which rely on up to maybe twenty pieces of information – booleans, status codes, arrays of booleans or status codes, arrays of indices into the other arrays, and so on. That’s somewhere between one and ten thousand “is this data current” questions the developer might need to ask before making a change. There’s an awful lot of recalculation and duplication going on, leading to something that is (some coughing, something that might rhyme with “butter frappe”) hard to maintain. This is not a phenomenon unique to this code. It’s how all old code seems to grow without frequent weeding, and I’ve seen the pattern elsewhere more times than I can count. How do we avoid this? That’s where the title comes in.

  • Gather
    Get all of the “raw” information that will be relevant to your decision, in any code path.
  • Prepare
    Slice and dice all the information you got from the real world, converting it into what you need for your decision-making process.
  • Cook
    This is where all the thinking goes. Decide what exactly you’re going to do, then record the decision separately from the data that led to it.
  • Eat
    Execute your decision, using only the information from the previous stage.

The key here is never go back. Time is an arrow, not a wheel. The developer should iterate on decisions; the code should not. If you’re in the Cook or Eat phase and you feel a need to revisit the Gather or Prepare stages, it means that you didn’t do the earlier stages properly. If you’re worried that gathering all data for all code paths means always gathering some information that the actual code path won’t need, it probably means that you’re not structuring that data in the way that best supports your actual decision process. There are exceptions, I’m not going to pretend that a structure like this will solve all of your complex-decision problems for you, but what this pattern does is make all of those dependencies obvious enough to deal with. Having dozens of fields that are “private” to particular code flows and ignored by others is how we got into this mess. (Notice how OOP tends to make this worse instead of better?) Having those fields associated with stages is how we get out of it, because the progression between stages is much more obvious than the (future) choice of which code path to follow. All of those artifacts that lead to “do we have that” and “not quite what I wanted” sit right there and stare you in the face instead of lurking in the shadows, so they get fixed.

Thoughts on Fowler’s LMAX Architecture

I have the best readers. One sent me email expressing a hope that I’d write about Martin Fowler’s LMAX Architecture. I’d be glad to. In fact I had already thought of doing so, but the fact that at least one reader has already expressed an interest makes it even more fun. The architecture seems to incorporate three basic ideas.

  • Event sourcing, or the idea of using a sequentially written log as the “system of record” with the written-in-place copy as a cache – an almost direct inversion of roles compared to standard journaling.
  • The “disruptor” data/control structure.
  • Fitting everything in memory.

I don’t really have all that much to say about fitting everything in memory. I’m a storage guy, which almost by definition means I don’t get interested until there’s more data than will fit in memory. Application programmers should IMO strive to use storage only as a system of record, not as an extension of memory or networking (“sending” data from one machine to another through a disk is a pet peeve). If they want to cache storage contents in memory that’s great, and if they can do that intelligently enough to keep their entire working set in memory that’s better still, but if their locality of reference doesn’t allow that then LMAX’s prescription just won’t work for them and that’s that. The main thing that’s interesting about the “fit in memory” part is that it’s a strict prerequisite for the disruptor part. LMAX’s “one writer many readers” rule makes sense because of how cache invalidation and so forth work, but disks don’t work that way so the disruptor’s advantage over queues is lost.

With regard to the disruptor structure, I’ll also keep my comments fairly brief. It seems pretty cool, not that dissimilar to structures I’ve used and seen used elsewhere; some of the interfaces to the SiCortex chip’s built-in interconnect hardware come to mind quickly. I think it’s a mistake to contrast it with Actors or SEDA, though. I see them as complementary, with Actors and SEDA as high-level paradigms and the disruptor as an implementation alternative to the queues they often use internally. The idea of running these other models on top of disruptors doesn’t seem strange at all, and the familiar nature of disruptors doesn’t even make the combination seem all that innovative to me. It’s rather disappointing to see useful ideas dismissed because of a false premise that they’re alternatives to another instead of being complementary.

The really interesting part for me, as a storage guy, is the event-sourcing part. Again, this has some pretty strong antecedents. This time it recalls Seltzer et al’s work on log-structured filesystems, which is even based on may of the same observations e.g. about locality of reference and relative costs of random vs. sequential access. That work’s twenty years old, by the way. Because event sourcing is so similar to log-structured file systems, it runs into some of the same problems. Chief among these is the potentially high cost of reads that aren’t absorbed by the cache, and the complexity involved with pruning no-longer-relevant logs. Having to scan logs to find the most recent copy of a block/record/whatever can be extremely expensive, and building indices carries its own expense in terms of both resources and complexity. It’s not a big issue if your system has very high locality of reference, which time-oriented systems such as LMAX or several others types of systems tend to, but it can be a serious problem in the general case. Similarly, the cleanup problem doesn’t matter if you can simply drop records from the tail or at least stage them to somewhere else, but it’s a big issue for files that need to stay online – with online rather than near-line performance characteristics – indefinitely.

In conclusion, then, the approach Fowler describes seems like a good one if your data characteristics are similar to LMAX’s, but probably not otherwise. Is it an innovative approach? Maybe in some ways. Two out of the three main features seem strongly reminiscent of technology that already existed, and combinations of existing ideas are so common in this business that this particular combination doesn’t seem all that special. On the other hand, there might be more innovation in the low-level details than one would even expect to find in Fowler’s high-level overview. It’s interesting, well worth reading, but I don’t think people who have dealt with high data volumes already will find much inspiration there.

Server Design Revisited

About eight years ago, I wrote a series of posts about server design, which I then combined into one post. That was also a time when debates were raging about multi-threaded vs. event-based programming models, about the benefits and drawbacks of TCP, etc. For a long time, my posts on those subjects constituted my main claim to fame in the tech-blogging community, until more recent posts on startup failures and CAP theorem and language wars started reaching an even broader audience, and that server-design article was the centerpiece of that set. Now some of those old debates have been revived, and Matt Welsh has written a SEDA retrospective, so maybe it’s a good time for me to follow suit to see what I and the rest of the community have learned since then.

Before I start talking about the Four Horsemen of Poor Performance, it’s worth establishing a bit of context. Processors have actually not gotten a lot faster in terms of raw clock speed since 2002 – Intel was introducing a 2.8GHz Pentium 4 then – but they’ve all gone multi-core with bigger caches and faster buses and such. Memory and disk sizes have gotten much bigger; speeds have increased less, but still significantly. Gigabit Ethernet was at the same stage back then that 10GbE is at today. Java has gone from being the cool new kid on the block to being the grumpy old man the new cool kids make fun of, with nary a moment spent in between. Virtualization and cloud have become commonplace. Technologies like map/reduce and NoSQL have offered new solutions to data problems, and created new needs as well. All of the tradeoffs have changed, and of course we’ve learned a bit as well. Has any of that changed how the Four Horsemen ride?

It’s Faster Because It’s C

I was recently drawn into another discussion about a claim that project Foo was faster than project Bar because Foo is written in C (or maybe C++) and Bar is written in Java. In my experience, as a long-time kernel programmer and as someone who often codes in C even when there are almost certainly better choices, such claims are practically always false. The speed at which a particular piece of code executes only has a significant effect if your program can find something else to do after that piece is done – in other words, if your program is CPU-bound and/or well parallelized. Most programs are neither. The great majority of programs fit into one or more of the following categories.

  • I/O-bound. Completing a unit of work earlier just means waiting longer for the next block/message.
  • Memory-bound. Completing a unit of work earlier just means more time spent thrashing the virtual-memory system.
  • Synchronization-bound (i.e. non-parallel). Completing a unit of work earlier just means waiting longer for another thread to release a lock or signal an event – and for the subsequent context switch.
  • Algorithm-bound. There’s plenty of other work to do, and the program can get to it immediately, but it’s wasted work because a better algorithm would have avoided it altogether. We did all learn in school why better algorithms matter more than micro-optimization, didn’t we?

If you look at this excellent list of performance problems based on real-world observation, you’ll see that most of the problems mentioned (except #5) fit this characterization and wouldn’t be solved by using a different language. It’s possible to run many synchronization-bound programs on one piece of hardware, with or without virtualization, but the fewer resources these programs share the more likely it becomes that you’ll just become memory-bound instead. On the flip side, if a program is purely disk-bound or memory-bound then you can obtain more of those resources by distributing work across many machines, but if you don’t know how to implement distributed systems well you’ll probably just become network-bound or synchronization-bound. In fact, the class of programs that exhibit high sensitivity to network latency – a combination of I/O-boundedness and synchronization-boundedness – is large and growing.

So, you have a program that uses efficient algorithms with a well-parallelized implementation, and it’s neither I/O-bound nor memory-bound. Will it be faster in C? Yes, it very well might. It might also be faster in Fortran, which is why many continue to use it for scientific computation but that hardly makes it a good choice for more general use. Everyone thinks they’re writing the most performance-critical code in the world, but in reality maybe one in twenty programmers are writing code where anything short of the most egregious bloat and carelessness will affect the performance of the system overall. (Unfortunately, egregious bloat and carelessness are quite common.) There are good reasons for many of those one in twenty to be writing their code in C, but even then most of the reasons might not be straight-line performance. JIT code can be quite competitive with statically compiled code, and even better in many cases, once it has warmed up, but performance-critical code often has to be not only fast but predictable. GC pauses, JIT delays, and unpredictable context-switch behavior all make such languages unsuitable for truly performance-critical tasks, and many of those effects remain in the runtime libraries or frameworks/idioms even when the code is compiled. Similarly, performance-critical code often needs to interact closely with other code that’s already written in C, and avoiding “impedance mismatches” is important. Most importantly, almost all programmers need to be concerned with making their code run well on multiple processors. I’d even argue that the main reason kernel code tends to be efficient is not because it’s written in C but because it’s written with parallelism and reentrancy in mind, by people who understand those issues. A lot of code is faster not because it’s written in C but for the same reasons that it’s written in C. It’s common cause, not cause and effect. The most common cause of all is that C code tends to be written by people who have actually lived outside the Java reality-distortion bubble and been forced to learn how to write efficient code (which they could then do in Java but no longer care to).

For those other nineteen out of twenty programmers who are not implementing kernels or embedded systems or those few pieces of user-level infrastructure such as web servers (web applications don’t count) where these concerns matter, the focus should be on programmer productivity, not machine cycles. “Horizontal scalability” might seem like a euphemism for “throw more hardware at it” and I’ve been conditioned to abhor that as much as anyone, but hyper-optimization is only a reasonable alternative when you have a long time to do it. Especially at startups, VC-funded or otherwise, you probably won’t. Focus on stability and features first, scalability and manageability second, per-unit performance last of all, because if you don’t take care of the first two nobody will care about the third. If you’re bogged down chasing memory leaks or implementing data/control structures that already exist in other languages instead of on better algorithms or new features, you’re spending your time on the wrong things. Writing code in C(++) won’t magically make it faster where it counts, across a whole multi-processor (and possibly multi-node) system, and even if it did that might be missing the point. Compare results, not approaches.

Hey Reddit users, if you want to try something less than two years old, how about today’s post? Thanks!

Scaling Beyond Caches

Caching is a neat trick. It’s a really really neat trick, using one level of the storage hierarchy to hide or work around slowness in the next, and we wouldn’t be able to do much of what we do without caches. They’re still a trick, though, because the illusion that they create always fades eventually. Inconsistencies start to creep in, or the cost of maintaining consistency starts to outweigh the advantage of having caches in the first place. Even if you manage to avoid both of those problems, request rates or problem sizes will practically always increase faster than cache sizes until those caches are too small and you’d better have another solution ready. HPC folks have known about this for a long time. They don’t think much of CPU caches, because the kind of stuff they do blows through those in no time. Increasingly, their attitude toward in-memory caching of disk data is the same, because they tend to blow through those caches too. After a few rounds of this, it’s natural to stop thinking of caches as performance magic and start thinking of them as just another tool in the box.

Some folks seem to have trouble learning that lesson. Even if they’ve learned it about computational tasks, they still cling to a belief that caches can last forever when it comes to I/O. Most recently, I’ve seen this in the form of people claiming that modern applications don’t need fast I/O because they can just keep everything in (distributed) memory. The H-Store papers behind some of this advocacy make some good points regarding OLTP specifically, but OLTP is only one kind of computing. While OLTP workloads are still extremely common and important, they account for an ever-decreasing share of storage capacity and bandwidth. Let’s run some numbers to see how inapplicable those conclusions are to most people. A modern server such as the IBM x3650 M2 can accommodate 128GB in 2U. That’s barely 1TB per rack after you account for the fact that some of the space has to be used for a top-of-rack switch to provide bandwidth to that data, and that you have to replicate the in-memory data for some semblance of resistance to faults (though many would say even that is far from sufficient for serious work). The SiCortex systems were built for high density, and even they only got to about 8TB in something less than 3x the physical space. Those are piddling numbers for storage, when installations with petabytes are no longer uncommon. It’s also a very costly way to get that much capacity, paying for RAM and the processors that go with it and the power to run both. It better be worth it. Is it? Only if some small and easily identified part of your data is red-hot, and the rest is all ice-cold, and your operations on that data distribute nice and evenly so you don’t have communication hot spots that would make memory vs. disk irrelevant. That’s a pretty limited set of conditions.

Still not convinced? OK, look at it this way. A lot of people have workloads more like Facebook than TPC-C, and it just so happens that there’s some information out there about how well caching has worked for Facebook. According to James Hamilton, as of last April Facebook was serving 475K images per second out of 6.5B total. Those numbers are certainly higher today – I’ve heard 600K and 15B – but they’re the basis for some other interesting numbers so let’s stick with them. Using 1TB of memory cache across 40 servers they get a 92% hit rate. That’s still multiple gigabytes per second that have to be served from disk – just for pictures, a year ago. For everything, today, they surely need dozens of servers to achieve the necessary bandwidth, and dozens of disks per server to achieve the necessary capacity. Facebook is far from typical, but also far from the “in-memory everything” crowd’s model and moving further away every day. Where Facebook is leading, many others will follow.

Sooner or later, your caches (including memory used as cache rather than as a functionally different operational resource) won’t be big enough to provide adequate performance if the next level in your storage hierarchy is too slow. Caches can still be useful as a way to optimize use of that next level, such as by avoiding redundant requests or turning random small-block I/O into large-block sequential, but they can’t hide its weaknesses. Parallelism and horizontal distribution are the real long-term keys to scalability, for I/O as well as for computation. If you really want scalability, you have to put parallelism at the center of your design with caches in a supporting role.

Brevity is Not Power

In the context of responding to a post about C++, I realized that part of what I was addressing was the fairly common attitude that brevity in a programming language was indicative of the language’s power or expressiveness. This is common in many communities, especially among perl programmers, but probably its best known expression is from Paul Graham.

making programs short is what high level languages are for. It may not be 100% accurate to say the power of a programming language is in inverse proportion to the length of programs written in it, but it’s damned close. Imagine how preposterous it would sound if someone said “The program is 10 lines of code in your language and 50 in my language, but my language is more powerful.” You’d be thinking: what does he mean by power, then?

Like many high-profile programmers, Paul tends to assume that if he can’t think of an answer then nobody can. He almost certainly meant the above as a rhetorical question with no good answer, but in fact it’s not hard to answer at all. A diesel-powered truck is likely to be more powerful than a Prius. It might take more to start it up, it might be the wrong vehicle for a daily commute or a trip to the grocery store, but once it gets going it can do things that a Prius never could. In other words, power has to do with ultimate capability, not initial cost. What if modifying the 10-line example in Paul’s example to run across many processors required increasing its size by an order of magnitude, but modifying the 50-line example required not one extra line because half of what the original fifty lines did was set up the infrastructure for that? Which language is more powerful now? This same argument has recently played out at the server-framework level in discussions of Twisted vs. Tornado. Twisted is more complex, it’s harder to write simple apps in it, but few would last long arguing that it’s not also more powerful. (I’m not actually a big Twisted fan, BTW, but it does illustrate this particular point well.) Writing a shorter “hello world” program is not interesting. Writing a shorter complete application that does something real, in a real world where performance and scalability and robust handling of unusual conditions all matter, is much closer to the true measure of a language’s (or framework’s) power.

I say “much closer” because brevity does not truly equal power even in the context of a serious program. Part of my initial point about C++ is that so much of its brevity is bad brevity. If you have deep class hierarchies with complex constructors, and you use lots of operator overloading, then a single line of your C++ code might translate into many thousands of instructions. That same line of C++ code might, under other circumstances, result in only a few instructions. The problem is largely that the person writing that line might not know – might not even be able to find out without trying – what the results will be with respect to instructions and cache misses and page faults and stack depth and all the other things that it might be important to know. I would modify Graham’s claim by saying that recognizable and predictable brevity is an indicator of programming-language power. Any programmer in a given language will immediately recognize that certain constructs might cause lots of code to be emitted by a compiler or executed by an interpreter, most often by explicitly redirecting the flow of execution – loops, function calls, throwing exceptions, etc. Decent programmers in just about any language know that operating on large data structures, even to assign them, might incur a large cost. They don’t need to know how someone else wrote their part of the program to know which operations are likely to be expensive; they just need to know the language itself. Contrast this with C++, where a simple declaration or assignment of a simple object might or might not – through the “magic” of constructors and destructors, templates, smart pointers, and so on – take whole milliseconds to execute.

If a language allows you to express a complicated operation simply, with no ambiguity and with the execution complexity clearly recognizable by any competent programmer in that language, then that might be a legitimate marker of the language’s power and expressiveness. Paul Graham’s Arc language might in fact be considered powerful by that standard. On the other hand, if understanding a single simple line of code’s likely execution complexity requires careful study of code elsewhere, then that language might not be more powerful or expressive. It might just be more obfuscated, or even broken. C++ completely fails this test, though it’s worth noting that close cousins on either side – C and Java – do much better. Even perl does better, despite being terrible in other ways. The real point of brevity is not fewer lines or characters, but more efficient communication of intent and effects to another programmer. If your three lines communicate those things clearly, then congratulations. If they just leave the reader confused or needing to look elsewhere, then you have abused whatever language you’re writing in. If that language makes it inevitable for abuse to crowd out good use, then it is a bad language and programmers interested in writing good code should avoid it.

Actor Model Example

At the end of my original actor-model post, I suggested that I might post some example code to show how it works. More importantly, I want to show how it avoids some common problems that arise in the lock-based model, and makes others at least a little bit more tractable. Before we dive in, then, I need to explain some of what those problems are and how they happen.

Actor Model Concurrency

All of a sudden, everybody is talking about concurrent programming. The driving factor, of course, is the fact that multi-core processors have moved beyond the server room, into the machines people use for web surfing and games. Whether that’s a good kind of evolution is a question best left for another time. Right now, much of the debate is over how to make it possible for programmers to take full advantage of all that extra processing power. Because of what’s driving it, most of this conversation is focused on systems where all processors share equal access to a single memory space. In fact, neither the access times nor the memory layout are always completely uniform, but that’s not really important here. In the past, I’ve written quite a bit about programming in this kind of environment. It’s probably why many of you are here, but today I’ll be going in a slightly different direction. I say “slightly” because I won’t be going to the other extreme – internet scale with dozens-of-millisecond latencies and high node/link failure rates – either. For now, I’ll be somewhere in between – clusters of machines with separate memory, connected with some kind of modern interconnect such as InfiniBand or 10GbE.

Scalability

One of the luxuries that I’ve gotten used to at some of the places I’ve worked is being surrounded by people who understand scalability. In my recent job hunt and other between-job activities, I’ve been reminded that a great many people in this industry lack such understanding. So, here’s the thing. If you draw a graph of a system’s component (e.g. server) count on the x axis, and aggregate performance on the y axis, then

Scalability is about the slope, not the height.

The line for a scalable system will have a positive slope throughout the range of interest. By contrast, the line for a non-scalable system will level off or even turn downward as contention effects outweigh parallelism. Note that a non-scalable system might still outperform a scalable system throughout that range of interest, if the scalable system has poor per-unit performance. Scalability is not the same as high performance; it enables higher performance, but does not guarantee it.

Similarly, building a scalable system might also increase single-request latency or decrease single-stream bandwidth due to higher levels of complexity or indirection. It would be wrong, though, to dismiss a scalable design based on such concerns. If low single-request latency or high single-stream bandwidth are hard requirements with specific numbers attached, then the more scalable system might not suit that particular purpose, but in the majority of cases it’s the aggregate requests or bytes per second that matter most so it’s a good tradeoff. The key to scalability is enabling the addition of more servers, more pipes, more disks, more widgets, not in making any one server or pipe etc. faster. Can you make better use of a network by using RDMA instead of messaging? Sure, that’s nice, it might even be all that’s needed to reach some people’s goals, but it’s a complete no-op where scalability is concerned. Ditto for “parallel” filesystems that only make data access parallel but do nothing to address the metadata bottleneck.

Scaling – more precisely what the true cognoscenti would recognize as horizontal scaling – across all parts of a system is an important key to performance in scientific and enterprise computing, in the grid and the cloud. That’s why all the biggest systems in the world, from Google and Amazon to Roadrunner and Jaguar, work that way. Anybody who doesn’t grasp that, in fact anyone who hasn’t internalized it until it’s instinct, is not qualified to be designing or writing about modern large computer systems. (Ditto, by the way, for anybody who thinks a hand-wave about their favorite protocol possibly allowing such scalability is the same as that protocol and implementation explicitly supporting it. Such people are frauds, to put it nicely.)

I’ll probably be writing more about scaling issues for a while, for reasons that will soon become apparent. Watch this space.