Server Design I: Data Copies

One of the reasons I’ve been writing so little here is that I’ve been working on a longer-than-usual article about server architectures. I have a pretty good outline and some phrasing in my head, but there’s still a lot of wordsmithing to do so I’ve only finished the introduction and the first section on avoiding data copies. I think I’ve finished them, anyway. In case people don’t feel like clicking through, here’s the data-copy section:

This could be a very short section, for one very simple reason: most people have learned this lesson already. Everybody knows data copies are bad; it’s obvious, right? Well, actually, it probably only seems obvious because you learned it very early in your computing career, and that only happened because somebody started putting out the word decades ago. I know that’s true for me, but I digress. Nowadays it’s covered in every school curriculum and in every informal how-to. Even the marketing types have figured out that “zero copy” is a good buzzword.

Despite the after-the-fact obviousness of copies being bad, though, there still seem to be nuances that people miss. The most important of these is that data copies are often hidden and disguised. Do you really know whether any code you call in drivers or libraries does data copies? It’s probably more than you think. Guess what “Programmed I/O” on a PC refers to. An example of a copy that’s disguised rather than hidden is a hash function, which has all the memory-access cost of a copy and also involves more computation. Once it’s pointed out that hashing is effectively “copying plus” it seems obvious that it should be avoided, but I know at least one group of brilliant people who had to figure it out the hard way. If you really want to get rid of data copies, either because they really are hurting performance or because you want to put “zero-copy operation” on your hacker-conference slides, you’ll need to track down a lot of things that really are data copies but don’t advertise themselves as such.

The tried and true method for avoiding data copies is to use indirection, and pass buffer descriptors (or chains of buffer descriptors) around instead of mere buffer pointers. Each descriptor typically consists of the following:

  • A pointer and length for the whole buffer.
  • A pointer and length, or offset and length, for the part of the buffer that’s actually filled.
  • Forward and back pointers to other buffer descriptors in a list.
  • A reference count.

Now, instead of copying a piece of data to make sure it stays in memory, code can simply increment a reference count on the appropriate buffer descriptor. This can work extremely well under some conditions, including the way that a typical network protocol stack operates, but it can also become a really big headache. Generally speaking, it’s easy to add buffers at the beginning or end of a chain, to add references to whole buffers, and to deallocate a whole chain at once. Adding in the middle, deallocating piece by piece, or referring to partial buffers will each make life increasingly difficult. Trying to split or combine buffers will simply drive you insane.

I don’t actually recommend using this approach for everything, though. Why not? Because it gets to be a huge pain when you have to walk through descriptor chains every time you want to look at a header field. There really are worse things than data copies. I find that the best thing to do is to identify the large objects in a program, such as data blocks, make sure those get allocated separately as described above so that they don’t need to be copied, and not sweat too much about the other stuff.

This brings me to my last point about data copies: don’t go overboard avoiding them. I’ve seen way too much code that avoids data copies by doing something even worse, like forcing a context switch or breaking up a large I/O request. Data copies are expensive, and when you’re looking for places to avoid redundant operations they’re one of the first things you should look at, but there is a point of diminishing returns. Combing through code and then making it twice as complicated just to get rid of that last few data copies is usually a waste of time that could be better spent in other ways.

No Bug Like an Old Bug

Serendipitiously, today I found an old post about a TCP bug, from when I was at Encore. October 11, 1989. It’s not quite as old as the very first post of mine that I could find, and which is is also somewhat relevant to the current discussions, but it’s close. Note the old capitalization/punctuation of my name, and the UUCP return address. The more things change, the more they remain the same.

Band-Aids != Brain Surgery

Bram Cohen has decided to join the TCP fray, and has written his own article in response to mine. In it, he says:

I have to tease Jeff a bit for having suggested that I fix a transfer rate problem I had by turning off Nagle. I instead implemented pipelining, and the problem immediately disappeared. If a monkey is having trouble learning a trick, brain surgery is usually the last resort.

Well, issuing a single ioctl never seemed like brain surgery to me. :-P

More seriously, though, I don’t remember the exchange to which Bram refers, and I suspect that there might be some misinterpretation involved. The Nagle algorithm doesn’t have a direct effect on tranfer rates. It does, however, have a direct and often catastrophic effect on latency. Therefore, if your higher-level protocol has a tendency to stop and wait for a round trip to occur, what started as a latency issue can very quickly become a throughput issue. Perhaps that is why Bram’s implementation of pipelining fixed things. I can’t say for sure, though, because Bram sometimes assigns novel or very project-specific meanings to common terms and then expects the rest of us to follow along. In this case “pipelining” might mean a very different thing to him than it does to me, so I can’t say for sure that adding it to BitTorrent solved an above-TCP “stop and wait” problem, but based on common definitions it seems fairly likely.

What’s in the CEO’s Inbox Today?

Lately, there’ve been people with laptops sitting on a bench in the tiny little park right next to the building at work. It’s a different person each time, but here’s what caught my attention: the laptop is usually on and open, but not actively used; often, the owner is reading a book for an extended period of time. That’s odd because most laptop users I know are absolutely paranoid about draining the battery when the machine is not plugged in, but these folks don’t seem to care.

Am I paranoid to suppose that maybe these folks are collecting 802.11b packets?

Linus, Types, and Information Hiding

Linus Torvalds says, in an LKML email about coding style and C typedefs:

For example, some people like to do things like


        typedef unsigned int counter_t; 

and then use “counter_t” all over the place. I think that’s not just ugly,

but stupid and counter-productive. It makes it much harder to do things

like “printk()” portably, for example (“should I use %u, %l or just %d?”),

and generally adds no value. It only _hides_ information, like whether the

type is signed or not.

Yes, it hides information, and that is exactly the point. Linus was literally in diapers when the concept of information hiding was developed by Dijkstra and named by Parnas in 1972. The idea behind information hiding is pretty simple: if you don’t know something you’re a lot less likely to depend on it, and therefore to be affected when it changes. Reducing dependencies is a good thing.

Let’s consider the example of three kinds of values that you often use in your program – e.g. block number, opcode and status. Maybe you could use plain ints for all three, but they have very different meanings within the code and none should ever be used where another is expected. By using typedefs you gain two benefits:

  • Whenever you use one of the three types it’s immediately clear to even the laziest programmer not only what machine type (# of bits, signed vs. unsigned) is expected, but what the value is supposed to represent. The code is clearer, and likelihood of error is reduced.
  • If you ever need to change the size of one type, it’s very easy to do by changing the typedef. Contrast this with what would happen if you had to change the size of the block-number type when all three types were declared as int: you’d have to examine every place where you used an int, determine whether it was supposed to be a block number, and change that instance by hand.

One advantage you unfortunately would not get in C is compiler help in detecting assignments between these types, such as passing an opcode where a block number is expected. In C, a typedef does not actually introduce a new type as it would be recognized by the part of the compiler that would recognize such errors; it only introduces a new name for an existing type. Many other languages’ equivalents of typedef will actually create different types within the compiler even if those types look and behave exactly the same, and those languages are therefore better about catching these sorts of errors.

Linus does raise an interesting point about typedefs making it harder to print out values, but there’s a very simple answer to his concern and it doesn’t involve abandoning typedef: you shouldn’t be formatting the display of other modules’ types anyway. If you really want to print them out, a function to create a displayable representation of the type – like Python’s __repr__ – should be part of the interface between modules. Even with C’s lame handling of strings and memory allocation, almost all of the details can be hidden in printk.

In short, there’s just no good reason to avoid typedefs. Yet again (previous example), I find myself wondering whether Linus hasn’t reached the point of doing more harm than good in the technical community.

Yet Another Site Update

I’ll try to keep this short. Both pl.atyp.us and www.platypus.ro have been running at the new site for a couple of days now, with no visible problems. At least, nobody has complained and several people have commented on content posted since the move. The statistics on the old site show a sharp drop in hits, but not quite to zero because sometimes old DNS entries don’t expire when they should. Interesting tidbit for geeks: several search-engine robots seem to be ignoring DNS timeouts on purpose. I any case, when I do see those hits drop to zero, I’ll kill the old site.

Live, from (TCP) Vegas!

Welcome to round five of the TCP Smackdown. In the blue corner we have Ziv Caspi, with whom I’ve continued to exchange email since Messages vs. Streams. Ziv has made a good case that I was wrong when I said that most traffic is still stream-oriented, and seems to abhor TCP’s message-boundary-erasing habits even more than I do. In the red corner we have Luke Gorrie, who last appeared here in TCP Apologists Considered Annoying. Luke seems to think that I’m being too hard on TCP, and that the problems I mention are figments of my imagination.

Since my own views are somewhere between these two fine contestants’ you might think I’d be inclined to let them fight it out between themselves, with me making sure they get equal time, calling fouls, etc. Well, if you thought that you obviously don’t understand the role of the referee in modern sports. From professional boxing and wrestling to the World Cup, referees taking sides has become the norm, and that’s exactly what I’ll do. Here’s the coin flip…oh, sorry Luke, looks like I’ll be on Ziv’s side today. Better luck next time. Without further ado, then, here’s my contribution to the next round.

By pay-as-you-use, I mean that most congestion-control features are only actually used when you encounter congestion. Or more accurately, when you encounter packet loss, severe packet reordering, or large “spikes” in delay, which TCP interprets as signs of congestion.

On a fast and reliable “specially provisioned” network, I’m assuming these things are extremely rare. That being the case, I don’t see why TCP congestion control should cause any problems.

Unfortunately, that is not the case. Consider the following:

  • Round-trip times and window-size calculations are being made all the time, congestion or no congestion. That might seem like a trivial cost, but you’d be surprised; if you want your times to be accurate and consistent your timer accesses can be more expensive than you think.
  • There is no window-based congestion avoidance method that will not occasionally throttle traffic unnecessarily. It’s a logical impossibility.
  • In my experience, most TCP implementations seem to be kind of “twitchy” – i.e. they’ll respond to the smallest change in RTT or packet-loss rate as though the sky is falling, with terrible effects on throughput.

TCP’s congestion control – which is mostly congestion avoidance – is a very good thing to have on almost any general-purpose network that’s running anywhere near capacity, and most particularly on the public internet, but it’s not “pay as you use” by any means. There are very definite costs, even in environments where the congestion TCP is trying to avoid never happens.

It would be more interesting to know what mistakes are showing up in packet traces, and whether they’re caused by TCP implementation bugs or by network quirks that the protocol intrinsically doesn’t handle well. I’d want to determine this before concluding that TCP’s congestion control is expensive, and certainly before turning it off or designing alternatives.

Sometimes it’s an implementation bug, sometimes it’s a protocol issue, sometimes it’s a little of both, sometimes it’s hard to tell. One thing that’s always true, though, is that TCP’s very complexity makes all classes of bugs more likely; witness, for example, the mess that’s developing with respect to mixing “Reno” and “Vegas” dialects of TCP.

The last part of your comment, though, is something I think we should all be able to agree on. TCP, for all of its imperfections, is a good starting point. Only experimentation with the real higher-level protocols on a real network can really show whether tuning or abandoning TCP will yield any benefit, and if the application designer has done their job in keeping things modular such experiments should be fairly easy to perform.

The reason that TCP proxies can do so much transparently is that TCP and SOCK_STREAM leave so much freedom to the transport, compared with e.g. a datagram protocol where application level frames must correspond to IP packets. With streams, write() isn’t defining a frame, it’s just writing the next sequence of bytes in the stream, so there is no frame information to be preserved.

It seems that any time someone suggests that preserving message boundaries would be a good thing, someone else assumes they’re talking about forcing complete equivalence between messages and network-layer packets. How absurd. In fact it’s not even possible, since application-level messages can be arbitrarily large and network MTUs only get so big. What we’re talking about here – what I’m talking about, anyway – is preserving application-level message boundaries regardless of what has to be done within the transport layer or below.

Streams give tremendous flexibility to the transport layer: its only restriction is to ultimately deliver the bytes in order. Any way it wants to chop them up into network-layer packets for better flow-control, retransmission, transfer efficiency, or to fit the receiver’s buffer is no problem.

…except that it’s lacking a piece of information that would allow it to do all of those things more effectively. One of the most important decisions that a transport or lower layer has to make is when to actually pass a packet onward. The no-brainer cases are to send when there’s a complete destination-port MTU available or upon explicit request. What if neither of these is true? Should we send immediately, or wait for more? One way can lead to lots of very small packets, which is terrible for bandwidth; the other way (Nagle) is much more bandwidth-efficient but has absolutely terrible effects on latency. It’s a lousy tradeoff either way.

If we know where the message boundaries are, though, we retain all of our ability to fragment and reassemble however we want but now we have one more piece of information that enables us to do so more effectively. We’re much less likely to guess wrong about whether to wait or send, and forward half a message to an application that’s not going to do anything with it until the rest arrives anyway. That’s pretty important. Nagle himself realized that his clever invention was no substitute for line-mode telnet when that was all applications needed. Tracing and filtering also become much easier when fields can be specified as offsets into transport-visible messages instead of requiring specific and separate knowledge of each application’s framing methods. If you want “pay as you use (and get something back)” this is it.

It’s also easy to put application frames on top of streams. Writing and reading a 2- or 4- byte in-band length header is cheap and simple.

If only it were so. As I’ll detail in a moment, though, it’s often a bit of a pain.

So for instance, if a program is doing a bunch of separate write() calls, this might cause the data to go out in separate IP packets and ethernet frames, when it could have all fit into one. Thoughful use of writev(3) and TCP_CORK could help out for things like this.

TCP_CORK is kind of nice, but let’s not forget that it’s highly platform-specific. For some reason, people who write network code often write it to run on multiple platforms, not all of which will have TCP_CORK. Writev comes a little nearer the mark, but is no panacea either. For one thing, my experience has been that writev is poorly implemented for network connections on some platforms, and some of the error handling is very ugly indeed. Also, while writev does avoid some of the stupider behavior related to application-level framing, it doesn’t do much more than that. In particular, the tail end of a message that’s larger than an MTU might still get “stranded” waiting for a transmission timeout.

The closest equivalent to a true message-boundary interface is actually {MSG,TCP}_PUSH. Proper use of this facility, with Nagle left on, can provide many of the same benefits of a message-boundary-preserving transport…but only at the sender. The in-network advantages discussed a couple of paragraphs ago still only occur with The Real Thing.

Day 1: Still Tired from the Move

The transition to the new site is complete, or at least I think it is. Transferring the data from PlatSpot was easier than I expected, and the DNS updates occurred more quickly than expected. As far as I can tell, www.platypus.ro and pl.atyp.us now refer to the same site hosted by my new provider, and likewise for email; the only way to get to the old site is by IP address. I’ll be checking the statistics to make sure that’s the case, and if so then in about a week I’ll shut it all down. As always, let me know ASAP if you find any glitches.

While I’m here, and since I’m too lazy to make a separate post about it, I added some simple <div> tags that should make RSS feeds a little easier. Thanks for Ziv Caspi for bugging me about it.

Net-ural Selection

Mail to platypus.ro was FUBAR for a few days, courtesy of JTLnet. It’s all better now (I think) so if you sent me anything and it bounced please retry.

The continuing problems I’ve had with JTLnet lately have made me decide to change providers. The new pl.atyp.us site is already live at the new location, except for PlatSpot; as soon as I’ve finished getting everything set up there, I’ll be changing my DNS info to point platypus.ro there as well. This should all be totally transparent to anyone but me, but there is a potential for some glitches during the transition, so please bear with me.

So long, JTLnet. It was good while your One Guy With A Clue still worked there, but obviously he left and it’s been all downhill since then.

New Domain

First, the bad news. I tried to get the domain platyp.us, only to find that it had been snapped up by a domain speculator. After multiple attempts, and pointing out that domain speculation doesn’t work very well if you skip the selling part, I finally got a response. In it, the squatter claims both that he has a non-resale use in mind for the domain, and that “in the past we’ve set a minimum of $1000 for resale”. Those are both lies, in my opinion, and hard to reconcile with one another. What a fool.

So, I went and registered atyp.us instead. Why? So that this site could be reachable as http://pl.atyp.us. Try it. ;-) Don’t worry, old links will still be valid; the old domain isn’t going away, now I just have two names for the exact same site.

Teaching the tech at JTLnet how to set up Apache to recognize the two domains as the same, and especially how to ensure that the hostname “pl” was accepted, was quite a chore, and one that I was surprised to find necessary considering that virtual hosts are the core of how such companies do business. Here’s a hint for the guy I spoke to: in this business you can often get away with being incompetent or being rude, but if you’re both at once you’ll lose business. Either figure out how to do your job, or lose the attitude.