Cloud Computing Definition

Cloud computing is where you can’t get your work done because a service you never heard of has failed somewhere half-way around the world. Yes, I know it’s a variant on an old joke, which was about a server instead of a service and it was down the hall instead of around the world, but there is definitely an “everything old is new again” aspect to a lot of the cloud-computing hype so we might as well recycle the jokes too.

P.S. If anyone can provide an authoritative source for the original, let me know and I’ll be glad to cite it. I know it’s not mine and I am in no way trying to take credit for it, but my Google-fu seems to be weak this morning.

P.S. Thanks to Paddy for identifying Leslie Lamport as the source of the original remark.

Thoughts on Network Diameter

I actually started composing this in my head before Matt Reilly’s post about Green Computing, but it kind of fits in with some of the points he makes about communication speed as a factor in overall performance. One of the properties of a Kautz graph, such as we at SiCortex use in our systems, is that

For a fixed degree M and number of vertices V = (M + 1)MN, the Kautz graph has the smallest diameter of any possible directed graph with V vertices and degree M. (from the Wikipedia article)

UPDATE 2008-11-10: I happened to read this while looking for something else, and realized that there are better torus routing methods than those I had considered. I think I’ve invented an even better one than the current general practice, but for now I’ve updated the text below to reflect the general-practice numbers (which don’t affect my argument at all).

Yeah, lovely, what does it mean? What it means is that if you want to make a system of a certain size, let’s say approximately 1000 nodes, and you’re constrained as to how many links per node you can have, you can achieve the smallest network diameter by arranging your nodes in a Kautz graph. For example, our 972-node system using three outbound links per node (i.e. degree=3) has diameter=6. That’s the maximum number of hops to get from a node X to any other node Y; the average is approximately five. By contrast, for a 10x10x10 hypertorus also using three outbound links per node, the diameter would be 27 18 and the average hop count would be over 13 almost 10. I’m not actually familiar with the proof for the statement quoted above and wouldn’t understand it even if it were shown to me, but it seems very believable based on my experience. Every time I try to think about tweaking some other topology to reduce the average hop count or bring the links per node down to reasonable levels (you can’t feasibly build a system that requires dozens of links on every node) the result starts to look more and more like a Kautz graph in terms of routing, partitioning and tiling characteristics, etc.

So far, so good, but how does that translate into anything useful? Well, if communication speed is a factor in overall performance, and communication speed falls through the floor when the system’s communication capacity is exceeded, then that capacity matters a lot. The common way to compare the communication capacity between different systems is to compare bisection bandwidth – how much capacity must be removed to divide the machine into two equal parts? It turns out that measuring the bisection bandwidth of a Kautz graph is not straightforward, and even the Kautz cognoscenti at work seem to disagree about the result. Personally, I think the whole bisection-bandwidth approach is bass-ackwards. Why measure the negative (minimum capacity you could remove) instead of the positive (maximum capacity you could use)? I think it happened because a naive attempt to measure maximum usable capacity would allow a system to be partitioned into small islands communicating amongst themselves, yielding inflated figures that don’t match what a real application running across the entire system could get. That’s fixable in a measure-the-positive approach, though. All you need to do is say that you’re going to measure the maximum bandwidth when each node in one half of the system sends equal amounts of traffic to each node in the other. That way, if you want to combine the figures for two islands (often corresponding to two switches or cabinets) then you have to pass half of your traffic between them, and what you get is a pretty fair approximation of what an actual full-scale application could get.

This is where network diameter and particularly average hop counts start to matter. The maximum usable communication capacity of a system is the aggregate link speed divided by the average hop count. Therefore, even if you can distribute your traffic perfectly across all links in the system for the test I propose above, your performance ceiling will be determined by the average hop count in your topology. For example, in our 972-node Kautz graph that’s 2916 links’ worth divided by an average hop count of five, or 583.2 links’ worth total. For that 1000-node hypertorus, it’s 3000 links’ worth divided by an average hop count of 13.5 10, or only 222.2 300 links’ worth for a similarly sized system. Looked at another way, your links would have to be about 2.6 times almost twice as fast just to keep up. If you made those links bidirectional you’d double the number of (half-duplex) links and reduce your average hop count to around 7.5, so you’d be getting 800 links’ worth of total capacity, but then you’d be talking about a system with vastly greater hardware complexity and cost than the others. If you wanted to compare apples to apples, a thousand-node Kautz graph with degree=6 would have the same 6000 links with an average hop count around 3.5 so the hypertorus would still be at a serious disadvantage. Then you have to consider that we’re talking about ceilings here, and that even in the abstract different topologies will allow for different levels of “perfection” as regards distributing traffic among all the links in the system. Maybe I’ll explore some of those issues at some time in the future. For now the key point is that, much as a more efficient algorithm will always eventually win out over a faster implementation, a more efficient topology will always eventually win out over raw bandwidth. In my opinion, bisection bandwidth as usually measured doesn’t tell that story well enough.

Helpful Interpeople

I’m in lovely Bangor (Maine) for work, and chose to stay at the Sheraton Four Points hotel and construction zone because it’s near the bus station. Yes, I took the bus, because flying would take just as long for ten times the price and driving wouldn’t have been good for my brand new car (more about that after I get home). One of the features here, as most everywhere nowadays, is free wi-fi via stayhome a.k.a. e-centre. I could browse OK, but I was having trouble with email and ssh tunnels. They’d just lock up. After trying many things to debug the problem, I had sort of concluded that the provider must be blocking traffic, so I googled for their name plus various relevant terms.

It turns out that they’re not blocking intentionally; they’re just screwed up. I pretty quickly found one article on an Eee forum (I’m using an Eee but wasn’t specifically searching for anything related to that) which pointed me to a solution – turn off TCP window scaling. Huh? I know a thing or two about TCP, and I’m still having trouble imagining how e-centre could have screwed things up so that mishandled window scaling affects email but not in web browsing (which is more likely to transfer large amounts of data). Clearly they’re aware of it as a problem affecting users, so why don’t they fix their routers?

Oh well, at least it works now.

Catching Strays

If I were ever to teach an actual class, it would probably be about how to write networking code that actually works. A lot of people teach about protocols and algorithms, and that stuff’s definitely important, but I see a lot less about there about the nitty-gritty details of how to implement those things in a debuggable and maintainable kind of way. There seems to be a huge gap between the things CS professors tell us we should be doing and the actual facilities that are present on most systems, and it’s increasingly apparent that a lot of programmers are falling into that gap. Even if you already have a complete and clear description of the protocols you’ll be implementing, actually turning that into working (and perhaps even efficient) code is a hard problem all by itself. Here,’s another example of a “defensive programming technique” that you can use to save time on mere debugging so you can spend more on the fun stuff.

One of the most common problems network programmers face is the prospect of a message arriving later than it should. Often the incoming message is a reply to one you had sent earlier, but by the time it arrives the object it referred to has already been deleted or has changed state. This is part of the reasons why I don’t like timeouts very much, and it’s one of many reasons why it’s a bad idea to send out pointers and expect them to be valid when they’re sent back. Textbooks are full of acknowledgment and sliding-window protocols to handle this for the case of messages sent through a single channel. In some cases you can take advantage of that by closing an old channel and reopening a new one, but that’s not a general solution. Closing and opening channels often is very disruptive and inefficient, you might be dealing with orders of magnitude too many objects to make such an approach feasible, you might have other reasons for not using a protocol that imposes ordering as well as providing exactly-once semantics, etc. For these or other reasons, you might need to deal with this issue yourself.

The way to avoid “stale message” problems is to observe that every message has a context. There’s some reason you’re getting it, something the sender knows – or thinks he knows – about the state or information you hold. If your state changes, it might invalidate the sender’s assumptions and that can be reflected by creating a new context to replace the old. The easiest way to do this is to represent the relevant context in the form of an object, and to establish a rule that every message you handle must identify an object to which it pertains. These objects must have three basic attributes:

  • A unique ID, so you can look it up.
  • A generation number, so you can reject messages from a previous “epoch” of the object’s existence (including another free/allocate cycle). Note that the generation number needs to be large enough to avoid wraparound, just like all that textbook stuff for the single-channel case.
  • A state, so you can reject messages that don’t make sense any more (or perhaps never did).

The objects used to establish a message’s context might not be objects as they would otherwise exist in your program. They could also represent requests, transactions, asynchronous event handlers, or groups of any of these things – in fact, whatever contexts you use when you think about your program. In many of these cases, the object also provides a handy place to maintain lock state, timestamps, or other information associated with handling the message. It’s OK to create a “global” object to establish context for messages that have a genuinely global effect. The ID can be an index into a table, and in fact that’s where the generation number becomes most useful. Similarly, it can be very useful if the ID that you expose externally is actually an ID+generation internally, so the beginning of your message handler can look like this:

  1. Extract the external ID from the message header.
  2. Separate the external ID into the real ID and the generation number.
  3. Use the real ID to look up the appropriate object.
  4. Check the message’s generation number against the object’s, and reject it (loudly) if they don’t match.
  5. Check the message type against the object’s state (more about this in a moment) and reject etc.
  6. Process the message, secure in the knowledge that it’s now safe to do so.

The part about checking state is also important. Basically, the idea here is that certain messages are only valid in certain states, and it’s the same idea as in Microsoft’s Singularity project. If a request object should only get a RequestComplete message while it’s in a RequestInFlight state and not when it’s in a RequestOnQueue state, that can be enforced even before real processing begins by assigning the appropriate states. Watch out for synchronization issues around state changes, though. If you’re using the kind of staged execution model that I’ve recommended elsewhere, these checks should happen before releasing the lock that allows another thread to take over the message-dispatch role. Also, keep it simple. States and messages that are uniquely associated with one another are pretty safe, but if you have too many rules more complex than “if type is X then state must (not) be Y” than you’re probably asking for trouble. More complex state models can protect you from more things, but the one thing we’re really concerned with here is stray (i.e. delayed or duplicate) messages, and chasing lots of false alarms from a too-complex state model is a cure worse than the disease.

Adopting this methodology won’t protect you from all stray-message problems. For example, if a message to you is duplicated and the duplicates arrive back to back, then they might get processed in the same generation and state. If that’s a possibility for you, you’ll need to work out another mechanism to deal with it and then subject that mechanism to some more rigorous formal verification. What these simple tricks represent is more of a coding style that inherently avoids some of the most common problems from timeouts and retransmissions and race conditions, complementary to any protocol-specific analysis you might do.

Evil Timeouts

A few days ago, Pierre Phaneuf revived a discussion about timeouts on a very old article. I’ve spent the last few days at the 2008 Lustre User Group meeting in beautiful Sonoma CA and, since I’m stuck at SFO waiting for my red-eye home (blech blech blech), it seems like a good time to write some more about why I think timeouts are evil.

The Lustre folks are making a bunch of changes to deal with timeout issues on very large systems. For example, they’re going to adaptive instead of fixed timeouts, and introducing a new “Network Request Scheduler” that basically does fair queuing at the Lustre level. Why? Because administrators of large and even medium-sized Lustre systems have gotten sick and tired of logs filled with “RPC timeout” messages and “haven’t heard from X for a while so I’m evicting him” and worse. In essence, Lustre is doing a good job of proving that per-request timeouts don’t scale. NRS will help to prevent outright starvation, but that’s only the most immediate manifestation of the fundamental problem. Let’s say node X sends a message and expects a reply. What happens when 100K other nodes have messages in the queue ahead of X’s, and each one takes more than timeout/100K to execute? Same as now, basically. There are many solutions, but pretty much all of them involve making NRS more complex and increasing protocol overhead without addressing the fundamental problem.

In various systems that I have designed and that have actually worked, I’ve followed a fairly simple rule: there is only one component in the system that can determine anything about the health of another node. Everybody else who sent something to that node will wait indefinitely until they get either a response or an indication from that one subsystem that the other end is dead. That way only the heartbeat subsystem has to be designed – and it does have to be designed carefully – to deal with all of the reliability and scalability issues of determining who’s alive or dead. You don’t have N components in the system each responding in different ways and at different time horizons to a real or perceived node failure. That way inevitably leads to disaster.

Note, however, that decisions and actions other than detecting and responding to a node failure can still be timeout-based. The best example is retransmissions over an unreliable network. It’s entirely possible, and in a large system likely, that a message will get lost and the sender will wait indefinitely while the heartbeat subsystem never sees a problem and thus rightly never sends any node-failure indication to end the wait. It’s entirely legitimate and necessary to retransmit in such a case, though of course the retransmission policy should be considered with an eye toward scalability and the implementation should ideally be invisible to higher software layers.

To pick the obvious example, what might the world be like if Lustre had followed this rule? NRS or something like it might still be necessary for other reasons, but not to prevent the ill effects of prematurely deciding a node is dead merely because it’s too overloaded to answer one request quickly enough. Adaptive timeouts wouldn’t even make sense. Perhaps most significantly, users wouldn’t have to implement failover using a completely separate package that triggers action independently of when Lustre itself detects and responds to RPC failures. That’s a tuning nightmare at best, and it’s highly questionable whether you could ever really trust such a setup not to tie itself up in knots at the most inopportune time (i.e. during heavy load).

Narrow Vision

I don’t have much time to write more about the whole congestion-control thing right now, but if I hear one more person blather about how the next version of DOCSIS will or won’t solve the problem, then I will fly to wherever there and strangle that person with some optical fiber or twisted pair (their choice). A problem that exists within only one last-mile technology is a problem for the service provider alone, and a solution implemented only within one last-mile technology is no solution at all. When you act as though all the world’s a cable modem, you only show that you don’t really understand the problem.

Congestion Control II

A couple of days ago, I wrote an article about congestion control on the internet. In the comments, I promised a more technical exploration of the subject, and this is it.

Congestion is a problem on the internet. It will always be a problem on the internet, just as it is on our roadways. Yes, bandwidth keeps increasing. So does the number of users, and the number of network applications per user, and the required bandwidth per application. Supply will never stay ahead of demand for long, if at all. Sooner or later, it will become logistically or economically infeasible for service providers to add more bandwidth all throughout their networks. Sooner or later, enough people will be sending enough traffic over the network’s weakest links that it won’t matter if other parts of the network are overbuilt. The “more bandwidth” crowd will lose to those who recognized the need for congestion control and did something about it.

When thinking about how to handle congestion, it’s possible to look at things from many different perspectives. Here are some examples:

  • Different kinds/definitions of fairness.
  • The core of the network vs. the edge.
  • Distinguishing flows vs. hosts vs. “economic entities” (i.e. users).

To keep this from turning into a major dissertation, I’m going to say little about these.

  • On fairness, I think Briscoe has really said all that needs to be said about flow-rate fairness vs. cost fairness. Note that “flow-rate fairness” does not necessarily refer to flows in the sense that they’re often discussed in the congestion-control literature. It refers to the rate at which things (packets) flow, whether they’re associated with connections or hosts or anything else. It’s an unfortunate re-use of terminology, but that’s life. I think cost fairness is the proper goal/standard for congestion control. If you disagree, keep it to yourself. I might discuss things in terms of cost fairness, but for the most part the concerns I’ll address are directly translatable to a flow-rate-fairness world and I don’t want to get bogged down trying to make the flow-rate zealots happy.
  • As for the core vs. the edge, I bring it up because they’re very different environments calling for very different kinds of congestion management. While routers at the edge can reasonably hope to make flow/host/user distinctions (modulo the mess that NAT introduces), routers at the core have no such hope. They must also process packets at a much higher rate, so the most they can do is make hopefully-useful guesses about what packets to drop when they’re overloaded. The dumbest approach is simply to drop the last packets to arrive. The various flavors of RED (Random Early Detect) are slightly better, and I believe Stochastic Fair Blue (which I’ve written about before) is even better, but they’re all still basically guesses. While congestion control in the core is a fascinating subject, it’s not what I’ll be writing about here.
  • On flows vs. hosts vs. users, most of the people weighing in on these issues have tended to focus on users and I’ll follow suit. NAT makes a hopeless muddle of the whole thing anyway, so the only thing you can really be sure of is one entity paying the bill for one physical line. There’s a lot of merit to the argument that if there are multiple uses huddled behind NAT then that’s their problem anyway.

OK, enough stage-setting. On with the show.

Congestion Control

I had been meaning to write about network congestion control for a while anyway, but it seems like now I have an extra reason. George Ou just posted an article called Fixing the unfairness of TCP congestion control, lauding (and somewhat representing) Bob Briscoe’s excellent paper on Flow Rate Fairness: Dismantling a Religion. Because Ou is an ardent opponent of so-called “network neutrality” the response has been predictable; the net-neuts have all gone nuts attacking Ou, while all but ignoring Briscoe. One example, sadly, is Wes Felter.

This is fine, but for the same cost as detecting unfair TCP, ISPs could probably just implement fair queueing.

I think you’re a great guy, Wes, but did you actually read Briscoe’s paper? You cite Nagle’s suggestion of queuing at ingress points, as though it somehow contradicts Briscoe, but that is in fact pretty much what Briscoe ends up recommending. The only difference is in how we define “fair” and Briscoe makes a pretty good case for cost fairness instead of flow-rate fairness. He points out, for example, how current techniques merely create an “arms race” between service providers and the application developers, creating new problems as the race continues but never really solving the old one. He even takes aim at some of the tricks that providers like Comcast have been using.

While everyone prevaricates, novel p2p applications have started to thoroughly exploit this architectural vacuum with no guilt or shame, by just running more flows for longer. Application developers assume, and they have been led to assume, that fairness is dealt with by TCP at the transport layer. In response some ISPs are deploying kludges like volume caps or throttling specific applications using deep packet inspection. Innocent experimental probing has turned into an arms race. The p2p community’s early concern for the good of the Internet is being set aside, aided and abetted by commercial concerns, in pursuit of a more pressing battle against the ISPs that are fighting back. Bystanders sharing the same capacity are suffering heavy collateral damage.

That sounds like a pretty serious indictment of forging RST packets and so on. What more could you want? The simple fact is that congestion control is necessary, it will always be necessary, and current methods aren’t doing a very good job. This issue is almost orthogonal to network neutrality, as it’s about responses to actual behavior and not preemptive special treatment based on source or destination before any behavior is manifested, so don’t let opposition to Ou or to his views on network neutrality color your evaluation.

Gaming-resistance is just as desirable a property in congestion control as it is in other areas we’ve both studied, and right now real users are seeing degraded performance because a few developers are gaming this system. I remember talking to Bram Cohen about this issue while BitTorrent was being developed, for example. He was very well aware that congestion control would sometimes slow his transfer rates and very deliberately designed BitTorrent to circumvent it. That benefits BitTorrent users, and I just used BitTorrent to download an ISO yesterday so I’m not unaware of its value, but really, what makes BitTorrent users so special? Why should those who create congestion not feel its effects first and most? How, exactly, is it “fair” to anyone else that they don’t?

Virtual Hosting on Amazon

Just for fun, I decided to experiment a bit with hosting some of the files from here through Amazon’s S3, which I’ve talked about before. First, this meant I had to create an S3 “bucket” to hold the files. Then I had to create a subdomain that redirects requests to S3. I chose “womb.atyp.us” because “wombatypus” is the punchline (as it were) of a children’s book we have involving a wombat and a platypus. It has nothing to do with wombs. Then I just had to upload the files and set their permissions, and voila! Here are the three videos from my previous post, hosted through S3: Counting Ears, Magnet Book, and Caterpillar. One interesting twist is that the files are also available via BitTorrent, just by tacking “?torrent” on the end, like this: Counting Ears, Magnet Book, and Caterpillar. Neat, huh?

There’s an interesting tradeoff here. Right now I have a hosting plan at eMax that gives me a certain amount of bandwidth. I’ve been using about 60% of that, so the other 40% is effectively free for now. By contrast, I pay for exactly what I use on Amazon, so as long as I’m below my current hosting-account limit hitting the pages through Amazon actually costs me extra. It’s literally pennies, so don’t think you’re going to send me to the poor-house that way. On the other hand, let’s say my traffic doubled. If I wanted to host everything at eMax, that would mean moving up to the next service level. The cost of hosting some files through S3 is probably less than the delta between the old and new service levels, and it’s definitely less than the extra-bandwidth charges I’d incur at eMax by staying at my current service level even though my traffic was higher. The S3 cost is even lower if people download through BitTorrent. Lastly, although I’ve had no complaints with availability at eMax, I’d be willing to bet that Amazon’s servers are even more reliable.

The upshot is that cheap hosting accounts are so cheap that it probably makes sense for most small personal websites to stay with them. It doesn’t take long, though, before the S3 alternative starts to seem rather appealing. I’m not there yet, but it wouldn’t take much for me to be, and I’m not exactly a net celebrity. On the other hand, S3 can only serve static files, so for blogs and such that rely on server-side scripts and databases and such it’s not an option at all. That’s why you can see the videos through S3, but not this page. What I think we’re likely to start seeing is more web hosts providing some level of integration with S3, so that the web host provides the computational and database resources but (some or all) files live on S3. Then there’s the possibility that Amazon’s Elastic Compute Cloud will catch on, so that a new class of resellers will spring up to use Amazon’s virtualized services instead of traditional data centers’ physical ones. It makes me wonder whether the next logical step for Amazon would be to provide a virtualized database service, at which point all of the facilities currently provided by traditional web hosts could also be had virtually through Amazon. Just a thought.

Backup Using Amazon’s S3

After reading Jeremy Zawodny’s analysis of backup-system costs, I decided to give Amazon’s Scalable Storage Service another look. Shortly after that, he posted a list of resources that was also very helpful. I immediately tried Carbonite and JungleDisk. (Yes, I know Carbonite might or might not be S3 based. Speculation abounds, but there doesn’t seem to be any hard evidence either way. In any case, it plays in the same space.) I found both pretty intolerably slow, each using less than one tenth of the bandwidth I know I have either at work or at home. Carbonite is also a for-pay service (with a free trial) so it’s a lot less interesting. JungleDisk, on the other hand, is free for now, but the author clearly means to profit from it some day. It’s also a little careless about things like leaving your Amazon “secret” key lying around in plaintext on your PC, and I don’t appreciate that kind of thing. I gave S3Drive a look, intrigued by the fact that it’s implemented as a true filesystem which works for all programs and not just as an Explorer-limited “web folder” or namespace extension, or (most limiting of all) as a separate standalone program. Unfortunately, even the author admits that it’s slow and uses a lot of memory, and doesn’t recommend storing more than 5MB. Sorry, but that’s not even enough for testing. I’ll pass for now; maybe I’ll come back and check it out again in a few more months. Currently I’m giving S3 Backup a try. It does suffer from being a standalone program, but it does seem to transfer data much faster and it’s more respectful of users’ privacy. Also, the pace of development seems high so there’s promise of it getting better.

The idea that really interests me, though, is of using S3 to back up this website. Yeah, the one you’re looking at right now. You see, I have decent bandwidth at home and at work, but it’s still nowhere near what either my web host or Amazon has. Why suck all that data through the thin pipe to do a backup when there’s a fat pipe between where the data reside now and a secure backup location? What if I need to change hosts yet again? Why bounce everything through my home connection instead of through S3? I think it would be far better to use the thin pipe only for control, to set up a “third party transfer” directly between the website and S3. That product space seems a lot more thinly populated, so instead of looking for programs I’ve been looking for libraries so I can write my own simple backup program. The language of interest here is PHP, for two reasons:

  • I sort of know PHP, and I don’t know Ruby (yet). I’m far more interested right now in getting things done and possibly learning something about S3 than in learning yet another language, thankyouverymuch.
  • A lot of other people who might find anything I produce useful are likely to have web hosts who support PHP but not Ruby.

Amazon has an example of using S3 from PHP, but it’s pretty basic. As far as I’ve been able to tell so far, the canonical PHP interface to S3 is neurofuzzy’s library so that’s probably where I’ll start. If nothing else, it should be a helpful guide to what the underlying S3 API looks like in real life and not just on paper.

One last thought, included here just for the sake of having a place to put it. If it turns out that S3 is too heavily optimized toward storing large objects (i.e. that performance is limited by number of object operations rather than number of bytes) then it seems like a one-to-one mapping of user files to S3 objects might not be a very good idea. The question, then, is how to aggregate user files into larger S3 objects without having to rewrite an entire large object whenever one small object within it changes. One approach I’ve been toying with is to use something like a log-structured or atomic-update filesystem, with S3 objects representing (an infinite supply of) disk slices instead of files. As you write, you actually write into a new slice. When it’s full, or at other “strategic” times, it gets linked into the overall filesystem hierarchy to supplant earlier versions. The ratio between user actions and S3 actions can therefore be extremely high without sacrificing data integrity. I don’t know yet whether such an approach is really a good idea, but maybe it’s something other filesystem geeks would like to chew on.