Canned Platypus

Making the world better, one byte at a time.

Archive for the ‘networks’ Category

One of the dangers of deploying in a public cloud is that Evil People are very much attracted to public clouds. This is for two main reasons:

  • Public clouds are a target-rich environment for evil-doers. The IP ranges tend to be densely packed with servers, each of which might contain a large trove of easily-exploitable information. Contrast this with a public connectivity provider such as Verizon or Comcast, where any given IP address is much more likely to be either unused or assigned to an individual PC containing only a few pieces of exploitable information.
  • Most public clouds’ network virtualization and firewall setups make it hard to do proper intrusion detection, and providers’ own intrusion detection is likely to be of little use. Heck, you can’t even rely on most of them doing any intrusion detection since they won’t tell you.

This isn’t just random paranoia; it’s actual experience. I’ve been running my own little server in the Rackspace cloud for a little while. Here’s a little tally of failed ssh login attempts for that one machine over a mere week and a half.

382 195.149.118.43
136 61.83.228.112
103 89.233.173.91
60 195.149.118.43
54 210.51.47.177
53 74.205.222.27
37 200.27.127.95
36 74.205.222.27
32 61.83.228.112
20 74.126.30.189
12 222.122.161.197
10 74.126.30.189
8 210.51.47.177
6 202.85.216.252
2 78.110.170.108
2 222.122.161.197

That’s four attempts per hour, for a site that nobody really has any reason to know exists. Of course, they’re not evenly spaced. The most recent attack came at one attempt per two seconds. The attacks are also coming from all over; the top three address above are from Poland, Korea, and Germany respectively. It’s also worth looking at what accounts people are trying to break in to.

532 root
18 ts
16 admin
15 postgres
14 test
14 oracle
14 nagios
12 mysql
11 shoutcast

Note that this is just one (particularly blatant) kind of attack, on one unremarkable machine, over a short time period. Imagine what the numbers must be for all attacks across a whole farm of machines, especially if early probes had shown encouraging signs of being weakly protected. I’m sure Rackspace makes some efforts to defeat or at least detect intrusion attempts, but how much can it be? If you were a cloud operator, what would you think of the following pattern?

  • Many ssh connections from the same external host, previously unknown to be associated with the internal one.
  • All connections spaced exactly two seconds apart.
  • Each connection made and then abandoned with practically zero data transfer (not even enough for a login prompt).
  • The same pattern repeated for other internal hosts belonging to different customers, either simultaneously or in quick succession.

That seems like one of the most glaringly obvious intrusion signatures I can think of, worthy of notifying someone. For all I know Rackspace does detect such patterns, and either cuts off or throttles the offending IP address, but there seems to be little sign of that. This is not to pick on Rackspace, either. I picked them for a reason, and I’ll bet the vast majority of other providers are even less secure. The real point is that if you’re in the cloud, even if it’s a good cloud, you need to be extra careful not to leave ports or accounts easily accessible to the sorts of folks who are aggressively probing your provider’s address space looking for such open doors.

Maybe this feature already exists, and I never saw it mentioned anywhere. When somebody goes on vacation, they often set a vacation auto-reply to let people know that their message won’t be read until the vacation ends. Some people think this is good etiquette; some think it’s bad. Some employers require it; others forbid it. Anyway, let’s say that you work in a place where a lot of your email is automatically generated, such as by bug trackers or source-control systems. It seems such a waste for my mailer to attempt a vacation auto-reply which will just go into a black hole somewhere. Wouldn’t it be better to prevent such an auto-reply from being generated? Something like this:

X-Allow-Auto-Reply: no

I know there are some headers, like List-Xxx, that can indicate mail is from a list and give a pretty strong hint that an auto-reply would be pointless, but that’s not quite the same thing. Not all auto-generated mail is generated by a list manager or destined for a list, for one thing. For another, I’m sure many people would like to have their mail program insert such a header into every email they send even though they’d be perfectly capable of receiving a reply. Surely I’m not the only person to have noticed this easy-to-eliminate waste, but there doesn’t seem to be any well publicized solution. Can anyone point me to one that might actually be implemented somewhere?

One of the problems that seems to occur again and again when “computing at scale” is some sort of server getting overwhelmed by requests from relatively much more numerous clients. Sooner or later, every kind of server has to deal with running out of buffers or request structures or (if the designers were fools) threads/processes, or just about any other kind of resource. Having the entire system grind to a halt because one server couldn’t handle overload gracefully and instead died a messy death is both unpleasant and unnecessary. It’s unnecessary because there are solutions that work. Unfortunately, there are other solutions that don’t work, but get tried anyway because they seem easier.

One such non-solution is to have lots and lots of servers, and spread the load between them as evenly as possible. Most people who have actually tried this have eventually realized that spreading the load that well is very hard if not impossible. Sooner or later, an access pattern appears that causes one server to get overloaded. Then it fails, increasing load on its peers (not just the shifted operational load but now the recovery load as well) and quite likely causing them to fail as well, and so on. If this sounds a bit like the northeast US power blackout in 2003, it should. Even if your load-balancing is really good and you’re committed to running your servers at 10% of capacity, a physical or configuration error could leave you with this sort of imbalance/failure cascade. The solution is to handle the condition, not avoid it, and that means some form of flow control. In other words, you have to make requests queue at the (more numerous) clients instead of within the servers.

Flow control can be implemented in many ways. It can be implemented at a low level to maximize generality and code reuse or at a high level to maximize efficiency and applicability to all kinds of resources. It can be implemented via credits that clients must hold or obtain before sending a request, or via “slow down” messages that are sent from servers to clients only when needed. The preachers of statelessness would say that the latter approach is less stateful and therefore preferable, but I think they’re mostly deluding themselves. For one thing, a lot of “stateless” servers have really just moved their session-layer state somewhere else (e.g. a database maintained by the application layer) instead of truly eliminating it. For another, the state after receiving a “slow down” message is still state that must be maintained. If clients can simply ignore such a message, or “forget” that they got it, you’ve achieved nothing whatsoever. If they’re bound to respect it, and especially if the server attempts to enforce it, then you’re just as stateful as you would be with full credit accounting but limited to only zero or one credit.

So, if you’re going to use a credit approach, how should credit be allocated? Again, many will be tempted to use non-scalable approaches. Often the easiest thing to do is to allocate a worst-case set of resources (and associated credit) to every client. That can work OK with few clients, but rapidly leads to unacceptable levels of resources being allocated but idle when the node counts stretch into the hundreds or thousands. The opposite end of the spectrum is to require that all resources and associated credit be explicitly obtained, but this can lead to unacceptable first-request latency and performance anomalies as each batch of credit is consumed. In my experience, a hybrid approach works better: preallocate just enough resources and credit for each client to keep it happy while it explicitly requests and obtains more from a common pool. The common pool can then be large enough to satisfy the maximum worst-case load for the entire system, which is often much less than the sum of the worst-case numbers for each node, allowing support for many more clients at the same resource level. Also, the allocation requests and replies can often be piggy-backed on other messages so they carry little additional cost.

The one remaining problem is how credit gets returned to the common pool when a client no longer needs it. This can be driven either by the client (when it recognizes that it no longer needs the resources) or by the server (when it needs to replenish the common pool). Since it’s generally hard to tell when a client doesn’t need credit any more, the client-driven approach usually involves giving up credit after a timeout. The server-driven approach, on the other hand, requires implementation of a credit-revocation exchange parallel to the credit-granting one. It’s even possible to combine approaches, and in fact I usually do so that a client might give up credit either on its own initiative or in response to a server message while reusing most of the same code for both cases.

With this kind of scheme – small amounts of per-client credit plus explicit requests and revocations of any credit over that amount, with credit-level changes potentially driven by either side – it’s possible to avoid server overload without either starving clients or wasting server resources. It’s not really as complicated as it might sound, and can be implemented with a negligible impact on common-case performance.

Don’t blame me for the comparison. It’s actually Walter Pinson’s.

It was once said back in the early ‘90s that “Client/server computing is a little like teenage sex – everyone talks about it, few actually do it, and even fewer do it right. Nevertheless, many people believe client/server computing is the next major step in the evolution of corporate information systems.”

Can the same be said about cloud computing, today?

I contend that cloud computing is like teenage sex in another way: teenagers act like they invented sex, annoying their elders who thought that they invented it back when they themselves were teenagers. As Pinson’s reference to client/server computing makes clear, there’s a lot about cloud computing that’s not new. There are even aspects that go back even further. When people talk about how to bill for cloud computing, or how to insulate users from one another, it all starts to sound a lot like the old time-sharing days. It’s time-sharing on a new kind of system, but it’s time-sharing nonetheless.

There are people creating new technology in the cloud computing space, to be sure. (This is where the teenage-sex analogy breaks down.) I used to be one of them, and might be again in the not-too-distant future. There are far more people merely reinventing old technology in the cloud computing space. If anyone really wants to understand cloud technology and how it might best be deployed to create value, I think it’s important to understand which parts are actually new and how they’re new vs. what parts have already been done or tried.

P.S. While we’re talking about cloud analogies, Bruce Sterling had another good one.

Okay, “webs” are not “platforms.” I know you’re used to that idea after five years, but consider taking the word “web” out, and using the newer sexy term, “cloud.” “The cloud as platform.” That is insanely great. Right? You can’t build a “platform” on a “cloud!” That is a wildly mixed metaphor! A cloud is insubstantial, while a platform is a solid foundation! The platform falls through the cloud and is smashed to earth like a plummeting stock price!

There’s a lot of other randomness in there too, but the fiction author’s comparison of cloud-computing fiction to financial-market fiction is worth thinking through.

At my last job, I had to work with InfiniBand. Believe me, this did not lead to an enduring love of IB. Before InfiniBand, I had worked with Fibre Channel and seen how overburdened it was with every vendor’s favorite feature or format or protocol variation, often with some little bit hidden somewhere to tell you which of several possible (and mutually incompatible) behaviors you were expected to exhibit in response. Compared to IB, FC is a model of streamlined simplicity. How’s that for scary? Nonetheless, now that all those thousands of person-hours have been poured into it, IB does actually manage to deliver somewhat on its original promise of high bandwidth and low latency at low cost.

So along comes 10-gigabit Ethernet (10GbE), which is so many levels removed even from the thing that people called Ethernet after original Ethernet had been dead and buried that nothing remains but the brand name. It seems that some folks are sure it’s going to displace IB as a cluster interconnect Any Day Now. Hitching themselves to that belief, they’ve started flinging FUD about IB’s “misleading” bandwidth numbers. Here’s one of the more egregious examples.

We will tear this black cable bandit down to size one claim at a time. First they assert that it’s 20Gbps, how about 12Gbps on it’s best day with all the electrons flowing in the same direction. Infiniband employs what is know as 8b/10b encoding to put the bits on the wire. For every 10 signal bits there are 8 useful data bits. Ethernet uses the same method, the difference is that Ethernet for the past 30 years has advertised the actual data rate while Infiniband promotes the 25% larger and useless signal rate. Using Infiniband math Ethernet would then be 12.5Gbps instead of the 10Gbps it actually is. So using Ethernet math Infiniband’s Double Data Rate (DDR) is actually only 16Gbps and not the 20Gbps they claim.

Apparently, according to “10GbE math” 16Gb/s is less than 10Gb/s. Spare me. DDR IB is at approximate price parity with 10GbE, and still 60% faster than 10GbE – with QDR products already available. How does that make 10GbE the superior choice, again? Wait, you say. Those are only nominal bandwidths, right? True enough, and just as true for 10GbE as for IB. It would be a little disingenuous to point out that IB doesn’t really achieve 16Gb/s except “on it’s best day with all the electrons flowing in the same direction” without also pointing out that 10GbE is subject to the same effects (and the vast majority of cards according to 10GbE.net’s own price lists aren’t even physically capable of more than 13Gb/s across two ports).

The writing style on 10GbE.net is strikingly similar to that of a certain Cisco employee. Instead of launching all this FUD from behind a screen of anonymity, would it be too much to ask that the author show a little more honest about his associations? When he can show repeatable, verifiable results indicating that DDR IB doesn’t still trounce 10GbE at the same price point, then we can have a real discussion about cluster interconnects.

The hot topic in the blogosphere right now is about BitTorrent Inc. switching to a UDP-based protocol for their popular uTorrent client. Most of the links on this subject ultimately lead back to Bittorrent declares war on VoIP, gamers at El Reg. Having talked to Richard Bennett before, I’m well aware of his penchant for saying outrageous things to get attention, but most of his critics are behaving even worse. For example, here’s Janko Roettgers.

Bennet’s piece is based on a belief that UDP traffic is “aggressive” and uncontrollable, whereas TCP is the nice and proper protocol that can be easily managed. This notion ignores the basic fact that P2P developers, in order to make the protocol work at all, need to implement TCP-like functionalities on top of UDP, one of which includes congestion control. You simply can’t operate a P2P client that eats up all of its users’ bandwidth, much less build a successful business model on top of it.

That’s an unfortunately, not to say selfishly, client-centric perspective. For one thing, eating up all of the user’s own bandwidth is not the issue; crowding out other users’ traffic is. For another, Roettgers completely ignores what happens to packets in between two PCs on the internet. The fact is that all those routers have mechanisms in place to do congestion control, and – while there are proposals for TCP-friendly rate control out there – many people have traditionally turned to UDP for the specific purpose of bypassing congestion control. As George Ou pointed out a while ago, P2P applications tend to “game the system” in effect and probably by intent. I know from my own conversations with Bram Cohen that he has never liked TCP, so it doesn’t really seem all that unlikely that the UDP switch is very intentionally to avoid congestion control – despite the inevitable “collateral damage” to non-BitTorrent users, and therefore quite antisocially. Bennett offers one further reason to believe this.

Upset about Bell Canada’s system for allocating bandwidth fairly among internet users, the developers of the uTorrent P2P application have decided to make the UDP protocol the default transport protocol for file transfers.

I don’t know where Bennett got the Bell Canada angle, and considering the source I’d be interested in any pointers to relevant statements by Cohen or anyone else at BitTorrent, but it certainly seems plausible. What I’d be most interested in, though, is any credible alternative explanation. “It will make BitTorrent downloads faster without affecting anyone else” is just another way of saying there is such a thing as a free lunch. The claims about BitTorrent’s uTP doing congestion control better than TCP are suspect too, since details on uTP don’t seem to be available for us to review and the one detail I’ve seen mentioned – using latency to detect congestion – doesn’t exactly represent the current state of knowledge among people who really understand congestion control. Can anyone explain, without hand-waves and exaggeration, why uTorrent is switching to UDP other than to consume a greater share of bandwidth than existing mechanisms to preserve robustness and fairness in the internet would allow?

No, this isn’t about open source, though frequent readers could be excused for thinking it would be. Denver International Airport’s free wi-fi is the biggest piece of junk ever. First they make you sit through an ad before you can use it. Then they wrap everything with their own stupid frame, in the process breaking things like Google Reader even if you Ad-Block it. Then they do allow ssh, but it (and I’ll guess anything encrypted so they can’t see inside) so slow that it’s practically unusable. Thanks a lot, jerks. I already have a subscription to Boingo, and if you hadn’t bought a monopoly I’d be using that productively. It’s too bad you don’t have anyone competent or ethical working for you.

Oct
16
Hype vs. FUD

One of the computing trends I’ve been watching for a while is so-called “cloud computing” which is really just the latest name for a kind of distributed computing that has been around in nearly identical form for just under a decade and in not-too-different forms for at least twice that long. Heck, I was working on one of the keystones of cloud computing – globally distributed storage – back in 2000, so when I react to the current hype wave with terms like “clown computing” it’s hardly because I’m afraid of something new.

In Cloud Computing is Scary – But the FUD Has to Stop, Dan Morrill complains about the FUD supposedly being directed at cloud computing. Well, Dan, with any new (or even supposedly new) technology there will be hype-mongers at one end and FUD-slingers at the other, with most of the computing community in between. Those with a vested interest in promoting a technology tend to tar everyone less enthusiastic than themselves with the “FUD” brush, just as those with a vested interest in suppressing it tend to tar everyone more enthusiastic with the “hype” brush. In this case the fact is that there are a lot of people who can’t even define cloud computing making all sorts of grandiose and often blatantly false claims. Saying that cloud computing is inherently flawed, that it can never work, would be FUD; pointing out that the claims being made by or about specific vendors and products remain unproven, or that there are still problems to be addressed, is just healthy skepticism. By portraying that skepticism as FUD, you only put yourself further on the “hype” end of the spectrum.

It is long past time to continue with the same old tired refrain of “no” and move on to where business is going.

It is time to start embracing where business is going, and trying to make sure that they are doing it in the safest way possible.

Where business is going, eh? Got any evidence of that? No, of course not. That’s where you would like business to go, but that’s not at all the same thing. Such grandiose “this is the wave of the future so don’t miss it” rhetoric does very little to allay people’s legitimate concern that it’s really a wave of hype. You’d be better off presenting cloud computing as a still-to-be-embraced opportunity, not a fait accompli in the business world.

Business has taken to virtualization in a big way, which I think is misguided for a whole different set of reasons (I believe it’s better to build and deploy smaller servers which can be combined into larger complexes instead of larger ones which then have to be sliced up). There is some correspondence and synergy between virtualization and cloud computing, but I can’t recall any cloud computing proponents articulating that connection as a coherent and usable business strategy. Riding on virtualization’s coat-tails isn’t enough. Some day very soon, somebody in the cloud computing camp needs to do a better job of explaining their Grand Concept’s very own value proposition separate from virtualization.

There are very few information security experts in cloud computing.

What security professionals need to be doing rather than creating their own FUD is work out ways to make it safer.

There might be very few information-security experts in the inbred cabal of people who push the “cloud computing” brand in blogs and such, but there are plenty of people who have been working at the intersection of security and distributed computing for years. Do you think the people behind Amazon, or Allmydata.org Tahoe, or Mozy (across the street from the folks at RSA), or Iron Mountain, don’t have a few clues about this stuff? I know many of them, have worked with some, and I know you’d be dead wrong. Securing data across the net is a well-studied problem. So is securing computation across the net, though that’s not my own specialty. It doesn’t mean all the answers are known, but it’s just not true that such expertise is rare or rarely applied.

it is time for information security folks to step up to the plate and get smart on how the technology works.

The best bet right now for the security engineer is to work through the process, and get smart now so that management can benefit from what you have learned.

No, maybe it’s time for cloud computing folks to get smart on how security technology works. Don’t try to push the burden of fixing your problems onto another community, and especially don’t try to hint that they’re “not smart” as you do it. That’s no way to get the help you need. If you cloud computing folks are such great innovators, take some responsibility for learning what’s already out there and using it to innovate your own solutions. When you act as though you invented the greatest thing ever and everyone else needs to catch up, you come across just like teenagers who act like they invented music or sex and that’s just really annoying. Customers don’t like to deal with annoying vendors.

There’s nothing wrong with consciousness-raising but, especially in this economic environment, people are suspicious of evangelists whose promises are incommensurate with their ability to demonstrate real working product with real business value. If you don’t want to sour everyone on the whole idea of cloud computing or anything like it for the next ten years, dial down the marketing and dial up the technical progress.

UPDATE: The Onion says Recession-Plagued Nation Demands New Bubble To Invest In.

A panel of top business leaders testified before Congress about the worsening recession Monday, demanding the government provide Americans with a new irresponsible and largely illusory economic bubble in which to invest.

“Perhaps the new bubble could have something to do with watching movies on cell phones,” said investment banker Greg Carlisle of the New York firm Carlisle, Shaloe & Graves. “Or, say, medicine, or shipping. Or clouds.

Cloud computing is where you can’t get your work done because a service you never heard of has failed somewhere half-way around the world. Yes, I know it’s a variant on an old joke, which was about a server instead of a service and it was down the hall instead of around the world, but there is definitely an “everything old is new again” aspect to a lot of the cloud-computing hype so we might as well recycle the jokes too.

P.S. If anyone can provide an authoritative source for the original, let me know and I’ll be glad to cite it. I know it’s not mine and I am in no way trying to take credit for it, but my Google-fu seems to be weak this morning.

P.S. Thanks to Paddy for identifying Leslie Lamport as the source of the original remark.

I actually started composing this in my head before Matt Reilly’s post about Green Computing, but it kind of fits in with some of the points he makes about communication speed as a factor in overall performance. One of the properties of a Kautz graph, such as we at SiCortex use in our systems, is that

For a fixed degree M and number of vertices V = (M + 1)MN, the Kautz graph has the smallest diameter of any possible directed graph with V vertices and degree M. (from the Wikipedia article)

UPDATE 2008-11-10: I happened to read this while looking for something else, and realized that there are better torus routing methods than those I had considered. I think I’ve invented an even better one than the current general practice, but for now I’ve updated the text below to reflect the general-practice numbers (which don’t affect my argument at all).

Yeah, lovely, what does it mean? What it means is that if you want to make a system of a certain size, let’s say approximately 1000 nodes, and you’re constrained as to how many links per node you can have, you can achieve the smallest network diameter by arranging your nodes in a Kautz graph. For example, our 972-node system using three outbound links per node (i.e. degree=3) has diameter=6. That’s the maximum number of hops to get from a node X to any other node Y; the average is approximately five. By contrast, for a 10×10x10 hypertorus also using three outbound links per node, the diameter would be 27 18 and the average hop count would be over 13 almost 10. I’m not actually familiar with the proof for the statement quoted above and wouldn’t understand it even if it were shown to me, but it seems very believable based on my experience. Every time I try to think about tweaking some other topology to reduce the average hop count or bring the links per node down to reasonable levels (you can’t feasibly build a system that requires dozens of links on every node) the result starts to look more and more like a Kautz graph in terms of routing, partitioning and tiling characteristics, etc.

So far, so good, but how does that translate into anything useful? Well, if communication speed is a factor in overall performance, and communication speed falls through the floor when the system’s communication capacity is exceeded, then that capacity matters a lot. The common way to compare the communication capacity between different systems is to compare bisection bandwidth – how much capacity must be removed to divide the machine into two equal parts? It turns out that measuring the bisection bandwidth of a Kautz graph is not straightforward, and even the Kautz cognoscenti at work seem to disagree about the result. Personally, I think the whole bisection-bandwidth approach is bass-ackwards. Why measure the negative (minimum capacity you could remove) instead of the positive (maximum capacity you could use)? I think it happened because a naive attempt to measure maximum usable capacity would allow a system to be partitioned into small islands communicating amongst themselves, yielding inflated figures that don’t match what a real application running across the entire system could get. That’s fixable in a measure-the-positive approach, though. All you need to do is say that you’re going to measure the maximum bandwidth when each node in one half of the system sends equal amounts of traffic to each node in the other. That way, if you want to combine the figures for two islands (often corresponding to two switches or cabinets) then you have to pass half of your traffic between them, and what you get is a pretty fair approximation of what an actual full-scale application could get.

This is where network diameter and particularly average hop counts start to matter. The maximum usable communication capacity of a system is the aggregate link speed divided by the average hop count. Therefore, even if you can distribute your traffic perfectly across all links in the system for the test I propose above, your performance ceiling will be determined by the average hop count in your topology. For example, in our 972-node Kautz graph that’s 2916 links’ worth divided by an average hop count of five, or 583.2 links’ worth total. For that 1000-node hypertorus, it’s 3000 links’ worth divided by an average hop count of 13.5 10, or only 222.2 300 links’ worth for a similarly sized system. Looked at another way, your links would have to be about 2.6 times almost twice as fast just to keep up. If you made those links bidirectional you’d double the number of (half-duplex) links and reduce your average hop count to around 7.5, so you’d be getting 800 links’ worth of total capacity, but then you’d be talking about a system with vastly greater hardware complexity and cost than the others. If you wanted to compare apples to apples, a thousand-node Kautz graph with degree=6 would have the same 6000 links with an average hop count around 3.5 so the hypertorus would still be at a serious disadvantage. Then you have to consider that we’re talking about ceilings here, and that even in the abstract different topologies will allow for different levels of “perfection” as regards distributing traffic among all the links in the system. Maybe I’ll explore some of those issues at some time in the future. For now the key point is that, much as a more efficient algorithm will always eventually win out over a faster implementation, a more efficient topology will always eventually win out over raw bandwidth. In my opinion, bisection bandwidth as usually measured doesn’t tell that story well enough.