A couple of days ago, I wrote an article about congestion control on the internet. In the comments, I promised a more technical exploration of the subject, and this is it.
Congestion is a problem on the internet. It will always be a problem on the internet, just as it is on our roadways. Yes, bandwidth keeps increasing. So does the number of users, and the number of network applications per user, and the required bandwidth per application. Supply will never stay ahead of demand for long, if at all. Sooner or later, it will become logistically or economically infeasible for service providers to add more bandwidth all throughout their networks. Sooner or later, enough people will be sending enough traffic over the network’s weakest links that it won’t matter if other parts of the network are overbuilt. The “more bandwidth” crowd will lose to those who recognized the need for congestion control and did something about it.
When thinking about how to handle congestion, it’s possible to look at things from many different perspectives. Here are some examples:
- Different kinds/definitions of fairness.
- The core of the network vs. the edge.
- Distinguishing flows vs. hosts vs. “economic entities” (i.e. users).
To keep this from turning into a major dissertation, I’m going to say little about these.
- On fairness, I think Briscoe has really said all that needs to be said about flow-rate fairness vs. cost fairness. Note that “flow-rate fairness” does not necessarily refer to flows in the sense that they’re often discussed in the congestion-control literature. It refers to the rate at which things (packets) flow, whether they’re associated with connections or hosts or anything else. It’s an unfortunate re-use of terminology, but that’s life. I think cost fairness is the proper goal/standard for congestion control. If you disagree, keep it to yourself. I might discuss things in terms of cost fairness, but for the most part the concerns I’ll address are directly translatable to a flow-rate-fairness world and I don’t want to get bogged down trying to make the flow-rate zealots happy.
- As for the core vs. the edge, I bring it up because they’re very different environments calling for very different kinds of congestion management. While routers at the edge can reasonably hope to make flow/host/user distinctions (modulo the mess that NAT introduces), routers at the core have no such hope. They must also process packets at a much higher rate, so the most they can do is make hopefully-useful guesses about what packets to drop when they’re overloaded. The dumbest approach is simply to drop the last packets to arrive. The various flavors of RED (Random Early Detect) are slightly better, and I believe Stochastic Fair Blue (which I’ve written about before) is even better, but they’re all still basically guesses. While congestion control in the core is a fascinating subject, it’s not what I’ll be writing about here.
- On flows vs. hosts vs. users, most of the people weighing in on these issues have tended to focus on users and I’ll follow suit. NAT makes a hopeless muddle of the whole thing anyway, so the only thing you can really be sure of is one entity paying the bill for one physical line. There’s a lot of merit to the argument that if there are multiple uses huddled behind NAT then that’s their problem anyway.
OK, enough stage-setting. On with the show.
Detection and Notification
Congestion control actually involves two separable problems: identifying congestion and then responding to it. Resolving congestion ultimately means getting senders to send less, so we’ll look at it from the perspective of the sender. How does a sender know that its packets have been dropped due to congestion? The worst way is to infer that something was dropped because an expected reply did not come back within an expected interval. The reply might not come back for reasons unrelated to congestion, of course, and the interval might have been miscalculated, but nonetheless this method has dominated the internet until quite recently. The slightly less bad way to detect congestion is to piggyback explicit congestion notification (ECN) onto replies. This is at least quicker than waiting for timeouts, and avoids some of the problems I just mentioned, but it’s still not exactly foolproof. For example, the information provided by ECN is interpreted by the endpoint, which means a user’s computer. Unfortunately, the response to ECN is only defined for TCP (not UDP) and only takes effect at the connection (rather than host) level. As Briscoe points out, applications that use many connections can and do often respond by simply shifting their work to other connections that may reintroduce congestion either at the same place as before or elsewhere. That’s not helpful.
The fact that ECN only operates at the connection level brings us to another important question: how should congestion discovered on one connection affect others? On the one hand, if other connections go to the same place (or nearby) they will be subject to the same congestion and information should be “shared” between them. On the other hand, it would be a mistake to throttle the entire user/application in response to ECN on one connection, because other connections may go to completely different places through completely different routes. Basically, the more accurately you can pinpoint the location of the congestion, the better you can determine which of your connections is affected. At one extreme you could attribute congestion only to one host; at the other you could indict “the internet” in general. An interesting in-between possibility that might be worth exploring would be to indict the AS (“autonomous system” = the entity of interest in BGP routing) or the route one used to an AS, because those are readily discoverable and likely to correlate with the extent of the congestion.
The main problem with ECN, though, is that the network shouldn’t rely on users’ machines to manage congestion. Even if those machines don’t simply throw away ECN information or try to “game the system” in other ways (which many provably do), even if they try to be well-behaved and implement tolerable solutions to the problems mentioned above, they’re just too many and too uncoordinated, especially since they’re fundamentally in competition with one another. The network entities best situated to bear most of the congestion-control load are the “edge routers” – the ones owned by ISPs that interface between users on one side and the ISP’s own network (thence the rest of the internet) on the other. They should also be looking for congestion indicators, responding to congestion on their own if the endpoints don’t seem to be doing so properly. As a first step, they should also be paying attention to ECN flags. To handle the UDP and other cases where ECN as it currently exists doesn’t work, they should also be participating in other mechanisms – invisibly to users, requiring no effort on their part – to feed back congestion information. Imagine, for example, that a certain destination ISP/AS is experiencing congestion and can identify a host within a particular source ISP/AS as a significant contributor. It would be extremely useful if information could get back to the host’s corresponding edge router which could then ensure that appropriate action is taken.
Response
What does “appropriate action” mean? Before I answer that, I’d like to give an example of what it doesn’t mean. Imagine you’re an edge router connected to two subscribers A and B on one side, and to two remote locations X and Y on the other. Subscriber A is busy sending to X, whereas B is mostly idle but occasionally sends a burst to Y. Now you start finding out that there’s severe congestion near X, and almost all of A’s packets are being dropped. How do you respond? According to “fair queuing” it would be by accepting A’s and B’s packets at the same rate, cutting B’s burst bandwidth even though A’s packets are almost certainly a waste of everyone’s time (or worse, since they continue contributing to the congestion). That hardly seems fair to me. Instead of trying to equalize the packet-acceptance rate, how about trying to equalize the actual-delivery rate? As long as A keeps sending you packets that are likely to be dropped anyway, why not drop them early (saving everyone else the effort of passing them on before they get dumped) and use the savings to let B send at his full burst rate? If A actually responds to the congestion by throttling back his sends to the congested area (and perhaps sending more elsewhere instead) then the congestion indicators will stop coming in and we’ll be back to status quo ante.
Note that none of this ever matters unless you’re dropping packets due to local capacity constraints anyway. It’s a way of apportioning the overload effects, not of reducing apparent capacity while actual capacity remains. To be more precise, let’s continue the you-as-edge-router example from above. Let’s say that during B’s bursts you’re getting 1000 packets per second evenly split between the two, and you’re dropping 20% due to simple lack of buffer space. At the same time, you know that 80% of A’s packets are being dropped further on due to congestion, and none of B’s are. “Fair queuing” would require you to ignore that last fact, allocating equal-length queues to A and B. You’ll effectively be dropping 20% of A’s packets and 20% of B’s packets, which is pretty much exactly what would have happened with a single queue so you’ve achieved exactly nothing. B’s actual packet-delivery rate is still 400/second when it could have been 500, just so you can pass more of A’s traffic on to where it’s being thrown away. This kind of fairness can work when the congestion is local and link flow control prevents packets from getting off queues at all, but in the real world of the internet the drops might be happening very far away and it’s pretty useless in that case. It addresses the wrong problem.
For contrast, let’s see what happens when we add some cost fairness. We know that we need to drop 20% of the packets sent to us overall, and we’ll apportion those drops according to the downstream drop rates. In this example, 100% of our downstream drops are for A, so they’ll get 100% of our local drops. In other words, we’ll drop 20% of their packets (either by manipulating queue lengths or by other means) and none of B’s. B is now getting his full burst rate delivered, an improvement of 100pps, in return for which A is losing 20pps. It’s a net win for our own users collectively, plus the reduction in A’s packets to X might help to alleviate the congestion around X so it’s a win for the broader internet collective as well. We can actually do even better by sorting packets more carefully (e.g. according to source/destination pair instead of just source) and allocating drop rates in a more sophisticated way, but we’re already ahead of the game.
Conclusion
So, what have we learned? First, I hope I’ve shown that getting congestion information from where it is known to where it needs to be known is an important part of the problem, and one that is not very well solved by current approaches. Some additional mechanisms, probably involving “out of band” communication, could help. Second, even if the information problem is addressed, “fair queuing” as usually proposed won’t do anything very useful with that information. I’ve proposed one scheme, or at least the very germ of one, that I think will do better. There are probably other good schemes as well, some similar and some quite different but none of them based on faux fairness. For the sake of users and applications that are doing their best to cope with congestion instead of making it worse by circumventing or perverting the congestion-control mechanisms already in place, we should start exploring and moving toward such schemes.
In the final example above, we’re saying that when B is bursting, both A and B are transmitting 500pps, that 80% of A’s are being dropped downstream, and that 200 of these packets must be dropped each second. I can see the communal gain in targeting A’s packets for all 200 drops.
How would you propose the system would work if A was sending 250pps with 80% loss and B was sending 750 with 0% loss? Would the router still assign all 200 dropped packets to A?
Under the 20% each system, A would have 50 packets dropped and B would have 150 dropped. Under the downstream management, assigning all 200 to A, A now experiences a loss of 80% of his packets (250 * .2 = 50, 50 * .2 = 10). Yes, we’re only trading 40pps from A to gain B 150pps, so the community still gains, but I don’t feel this is the most fair way to do it.
When we have users broadcasting disproportionately, at what point do we begin restricting the higher flow, regardless of downstream drop rate?
Well, if A is getting 80% drop rates at 250pps then AIMD (on which congestion control in TCP is based) says he shouldn’t be sending at 250pps for long. Is AIMD unfair? Assuming a single destination, any sustained send rate over 50pps indicates that A is not responding properly to congestion and should suffer worse consequences than if he had responded properly (to ensure that there’s an incentive for Doing The Right Thing on one’s own).
If A’s remote drop rate decreases as a result, then so will his share of the local drops. He can then use the AIMD rules to ratchet up the send rate until equilibrium (send rate approximately equal to successful delivery rate) is reached, as should have happened in the first place. A got an extreme reaction because he exhibited extreme behavior. Sending a high-rate stream of packets into a black hole is antisocial and counterproductive. It imposes costs on the network, and those costs should be fed back to the one that incurred them. How is that not fair?
In the real internet, an edge router is likely to serve far more than two users and it’s highly unlikely that any one of those users would account for 100% of the downstream drops. Nonetheless, and despite what I just said in the previous paragraph, it’s often the case that resource-management protocols work better when they don’t result in sudden or extreme changes. Damping and hysteresis are commonly added to address this, as AIMD itself illustrates. In this case, for example, it would be trivial to apportion only some of the drops according to the cost-fairness rule and the rest some other way (including so-called fair queuing), or to phase in cost-apportionment changes over time.
When it is useful and fair to do so, and not just because it’s higher. B has rights too, among them the right to use more than a 1/N share of the local resource if network conditions allow it. Why should his peak rate be limited any more than A’s sustained rate? The answers all come back to definitions of fairness, and different kinds of fairness are often in conflict with one another. What’s fair by one criterion might seem very unfair by another. Sometimes it’s almost like those optical illusions where your brain switches between two complementary images. I know flow-rate fairness is the “gut feel” favorite, but I basically don’t care. Based on my own attempts to reconcile various abstract principles (e.g. equality vs. accountability), and to analyze the network-wide consequences of adopting/enforcing each definition, I’ve come to believe that “gut feel” is wrong this time.
“…getting congestion information from where it is known to where it needs to be known is an important part of the problem”
Exactly.
What we did in re-ECN was to force the sender to have to declare /expected/ congestion in packets. Networks nearer the receiver make honesty the best strategy by dropping from flows, if actual path congestion (ECN markings) are persistently greater than the declared expected congestion (which uses the last available bit in the IPv4 header) or a v6 extension.
Then you get congestion information visible as it crosses the trust boundary into the internetwork, just where it’s needed to do policing etc.
The whole thing is described in the comprehensive I-D linked off our ‘re-feedback’ project page:
URL of re-feedback project page got trimmed from last posting, ‘cos I enclosed it in angle brackets. Retrying…:
http://www.cs.ucl.ac.uk/staff/B.Briscoe/projects/refb/#retcp