It looks like the swap vs. no-swap discussion won’t die, with the LKML/KernelTrap threads remaining busy and a new one appearing on Slashdot, so I might as well add a little more than my previous two cents here.

If the total working set for all of your applications will fit in RAM, only an extremely poorly designed system would swap anyway so swapping’s really not an issue. Some people would say that you simply shouldn’t run a system where that’s not the case, and in the majority of cases I think they’re right. However, there are still going to be cases where your total working set exceeds your current RAM and it’s not feasible to increase the RAM (e.g. because your OS or motherboard or whatever won’t support it – price is not likely to be as much of an issue nowadays). Then you need to make a decision about what to do. If you still believe that you should simply not allow the working set to exceed RAM size, even at the cost of having to enforce artificial restrictions on memory usage, then there’s just not much more to say and people who have that belief should IMO just depart the conversation. The real conversation starts with the assumption that we are going to run with a working set larger than RAM, and focuses on how best to do that.

If your total working set is larger than RAM, you’re simply doing more work than you have the hardware for. If you didn’t have swap space (which is used for VM paging in addition to traditional swapping) you just couldn’t run that workload. Even as you approach that point, if you’re committed to not swapping or paging, the only thing you’d be able to do would be to shrink the file cache, which will in turn reduce hit rates and generate more “unnecessary” I/O. In other words you’re now thrashing the file cache instead of swap. Some would say that’s OK, constituting a sort of “poor man’s admission control” to limit the amount of work entering the system, or that at least it’s more recoverable than a swap storm. I’m actually pretty inclined to agree with those beliefs. Swap space (really more likely to be paging space in this scenario) is not a solution to persistent memory overcommit.

The one place where swapping – and here we must distinguish swapping from paging – can help is in that “near the edge” milieu, but it can only do so in cooperation with the scheduler and it can only do so in the transient-overcommit case. If all processes are left resident and actively scheduled in an overcommit situation, they will contend with each other for available memory and thrashing will ensue. However, swapping gives us the option to take some players out of the race for a while. If you reduce the number of active processes so that the total working set for the remainder fits in RAM, and give those a chance to run to completion, the system will thrash less and you can return to a non-overcommitted steady state sooner. If they won’t run to completion you’re back to the persistent-overcommit case I said you shouldn’t allow, and you’re just out of luck. Nothing you can do will yield more than a slight benefit.

I don’t for a moment pretend any of this is new insight, by the way. Swapping vs. paging vs. nothing at all have been discussed for at least twenty years among operating system designers and implementors. It only seems new to the Linux crowd because of their persistent “not invented here” syndrome and reluctance to consult any sources that predate their own kernel-hacking experience. That’s why they end up arguing about whether there should even be swap when not only that issue but second- and third-order issues such as how to identify swap candidates are already pretty well understood in the broader community.