I’ve noticed a significant increase lately in the number of complaints people are making about the operating systems they use, particularly Linux and most especially the storage stack. No, I’m not thinking of a certain foul-mouthed SSD salesman, who has made such kvetching the centerpiece of his Twitter persona. I’m talking about several people I know in the NoSQL/BigData world, who I’ve come to respect as very smart and generally reasonable people, complaining about things like OS caches, virtual memory in general, CPU schedulers, I/O schedulers, and so on. Sometimes the complaints are just developers being developers, which (unfortunately) seems to mean being disrespectful of developers in other specialties. Sometimes the complaints take the form of an unexamined assumption that OS facilities just can’t be trusted, get in the way, and kill performance. The meme seems to be that the way to get better application performance is to get the OS out of the way as much as possible and reinvent half of what it does within your application. That’s wrong. No matter how the complaint is framed, it’s highly likely to reflect more negatively on the complainer rather than the thing they’re complaining about.
Look, folks, operating-system developers can’t read minds. They have to build very complex, very general systems. They set defaults that suit the most common use cases, and they provide knobs to tune for something different. Learn how to use those knobs to tune for your exotic workload, or STFU. Does your code perform well in every possible use on every possible configuration, without tuning? Not so much, huh? I’ve probably seen your developers deliver a very loud “RTFM” when users visit mailing lists or IRC channels looking for help with a “wrong” use or config. I’ve probably seen them say far worse, even. How can the same person do that, and then turn around to complain about an OS they haven’t learned properly, and not be a hypocrite? When you do find those tuning knobs, often after having been told about them because you had already condemned the things they control as broken, don’t try and pass it off as your personal victory over the lameness of operating systems and their developers. You just turned a knob, which was put there by someone else in the hopes that you’d be smart enough to use it before you complained. They did the hard work – not you.
I’m not going to say that all complaints about operating systems are invalid, of course. I still think it’s ridiculous that Linux requires swap space even when there’s plenty of memory, and behaves poorly when it can’t get any. I think the “OOM Killer” is one of the dumbest ideas ever, and the implementation is even worse than the idea. I won’t say that operating-system documentation is all that it should be, either. Still, if you haven’t even tried to find out what you can tune through /proc and /sys and fcntl/setsockopt/*advise, or gone looking in the Documentation subdirectory of your friendly neighborhood kernel tree, or accepted an offer of help from a kernel developer who came to you to help make things better, you’re just in no position to complain or criticize. It’s like complaining that your manual-transmission car stalled, when you never even learned to drive it. Not knowing something doesn’t make you a fool, but complaining instead of asking does. Maybe if you actually engaged with your peers instead of pissing on them, they could help you build better applications.
I really like this guy’s way of thinking about OS/application interactions: https://www.varnish-cache.org/trac/wiki/ArchitectNotes
Yes, that’s a good reference, Matt. I actually thought of citing PHK specifically to answer some of the complaints about OS caches and virtual memory, especially since I know some of the complainers are Varnish fans, but it slipped my mind when I was actually writing this. Thanks for the reminder.
But but you’re an expert! In theory it makes sense to use swap when there’s plenty of memory, right? ;-)
Heh,
Varnish outsourcing to the OS stresses the vm system in interesting ways. I don’t actually complain until I’ve dug out the kernel/glibc/varnish/etc and read the source and attempt to fix it.
I wish madvise/fadvise and friends actually got new and extended semantics to let us do what we need. There are lots of undocumented here be dragons areas there. And it keeps changing with kernel releases. It is sad that msync actually syncs the entire vma, and not the range I gave it. It is annoying I have to do MADV_DONTNEED on the memory and FADVISE_DONTNEED on the file to actually evict something. It is sad I can’t easily send TRIM on a region of the mmap. It is sad I can’t tell it to not fault in something from disk, but still back it from disk (useful for log files for example).
Personally, I wish for the OS to do as much of the heavy lifting. There is however no excuse for vm.zone_reclaim_mode defaulting to on suddenly.
Artur
I’m not sure why you consider this a problem. Presumably, you want it to be evicted immediately for one of two reasons: you’re deliberately manipulating the eviction order to be something other than LRU, or you know you’ll need the space soon and you want to avoid the regular eviction process at that time even if it means leaving the memory idle in the interim. The first suggests application-specific knowledge, and DONTNEED is the way to pass that on to the non-mind-reading OS. The second (weakly) suggests that you expect use-once behavior to be the norm. Are you suggesting some kind of per-file or global option to tell the OS that once, instead of having to do it repeatedly? It’s an interesting idea. If you’d like, I could casually suggest it to someone who actually works in that area – giving you credit, of course – and see what they think.
I’m not sure there’s an excuse for vm.zone_reclaim_mode at all. ;)
If I call MADV_DONTNEED the page gets dropped to be refilled by the underlying store. FADVISE_DONTNEED however will write out the changes, so if I really don’t care about the changes, there is no way I can communicate that.
For use-once behavior you can set MADV_SEQUENTIAL (and then you can tune /sys/block/xxx/queue/read_ahead_kb if you don’t like the excessive read-ahead. This turns the LRU into a FIFO for those pages.
If I want to help the eviction (we see vm efficiency rate down to 1-2% sometimes), I would love to be able to do MADV_FREE_CLEAN_PAGES, since you know, there is a race between mincore and madvise. However, since DONTNEED doesn’t actually drop it from the page cache, that isn’t needed. But relying on that seems to be a dangerous idea.
Then it is the fact that to write to a page I need to read it in from the backing store, even if I don’t want it. (Logging for example, or discarded data). There is no way to say, give me empty pages that are backed by this underlying store.
There is also no way to actually tell the memory system to write a specific set of pages as one sequential write. msync() should be able to do so, but msync actually syncs the entire vma. And creating and destroying vmas is incredible expensive when you are under high memory pressure due to the write semaphore needed during the time.
Artur
So you dirtied pages but then want to discard those changes? Something like XFS_IOC_FREESP64 or FALLOC_FL_PUNCH_HOLE for memory regions?
I have to assume you mean partial writes to already-allocated pages, since the newly-allocated case has been handled without reading for ages.