Created attachment 434098 [details] dstat dump Description of problem: Sometimes my desktop machine thrashes the swap partition while there's a plenty of free ram. The machine unusable for a half minute (the mouse cursor- and the music from rhythmbox stops) in such case. After the swap-out a long swap-in session begins while interactivity of the applications is terrible. This mass swap-out event is rare, mostly triggered by some kind of disk input (eg.: ls -R /usr, tar /usr) but when come always come with a long pause and huge disk(swap) writes. I run 3x KVM (2x Win7 with 512MB RAM, 1x Debian with 256MB RAM; KSM enabled), NetBeans 6.8 (~500MB RSS), Firefox, Rhythmbox on KDE. A thrashing observed via dstat is attached: 354MB free ram available when the swap-out began (~600MB swap-out at once!) and the kernel freed up 1604MB(!) - why?? The usual applications were running, I did nothing special.. Current /proc/meminfo (after some thrashing): MemTotal: 4056324 kB MemFree: 389520 kB Buffers: 401332 kB Cached: 641096 kB SwapCached: 585312 kB Active: 980964 kB Inactive: 1736368 kB Active(anon): 482120 kB Inactive(anon): 1207236 kB Active(file): 498844 kB Inactive(file): 529132 kB Unevictable: 0 kB Mlocked: 0 kB SwapTotal: 3145720 kB SwapFree: 2063420 kB Dirty: 896 kB Writeback: 0 kB AnonPages: 1190272 kB Mapped: 88892 kB Shmem: 14396 kB Slab: 691604 kB SReclaimable: 593392 kB SUnreclaim: 98212 kB KernelStack: 3456 kB PageTables: 45680 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 5173880 kB Committed_AS: 3946884 kB VmallocTotal: 34359738367 kB VmallocUsed: 343532 kB VmallocChunk: 34359379452 kB HardwareCorrupted: 0 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB DirectMap4k: 194432 kB DirectMap2M: 3997696 kB /proc/buddyinfo: Node 0, zone DMA 3 2 3 3 3 1 0 0 1 1 3 Node 0, zone DMA32 33009 12170 1327 309 16 0 1 0 0 0 1 Node 0, zone Normal 3378 888 167 8 3 2 0 1 1 0 0 Custom /etc/sysctl.conf settings: vm.swappiness = 0 vm.min_free_kbytes = 65536 Version-Release number of selected component (if applicable): kernel 2.6.33.6-147.fc13.x86_64 qemu-system-x86-0.12.3-8.fc13.x86_64 qemu-img-0.12.3-8.fc13.x86_64 qemu-kvm-0.12.3-8.fc13.x86_64 qemu-common-0.12.3-8.fc13.x86_64
Created attachment 434099 [details] /proc/cpuinfo
This swap-thrashing behavior isn't depend on KVM - I can reproduce it even if only usual desktop applications are run. I simplified the triggering too: a non-recursive ls on a large dir (cifs share) is enough to began the mass swap-out. I forgot to mention before the swap is on LVM over dm-crypt. I tried to put the swap into a file on a unecrypted filesystem over LVM, but that didn't help. I stopped KVM and filled up the memory with memory-hungry applications like openoffice, gimp, digikam, totem, konqueror, firefox, seamonkey, netbeans, rhythmbox. Memory reported by dstat: used: 2766M buffers: 149M cache: 520M free: 526M(!) $ for i in /mnt/CIFS-mount/large-dirs/*; do ls $i | wc -l; done (_the system paused for 1 minute_) 2297 1505 1801 1816 1404 1334 1002 Memory reported by dstat after 800MB(!) swapped out: used: 2467M buffers: 54.3M cache: 415M free: 1024M(!) (The complete dstat output attached.) Maybe memory fragmentation is the problem? How can I measure that? How can I sure about that?
Created attachment 434146 [details] dstat output from a mass swap-out (only usual desktop apps ran)
I'm on track. In the past I set in /etc/modprobe.d/local_options.conf (I'm using samba over Internet): options cifs CIFSMaxBufSize=65536 cifs_min_rcv=64 I knew large CIFSMaxBufSize can hurt the performance due to memory fragmentation, but I hoped maximized cifs_min_rcv compass the problem - I was wrong. The first problem readdir *never uses* cifs_min_rcv-allocated buffers: $ while :; do grep cifs_request /proc/slabinfo ; sleep 0.1; done cifs_request 66 66 61568 1 16 : [...] doing ls in another terminal: $ ls /mnt/cifs-mount/smalldir | wc -l 12 cifs_request 67 67 61568 1 16 : [...] cifs_request 67 67 61568 1 16 : [...] ls done. cifs_request 66 66 61568 1 16 : [...] The second problem is CIFSMaxBufSize=64k takes up 32 linear pages of memory, so CIFSMaxBufSize=61440 is the right choice that takes up only 16: grep cifs_request /proc/slabinfo | awk '{ print $6 }' Based on /proc/buddyinfo there's much more chance to allocate 16 linear pages than 32. I fixed it. I think to allocate a large page while there's 66 empty, pre-allocated buffers is a mistake. So I keep open this bugreport now. Now I'm using samba with CIFSMaxBufSize=61568 and no mass swap-outs.
Unfortunately, CIFSMaxBufSize=61568 isn't the solution. Reduces mass swap-outs, but doesn't stop them. I've seen 1GB swap-out today - I went to drink a coffee. :) I began to think it's a VM-problem, CIFS is just the trigger. Not so extraordinary thing to allocate 64KB continuous memory (hey I've 4GB), why need to evict 1GB memory for the success..? Maybe kswapd doesn't do the right thing when a high order allocation requested (from my point of view sure if doesn't). This time I traced /proc/pagetypeinfo too and extended dstat to show free pages by zone. After all I don't understand exactly how the VM works (so complex beast), I can't evaluate the results but I hope one can. What I see in the attached logs I had free pages at order:0-6 in zone DMA32 but not in zone Normal when I ran 'ls /mnt/cifs/dir' - and kswapd came and did massacre in all zones. So, why..? (The daily backup was running meanwhile I made the logs.)
Created attachment 434295 [details] 1GB swap-out at once
Created attachment 434296 [details] cat /proc/pagetypeinfo in every 0.5s while 1GB swapped-out
Yeah seems like more of a VM problem than anything CIFS-specific, assuming your diagnosis is correct. One question though...were there a lot of dirty pagecache pages for files on the cifs share at the time?
Might also be interesting to redo this test: $ ls /mnt/cifs-mount/smalldir | wc -l ...but instead to change the 'ls' there to '/bin/ls'. That'll disable the aliasing for colorized ls that bash usually has by default. Those can cause ls to do a bunch of stat() calls for each file during its invocation. If that performs better, then that may help narrow things down to the stat() codepath.
The cifs mount is read-only, there weren't dirty pages. This problem is annoying me so I went deeper today. I wrote a simple SystemTap script that prints details about the following events in real-time: mm_page_alloc_zone_locked if order >= 4 mm_page_alloc if order >= 4 balance_pgdat function call by kswapd There's no doubt: processes which works on my cifs mount allocates high order pages at least on the following operations: - {f,l}stat - open (tested with ruby-irb> open "/mnt/cifs/testfile") - getdents (tested with ruby-irb> Dir[/mnt/cifs/*]) The SystemTap script has been showed when 'mm_page_alloc' failed, kswapd has been kicked with order=4 that leads to mass page eviction. Not every kswapd order=4 call leads to mass eviction, only the first when the swap yet empty. After the first large swap-out, only 20-40MB written to swap at once. But after I umount the cifs share, click through every application to force swap-in, remount the cifs share, I can reproduce the large swap-out again.. After all, I've vm.swappiness set to 0, so I can't even tolerate the "small" 20-40MB swap-outs. I smell two problems here: - kswapd evicts too aggressively at high order, doesn't respect the vm.swappiness=0 setting. - cifs allocates high order pages regularly, it shouldn't, it can badly hurt the performance. AFAIK Windows 7 uses 64KB cifs block size by default so performs much better on WAN links by default. We'll never catch if every simple stat() call causes a high order allocation, ee! :) Latest logs attached, .stap included.
Created attachment 434586 [details] dstat log
Created attachment 434587 [details] /proc/pagetypeinfo log
Created attachment 434588 [details] SystemTap script log
Created attachment 434589 [details] SystemTap script
I'm assuming that you've opened a separate bug for the VM issue so we can focus on the cifs piece here. I'd like to track the VM portion though, so if you can paste the bug number there I'd appreciate it... For the CIFS piece... CIFS defaults to a 16k max buf size. You've quadrupled that here. Why? Does it offer some measurable performance gain? If so, how much? Without setting that module parm these would be order 2 or 3 allocations. Certainly, it would be good to handle larger buffer sizes in cifs without needing higher-order allocations. The code already uses mempools however, so we do our best to minimize the thrashing. CIFS is a rather cluttered protocol on the wire. It sort of requires rather large buffers. We could try to assemble these buffers out of individual pages, but the code isn't set up that way today. I don't personally have the time to tackle a project of that size at the moment. If you're willing and able to do so, let me know and I'll see what I can do to help. In the meantime, I'll keep this bug open as a reminder that this needs to be considered when we do have time. It would be very helpful actually to open a bug for this at bugzilla.samba.org. There are people working on a filesystem for smb2 that I'm fairly certain uses the same sort of allocations. By opening a bug there, it may help encourage them to do the legwork for this problem.
About the VM issue: I tracked down and it's the programmed behavior. Kswapd tries to free high watermark count of pages at the requested order (and above) in every zone. (I accept it has good reason to do so.) My high watermark is 20k in zone DMA32 and ~5k in zone Normal. If you look https://bugzilla.redhat.com/attachment.cgi?id=434587 you'll see it does the job: freed 20*4MB memory from zone DMA32 and 5*4MB memory from zone Normal above order 4. The only problem is the machine unusable under swapping-out. It's the vice of the VM not the mass eviction. I open a bugreport with this subject soon. About the CIFS issue: 64KB block size gives double throughput in single threaded sequental file read over my Internet link. I've 60Mbit/s bandwidth (in theory) but unfortunately with a high RTT: min/avg/max/mdev = 9.827/14.998/27.246/4.709 ms (measured with ping). CIFS client doesn't use a sliding window (there's no large async read-ahead) so the RTT is a bottleneck for me. (By default I can copy from the cifs share at 800KB/s, with 64KB rsize 1800KB/s. The cumulated rate is almost double if I start another copy in parallel.) Oh, I'm regret that so hard to get rid of high order allocations in the cifs code. Maybe I'll look at the code but unfortunately I don't have too much time too, I already spent a lot to learn about VM (but hey, I'm knew the basics of the linux VM now, I'm so glad!:) I'll check in at bugzilla.samba.org soon. Thanks for your attention!
It may not be *that* hard, but a lot of places in the code assume a contiguous buffer. If you switch this to using collections of individual pages then all of the places that work with a pointer to the buffer will need to be changed to deal with it. IOW, it'll mean touching a lot of code. Another possibility is to change the code to use vmalloc'ed buffer, but that has its own problems wrt performance and is frowned upon...
This message is a reminder that Fedora 13 is nearing its end of life. Approximately 30 (thirty) days from now Fedora will stop maintaining and issuing updates for Fedora 13. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as WONTFIX if it remains open with a Fedora 'version' of '13'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version prior to Fedora 13's end of life. Bug Reporter: Thank you for reporting this issue and we are sorry that we may not be able to fix it before Fedora 13 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora please change the 'version' of this bug to the applicable version. If you are unable to change the version, please add a comment here and someone will do it for you. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. The process we are following is described here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping
Fedora 13 changed to end-of-life (EOL) status on 2011-06-25. Fedora 13 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. Thank you for reporting this bug and we are sorry it could not be fixed.