617795 – cifs option causes unnecessary swapping

Bug 617795 - cifs option causes unnecessary swapping

Summary: cifs option causes unnecessary swapping

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	13
Hardware:	x86_64
OS:	Linux
Priority:	low
Severity:	low
Target Milestone:	---
Assignee:	Jeff Layton
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2010-07-24 02:16 UTC by Roland Pallai
Modified:	2014-06-18 07:40 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2011-06-29 13:04:19 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
dstat dump (9.37 KB, text/plain) 2010-07-24 02:16 UTC, Roland Pallai	no flags	Details
/proc/cpuinfo (1.46 KB, text/plain) 2010-07-24 02:17 UTC, Roland Pallai	no flags	Details
dstat output from a mass swap-out (only usual desktop apps ran) (7.91 KB, text/plain) 2010-07-24 14:04 UTC, Roland Pallai	no flags	Details
1GB swap-out at once (2.94 KB, text/plain) 2010-07-25 22:26 UTC, Roland Pallai	no flags	Details
cat /proc/pagetypeinfo in every 0.5s while 1GB swapped-out (32.12 KB, text/plain) 2010-07-25 22:28 UTC, Roland Pallai	no flags	Details
dstat log (3.46 KB, text/plain) 2010-07-27 02:07 UTC, Roland Pallai	no flags	Details
/proc/pagetypeinfo log (9.18 KB, text/plain) 2010-07-27 02:07 UTC, Roland Pallai	no flags	Details
SystemTap script log (6.74 KB, text/plain) 2010-07-27 02:08 UTC, Roland Pallai	no flags	Details
SystemTap script (633 bytes, text/plain) 2010-07-27 02:08 UTC, Roland Pallai	no flags	Details
View All

Description Roland Pallai 2010-07-24 02:16:37 UTC

Created attachment 434098 [details]
dstat dump

Description of problem:
Sometimes my desktop machine thrashes the swap partition while there's a plenty of free ram. The machine unusable for a half minute (the mouse cursor- and the music from rhythmbox stops) in such case. After the swap-out a long swap-in session begins while interactivity of the applications is terrible. This mass swap-out event is rare, mostly triggered by some kind of disk input (eg.: ls -R /usr, tar /usr) but when come always come with a long pause and huge disk(swap) writes.

I run 3x KVM (2x Win7 with 512MB RAM, 1x Debian with 256MB RAM; KSM enabled), NetBeans 6.8 (~500MB RSS), Firefox, Rhythmbox on KDE.

A thrashing observed via dstat is attached: 354MB free ram available when the swap-out began (~600MB swap-out at once!) and the kernel freed up 1604MB(!) - why?? The usual applications were running, I did nothing special..


Current /proc/meminfo (after some thrashing):
MemTotal:        4056324 kB
MemFree:          389520 kB
Buffers:          401332 kB
Cached:           641096 kB
SwapCached:       585312 kB
Active:           980964 kB
Inactive:        1736368 kB
Active(anon):     482120 kB
Inactive(anon):  1207236 kB
Active(file):     498844 kB
Inactive(file):   529132 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:       3145720 kB
SwapFree:        2063420 kB
Dirty:               896 kB
Writeback:             0 kB
AnonPages:       1190272 kB
Mapped:            88892 kB
Shmem:             14396 kB
Slab:             691604 kB
SReclaimable:     593392 kB
SUnreclaim:        98212 kB
KernelStack:        3456 kB
PageTables:        45680 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     5173880 kB
Committed_AS:    3946884 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      343532 kB
VmallocChunk:   34359379452 kB
HardwareCorrupted:     0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:      194432 kB
DirectMap2M:     3997696 kB

/proc/buddyinfo:
Node 0, zone      DMA      3      2      3      3      3      1      0      0      1      1      3 
Node 0, zone    DMA32  33009  12170   1327    309     16      0      1      0      0      0      1 
Node 0, zone   Normal   3378    888    167      8      3      2      0      1      1      0      0 


Custom /etc/sysctl.conf settings:
vm.swappiness = 0
vm.min_free_kbytes = 65536


Version-Release number of selected component (if applicable):
kernel 2.6.33.6-147.fc13.x86_64
qemu-system-x86-0.12.3-8.fc13.x86_64
qemu-img-0.12.3-8.fc13.x86_64
qemu-kvm-0.12.3-8.fc13.x86_64
qemu-common-0.12.3-8.fc13.x86_64

Comment 1 Roland Pallai 2010-07-24 02:17:50 UTC

Created attachment 434099 [details]
/proc/cpuinfo

Comment 2 Roland Pallai 2010-07-24 14:03:31 UTC

This swap-thrashing behavior isn't depend on KVM - I can reproduce it even if only usual desktop applications are run. I simplified the triggering too: a non-recursive ls on a large dir (cifs share) is enough to began the mass swap-out.

I forgot to mention before the swap is on LVM over dm-crypt. I tried to put the swap into a file on a unecrypted filesystem over LVM, but that didn't help.

I stopped KVM and filled up the memory with memory-hungry applications like openoffice, gimp, digikam, totem, konqueror, firefox, seamonkey, netbeans, rhythmbox.

Memory reported by dstat:
used: 2766M buffers: 149M cache: 520M free: 526M(!)

$ for i in /mnt/CIFS-mount/large-dirs/*; do ls $i | wc -l; done
(_the system paused for 1 minute_)
2297
1505
1801
1816
1404
1334
1002

Memory reported by dstat after 800MB(!) swapped out:
used: 2467M buffers: 54.3M cache: 415M free: 1024M(!)

(The complete dstat output attached.)


Maybe memory fragmentation is the problem? How can I measure that? How can I sure about that?

Comment 3 Roland Pallai 2010-07-24 14:04:55 UTC

Created attachment 434146 [details]
dstat output from a mass swap-out (only usual desktop apps ran)

Comment 4 Roland Pallai 2010-07-24 16:41:55 UTC

I'm on track. In the past I set in /etc/modprobe.d/local_options.conf (I'm using samba over Internet):
 options cifs CIFSMaxBufSize=65536 cifs_min_rcv=64

I knew large CIFSMaxBufSize can hurt the performance due to memory fragmentation, but I hoped maximized cifs_min_rcv compass the problem - I was wrong.


The first problem readdir *never uses* cifs_min_rcv-allocated buffers:

$ while :; do grep cifs_request /proc/slabinfo ; sleep 0.1; done
cifs_request          66     66  61568    1   16 : [...]
 doing ls in another terminal:
  $ ls /mnt/cifs-mount/smalldir | wc -l
  12
cifs_request          67     67  61568    1   16 : [...]
cifs_request          67     67  61568    1   16 : [...]
 ls done.
cifs_request          66     66  61568    1   16 : [...]


The second problem is CIFSMaxBufSize=64k takes up 32 linear pages of memory, so CIFSMaxBufSize=61440 is the right choice that takes up only 16:
 grep cifs_request /proc/slabinfo | awk '{ print $6 }'

Based on /proc/buddyinfo there's much more chance to allocate 16 linear pages than 32. I fixed it.


I think to allocate a large page while there's 66 empty, pre-allocated buffers is a mistake. So I keep open this bugreport now.


Now I'm using samba with CIFSMaxBufSize=61568 and no mass swap-outs.

Comment 5 Roland Pallai 2010-07-25 22:23:38 UTC

Unfortunately, CIFSMaxBufSize=61568 isn't the solution. Reduces mass swap-outs, but doesn't stop them. I've seen 1GB swap-out today - I went to drink a coffee. :)

I began to think it's a VM-problem, CIFS is just the trigger. Not so extraordinary thing to allocate 64KB continuous memory (hey I've 4GB), why need to evict 1GB memory for the success..? Maybe kswapd doesn't do the right thing when a high order allocation requested (from my point of view sure if doesn't).

This time I traced /proc/pagetypeinfo too and extended dstat to show free pages by zone. After all I don't understand exactly how the VM works (so complex beast), I can't evaluate the results but I hope one can.


What I see in the attached logs I had free pages at order:0-6 in zone DMA32 but not in zone Normal when I ran 'ls /mnt/cifs/dir' - and kswapd came and did massacre in all zones. So, why..?

(The daily backup was running meanwhile I made the logs.)

Comment 6 Roland Pallai 2010-07-25 22:26:00 UTC

Created attachment 434295 [details]
1GB swap-out at once

Comment 7 Roland Pallai 2010-07-25 22:28:18 UTC

Created attachment 434296 [details]
cat /proc/pagetypeinfo in every 0.5s while 1GB swapped-out

Comment 8 Jeff Layton 2010-07-26 13:57:16 UTC

Yeah seems like more of a VM problem than anything CIFS-specific, assuming your diagnosis is correct. One question though...were there a lot of dirty pagecache pages for files on the cifs share at the time?

Comment 9 Jeff Layton 2010-07-26 14:00:32 UTC

Might also be interesting to redo this test:

  $ ls /mnt/cifs-mount/smalldir | wc -l

...but instead to change the 'ls' there to '/bin/ls'. That'll disable the aliasing for colorized ls that bash usually has by default. Those can cause ls to do a bunch of stat() calls for each file during its invocation.

If that performs better, then that may help narrow things down to the stat() codepath.

Comment 10 Roland Pallai 2010-07-27 02:02:11 UTC

The cifs mount is read-only, there weren't dirty pages.

This problem is annoying me so I went deeper today. I wrote a simple SystemTap script that prints details about the following events in real-time:
mm_page_alloc_zone_locked if order >= 4
mm_page_alloc if order >= 4
balance_pgdat function call by kswapd

There's no doubt: processes which works on my cifs mount allocates high order pages at least on the following operations:
- {f,l}stat
- open (tested with ruby-irb> open "/mnt/cifs/testfile")
- getdents (tested with ruby-irb> Dir[/mnt/cifs/*])

The SystemTap script has been showed when 'mm_page_alloc' failed, kswapd has been kicked with order=4 that leads to mass page eviction.

Not every kswapd order=4 call leads to mass eviction, only the first when the swap yet empty. After the first large swap-out, only 20-40MB written to swap at once. But after I umount the cifs share, click through every application to force swap-in, remount the cifs share, I can reproduce the large swap-out again..
After all, I've vm.swappiness set to 0, so I can't even tolerate the "small" 20-40MB swap-outs.

I smell two problems here:
- kswapd evicts too aggressively at high order, doesn't respect the vm.swappiness=0 setting.
- cifs allocates high order pages regularly, it shouldn't, it can badly hurt the performance. AFAIK Windows 7 uses 64KB cifs block size by default so performs much better on WAN links by default. We'll never catch if every simple stat() call causes a high order allocation, ee! :)

Latest logs attached, .stap included.

Comment 11 Roland Pallai 2010-07-27 02:07:02 UTC

Created attachment 434586 [details]
dstat log

Comment 12 Roland Pallai 2010-07-27 02:07:39 UTC

Created attachment 434587 [details]
/proc/pagetypeinfo log

Comment 13 Roland Pallai 2010-07-27 02:08:20 UTC

Created attachment 434588 [details]
SystemTap script log

Comment 14 Roland Pallai 2010-07-27 02:08:58 UTC

Created attachment 434589 [details]
SystemTap script

Comment 15 Jeff Layton 2010-07-27 10:58:08 UTC

I'm assuming that you've opened a separate bug for the VM issue so we can focus on the cifs piece here. I'd like to track the VM portion though, so if you can paste the bug number there I'd appreciate it...

For the CIFS piece...

CIFS defaults to a 16k max buf size. You've quadrupled that here. Why? Does it offer some measurable performance gain? If so, how much? Without setting that module parm these would be order 2 or 3 allocations.

Certainly, it would be good to handle larger buffer sizes in cifs without needing higher-order allocations. The code already uses mempools however, so we do our best to minimize the thrashing.

CIFS is a rather cluttered protocol on the wire. It sort of requires rather large buffers. We could try to assemble these buffers out of individual pages, but the code isn't set up that way today.

I don't personally have the time to tackle a project of that size at the moment. If you're willing and able to do so, let me know and I'll see what I can do to help.

In the meantime, I'll keep this bug open as a reminder that this needs to be considered when we do have time.

It would be very helpful actually to open a bug for this at bugzilla.samba.org. There are people working on a filesystem for smb2 that I'm fairly certain uses the same sort of allocations. By opening a bug there, it may help encourage them to do the legwork for this problem.

Comment 16 Roland Pallai 2010-07-27 17:54:47 UTC

About the VM issue:
I tracked down and it's the programmed behavior. Kswapd tries to free high watermark count of pages at the requested order (and above) in every zone. (I accept it has good reason to do so.)

My high watermark is 20k in zone DMA32 and ~5k in zone Normal. If you look https://bugzilla.redhat.com/attachment.cgi?id=434587 you'll see it does the job: freed 20*4MB memory from zone DMA32 and 5*4MB memory from zone Normal above order 4.

The only problem is the machine unusable under swapping-out. It's the vice of the VM not the mass eviction. I open a bugreport with this subject soon.


About the CIFS issue: 
64KB block size gives double throughput in single threaded sequental file read over my Internet link. I've 60Mbit/s bandwidth (in theory) but unfortunately with a high RTT: min/avg/max/mdev = 9.827/14.998/27.246/4.709 ms (measured with ping). CIFS client doesn't use a sliding window (there's no large async read-ahead) so the RTT is a bottleneck for me.

(By default I can copy from the cifs share at 800KB/s, with 64KB rsize 1800KB/s. The cumulated rate is almost double if I start another copy in parallel.)


Oh, I'm regret that so hard to get rid of high order allocations in the cifs code. Maybe I'll look at the code but unfortunately I don't have too much time too, I already spent a lot to learn about VM (but hey, I'm knew the basics of the linux VM now, I'm so glad!:)

I'll check in at bugzilla.samba.org soon. Thanks for your attention!

Comment 17 Jeff Layton 2010-07-27 18:03:08 UTC

It may not be *that* hard, but a lot of places in the code assume a contiguous buffer. If you switch this to using collections of individual pages then all of the places that work with a pointer to the buffer will need to be changed to deal with it. IOW, it'll mean touching a lot of code.

Another possibility is to change the code to use vmalloc'ed buffer, but that has its own problems wrt performance and is frowned upon...

Comment 18 Bug Zapper 2011-06-01 13:05:55 UTC

This message is a reminder that Fedora 13 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 13.  It is Fedora's policy to close all
bug reports from releases that are no longer maintained.  At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '13'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 13's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 13 is end of life.  If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora please change the 'version' of this 
bug to the applicable version.  If you are unable to change the version, 
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 19 Bug Zapper 2011-06-29 13:04:19 UTC

Fedora 13 changed to end-of-life (EOL) status on 2011-06-25. Fedora 13 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.

Note You need to log in before you can comment on or make changes to this bug.