Description of problem: A RHEL5U1 system eventually gets bogged down waiting for IO to complete to the system disk. kswapd is quite active. Using blktrace it is clear that kswapd is by far the largest producer of write IO to the system disk. Other services trying to read from the system disk must wait quite a long time for their requests to complete. The system is a Dell 1950 w/ SMP 2.0GHz Intel Xeon (x86_64) and 1GB of memory. It has a qlogic 2300 card to access LUNs on a SAN. Version-Release number of selected component (if applicable): 2.6.18-53.el5 How reproducible: As frequently as 3 times a day or as infrequently as once every 3 days. Steps to Reproduce: 1. This issue can't be easily reproduced; the system users (that I'm reporting this bug on behalf of) periodically hit this issue under varying workloads. E.g. After formatting 64 5GB disks in parallel. Actual results: the system's memory is not heavily used: crash> kmem -i PAGES TOTAL PERCENTAGE TOTAL MEM 223890 874.6 MB ---- FREE 178189 696.1 MB 79% of TOTAL MEM USED 45701 178.5 MB 20% of TOTAL MEM SHARED 4951 19.3 MB 2% of TOTAL MEM BUFFERS 333 1.3 MB 0% of TOTAL MEM CACHED 1950 7.6 MB 0% of TOTAL MEM SLAB 18914 73.9 MB 8% of TOTAL MEM TOTAL HIGH 0 0 0% of TOTAL MEM FREE HIGH 0 0 0% of TOTAL HIGH TOTAL LOW 223890 874.6 MB 100% of TOTAL MEM FREE LOW 178189 696.1 MB 79% of TOTAL LOW TOTAL SWAP 510061 1.9 GB ---- SWAP USED 37417 146.2 MB 7% of TOTAL SWAP SWAP FREE 472644 1.8 GB 92% of TOTAL SWAP various services that are running on the system become UN (uninterruptible) waiting for IO to complete. automount (which is _not_ being used but the service was never disabled) is quite busy. kswapd is seemingly getting throttled by the VM: crash> ps ... 226 19 1 ffff81003f771080 UN 0.0 0 0 [kswapd0] ... crash> bt 226 PID: 226 TASK: ffff81003f771080 CPU: 1 COMMAND: "kswapd0" #0 [ffff81003fb97c20] schedule at ffffffff80060f29 #1 [ffff81003fb97d08] schedule_timeout at ffffffff80061839 #2 [ffff81003fb97d58] io_schedule_timeout at ffffffff800611c7 #3 [ffff81003fb97d88] blk_congestion_wait at ffffffff8003aa23 #4 [ffff81003fb97dd8] kswapd at ffffffff80055b6d #5 [ffff81003fb97ee8] kthread at ffffffff800321d8 #6 [ffff81003fb97f48] kernel_thread at ffffffff8005bfb1 Expected results: kswapd should not be so active given that the system has considerable amounts of memory free. Additional info: Rik van Riel <riel> posted a fix for kswapd 5 months ago here: http://lkml.org/lkml/2007/9/25/457 That fix only recently got committed upstream and was included in 2.6.24, the git commit was titled: "kswapd should only wait on IO if there is IO" commit: f1a9ee758de7de1e040de849fdef46e6802ea117 I had a look over other mm/vmscan.c changes that happened in the past year, here are some seemingly relevant ones: included in 2.6.21: throttle_vm_writeout(): don't loop on GFP_NOFS and GFP_NOIO allocations: commit: 232ea4d69d81169453344b7d05203425c88d973b included in 2.6.23: mm: prevent kswapd from freeing excessive amounts of lowmem: commit: 32a4330d4156e55a4888a201f484dbafed9504ed I don't know if any of these changes will actually fix the problem. I would really appreciate it if RedHat could help me understand if there is a newer released RHEL5 kernel that would possibly address this "kswapd gone crazy" problem.
One important fact I forgot to mention: If any of the system services (e.g. ntpd, autofs, etc) is stopped kswapd releases and the system disk's heavy IO subsides. It should be noted that even when the system is experiencing the kswapd load the system isn't amazingly loaded: load average: 1.80, 1.82, 1.90 Once a service (e.g. autofs, aka automount) is stopped the load on the system returns to 0.
I suspect that one of the problems is that, when kswapd is started, almost no memory is freeable. This causes kswapd to free memory more and more agressively, increasing its free targets. Under some circumstances - I have not figured out the problem yet, even though I see it once a week or so on my own system - it looks like kswapd (and other processes in the pageout code) will continue to free pages even after the system has lots of free pages already. I am not quite sure how to fix this, since sometimes the VM actually needs to do this. Eg. to satisfy higher order allocations.
I added the tunable /proc/sys/vm/pagecache to RHEL5-U1. This tunable(which defaults to 100) controls the percentage of memory that can be in the pagecache before we start reclaiming for the pagecache almost entirely. It works by having mark_page_accessed() place pagecache pages on the inactive list if the percentage of the pagecache is over /proc/sys/vm/pagecache. This way if you lower /proc/sys/vm/pagecache to 10 the inactive list is almost entirely pagecache pages which are mostly clean due to pdflush and kupdate, therefore the system does not need to swap especially if majority of the pagedemmand is via the pagecache. Another observation is that min_free_kbytes controls the zone min, low and high watermarks. If you increase min_free_kbytes, low gets min*2 and high gets min*3. This will cause kswapd to be woken up earlier and free more pages until it stops running. If this does not work, we probably need to change the scaling of low and high so they are much higher than 2 and 3 times min respectively. This way the allocator will wake up kswapd much earlier than it will drive the free list down below min and kswapd has a chance of keeping up with teh memory demand. Larry Woodman Oh and one final comment, if you move the swap partition tio a different device kswapd's IO will not stall other process's IO requests.
Mike, can you please provide some feedback on how your systems behavie after suggested tunning parameters settings?
Sorry for not getting back to you sooner (was away/busy). Turns out that the users of the systems that were having a problem resolved the issue simply by adding more physical memory (went from 1G to 2G). Testing with the suggested tunings will be tough seeing as the systems in question aren't under my control (or available to me). Has anyone that is looking at this issue reproduced this behavior (Rik seemed to say he had seen it periodically) or are you purely relying on me to make further progress?
I also have the same problem for my Ubuntu on Lenovo T520. I do have 8Gb RAM. It definitely not a lack of RAM, it is a bug in kswapd. $ uname -a Linux dmugtasimov 3.2.0-23-generic #36-Ubuntu SMP Tue Apr 10 20:39:51 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux $ cat /etc/issue Linux Mint 13 Maya \n \l $ free -m total used free shared buffers cached Mem: 7939 2301 5638 0 67 1051 -/+ buffers/cache: 1182 6757 Swap: 15999 0 15999