Description of problem:
A RHEL5U1 system eventually gets bogged down waiting for IO to complete to the
system disk. kswapd is quite active. Using blktrace it is clear that kswapd is
by far the largest producer of write IO to the system disk. Other services
trying to read from the system disk must wait quite a long time for their
requests to complete.
The system is a Dell 1950 w/ SMP 2.0GHz Intel Xeon (x86_64) and 1GB of memory.
It has a qlogic 2300 card to access LUNs on a SAN.
Version-Release number of selected component (if applicable):
As frequently as 3 times a day or as infrequently as once every 3 days.
Steps to Reproduce:
1. This issue can't be easily reproduced; the system users (that I'm reporting
this bug on behalf of) periodically hit this issue under varying workloads.
E.g. After formatting 64 5GB disks in parallel.
the system's memory is not heavily used:
crash> kmem -i
PAGES TOTAL PERCENTAGE
TOTAL MEM 223890 874.6 MB ----
FREE 178189 696.1 MB 79% of TOTAL MEM
USED 45701 178.5 MB 20% of TOTAL MEM
SHARED 4951 19.3 MB 2% of TOTAL MEM
BUFFERS 333 1.3 MB 0% of TOTAL MEM
CACHED 1950 7.6 MB 0% of TOTAL MEM
SLAB 18914 73.9 MB 8% of TOTAL MEM
TOTAL HIGH 0 0 0% of TOTAL MEM
FREE HIGH 0 0 0% of TOTAL HIGH
TOTAL LOW 223890 874.6 MB 100% of TOTAL MEM
FREE LOW 178189 696.1 MB 79% of TOTAL LOW
TOTAL SWAP 510061 1.9 GB ----
SWAP USED 37417 146.2 MB 7% of TOTAL SWAP
SWAP FREE 472644 1.8 GB 92% of TOTAL SWAP
various services that are running on the system become UN (uninterruptible)
waiting for IO to complete. automount (which is _not_ being used but the
service was never disabled) is quite busy.
kswapd is seemingly getting throttled by the VM:
226 19 1 ffff81003f771080 UN 0.0 0 0 [kswapd0]
crash> bt 226
PID: 226 TASK: ffff81003f771080 CPU: 1 COMMAND: "kswapd0"
#0 [ffff81003fb97c20] schedule at ffffffff80060f29
#1 [ffff81003fb97d08] schedule_timeout at ffffffff80061839
#2 [ffff81003fb97d58] io_schedule_timeout at ffffffff800611c7
#3 [ffff81003fb97d88] blk_congestion_wait at ffffffff8003aa23
#4 [ffff81003fb97dd8] kswapd at ffffffff80055b6d
#5 [ffff81003fb97ee8] kthread at ffffffff800321d8
#6 [ffff81003fb97f48] kernel_thread at ffffffff8005bfb1
kswapd should not be so active given that the system has considerable amounts of
Rik van Riel <email@example.com> posted a fix for kswapd 5 months ago here:
That fix only recently got committed upstream and was included in 2.6.24,
the git commit was titled: "kswapd should only wait on IO if there is IO"
I had a look over other mm/vmscan.c changes that happened in the past
year, here are some seemingly relevant ones:
included in 2.6.21:
throttle_vm_writeout(): don't loop on GFP_NOFS and GFP_NOIO allocations:
included in 2.6.23:
mm: prevent kswapd from freeing excessive amounts of lowmem:
I don't know if any of these changes will actually fix the problem. I would
really appreciate it if RedHat could help me understand if there is a newer
released RHEL5 kernel that would possibly address this "kswapd gone crazy" problem.
One important fact I forgot to mention:
If any of the system services (e.g. ntpd, autofs, etc) is stopped kswapd
releases and the system disk's heavy IO subsides. It should be noted that even
when the system is experiencing the kswapd load the system isn't amazingly loaded:
load average: 1.80, 1.82, 1.90
Once a service (e.g. autofs, aka automount) is stopped the load on the system
returns to 0.
I suspect that one of the problems is that, when kswapd is started, almost no
memory is freeable. This causes kswapd to free memory more and more
agressively, increasing its free targets.
Under some circumstances - I have not figured out the problem yet, even though I
see it once a week or so on my own system - it looks like kswapd (and other
processes in the pageout code) will continue to free pages even after the system
has lots of free pages already.
I am not quite sure how to fix this, since sometimes the VM actually needs to do
this. Eg. to satisfy higher order allocations.
I added the tunable /proc/sys/vm/pagecache to RHEL5-U1. This tunable(which
defaults to 100) controls the percentage of memory that can be in the pagecache
before we start reclaiming for the pagecache almost entirely. It works by
having mark_page_accessed() place pagecache pages on the inactive list if the
percentage of the pagecache is over /proc/sys/vm/pagecache. This way if you
lower /proc/sys/vm/pagecache to 10 the inactive list is almost entirely
pagecache pages which are mostly clean due to pdflush and kupdate, therefore the
system does not need to swap especially if majority of the pagedemmand is via
Another observation is that min_free_kbytes controls the zone min, low and high
watermarks. If you increase min_free_kbytes, low gets min*2 and high gets
min*3. This will cause kswapd to be woken up earlier and free more pages until
it stops running. If this does not work, we probably need to change the scaling
of low and high so they are much higher than 2 and 3 times min respectively.
This way the allocator will wake up kswapd much earlier than it will drive the
free list down below min and kswapd has a chance of keeping up with teh memory
Oh and one final comment, if you move the swap partition tio a different device
kswapd's IO will not stall other process's IO requests.
Mike, can you please provide some feedback on how your systems
behavie after suggested tunning parameters settings?
Sorry for not getting back to you sooner (was away/busy). Turns out that the
users of the systems that were having a problem resolved the issue simply by
adding more physical memory (went from 1G to 2G).
Testing with the suggested tunings will be tough seeing as the systems in
question aren't under my control (or available to me). Has anyone that is
looking at this issue reproduced this behavior (Rik seemed to say he had seen it
periodically) or are you purely relying on me to make further progress?
I also have the same problem for my Ubuntu on Lenovo T520. I do have 8Gb RAM. It definitely not a lack of RAM, it is a bug in kswapd.
$ uname -a
Linux dmugtasimov 3.2.0-23-generic #36-Ubuntu SMP Tue Apr 10 20:39:51 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux
$ cat /etc/issue
Linux Mint 13 Maya \n \l
$ free -m
total used free shared buffers cached
Mem: 7939 2301 5638 0 67 1051
-/+ buffers/cache: 1182 6757
Swap: 15999 0 15999