Bug 437202

Summary:	kswapd causing system disk to have 50% io wait
Product:	Red Hat Enterprise Linux 5	Reporter:	Mike Snitzer <snitzer>
Component:	kernel	Assignee:	Peter Zijlstra <pzijlstr>
Status:	CLOSED NOTABUG	QA Contact:	Red Hat Kernel QE team <kernel-qe>
Severity:	high	Docs Contact:
Priority:	low
Version:	5.1	CC:	dmugtasimov, lwang, lwoodman, riel
Target Milestone:	rc
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2012-05-09 14:09:34 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Mike Snitzer 2008-03-12 20:34:59 UTC

Description of problem:
A RHEL5U1 system eventually gets bogged down waiting for IO to complete to the
system disk.  kswapd is quite active.  Using blktrace it is clear that kswapd is
by far the largest producer of write IO to the system disk.  Other services
trying to read from the system disk must wait quite a long time for their
requests to complete.

The system is a Dell 1950 w/ SMP 2.0GHz Intel Xeon (x86_64) and 1GB of memory.
It has a qlogic 2300 card to access LUNs on a SAN.

Version-Release number of selected component (if applicable):
2.6.18-53.el5

How reproducible:
As frequently as 3 times a day or as infrequently as once every 3 days.

Steps to Reproduce:
1. This issue can't be easily reproduced; the system users (that I'm reporting
this bug on behalf of) periodically hit this issue under varying workloads. 
E.g. After formatting 64 5GB disks in parallel.

Actual results:

the system's memory is not heavily used:
crash> kmem -i
              PAGES        TOTAL      PERCENTAGE
 TOTAL MEM   223890     874.6 MB         ----
      FREE   178189     696.1 MB   79% of TOTAL MEM
      USED    45701     178.5 MB   20% of TOTAL MEM
    SHARED     4951      19.3 MB    2% of TOTAL MEM
   BUFFERS      333       1.3 MB    0% of TOTAL MEM
    CACHED     1950       7.6 MB    0% of TOTAL MEM
      SLAB    18914      73.9 MB    8% of TOTAL MEM

TOTAL HIGH        0            0    0% of TOTAL MEM
 FREE HIGH        0            0    0% of TOTAL HIGH
 TOTAL LOW   223890     874.6 MB  100% of TOTAL MEM
  FREE LOW   178189     696.1 MB   79% of TOTAL LOW

TOTAL SWAP   510061       1.9 GB         ----
 SWAP USED    37417     146.2 MB    7% of TOTAL SWAP
 SWAP FREE   472644       1.8 GB   92% of TOTAL SWAP

various services that are running on the system become UN (uninterruptible)
waiting for IO to complete.  automount (which is _not_ being used but the
service was never disabled) is quite busy.

kswapd is seemingly getting throttled by the VM:

crash> ps
...
    226     19   1  ffff81003f771080  UN   0.0       0      0  [kswapd0]
...
crash> bt 226
PID: 226    TASK: ffff81003f771080  CPU: 1   COMMAND: "kswapd0"
 #0 [ffff81003fb97c20] schedule at ffffffff80060f29
 #1 [ffff81003fb97d08] schedule_timeout at ffffffff80061839
 #2 [ffff81003fb97d58] io_schedule_timeout at ffffffff800611c7
 #3 [ffff81003fb97d88] blk_congestion_wait at ffffffff8003aa23
 #4 [ffff81003fb97dd8] kswapd at ffffffff80055b6d
 #5 [ffff81003fb97ee8] kthread at ffffffff800321d8
 #6 [ffff81003fb97f48] kernel_thread at ffffffff8005bfb1

Expected results:
kswapd should not be so active given that the system has considerable amounts of
memory free.

Additional info:
Rik van Riel <riel> posted a fix for kswapd 5 months ago here:
http://lkml.org/lkml/2007/9/25/457

That fix only recently got committed upstream and was included in 2.6.24,
the git commit was titled: "kswapd should only wait on IO if there is IO"
commit: f1a9ee758de7de1e040de849fdef46e6802ea117

I had a look over other mm/vmscan.c changes that happened in the past
year, here are some seemingly relevant ones:

included in 2.6.21:
throttle_vm_writeout(): don't loop on GFP_NOFS and GFP_NOIO allocations:
commit: 232ea4d69d81169453344b7d05203425c88d973b

included in 2.6.23:
mm: prevent kswapd from freeing excessive amounts of lowmem:
commit: 32a4330d4156e55a4888a201f484dbafed9504ed

I don't know if any of these changes will actually fix the problem.  I would
really appreciate it if RedHat could help me understand if there is a newer
released RHEL5 kernel that would possibly address this "kswapd gone crazy" problem.

Comment 1 Mike Snitzer 2008-03-12 20:42:36 UTC

One important fact I forgot to mention:

If any of the system services (e.g. ntpd, autofs, etc) is stopped kswapd
releases and the system disk's heavy IO subsides.  It should be noted that even
when the system is experiencing the kswapd load the system isn't amazingly loaded:
load average: 1.80, 1.82, 1.90

Once a service (e.g. autofs, aka automount) is stopped the load on the system
returns to 0.

Comment 2 Rik van Riel 2008-05-12 18:59:46 UTC

I suspect that one of the problems is that, when kswapd is started, almost no
memory is freeable.  This causes kswapd to free memory more and more
agressively, increasing its free targets.

Under some circumstances - I have not figured out the problem yet, even though I
see it once a week or so on my own system - it looks like kswapd (and other
processes in the pageout code) will continue to free pages even after the system
has lots of free pages already.

I am not quite sure how to fix this, since sometimes the VM actually needs to do
this.  Eg. to satisfy higher order allocations.

Comment 3 Larry Woodman 2008-05-12 19:36:13 UTC

I added the tunable /proc/sys/vm/pagecache to RHEL5-U1.  This tunable(which
defaults to 100) controls the percentage of memory that can be in the pagecache
before we start reclaiming for the pagecache almost entirely.  It works by
having mark_page_accessed() place pagecache pages on the inactive list if the
percentage of the pagecache is over /proc/sys/vm/pagecache.  This way if you
lower /proc/sys/vm/pagecache to 10 the inactive list is almost entirely
pagecache pages which are mostly clean due to pdflush and kupdate, therefore the
system does not need to swap especially if majority of the pagedemmand is via
the pagecache.

Another observation is that min_free_kbytes controls the zone min, low and high
watermarks.  If you increase min_free_kbytes, low gets min*2 and high gets
min*3.  This will cause kswapd to be woken up earlier and free more pages until
it stops running.  If this does not work, we probably need to change the scaling
of low and high so they are much higher than 2 and 3 times min respectively. 
This way the allocator will wake up kswapd much earlier than it will drive the
free list down below min and kswapd has a chance of keeping up with teh memory
demand.

Larry Woodman

Oh and one final comment, if you move the swap partition tio a different device
kswapd's IO will not stall other process's IO requests.

Comment 4 Linda Wang 2008-07-03 23:49:48 UTC

Mike, can you please provide some feedback on how your systems
behavie after suggested  tunning parameters settings?

Comment 6 Mike Snitzer 2008-07-16 20:08:13 UTC

Sorry for not getting back to you sooner (was away/busy).  Turns out that the
users of the systems that were having a problem resolved the issue simply by
adding more physical memory (went from 1G to 2G).

Testing with the suggested tunings will be tough seeing as the systems in
question aren't under my control (or available to me).  Has anyone that is
looking at this issue reproduced this behavior (Rik seemed to say he had seen it
periodically) or are you purely relying on me to make further progress?

Comment 10 Dmitry Mugtasimov 2013-02-20 10:03:06 UTC

I also have the same problem for my Ubuntu on Lenovo T520. I do have 8Gb RAM. It definitely not a lack of RAM, it is a bug in kswapd.

$ uname -a
Linux dmugtasimov 3.2.0-23-generic #36-Ubuntu SMP Tue Apr 10 20:39:51 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux
$ cat /etc/issue
Linux Mint 13 Maya \n \l
$ free -m
             total       used       free     shared    buffers     cached
Mem:          7939       2301       5638          0         67       1051
-/+ buffers/cache:       1182       6757
Swap:        15999          0      15999