Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

For bugs related to Red Hat Enterprise Linux 2.1 product line. For Red Hat Enterprise Linux 6 and above, please visit Red Hat JIRA https://issues.redhat.com/secure/CreateIssue!default.jspa?pid=12332745 to report new issues.

Bug 117902

Summary:	kswapd consumes a large amount of CPU for an extended period of time
Product:	Red Hat Enterprise Linux 2.1	Reporter:	John Caruso <jcaruso>
Component:	kernel	Assignee:	Larry Woodman <lwoodman>
Status:	CLOSED ERRATA	QA Contact:
Severity:	high	Docs Contact:
Priority:	medium
Version:	2.1	CC:	awagner, jbaron, jhp, kambiz, lg34, phansen, pkichigin, riel, tao
Target Milestone:	---
Target Release:	---
Hardware:	i686
OS:	Linux
URL:	https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=58406
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2004-08-18 14:25:00 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description John Caruso 2004-03-09 20:17:56 UTC

Description of problem:

On an active RHAS 2.1 system, kswapd can begin consuming a large 
amount of CPU for an extended period of time (hours).  It stays at or 
near the top of a "top" listing the entire time, essentially 
consuming an entire CPU's worth of CPU (on a multiprocessor system) 
throughout the interval.  It appears that the required eliciting 
condition is an active system on which all memory is either 
allocated, or has been used for buffers/cache (leaving only a very 
small amount of genuinely "free" memory).

On a system where this was occurring, memory stats were as follows:

# free
             total       used       free     shared    buffers     
cached
Mem:       3924400    3914832       9568    1203884     125956    
1394120
-/+ buffers/cache:    2394756    1529644
Swap:      6289320     755704    5533616
# cat /proc/swaps
Filename                        Type            Size    Used    
Priority
/dev/sda6                       partition       2096440 755704  -1
/dev/sda7                       partition       2096440 0       -2
/dev/sda8                       partition       2096440 0       -3

Note that this bug appears to be the same as bug 58406, which was 
opened against RH 7.2 but then closed because it was reportedly fixed 
in RH 7.3.  Apparently whatever fix appeared in RH 7.3 was never 
backported to RHAS 2.1.  Since RHAS 2.1 is in use by enterprise 
customers and will be supported for many years, it would seem that a 
fix needs to be backported.

Also, this bug appears to be related to bug 117460 (previously filed 
by us), in that they both manifest themselves on a RHAS 2.1 system on 
which all memory is in use (specifically, both bugs are showing up on 
an RHAS 2.1 system running Oracle, though we've seen the same 
behavior on a non-Oracle system as well).  They are also related in 
the sense that they are both instances of bugs that were addressed in 
previous RH errata, but those fixes were not rolled into RHAS 2.1.

I'm marking this as high priority (like bug 117460) because it can 
directly impact the stability of a production system; since RHAS 2.1 
is intended for enterprise use, this is clearly a problem.


Version-Release number of selected component (if applicable):
kswapd / kernel-smp-2.4.9-e.38

How reproducible:
See above.  Unfortunately we've discovered no systematic way to 
reproduce the bug.  It seems to require a system on which a fairly 
large amount of memory is allocated, and the bulk of the remainder is 
allocated for buffers/cache, leaving only a small amount of free RAM; 
basically, RHAS 2.1 doesn't deal with this condition very well.

Actual results:
kswapd consumes large amounts of CPU for extended periods of time

Expected results:
kswapd consumes negligible amounts of CPU for extended periods of time

Additional info:
While it's primarily kswapd that consumes CPU during these intervals, 
krefilld also occasionally shows up at the top of the "top" listing 
for minutes at a time (even above kswapd, while this is happening).

Comment 1 Larry Woodman 2004-03-09 23:03:19 UTC

Please get /proc/slabinfo, top and "AltSysrq M" outouts when the
system is in this condition.

Larry Woodman

Comment 2 John Caruso 2004-03-10 04:09:26 UTC

I'll get you the /proc/slabinfo contents the next time kswapd goes 
crazy.  Alt-Sysrq-M is a problem though, because this is a production 
database server and it cannot go down, ever--in fact we have 
kernel.sysrq disabled on there.  Is there some non-invasive way to 
get the information you're looking for?

Also, what specifically do you want from top...just the standard 
display?  Other than the memory stats (which I showed you with 
the "free" output) I'm not sure what you might be looking for.  If 
it's kswapd's CPU time, here's the current ps output for it:

   root    10  0.3  0.0   0    0 ?     SW   Feb27  52:53 [kswapd]

Comment 3 John Caruso 2004-03-10 05:12:12 UTC

I was able to get kswapd going on another database server (not 
production) using the method from bug 58406 of dumping a large 
filesystem.  Not sure if that's useful to you or not, but it can't 
hurt while we're waiting for the production server to run into the 
problem again.  Here are the results from that test:

# top
  9:09pm  up 8 days,  5:58,  3 users,  load average: 1.75, 1.74, 1.21
153 processes: 152 sleeping, 1 running, 0 zombie, 0 stopped
CPU0 states:  1.0% user, 33.0% system,  0.0% nice, 65.0% idle
CPU1 states:  0.0% user, 33.0% system,  0.0% nice, 66.0% idle
CPU2 states:  0.0% user, 14.0% system,  0.0% nice, 85.0% idle
CPU3 states:  1.0% user,  5.0% system,  0.0% nice, 93.0% idle
Mem:  3924400K av, 3916792K used,    7608K free,       0K shrd,  
627528K buff
Swap: 6289320K av,   11360K used, 6277960K free                 
3098000K cached

  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME COMMAND
   10 root      16   0     0    0     0 SW   33.0  0.0   2:49 kswapd
22905 root      16   0  1012  480   264 D    14.0  0.0   0:06 dump
22903 root      15   0  1012  480   264 S    13.0  0.0   0:06 dump
   13 root      39   0     0    0     0 SW   12.0  0.0   0:22 bdflush
22904 root      15   0  1012  480   264 S     8.0  0.0   0:06 dump
22902 root      15   0  2108  512   264 S     3.0  0.0   0:02 dump
21556 root      15   0  1096 1096   824 R     2.0  0.0   0:05 top
    1 root      15   0   116   68    68 S     0.0  0.0   0:03 init
    2 root      15   0     0    0     0 SW    0.0  0.0   0:00 keventd
    3 root      15   0     0    0     0 SW    0.0  0.0   0:00 keventd
    4 root      15   0     0    0     0 SW    0.0  0.0   0:00 keventd

# cat /proc/slabinfo
slabinfo - version: 1.1 (SMP)
kmem_cache            80     80    244    5    5    1 :  252  126
nfs_read_data          0      0    384    0    0    1 :  124   62
nfs_inode_cache        2     34    224    2    2    1 :  252  126
nfs_write_data         0      0    384    0    0    1 :  124   62
nfs_page               0      0     96    0    0    1 :  252  126
ip_fib_hash           15    226     32    2    2    1 :  252  126
journal_head       27222  27222     48  349  349    1 :  252  126
revoke_table           6    253     12    1    1    1 :  252  126
revoke_record          0      0     32    0    0    1 :  252  126
clip_arp_cache         0      0    128    0    0    1 :  252  126
ip_mrt_cache           0      0     96    0    0    1 :  252  126
tcp_tw_bucket         30     30    128    1    1    1 :  252  126
tcp_bind_bucket       15    226     32    2    2    1 :  252  126
tcp_open_request       1     40     96    1    1    1 :  252  126
inet_peer_cache        0      0     64    0    0    1 :  252  126
ip_dst_cache          12    100    192    5    5    1 :  252  126
arp_cache              3     90    128    3    3    1 :  252  126
blkdev_requests    38400  38680     96  967  967    1 :  252  126
kioctx                 0      0     96    0    0    1 :  252  126
kiocb                  0      0     96    0    0    1 :  252  126
dnotify cache          0      0     20    0    0    1 :  252  126
file lock cache      183    252     92    6    6    1 :  252  126
async poll table       0      0    140    0    0    1 :  252  126
fasync cache           0      0     16    0    0    1 :  252  126
uid_cache              5    226     32    2    2    1 :  252  126
skbuff_head_cache   1273   2040    160   85   85    1 :  252  126
sock                 105    135   1312   45   45    1 :   60   30
sigqueue              58     58    132    2    2    1 :  252  126
kiobuf                 0      0   8768    0    0    4 :    0    0
cdev_cache            18    236     64    4    4    1 :  252  126
bdev_cache            10    118     64    2    2    1 :  252  126
mnt_cache             19    118     64    2    2    1 :  252  126
inode_cache        52828  59013    448 6557 6557    1 :  124   62
dentry_cache         559   1950    128   65   65    1 :  252  126
dquot                  0      0    128    0    0    1 :  252  126
filp                1097   1240     96   31   31    1 :  252  126
names_cache            3      3   4096    3    3    1 :   60   30
buffer_head       622070 657880     96 16444 16447    1 :  252  126
mm_struct            260    260    192   13   13    1 :  252  126
vm_area_struct      1525   2006     64   34   34    1 :  252  126
fs_cache             297    354     64    6    6    1 :  252  126
files_cache          117    117    416   13   13    1 :  124   62
signal_act           102    102   1312   34   34    1 :   60   30
size-131072(DMA)       0      0 131072    0    0   32 :    0    0
size-131072            0      0 131072    0    0   32 :    0    0
size-65536(DMA)        0      0  65536    0    0   16 :    0    0
size-65536             4      4  65536    4    4   16 :    0    0
size-32768(DMA)        0      0  32768    0    0    8 :    0    0
size-32768            18     18  32768   18   18    8 :    0    0
size-16384(DMA)        0      0  16384    0    0    4 :    0    0
size-16384             0      0  16384    0    0    4 :    0    0
size-8192(DMA)         0      0   8192    0    0    2 :    0    0
size-8192             17     17   8192   17   17    2 :    0    0
size-4096(DMA)         0      0   4096    0    0    1 :   60   30
size-4096             44     44   4096   44   44    1 :   60   30
size-2048(DMA)         0      0   2048    0    0    1 :   60   30
size-2048            860    918   2048  459  459    1 :   60   30
size-1024(DMA)         0      0   1024    0    0    1 :  124   62
size-1024            252    252   1024   63   63    1 :  124   62
size-512(DMA)          0      0    512    0    0    1 :  124   62
size-512             304    304    512   38   38    1 :  124   62
size-256(DMA)          0      0    256    0    0    1 :  252  126
size-256             390    390    256   26   26    1 :  252  126
size-128(DMA)          0      0    128    0    0    1 :  252  126
size-128            1044   1170    128   39   39    1 :  252  126
size-64(DMA)           0      0     64    0    0    1 :  252  126
size-64              434    885     64   15   15    1 :  252  126
size-32(DMA)           0      0     32    0    0    1 :  252  126
size-32             1525   4068     32   36   36    1 :  252  126

# free
             total       used       free     shared    buffers     
cached
Mem:       3924400    3915392       9008          0     660120    
3077868
-/+ buffers/cache:     177404    3746996
Swap:      6289320      11360    6277960

# cat /proc/swaps
Filename                        Type            Size    Used    
Priority
/dev/sda6                       partition       2096440 10544   -1
/dev/sda7                       partition       2096440 816     -2
/dev/sda8                       partition       2096440 0       -3

Comment 4 Larry Woodman 2004-03-10 15:55:48 UTC

We really need the AltSysrq-M output to debug this problem.  Can you
try "echo m > /proc/sysrq-trigger" and "dmesg" ?  That should get the
same results as the console keyboard.

Larry

Comment 5 Akira YOSHIYAMA 2004-03-11 02:12:03 UTC

I know another same report that says "this problem doesn't reproduce
when RAM is not 4GB but 2GB." (a borderline problem?)

Comment 6 Larry Woodman 2004-03-11 14:16:11 UTC

Perhaps, I still need that AltSysrq-M output to determine it.  Any
luck getting that yet?

Larry

Comment 7 John Caruso 2004-03-12 20:32:33 UTC

Akira: I can easily trigger kswapd via the dump test on a machine 
with 2GB of RAM, so I don't think it's restricted to machines with 
4GB.

Larry: I've got the output you wanted from the other machine, after 
running another dump to trigger kswapd.  I'm not sure if the dump 
test is actually giving you what you need, though--let me know (since 
this bug was already reported and fixed previously, I assume you're 
just looking for some kind of verification anyway).  The production 
database server has been having the problem once a day, so with any 
luck I'll be able to capture the output you want from it fairly soon 
as well.

Here's the output from the dump test (a 0-level dump of a filesystem 
to /dev/null):

--> /proc/slabinfo
slabinfo - version: 1.1 (SMP)
kmem_cache            80     80    244    5    5    1 :  252  126
nfs_read_data          0      0    384    0    0    1 :  124   62
nfs_inode_cache        2     34    224    2    2    1 :  252  126
nfs_write_data         0      0    384    0    0    1 :  124   62
nfs_page               0      0     96    0    0    1 :  252  126
ip_fib_hash           15    226     32    2    2    1 :  252  126
journal_head       20106  31824     48  338  408    1 :  252  126
revoke_table           6    253     12    1    1    1 :  252  126
revoke_record          0      0     32    0    0    1 :  252  126
clip_arp_cache         0      0    128    0    0    1 :  252  126
ip_mrt_cache           0      0     96    0    0    1 :  252  126
tcp_tw_bucket          1     30    128    1    1    1 :  252  126
tcp_bind_bucket       14    226     32    2    2    1 :  252  126
tcp_open_request       1     40     96    1    1    1 :  252  126
inet_peer_cache        2     59     64    1    1    1 :  252  126
ip_dst_cache          44    140    192    7    7    1 :  252  126
arp_cache              3     90    128    3    3    1 :  252  126
blkdev_requests    38400  38680     96  967  967    1 :  252  126
kioctx                 0      0     96    0    0    1 :  252  126
kiocb                  0      0     96    0    0    1 :  252  126
dnotify cache          0      0     20    0    0    1 :  252  126
file lock cache      126    252     92    6    6    1 :  252  126
async poll table       0      0    140    0    0    1 :  252  126
fasync cache           0      0     16    0    0    1 :  252  126
uid_cache              5    226     32    2    2    1 :  252  126
skbuff_head_cache   1171   2016    160   84   84    1 :  252  126
sock                 131    147   1312   49   49    1 :   60   30
sigqueue              82     87    132    3    3    1 :  252  126
kiobuf                 0      0   8768    0    0    4 :    0    0
cdev_cache            18    236     64    4    4    1 :  252  126
bdev_cache            10    118     64    2    2    1 :  252  126
mnt_cache             19    118     64    2    2    1 :  252  126
inode_cache        52948  59166    448 6574 6574    1 :  124   62
dentry_cache         687   2160    128   72   72    1 :  252  126
dquot                  0      0    128    0    0    1 :  252  126
filp                1097   1240     96   31   31    1 :  252  126
names_cache            3      3   4096    3    3    1 :   60   30
buffer_head       913370 914000     96 22850 22850    1 :  252  126
mm_struct            280    280    192   14   14    1 :  252  126
vm_area_struct      1703   2006     64   34   34    1 :  252  126
fs_cache             290    413     64    7    7    1 :  252  126
files_cache          113    153    416   17   17    1 :  124   62
signal_act           103    105   1312   35   35    1 :   60   30
size-131072(DMA)       0      0 131072    0    0   32 :    0    0
size-131072            0      0 131072    0    0   32 :    0    0
size-65536(DMA)        0      0  65536    0    0   16 :    0    0
size-65536             4      4  65536    4    4   16 :    0    0
size-32768(DMA)        0      0  32768    0    0    8 :    0    0
size-32768            18     18  32768   18   18    8 :    0    0
size-16384(DMA)        0      0  16384    0    0    4 :    0    0
size-16384             0      0  16384    0    0    4 :    0    0
size-8192(DMA)         0      0   8192    0    0    2 :    0    0
size-8192             17     17   8192   17   17    2 :    0    0
size-4096(DMA)         0      0   4096    0    0    1 :   60   30
size-4096             67     67   4096   67   67    1 :   60   30
size-2048(DMA)         0      0   2048    0    0    1 :   60   30
size-2048            874    928   2048  464  464    1 :   60   30
size-1024(DMA)         0      0   1024    0    0    1 :  124   62
size-1024            260    260   1024   65   65    1 :  124   62
size-512(DMA)          0      0    512    0    0    1 :  124   62
size-512             304    304    512   38   38    1 :  124   62
size-256(DMA)          0      0    256    0    0    1 :  252  126
size-256             546    675    256   45   45    1 :  252  126
size-128(DMA)          0      0    128    0    0    1 :  252  126
size-128            1170   1170    128   39   39    1 :  252  126
size-64(DMA)           0      0     64    0    0    1 :  252  126
size-64              685    944     64   16   16    1 :  252  126
size-32(DMA)           0      0     32    0    0    1 :  252  126
size-32             1592   4859     32   43   43    1 :  252  126

--> top
 12:17pm  up 10 days, 21:05,  2 users,  load average: 4.45, 1.78, 0.66
155 processes: 151 sleeping, 4 running, 0 zombie, 0 stopped
CPU0 states:  0.5% user, 21.3% system,  0.0% nice, 77.1% idle
CPU1 states:  0.3% user, 12.0% system,  0.0% nice, 87.1% idle
CPU2 states:  1.2% user, 34.0% system,  0.0% nice, 64.1% idle
CPU3 states:  0.0% user, 22.0% system,  0.0% nice, 77.4% idle
Mem:  3924400K av, 3918784K used,    5616K free,       0K shrd,  
658120K buff
Swap: 6289320K av,   11284K used, 6278036K free                 
3087656K cached

  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME COMMAND
   10 root      15   0     0    0     0 SW   21.5  0.0   5:07 kswapd
 8043 root      16   0  1008  476   264 R    20.9  0.0   0:17 dump
 8041 root      15   0  1008  476   264 R    20.1  0.0   0:17 dump
 8042 root      15   0  1008  476   264 R    17.2  0.0   0:16 dump
   13 root      39   0     0    0     0 SW    4.9  0.0   0:31 bdflush
 8040 root      15   0  2104  508   264 S     3.9  0.0   0:04 dump
  377 root      15   0     0    0     0 DW    2.3  0.0   0:24 
kjournald
 8021 root      15   0  1092 1092   824 R     0.3  0.0   0:00 top
    1 root      15   0   116   68    68 S     0.0  0.0   0:03 init
    2 root      15   0     0    0     0 SW    0.0  0.0   0:00 keventd
    3 root      15   0     0    0     0 SW    0.0  0.0   0:00 keventd
    4 root      15   0     0    0     0 SW    0.0  0.0   0:00 keventd

--> AltSysrq M
SysRq : Show Memory
Mem-info:
Free pages:       19892kB (  2040kB HighMem)
( Active: 444607, inactive_dirty: 412756, inactive_clean: 66514, 
free: 4973 (638
 1276 1914) )
179*4kB 9*8kB 2*16kB 1*32kB 1*64kB 1*128kB 1*256kB 0*512kB 0*1024kB 
0*2048kB 0*4
096kB = 1300kB
  active: 1424, inactive_dirty: 145, inactive_clean: 0, free: 325 
(128 256 384)
3768*4kB 60*8kB 1*16kB 1*32kB 1*64kB 1*128kB 1*256kB 1*512kB 0*1024kB 
0*2048kB 0
*4096kB = 16560kB
  active: 125235, inactive_dirty: 31299, inactive_clean: 0, free: 
4149 (255 510
765)
2*4kB 0*8kB 1*16kB 1*32kB 1*64kB 1*128kB 1*256kB 1*512kB 1*1024kB 
0*2048kB 0*409
6kB = 2040kB
  active: 317948, inactive_dirty: 381312, inactive_clean: 66514, 
free: 510 (255
510 765)
Swap cache: add 2295856, delete 2294373, find 1121400/1121926
Page cache size: 775110
Buffer mem: 148767
Ramdisk pages: 0
Free swap:       6278036kB
1048576 pages of RAM
770028 pages of HIGHMEM
67476 reserved pages
713490 pages shared
1484 pages swap cached
22 pages in page table cache
33274 pages in slab cache
Buffer memory:   595068kB
    CLEAN: 518648 buffers, 2074496 kbyte, 236 used (last=517543), 2 
locked, 0 pr
otected, 0 dirty
   LOCKED: 23366 buffers, 93464 kbyte, 23366 used (last=23366), 12084 
locked, 0
protected, 0 dirty

Comment 8 John Caruso 2004-03-12 22:17:46 UTC

Ok, we just saw the kswapd issue occur naturally on the production 
database server.  This wasn't a severe instance...we've had it go for 
hours in the past, and this was just for a few minutes.  But it may 
give you what you need.

BTW, there are suggestions in the Oracle support forums about 
mitigating this problem by changing the cache settings 
via /proc/sys/vm/pagecache (for instance, setting it to "2 10 30").  
Does this make sense as a workaround?

Here's the output you requested:

--> /proc/slabinfo
slabinfo - version: 1.1 (SMP)
kmem_cache            80     80    244    5    5    1 :  252  126
nfs_read_data        110    110    384   11   11    1 :  124   62
nfs_inode_cache      185    306    224   18   18    1 :  252  126
nfs_write_data        30     50    384    5    5    1 :  124   62
nfs_page             171    240     96    5    6    1 :  252  126
ip_fib_hash           17    339     32    3    3    1 :  252  126
journal_head         338   1560     48   20   20    1 :  252  126
revoke_table           6    253     12    1    1    1 :  252  126
revoke_record          0      0     32    0    0    1 :  252  126
clip_arp_cache         0      0    128    0    0    1 :  252  126
ip_mrt_cache           0      0     96    0    0    1 :  252  126
tcp_tw_bucket          4     60    128    2    2    1 :  252  126
tcp_bind_bucket       19    452     32    4    4    1 :  252  126
tcp_open_request       3     40     96    1    1    1 :  252  126
inet_peer_cache        0      0     64    0    0    1 :  252  126
ip_dst_cache          95    380    192   19   19    1 :  252  126
arp_cache              4     60    128    2    2    1 :  252  126
blkdev_requests    38400  38680     96  967  967    1 :  252  126
kioctx                 0      0     96    0    0    1 :  252  126
kiocb                  0      0     96    0    0    1 :  252  126
dnotify cache          0      0     20    0    0    1 :  252  126
file lock cache      175    378     92    9    9    1 :  252  126
async poll table       0      0    140    0    0    1 :  252  126
fasync cache           0      0     16    0    0    1 :  252  126
uid_cache              6    226     32    2    2    1 :  252  126
skbuff_head_cache   1271   1896    160   79   79    1 :  252  126
sock                 295    366   1312  122  122    1 :   60   30
sigqueue               4     58    132    2    2    1 :  252  126
kiobuf                 0      0   8768    0    0    4 :    0    0
cdev_cache           234    236     64    4    4    1 :  252  126
bdev_cache            10    118     64    2    2    1 :  252  126
mnt_cache             19    177     64    3    3    1 :  252  126
inode_cache          999   3447    448  383  383    1 :  124   62
dentry_cache         828   2580    128   86   86    1 :  252  126
dquot                  0      0    128    0    0    1 :  252  126
filp                5622   5840     96  146  146    1 :  252  126
names_cache            5      5   4096    5    5    1 :   60   30
buffer_head        48484 196240     96 4906 4906    1 :  252  126
mm_struct            344    660    192   33   33    1 :  252  126
vm_area_struct      9748  10266     64  174  174    1 :  252  126
fs_cache             346    708     64   12   12    1 :  252  126
files_cache          344    468    416   52   52    1 :  124   62
signal_act           302    336   1312  112  112    1 :   60   30
size-131072(DMA)       0      0 131072    0    0   32 :    0    0
size-131072            0      0 131072    0    0   32 :    0    0
size-65536(DMA)        0      0  65536    0    0   16 :    0    0
size-65536             4      4  65536    4    4   16 :    0    0
size-32768(DMA)        0      0  32768    0    0    8 :    0    0
size-32768            18     18  32768   18   18    8 :    0    0
size-16384(DMA)        0      0  16384    0    0    4 :    0    0
size-16384             0      0  16384    0    0    4 :    0    0
size-8192(DMA)         0      0   8192    0    0    2 :    0    0
size-8192             17     17   8192   17   17    2 :    0    0
size-4096(DMA)         0      0   4096    0    0    1 :   60   30
size-4096             44     45   4096   44   45    1 :   60   30
size-2048(DMA)         0      0   2048    0    0    1 :   60   30
size-2048            960    960   2048  480  480    1 :   60   30
size-1024(DMA)         0      0   1024    0    0    1 :  124   62
size-1024            280    280   1024   70   70    1 :  124   62
size-512(DMA)          0      0    512    0    0    1 :  124   62
size-512             288    288    512   36   36    1 :  124   62
size-256(DMA)          0      0    256    0    0    1 :  252  126
size-256             330    330    256   22   22    1 :  252  126
size-128(DMA)          0      0    128    0    0    1 :  252  126
size-128            1200   1200    128   40   40    1 :  252  126
size-64(DMA)           0      0     64    0    0    1 :  252  126
size-64              705   1003     64   17   17    1 :  252  126
size-32(DMA)           0      0     32    0    0    1 :  252  126
size-32             1359   5537     32   49   49    1 :  252  126

--> top
  1:58pm  up 13 days, 23:13,  2 users,  load average: 1.86, 0.80, 0.54
324 processes: 321 sleeping, 3 running, 0 zombie, 0 stopped
CPU0 states:  3.3% user, 41.4% system,  0.0% nice, 54.3% idle
CPU1 states: 12.0% user, 14.0% system,  0.0% nice, 73.5% idle
CPU2 states: 11.5% user, 22.3% system,  0.0% nice, 65.3% idle
CPU3 states: 10.2% user, 30.2% system,  0.0% nice, 59.1% idle
Mem:  3924400K av, 3919436K used,    4964K free, 1189344K shrd,   
98792K buff
Swap: 6289320K av,  683788K used, 5605532K free                 
1531032K cached

  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME COMMAND
   10 root      20   0     0    0     0 RW   64.1  0.0  94:08 kswapd
27055 oracle    18   0 1108M 1.1G 1101M R    44.9 28.7   9:06 oracle
29762 root      39   0  1208 1208   824 R    14.7  0.0   1:17 top
30322 oracle    15   0 1052M 1.0G 1044M S     7.6 27.3   4:16 oracle
 3935 oracle    15   0  207M 206M  200M S     6.2  5.3   0:05 oracle
26997 oracle    15   0 1144M 1.1G 1133M S     5.6 29.6  13:57 oracle
26706 oracle    15   0 1134M 1.1G 1126M S     3.2 29.4  11:24 oracle
 8748 oracle    16   0  171M 169M  169M S     2.3  4.4   0:29 oracle
27207 oracle    15   0 1111M 1.1G 1109M S     2.2 28.9   8:10 oracle
 8746 oracle    15   0  171M 169M  169M S     1.6  4.4   0:29 oracle

--> AltSysrq M
SysRq : Show Memory
Mem-info:
Free pages:        5148kB (  2344kB HighMem)
( Active: 533299, inactive_dirty: 158519, inactive_clean: 173836, 
free: 1287 (638 1276 1914) )
1*4kB 1*8kB 13*16kB 3*32kB 9*64kB 1*128kB 0*256kB 0*512kB 0*1024kB 
0*2048kB 0*4096kB = 1020kB
  active: 1865, inactive_dirty: 985, inactive_clean: 0, free: 255 
(128 256 384)
15*4kB 2*8kB 73*16kB 15*32kB 1*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 
0*2048kB 0*4096kB = 1788kB
  active: 110216, inactive_dirty: 73392, inactive_clean: 0, free: 447 
(255 510 765)
58*4kB 10*8kB 1*16kB 9*32kB 5*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 
0*2048kB 0*4096kB = 2344kB
  active: 421218, inactive_dirty: 84142, inactive_clean: 173836, 
free: 586 (255 510 765)
Swap cache: add 1141819, delete 980856, find 2998592/3011923
Page cache size: 840946
Buffer mem: 24708
count_ramdisk_pages: pagemap_lru_lock locked
Ramdisk pages: 0
Free swap:       5605532kB
1048576 pages of RAM
770028 pages of HIGHMEM
67476 reserved pages
14755969 pages shared
160943 pages swap cached
37 pages in page table cache
5709597 pages in slab cache
Buffer memory:    98832kB
    CLEAN: 25621 buffers, 102106 kbyte, 151 used (last=25593), 0 
locked, 0 protected, 0 dirty
    DIRTY: 70 buffers, 280 kbyte, 70 used (last=70), 0 locked, 0 
protected, 39 dirty

Comment 9 Larry Woodman 2004-03-15 14:51:33 UTC

Ah, I see the problem here, I missed the comment about the "dump
test".  This is causing the normal zone to be consumed with buffermem
pages!  Since buffermem pages must come from lowmem(from the DMA
and/or normal zone) and because buffermem pages have no mapping(they
are not linked on an inode) in AS2.1, excessive buffermem usage causes
kswapd to very hard and not accomplish enough.  What is this "dump
test" program, where can I get it and do you really need to be running
it as part of your production environment?  I will look into making
kswapd more appropriately and aggressively reclaiming buffermem pages
under these conditions.

Larry Woodman

Comment 10 John Caruso 2004-03-15 20:11:04 UTC

Larry, no offense, but I think you may need to read the case notes a 
bit more closely.  I'm wondering what you missed (I now understand 
why you weren't answering some questions...), so I'll just repeat the 
most important points:

1) This bug appears to be a duplicate of bug 58406.  It seems 
worthwhile for you to find out what fix was put in for kswapd in 
between RH7.2 and RH7.3 that led Arjan to close that case, and 
consider backporting that fix to RHAS 2.1 (he specifically mentions 
some improvements in kernel 2.4.9-21, and then says it was completely 
fixed in RH7.3).

2) The "dump test" I'm talking about is a 0-level dump of a 
filesystem to /dev/null (e.g., "dump -0f /dev/null /bigfs").  This 
methodology was suggested by bug 58046--it's not something I came up 
with or was doing before I reported this bug, and it's certainly not 
anything we do in production.

3) I only ran the dump test on a non-production database server, in 
order to try to get some useful output for you to look at while we 
were waiting for the production database server to exhibit the 
behavior.  The output is in comment #7.

4) I was finally able to capture an instance of the problem on the 
production database server; that output is in comment #8.  Again, I 
did NOT run the dump test to trigger the instance of the problem in 
comment #8; it happened naturally, as a part of the regular operation 
of this server (which is completely dedicated to running Oracle).


Those are the main points.  As I mentioned, I've seen a suggestion to 
change the settings in /proc/sys/vm/pagecache (and later 
also /proc/sys/vm/buffermem) as a workaround, though I'm still trying 
to determine reasonable values.  I received a recommendation from the 
admin of an Oracle site that was running into the same bug to set 
them as follows:

vm.buffermem = 0 4 5
vm.pagecache = 2 10 20

He claimed this resolved the problem on their database server 
(although clearly it would be better to just fix the bug in the 
kernel itself).

Comment 11 Larry Woodman 2004-03-15 22:23:54 UTC

John, I did read this case.  However, once I read that 7.2 and 7.3
were being compared and that some 7.3 fix was missing from 2.4.9-21 I
discounted most of it and focused on the actual problem you were
seeing because the VM code in 7.3 is so different from 7.2 its
impossible bring changes from 7.3 backward.  AS2.1 is based on the
same VM system that 7.2 is based on. From 7.3 on *everything* changed
in the VM, so much so that backporting anything to do with page
reclamation is nearly impossible.

And BTW, the "fix" that took place in the 7.3 kernel referenced in bug
58406 is pretty much a replacement of the VM system!

The problem you are having without the dump test running is that the
number of pages in the slabcache is totally wrong "5709597 pages in
slab cache" and kswapd uses that number to determine how hard it
should work.  The problem you are having with the dump test running is
most of lowmem is consumed in buffermem and kswapd is not aggressive
enough to reclaim buffermem pages in AS2.1.

I am building you a test kernel that fixes both of these problems.



The patch to fix for the wrong slabcache page count is:

--- linux/mm/slab.c.orig
+++ linux/mm/slab.c
@@ -494,7 +494,8 @@ static inline void * kmem_getpages (kmem
 	 */
 	flags |= cachep->gfpflags;
 	addr = (void*) __get_free_pages(flags, cachep->gfporder);
-
slabpages += 1 << cachep->gfporder;
+
if (addr)
+
	slabpages += 1 << cachep->gfporder;
 	/* Assume that now we have the pages no one else can legally
 	 * messes with the 'struct page's.
 	 * However vm_scan() might try to test the structure to see if



The patch to fix the excessive number of buffermem pages is:

--- linux/mm/vmscan.c.orig
+++ linux/mm/vmscan.c
@@ -639,10 +639,11 @@ dirty_page_rescan:
                 *
                 * 1) we avoid a writeout for that page if its dirty.
                 * 2) if its a buffercache page, and not a pagecache
-                * one, we skip it since we cannot move it to the
-                * inactive clean list --- we have to free it.
+                *    one, we skip(unless its a lowmem zone and buffermem
+                *    is over maxpercent) it since we cannot move it to
+                *    the inactive clean list --- we have to free it.
                 */
-               if (zone_free_plenty(page->zone)) {
+               if (zone_free_plenty(page->zone) &&
!buffermem_over_max()) {
                        /* here be dragons! do not change this or it
breaks */
                        if (!page->mapping || page_dirty(page)) {
                                list_del(page_lru);
--- linux/include/linux/pagemap.h.orig
+++ linux/include/linux/pagemap.h
@@ -35,6 +35,14 @@
 extern void page_steal_zone(struct zone_struct *, int);
 extern buffer_mem_t page_cache;

+static inline int buffermem_over_max(void)
+{
+       int buffermem = atomic_read(&buffermem_pages);
+       int limit = max_low_pfn * buffer_mem.max_percent / 100;
+
+       return buffermem > limit;
+}
+
 static inline int pagecache_over_max(void)
 {
        int pagecache = atomic_read(&page_cache_size) -
swapper_space.nrpages;

Comment 12 Need Real Name 2004-03-16 14:40:15 UTC

Just for reporting purposes, we are also experiencing this same 
issue. We have aquad processor Compaq server with 4GB of memory and 
are experiencing the same results (kswapd taking up large amounts of 
CPU and Memory). 

Larry Grillo
lg34

Comment 13 Larry Woodman 2004-03-16 18:07:12 UTC

OK Guys, I made an attempt to fix this problem.  When you get a chance
please try out the appropriate kernel in:

http://people.redhat.com/~lwoodman/.for_bug117902/


Larry Woodman

Comment 14 John Caruso 2004-03-17 00:10:46 UTC

Actually you can easily reproduce/test this yourself: just run 
a "dump -0f /dev/null /usr" (or some other filesystem) repeatedly in 
one window, and run top in another window.

I can't install this kernel on our production systems, but I did put 
it on a test system (2GB, 2 CPUs) and run the dump test there.  Under 
the e.38 kernel kswapd consumes about 17-25% CPU during the dump 
test, but with the e.39.1 kernel that you provided kswapd consumes 
about 6-10% CPU.  It also seems to be at the top of the top listing 
less frequently.  So this seems like an improvement (but since I 
don't know what the expected behavior is, I don't know if this is 
back to what would be considered normal).

One caveat though.  In my first round of tests with the e.39.1 
kernel, I stopped the dump test after 3 dumps or so, and free memory 
was listed as about 20MB in top.  But even with the dumps stopped and 
no other major activity on the system (this system is idle except for 
system processes), kswapd continued to consume around 3% CPU...and it 
never stopped.  I let it run for 30 minutes before rebooting the 
system to revert to the e.38 kernel, and throughout that period 
kswapd was continually consuming 3% CPU.  Not only that, but free 
memory *stayed* at 20MB throughout that entire interval...so whatever 
kswapd was doing, it wasn't actually reclaiming any memory.  However, 
I was unable to reproduce this behavior after subsequent reboots and 
tests.

Actually, after reverting and testing e.38 I did see similar 
behavior, but in that case kswapd only continued running for one or 
two minutes (again, at 3% CPU) after the dump test was stopped.

Comment 15 Need Real Name 2004-03-17 13:40:48 UTC

Ok..  We've installed it on our production server (since that'll be 
the real test - risky - but confident). The changes will take affect 
on our next reboot, which will be later this evening (unless the 
system crashes earlier).

I appreciate all your help !! 

Larry Grillo
lg34

Comment 16 Jason Baron 2004-03-19 14:41:53 UTC

any feedback from the production server?

Comment 17 John Caruso 2004-03-19 21:09:56 UTC

Not sure what Larry (Grillo) has seen so far, but we're planning to 
install the test kernel on our production database server this 
weekend during our maintenance window.  Based on our historic CPU 
data for the system, it looks the behavior is normal for the first 
week after a reboot and the problems with kswapd don't show up until 
at least the second week, so it'll take some time before we can say 
whether or not it appears to have resolved the kswapd issue (and 
others...).

Comment 18 Larry Woodman 2004-03-30 15:04:37 UTC

John, we are still waiting for feedback on whether these changes fixed
the excessive kswapd usage problem you reported.

Thanks, Larry Woodman

Comment 19 John Caruso 2004-03-30 21:23:40 UTC

As I said above, it'll take at least two weeks to really see if it's 
doing anything--and this is only the start of week two for us.  Our 
historic CPU data show that the behavior doesn't normally kick in 
until the end of the second week.  If it's still behaving normally by 
the end of *next* week (4/9/2004), we'll have a reasonable indication 
that the kswapd issue is fixed by this kernel.  We've only seen minor 
spikes in system CPU so far, and I'm not sure if those were due to 
kswapd.

As I mentioned in bug 117460, though, it doesn't seem to have fixed 
the ENOBUFS problems.

Comment 20 Need Real Name 2004-03-31 16:15:29 UTC

Sorry it's taken so long to reply back..  However, I have pasted our 
results below. At the time of writing this update, our server has 
been running on the new kernel for '7 days, 14:12. It seems like 
kswapd (although running better) is still running for very long 
periods of time.  Check out the info below. Let me know what you 
think.. Thanks !!   
-----------------------------------------------------
355 processes: 354 sleeping, 1 running, 0 zombie, 0 stopped
CPU0 states:  0.0% user,  0.0% system,  0.0% nice, 100.0% idle
CPU1 states:  1.0% user,  5.0% system,  0.0% nice, 92.0% idle
CPU2 states:  0.0% user,  0.0% system,  0.0% nice, 100.0% idle
CPU3 states:  0.0% user,  0.0% system,  0.0% nice, 100.0% idle
Mem:  4112276K av, 3945928K used,  166348K free,     328K shrd,  
551992K buff
Swap: 2097096K av, 1153768K used,  943328K free                 
1998780K cached

  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME COMMAND
30762 pvcsadm   15   0  1264 1260   828 R     7.3  0.0   0:00 top
    1 root      15   0   508  460   460 S     0.0  0.0   0:04 init
    2 root      15   0     0    0     0 SW    0.0  0.0   0:00 keventd
    3 root      15   0     0    0     0 SW    0.0  0.0   0:00 keventd
    4 root      15   0     0    0     0 SW    0.0  0.0   0:00 keventd
    5 root      15   0     0    0     0 SW    0.0  0.0   0:00 keventd
    6 root      34  19     0    0     0 SWN   0.0  0.0   0:03 
ksoftirqd_CPU0
    7 root      34  19     0    0     0 SWN   0.0  0.0   0:01 
ksoftirqd_CPU1
    8 root      34  19     0    0     0 SWN   0.0  0.0   0:02 
ksoftirqd_CPU2
    9 root      34  19     0    0     0 SWN   0.0  0.0   0:03 
ksoftirqd_CPU3
   10 root      15   0     0    0     0 SW    0.0  0.0  74:10 kswapd
   11 root      15   0     0    0     0 SW    0.0  0.0   0:00 
kreclaimd
   12 root      15   0     0    0     0 SW    0.0  0.0   4:06 krefilld
   13 root      15   0     0    0     0 SW    0.0  0.0   0:01 bdflush
   14 root      15   0     0    0     0 SW    0.0  0.0   0:06 kupdated
------------------------------------------------------------------

Take Care,
Larry Grillo
lg34

Comment 21 Larry Woodman 2004-03-31 17:47:27 UTC

Larry, is there another problem other than kswapd rund for 74 minutes
out of a week?  In AS2.1, kswapd is quite cpu intensive because it
walks the page tables of all processes looking for mapped pages to
reclaim.  Basically kswapd is running 1/136 of less than 1% of the
time.  This is not unusual for AS2.1 when there is memory pressure.

Larry Woodman

Comment 22 John Caruso 2004-04-01 00:25:32 UTC

Larry G: In case it helps you track it down, we see a few 
fingerprints for this issue: 1) system CPU totals much higher than 
normal (our standard workload is almost entirely user CPU), which you 
can verify with sar -u; 2) kswapd staying at or near the top of a top 
listing for extended periods.  We're now monitoring the latter 
via "top -b -n 20160 -d 30 > /var/tmp/topout" so we'll have a record 
if it happens when nobody's watching.

Larry W: looking at our own systems, 74 minutes in a week does seem 
to be a fairly high total for kswapd.  The total CPU for kswapd is 
often a good indicator that a system is having the problem, but the 
real point is whether or not kswapd accrued a large amount of CPU in 
one extended burst.  We did have one minor burst this morning, in 
which kswapd accrued 1 minute of CPU time over a 25 minute stretch 
(averaging about 4% CPU in the top listing, which makes sense).  
Still watching to see if it'll hit a dramatic spike as it used to on 
the e.38 kernel, though.

Comment 23 Pavel Kichigin 2004-04-07 09:49:59 UTC

Larry, please see Service Request #315520.

Comment 24 John Caruso 2004-04-10 00:34:57 UTC

Ok, it's 4/9/2004 and we've not run into any extended CPU hits 
because of kswapd; the biggest kswapd-related hit remains the 25-
minute stretch I described in comment 22, which was nothing to worry 
about.  So that's at least one week into the period when we were 
seeing problems before--and we *were* seeing the ENOBUFS errors (as 
per bug 117460), which normally accompany the kswapd problems, so it 
seems likely that if kswapd were still misbehaving we'd have seen it 
do so this week.

So, it looks to me like this test kernel (e.39.1) may have fixed the 
kswapd issue.

Comment 25 John Caruso 2004-06-09 19:43:14 UTC

Our DBA raised an alarm with me about krefilld recently, saying that 
he's seeing it hit the top of a "top" listing several times a day on 
our production database server (running the e.40.8smp kernel).  The 
spikes seem to be fairly short (2-3 minutes), but total CPU time for 
krefilld seems somewhat high--141:56 over 11 days of uptime.  At 
least, higher than it was in the past, and in the ballpark that 
kswapd used to reach.

I don't really think this is cause for alarm since the system is 
fairly busy and I haven't seen any of the dramatic spikes that we 
used to see with kswapd, but I thought I'd mention it.

Comment 26 Paul Hansen 2004-06-24 21:35:42 UTC

We've been trying a 40.8 kernel downloaded from Jason Baron's people
website with good result.  We are running RedHat Cluster Manager on
a heavily loaded machine doing lots of i/o.  Kswapd was going beserk
consuming hours of cpu time (the machine being used is a Compaq 
DL380 G3, dual Xeon 3.06GHz, 9GB of memory).  Kswapd went from 
consuming hours, sometimes 10s of hours of cpu time on heavy loads
to values in the minutes.  The on e27, cluster manager was 
experiencing false failovers, we were getting ENOMEMs from reads and  
writes and response time to interactive commands was poor.  All this 
and the performance of our load processes (high i/o levels)was bad as
kswapd and loads got in tight loops looking for memory.  I'm pleased
to say all this has changed with 40.8; the system is behaving under
the same loads in a very civilized manner.  I'm wondering why the
patch isn't being included in the regular errata releases.

Comment 27 Jason Baron 2004-06-25 02:17:08 UTC

These changes will be included in U5. A preview of U5 is available at:

http://people.redhat.com/~jbaron/.private/u5/2.4.9-e.41.12/

As always, any feedback that you can provide on that kernel is much
appreciated.

Comment 28 James Powers 2004-07-28 20:54:49 UTC

When pushing the system with high I/O traffic I'm seeing krefilld 
consume between 50 to 60% CPU ( monitoring via "top" ) for several 
minutes at a time.  Considering that this task is running at priority 
25, little else is getting done on the system.  This behavior with 
krefilld started someplace between kernel builds  2.4.9-e.24smp and 
2.4.9-e.41smp.  When I was running on kernel 2.4.9-e.24smp I 
exhibited the behavior mentioned above of kswapd going wild and 
consuming 77 to 89% CPU for extended periods of time (up to several 
minutes).  After upgrading to 2.4.9-e.41smp the behavior of kswapd 
got much better, but the krefilld behavior described above started 
showing up.  Recently I've upgraded to 2.4.9-e.41.8smp ( can't find 
Jason's 2.4.9-e.41.12 mentioned above ) and kswapd seems to be 
running just fine, but krefilld is still a heavy hitter.  Is anyone 
looking into this?

Thanks
Jim

Comment 29 Jason Baron 2004-07-30 15:35:36 UTC

krefilld does virtual page scanning to try and free pages, when we are
low on memory. rhel2.1 does not have a reverse mapping...pages to
virtual addresses so this can be very time consuming, and is a
fundamental limitation in rhel2.1. The latest beta kernel is at:

http://people.redhat.com/~jbaron/.private/u5/2.4.9-e.47.5.3/

Comment 30 James Powers 2004-08-06 14:43:32 UTC

Thanks Jason, I tried the 2.4.9-e.47.5.3 kernel and krefilld is 
behaving much better in this version as is kswapd.  I only noticed 
krefilld kicking in when free memory was going below 5MB and ended 
when free was over 5MB ( usually no more than a min of run time ).  I 
also noticed the %CPU of krefilld now isn't going above 70% which is 
a good thing since it's running at priority 25.

Jim

Comment 31 John Caruso 2004-08-06 16:51:06 UTC

I take it 2.4.9-e.47.5.3 is whatever patchset you're using applied to 
the base e.47 kernel...do you have it for e.48?  Since there's a
major security fix in that kernel.

Also, what's the timeline for releasing this fix officially (which I 
guess actually means: when is U5 due out)?

Comment 32 Jason Baron 2004-08-06 17:12:11 UTC

Hot of the presses is the U5 beta respun with the security fixes,
e.49, pls find it at the usualy place:

http://people.redhat.com/~jbaron/.private/u5/

U5 was on schedule to ship August 18th, but the security issues have
pushed that date out a bit now. maybe ~1 week later...

Comment 33 mark wisner 2004-08-13 14:49:59 UTC

James, please restest with the newer Kernel.

Comment 35 James Powers 2004-08-13 18:22:07 UTC

I tested with the e.49 kernel and got the same results as I did with 
the e.47.5.3 kernel.  Kswapd seemed to be well behaved and krefilld 
behaved as mentioned above.  So it looks good, at least from my 
standpoint.

Thanks
Jim

Comment 36 John Flanagan 2004-08-18 14:25:00 UTC

An errata has been issued which should help the problem 
described in this bug report. This report is therefore being 
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files, 
please follow the link below. You may reopen this bug report 
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2004-437.html

Comment 37 Kong XS 2005-02-03 07:25:59 UTC

Plz somebody tell me can i compile a new kernel from kernel-source-
2.4.9-e3smp to solve this problem?Must i get a new rpm format kernel?
or should i make a new kernel from tarball kernel source??And what 
should i do to make a new kernel from tarball kernel source to fix 
this annoying kswapd bug???

Comment 38 Rik van Riel 2005-02-03 11:47:22 UTC

You really need to download a new version of the AS2.1 kernel.  I
believe we are up to 2.4.9-e54 or something like that...

Comment 39 Paul Hansen 2005-02-03 14:21:48 UTC

e49 and beyond kernels have the fix.  IIRC, the current released
version is e59.  You can download it from the errata section of
the RedHat network.

Comment 40 adam 2005-03-23 22:11:59 UTC

I'm running 2.4.9-e.59summit on an IBM x445 running oracle. I'm seeing krefilld
just sit at the top of top, it's consumed 487.49 minutes of cpu time, and the
machine has only been up for 20 days. The really strange thing is that only half
of the available memory is being used, there's 8gig used and 8 free. Is there
anything in the works to fix krefilld?