Bug 78759 - The system randomly hangs on kswapd
The system randomly hangs on kswapd
Product: Red Hat Enterprise Linux 2.1
Classification: Red Hat
Component: kernel (Show other bugs)
i686 Linux
medium Severity medium
: ---
: ---
Assigned To: Larry Woodman
Brian Brock
Depends On:
  Show dependency treegraph
Reported: 2002-11-29 02:55 EST by Pietro Dania
Modified: 2007-11-30 17:06 EST (History)
16 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2007-10-19 15:25:47 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
System profiling (284.76 KB, application/octet-stream)
2003-01-21 00:55 EST, Need Real Name
no flags Details
meminfo slabinfo and top (7.73 KB, text/plain)
2003-02-24 15:55 EST, keith mannth
no flags Details

  None (edit)
Description Pietro Dania 2002-11-29 02:55:37 EST
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (compatible; Konqueror/3.0.0-10; Linux 2.4.18-18.7.x; X11; i686; en_US)

Description of problem:
I experienced random system hangs on an HP lr2000r with 2 1.133 GHz P3, 1GB Ram, 6 18 GB RAID5 disks, 1 DLT8000 tape drive when uptime >= 2 days.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

Additional info:

Whe the machine was at least 2 days up, it often became incredibly slow: 'ps aux' and 'vmstat 1' printed a line every 2 seconds, ten seconds were needed to get the prompt after passwd entry at console login, and such. Issuing 'top' i saw, after a couple of minutes, a lot of 'TIME' amount for the 'kswapd' process, having 0K used swap space.
I tried all updated kernels (both normal and enterprise ones) with the same results.
I finally installed RHL-7.3 and got the thing work as expected.
Comment 1 Larry Woodman 2002-12-02 09:19:02 EST
I am confident that we have fixed the AS2.1 kswapd hand problems.  
Would it be possible to try the new kernel out on a test machine???

The kernel can be found in:


Thanks, Larry Woodman
Comment 2 Ray DeJean 2002-12-03 12:24:48 EST
I am trying out your kernel on a test machine and it does not appear to fix 
the problem.  
18 root      15   0     0    0     0 SW    0.0  0.0   0:18 kswapd  
kernel 2.4.9-e.8corruptsdata.17enterprise  
is there anything else i can try?
Comment 3 Need Real Name 2002-12-12 02:23:19 EST
We have a machine which may be suffering from the same problem:
root        18  1.6  0.0     0    0 ?        SW   Nov28 350:06 [kswapd]

However the machine is "stable", it doesn't crash. It will regularly slowdown
for 15-60 seconds. During this time the machine is locking up for a few seconds
at a time and then recovering. "while usleep 500000; do date; done" will skip
seconds during the slowdowns.

This is a 4 (physical) processor Xeon box with Hyperthreading. Kernel:

Is there a "summit" kernel I can test on this machine?
Comment 4 Ray DeJean 2002-12-12 11:11:44 EST
Mark, that is the same symptoms i am having.  You have an x360 or x440?  I see
the problem on our 2 x360s.  I am running the summit kernel, but have tried all
the other kernels with no success.  I finally called Redhat and they asked me to
run some stats (kernel profiling) when the problem happens and send to them.  I
can send this email to you.  I would suggest you call Redhat and report the
problem also...  if this isn't fixed soon i will have to switch back to 7.3.
Comment 5 Pietro Dania 2002-12-13 03:56:05 EST
I'd like to make some remarks.  In order to evaluate 2.1AS, should my customer decide to purchase it, i  rebuilded it from SRPMS and deployed it by hand on top of a 7.2 (ugly but easy  way). I only had UNCERTIFIED hardware available for testing (HP  lp1000r/lp2000r). Everything seemed to work fine; i set up a couple of FOS  clusters running a very light service (a multicast streaming server) and didn't  observe any slowdown. I then installed the machine described above and started experiencing the problem. 
Comment 6 Need Real Name 2002-12-19 00:34:29 EST
Ray, yes, this is happening on an x440.
Comment 7 Scott Carlson 2003-01-02 20:44:57 EST
I'm also having this problem on a Compaq DL380 w/2 Proc & 4GB Ram with 
Multithreading enabled.  kswapd consumes 99.9% of the CPU for 30-45 minutes at 
a time once the available memory that I have gets rather low.  We're running a 
very high volume sendmail application on this server that occasionally gets 
3000 sendmail children processes running on it at a time.

   6 root      25   0     0    0     0 RW   99.9  0.0 671:16 kswapd

More stuff if needed.  I have the information to turn on profiling, but since 
this is a production system, we have not done that yet.

Linux version 2.4.9-e.8smp (bhcompile@daffy.perf.redhat.com) (gcc version 2.96 
0000731 (Red Hat Linux 7.2 2.96-108.1)) #1 SMP Fri Jul 19 15:38:30 EDT 2002
BIOS-provided physical RAM map:
 BIOS-e820: 0000000000000000 - 000000000009f400 (usable)
 BIOS-e820: 000000000009f400 - 00000000000a0000 (reserved)
 BIOS-e820: 00000000000f0000 - 0000000000100000 (reserved)
 BIOS-e820: 0000000000100000 - 000000008fffc000 (usable)
 BIOS-e820: 000000008fffc000 - 0000000090000000 (ACPI data)
 BIOS-e820: 00000000fec00000 - 00000000fec10000 (reserved)
 BIOS-e820: 00000000fee00000 - 00000000fee10000 (reserved)
 BIOS-e820: 00000000ffc00000 - 0000000100000000 (reserved)
Scanning bios EBDA for MXT signature
1407MB HIGHMEM available.
896MB LOWMEM available.
found SMP MP-table at 000f4fd0
hm, page 000f4000 reserved twice.
hm, page 000f5000 reserved twice.
hm, page 000f3000 reserved twice.
hm, page 000f4000 reserved twice.
On node 0 totalpages: 589820
zone(0): 4096 pages.
zone(1): 225280 pages.
Intel MultiProcessor Specification v1.4
    Virtual Wire compatibility mode.
OEM ID: COMPAQ   Product ID: PROLIANT     APIC at: 0xFEE00000
Processor #3 Pentium(tm) Pro APIC version 16
Processor #0 Pentium(tm) Pro APIC version 16
I/O APIC #8 Version 17 at 0xFEC00000.
I/O APIC #2 Version 17 at 0xFEC01000.
Processors: 2
Kernel command line: ro root=/dev/cciss/c0d0p9
Initializing CPU#0
Detected 1396.545 MHz processor.
Console: colour VGA+ 80x25
Calibrating delay loop... 2785.28 BogoMIPS
Memory: 2308028k/2359280k available (1976k kernel code, 42672k reserved, 107k 
ta, 244k init, 1441776k highmem)
Comment 8 Annie 2003-01-06 13:46:18 EST
I think we are seeing the same thing with a Proliant DL760R PIII900 with 8 
processors and 8GB RAM.  We had kswapd problems under RH6.2, which were 
resolved by upgrading the kernel from 2.4.2 to 2.4.16 (aka "the Google 
problem").   But it hung again when we upgraded from 4cpu/4GB RAM to this 
machine, so now we are testing Advanced Server.  Hoping to put this machine in 
production next week.  Is there a way to confirm this is the issue?  Any 
monitoring we should be doing to help diagnose?  Any update on resolution?
Comment 9 Need Real Name 2003-01-21 00:55:31 EST
Created attachment 89454 [details]
System profiling

Here is some profiling which was done while experiencing the slowdowns, as
Comment 10 Larry Woodman 2003-01-21 09:44:33 EST
We are getting ready to release an AS2.1 errata kernel to fix the kswapd
issue you are seeing.  You can try out a beta release of that errata kernel
if you like, it can be downloaded from:


Larry Woodman

Comment 11 Need Real Name 2003-02-23 18:06:34 EST
After applying the 2.4.9-e.12summit kernel errata the problem still appears to
be occuring.
Comment 12 keith mannth 2003-02-24 12:45:29 EST
   I have been testing 2.4.9-e.12summit on a x440. I think I have found
the root of the problem.  kswapd can't ever reclaim slabcache memory.
After running some I/O heavy scripts the memory state of the machine 
can be described as 
        total:    used:    free:  shared: buffers:  cached:
      Mem:  16880508928 7220150272 9660358656        0   978944 6545244160
      Swap: 2146754560        0 2146754560
      MemTotal:     16484872 kB
      MemFree:       9433944 kB
      MemShared:           0 kB
      Buffers:           956 kB
      Cached:        6391840 kB
      SwapCached:          0 kB
      Active:        3731408 kB
      Inact_dirty:       444 kB
      Inact_clean:   2660808 kB
      Inact_target:  4121192 kB
      HighTotal:    15859160 kB
      HighFree:      9432412 kB
      LowTotal:       625712 kB
      LowFree:          1532 kB
      SwapTotal:     2096440 kB
      SwapFree:      2096440 kB
      BigPagesFree:        0 kB

slabinfo - version: 1.1 (SMP)
mnt_cache             13     30    128    1    1    1 :  252  126
inode_cache       698012 698145    512 99735 99735    1 :  124   62
dentry_cache         512   1860    128   62   62    1 :  252  126
dquot                  0      0    128    0    0    1 :  252  126
filp                 929   1110    128   37   37    1 :  252  126
names_cache           12     12   4096   12   12    1 :   60   30
buffer_head       953133 1265100    128 42170 42170    1 :  252  126
mm_struct            186    195    256   13   13    1 :  252  126

cat /proc/sys/vm/pagecache   
 2       50      70

  This snapshot was taken about an hour after scripts had finished.  
Checks on top show kswapd using lot's of cpu time trying to reclaim
as much lowmem as it can.  
   During this state the box can easily hang with a strong attack on
lowmem usage. If the rest of lowmem is needed kswapd will will 
go into a loop and stay there (at least 10-20 min) which in my 
book is hung.  
  There needs to be a way for the slabcache pages to be reclaimed.

Comment 13 Larry Woodman 2003-02-24 13:36:06 EST
Keith, can you get us an "AltSysrq m" output and a full /proc/slabinfo
output when the machine is in this state and attach it to this bug?

FYI, we did add code to AS2.1 to cause some of the kernel data structures
to be reclaimed from the slab when there is a lowmem shortage.  Perhaps
we need to be more agressive for 16GB systems though.

Thanks, Larry Woodman
Comment 14 keith mannth 2003-02-24 15:55:13 EST
Created attachment 90320 [details]
meminfo slabinfo and top  

This is output from meminfo slabinfo and top when the kernel is 
in a low- lowmem state. (Kernel is 2.4.9-e.12summit)
Comment 15 rob lojek 2003-02-25 20:48:09 EST
We have a 2x1400 mhz P-III dl380 w/4 gB running 2.4.9-e.12enterprise.

I think we might have the same problem:

This machine becomes unusable during 1) nfs copies of big files or 2) dd'ing of
big files (e.g., dd if=/dev/zero of=/opt/2gB-testfile bs=1024 count=2M).

You can't login to the machine on the console; commands like "ls", "free" take
minutes to return.

If we boot into non-RHAS 2.4.20 kernel, everything works as it should. Boot back
into 2.4.9-e.12 (or any of the older ones), then problem exists.
Comment 16 Jon Nangle 2003-03-11 09:31:06 EST
We are also seeing this on a 2x2.8GHz/6GB hyperthreaded DL380 G3 running 2.4.9-

Larry, I notice that you have uploaded a -14 build to your website - is this 
worth trying or is it still a work in progress?
Comment 17 Larry Woodman 2003-03-11 14:35:29 EST
Jon, please try the kernel in http://people.redhat.com/lwoodman/.private/
It specifically flushes inodes and bufferheaders so they can be returned to
the slab cache and then the slab caches pages can be reclaimed.  This should
free up lowmem.

Please let me know how this works.

Larry Woodman
Comment 18 Paulr 2003-03-14 18:08:27 EST
Yep I am having the same problem. It only happens on 440s and 360s.
Kernel I am using is 2.4.9-e.12enterprise.

I find that it is easy to reproduce. I take a machine say with 2GB Memory, 
look at top.
Shows 2GB installed, 1.4GB free, 100MB Bufer and 300MB Cached

and then:

scp -pr root@localhost:/usr /other/area.

for about 20 Minutes the copy streams fine, then it crawls to a halt. a du 
in /other/area shows apx 1.4GB copied over, all activity is now affected, a ps 
14+ seconds. A top shows:
that the 1.4GB of free memory has moved to cache (not buffer)and that we now 
only have apx 5MB free. not enough to run smoothly, the kswapd is busy but not 
flatout. I have noticed that bdflush does not kick into life that often and 
that even though I only have 6 -> 4 MB free swap is running a 0% used. I have 
tried this on MDK 9 amd RH 8.0 both are OK.


Comment 19 Paulr 2003-03-17 10:04:10 EST
Hi Guys,
   I've been testing 2.4.9-e.14.1enterprise and it looks like the problem is
fixed. Although it drops down to 5 MB at times it often jumps back to 15 / 20
MB, then after about 20 Mins. free jumps to 80 MB.


Comment 20 keith mannth 2003-03-18 16:58:08 EST
Hello all,
   I have tested 2.4.9-e.14.1summit on an 16gig x440 box and the problem seems
to be fixed.  It is way better than before, I have been unable to hang to box
due to kswapd issues (unlike e.12).  Thanks for looking into this. 

Comment 21 Jon Nangle 2003-03-21 04:21:26 EST
The new enterprise kernel is working well for us too - nice job.

Comment 22 Need Real Name 2003-03-24 10:04:04 EST
I've noticed that a new kernel (2.4.9-e.16) has been released, and I've been
asked if this kernel contains the fixes that was applied to 2.4.9-e.14.1, can
you confirm if the new release also conatin this fix?

Comment 23 Jon Nangle 2003-04-01 02:54:56 EST
We're still seeing this, albeit nowhere near as often as before. One of our DB 
servers (running Sybase on raw partitions) is getting this about once a day, 
usually when we are backing up a large DB to the filesystem (with several 
stripes). Is there something I can tune under /proc/sys/vm that will alleviate 
the problem? I don't mind giving up some buffer/cache space if it means that 
kswapd can recover more gracefully when under very heavy pressure.
Comment 24 Need Real Name 2003-04-02 01:29:59 EST
We have found the problem is still occuring with the e.16summit kernel. However
it  seems to occur a bit less often now than with the earlier kernels.
Comment 25 Jon Nangle 2003-04-02 02:42:01 EST
I have increased the values in /proc/sys/vm/freepages to attempt to give the 
system a little more room for manoevure.
Comment 27 Jure Pecar 2003-05-05 12:55:45 EDT
I might have some usefull info to add to this bug. My setup is a ~500gb reiserfs
volume used for cyrus mailstore on a dual p3 1.26ghz box with 2gb ram.

With 2.4.9-e.3 i could measure deadlocks caused by kswapd in tens of minutes, so
the box was useless as a server. kswapd used more than 1300 minutes of cpu time
in about two days of uptime. The box never touched swap partition at all.

Upgrading to 2.4.9-e.16 makes noticeable difference. kswapd still kicks in
ocasionally, but it has only spent ~50 minutes of cpu time in two and a half
days of uptime. What bothers me is that when it starts its job, the box still
slows down to the point of being unusuable, but fortunately this time is now
measured in seconds.

FYI: for this specific workload (cyrus mailstore), i found out that -aa kernels
give much better performance and also much better 'feeling' (responsiveness) of
the server. I'll try to do some tests when i finish migrating users from the old
system to this new one.
Comment 28 Hisaaki SHibata 2003-05-21 12:25:53 EDT
Our 4 servers w/ RH-AS2.1 aloso randomly hangs and/or slowdown by kswapd CPU eater.

At 1st, kernel-2.4.9-e3 hangs very often; once or twice a day.
After we version-UPed kernels to 2.4.9-e16, we get slowdowns twice in a week.

Followings are our 'top' result during latest slowdown.
At that time, only 3 cron batch jobs were running .

If you need more infomation, Please let us know what should we show.

And Please this bug fix priority to HIGH.

Best Regards,
Hisaaki Shibata

 ===== 2003/05/20 23:59:19 =====

 11:59pm  up 5 days, 14:49,  0 users,  load average: 0.08, 0.06, 0.01
166 processes: 164 sleeping, 2 running, 0 zombie, 0 stopped
CPU0 states:  0.0% user,  0.0% system,  0.0% nice, 100.0% idle
CPU1 states:  0.0% user,  0.0% system,  0.0% nice, 100.0% idle
CPU2 states:  0.0% user,  0.0% system,  0.0% nice, 100.0% idle
CPU3 states:  0.1% user,  0.0% system,  0.0% nice, 99.0% idle
Mem:  2058832K av, 2044540K used,   14292K free,     652K shrd,  219500K buff
Swap: 4192944K av,      76K used, 4192868K free                 1653216K cached

   10 root      15   0     0    0     0 SW    0.0  0.0   2:15 kswapd

 ===== 2003/05/21 00:10:52 =====

 12:10am  up 5 days, 15:00,  0 users,  load average: 73.86, 67.02, 36.97
191 processes: 189 sleeping, 2 running, 0 zombie, 0 stopped
CPU0 states:  0.0% user,  0.1% system,  0.0% nice, 99.0% idle
CPU1 states:  0.1% user,  0.1% system,  0.0% nice, 98.1% idle
CPU2 states:  0.0% user,  0.0% system,  0.0% nice, 100.0% idle
CPU3 states: 16.0% user,  3.0% system,  0.0% nice, 80.0% idle
Mem:  2058832K av, 2045536K used,   13296K free,     652K shrd,  225480K buff
Swap: 4192944K av,      76K used, 4192868K free                 1644660K cached

   10 root      15   0     0    0     0 SW    0.0  0.0  12:17 kswapd

 ===== 2003/05/21 00:12:53 =====

 12:12am  up 5 days, 15:02,  0 users,  load average: 12.72, 45.89, 32.86
180 processes: 177 sleeping, 3 running, 0 zombie, 0 stopped
CPU0 states:  0.0% user,  0.1% system,  0.0% nice, 99.0% idle
CPU1 states: 90.0% user,  9.0% system,  0.0% nice,  0.0% idle
CPU2 states:  0.1% user,  0.1% system,  0.0% nice, 98.0% idle
CPU3 states:  0.0% user,  0.0% system,  0.0% nice, 100.0% idle
Mem:  2058832K av, 2053496K used,    5336K free,     652K shrd,  240600K buff
Swap: 4192944K av,      76K used, 4192868K free                 1635976K cached

   10 root      15   0     0    0     0 SW    0.0  0.0  12:17 kswapd

Comment 29 Ing. Christoph Pirchl 2003-05-27 16:34:37 EDT

Tried the memtest.sh Script from http://people.redhat.com/dledford/memtest.html
The System crashes or hung after 20 minutes running the test,
also the kswapd has 99.9% cpu load.
Seems to be the same proble as described above !?
Tried this with kernel-2.4.9-e.12, e.16 and also e.20
The behavior is slidly different, but the symptoms are the same.

Any idea what else i can try

System Info

IBM X440
4 x 2,4 GHz Xeon
8 GB Ram
ext. FAStT700 with QLogic QLA 2312

thanks in advance

Comment 30 Ray DeJean 2003-05-30 11:50:39 EDT
I extracted the kernel source from the Redhat 9 kernel (rpm2cpio) and have been 
running that on my AS systems.  it has been going for about a week with no major 
problems, and seems to do a good job of keeping memory free (with the kscand 
threads).  i believe this kernel (like -ac) uses the rmap VM. 
i have given up on the AS kernel until Redhat puts this bug at a higher priority and 
fixes it for good.  There are obviously still some major problems in the AS kernel VM. 
We have been dealing with this bug since December and have not seen any 
progress.  It has left a very bad taste in my manager's mouth, and he will question the 
use of AS or any linux for future large-scale projects.  Sorry to get political in a bug 
report, but that is the reality.   If any of you are still having problems, try the RH9 kernel. 
Comment 31 Bastien Nocera 2003-05-30 12:15:17 EDT
As people are updating this bug, and I'm on the CC List. Please remember that
bugzilla is not a support channel. If you want focused help on your problems,
contact your local Red Hat support.

BTW, I believe that the 2.4.9-e.24 kernel has fixed these issues. See
for details

Comment 32 Ing. Christoph Pirchl 2003-06-02 06:46:50 EDT
Updatet to kernel 2.4.9-e.24 still no changes. After running 45 minutes the 
memtest.sh script, the server hangs !

12:40pm  up  1:42,  3 users,  load average: 3,79, 11,97, 17,68
113 processes: 110 sleeping, 3 running, 0 zombie, 0 stopped
CPU0 states:  1,5% user, 61,2% system,  0,5% nice, 36,4% idle
CPU1 states:  0,2% user, 49,0% system,  0,2% nice, 50,1% idle
CPU2 states:  2,3% user, 59,0% system,  0,0% nice, 38,0% idle
CPU3 states:  0,1% user, 84,4% system,  0,3% nice, 14,2% idle
Mem:  8240660K av, 4670672K used, 3569988K free,       0K shrd,    1288K buff
Swap: 2096220K av,       0K used, 2096220K free                 3837888K cached

   10 root      25   0     0    0     0 RW   99,9  0,0  35:42 kswapd
 2189 root      39   0  1112 1112   824 R    13,7  0,0  14:47 top

Any suggestions ?

CHristoPh Pirchl
Comment 33 Ing. Christoph Pirchl 2003-06-05 09:01:29 EDT
Hi everyone,

When i run the testscript with 4 GB RAM (instead of 8GB RAM), 
everything is working ;-))

Does anyone know, if the summit Kernel is compiled with HIGH MEMORY 

Thanks in advance

Comment 34 rob lojek 2003-06-05 15:02:28 EDT
This is just FYI:

We're running 2.4.9-e.16, which seems to be better than the really old kernels
AFA memory management, but we're now running into more problems with general
"sluggishness" on e.16 boxes, which matches some symptoms above.

Per RH's advice (see bug #85231), we adjusted these parameters:

echo 4000 > /proc/sys/vm/high_io_sectors
echo 2000 > /proc/sys/vm/low_io_sectors

but when we crank up oracle, we see periodic sluggishness on the console and
poor oracle performance compared to our current production kernel (2.4.17-rmap12f). 

We're in the same boat as Ray DeJean above--we can't deploy RHAS to our db farm
until this problem is solved, and we _have_ to run RHAS because oracle won't
support any other linux except SUSE-enterprise, which we may start testing soon.

We haven't tried e.24 yet, though, and it looks like there's a lot of VM work
from the errata notes on rhn.
Comment 35 Larry Woodman 2003-06-05 15:40:41 EDT
Whats going on here is that the system is out of lowmem because
the system is managing 16GB which requires ~300MB of lowmem for the 
mem-map and there is ~6GB of memory in the pagecache which is also 
consuming ~500MB of lowmem for the inodes and buffer headers as can 
be seen in the slabinfo output.  So, kswapd cant reclaim the lowmem
directly.  I did add additional memory reclaiming logic to launder
highmem pagecache memory when lowmem is consumed by inodes and buffer
headers.  This will free the inodes and buffer headers so the slab
memory can also be freed to reduce lowmem pressure.  This logic is in 
the e.24 kernel but not e.16 so you should try that kernel out ASAP.   

Larry Woodman
Comment 36 Yoshihide Sonoda 2003-07-08 00:57:54 EDT
Doesn't this problem relate to bug #98333?
Comment 37 RHEL Product and Program Management 2007-10-19 15:25:47 EDT
This bug is filed against RHEL2.1, which is in maintenance phase.
During the maintenance phase, only security errata and select mission
critical bug fixes will be released for enterprise products.  Since
this bug does not meet that criteria, it is now being closed.

For more information of the RHEL errata support policy, please visit:

If you feel this bug is indeed mission critical, please contact your
support representative.  You may be asked to provide detailed
information on how this bug is affecting you.

Note You need to log in before you can comment on or make changes to this bug.