Bug 132639 - kswapd run amok after U3 upgrade
kswapd run amok after U3 upgrade
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 3
Classification: Red Hat
Component: kernel (Show other bugs)
3.0
i386 Linux
medium Severity medium
: ---
: ---
Assigned To: Larry Woodman
:
Depends On:
Blocks: 123571
  Show dependency treegraph
 
Reported: 2004-09-15 08:53 EDT by Göran Uddeborg
Modified: 2008-10-18 03:23 EDT (History)
32 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2004-12-20 15:56:37 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Probe while tar was still running (14.64 KB, text/plain)
2004-09-16 04:52 EDT, Göran Uddeborg
no flags Details
Probe after tar was killed (19.72 KB, text/plain)
2004-09-16 04:55 EDT, Göran Uddeborg
no flags Details
Complete messages file (451.29 KB, text/plain)
2004-09-16 04:58 EDT, Göran Uddeborg
no flags Details
Output from a number of SysRq p and w (31.88 KB, text/plain)
2004-09-16 11:46 EDT, Göran Uddeborg
no flags Details
Memory usage for one of the affected machines, by day (17.01 KB, image/png)
2004-09-16 14:47 EDT, Tore Anderson
no flags Details
CPU usage for one of the affected machines, by day (8.56 KB, image/png)
2004-09-16 14:49 EDT, Tore Anderson
no flags Details
Result in /var/log/messages from SysRq m, t, p, and w (160.84 KB, text/plain)
2004-09-23 07:58 EDT, Göran Uddeborg
no flags Details
Output of "head /proc/megaraid/hba[1-3]/raiddrives-0-9" (656 bytes, text/plain)
2004-09-23 08:01 EDT, Göran Uddeborg
no flags Details
Output of "pvdisplay /dev/sd[b-d]1" (1.30 KB, text/plain)
2004-09-23 08:03 EDT, Göran Uddeborg
no flags Details
Output of "vgdisplay /dev/vg00" (551 bytes, text/plain)
2004-09-23 08:04 EDT, Göran Uddeborg
no flags Details
Output of "lvdisplay /dev/vg00/lvol2" (440 bytes, text/plain)
2004-09-23 08:06 EDT, Göran Uddeborg
no flags Details
Output of "lspci -vv" (15.06 KB, text/plain)
2004-09-23 08:10 EDT, Göran Uddeborg
no flags Details
Contents of /proc/cpuinfo (1.80 KB, text/plain)
2004-09-23 08:13 EDT, Göran Uddeborg
no flags Details
Contents of /proc/meminfo (732 bytes, text/plain)
2004-09-23 08:14 EDT, Göran Uddeborg
no flags Details
Output of "lsmod" (1.04 KB, text/plain)
2004-09-23 11:30 EDT, Göran Uddeborg
no flags Details
Result in /var/log/messages from SysRq m, t, p, and w (101.79 KB, text/plain)
2004-09-23 12:23 EDT, Göran Uddeborg
no flags Details
Test program to trigger kswapd (7.87 KB, text/plain)
2004-09-24 08:05 EDT, Göran Uddeborg
no flags Details
Output of "vmstat 1 20", inactive_clean_percent=100 (1.74 KB, text/plain)
2004-09-24 08:07 EDT, Göran Uddeborg
no flags Details
Output of "top b n 3", inactive_clean_percent=100 (26.29 KB, text/plain)
2004-09-24 08:08 EDT, Göran Uddeborg
no flags Details
Result in /var/log/messages from SysRq m, t, and w, inactive_clean_percent=100 (110.35 KB, text/plain)
2004-09-24 08:09 EDT, Göran Uddeborg
no flags Details
/var/log/messages during restart and running test program (87.60 KB, text/plain)
2004-09-27 14:02 EDT, Göran Uddeborg
no flags Details
Last retrieved copy of slabinfo. (4.87 KB, text/plain)
2004-09-30 04:21 EDT, Göran Uddeborg
no flags Details
One of the first samples, for reference. (4.87 KB, text/plain)
2004-09-30 04:24 EDT, Göran Uddeborg
no flags Details
Result in /var/log/messages from SysRq t repeatedly (729.58 KB, text/plain)
2004-10-01 04:50 EDT, Göran Uddeborg
no flags Details
Details of the problem we are seeing (10.13 KB, text/plain)
2004-10-19 05:45 EDT, Ville Herva
no flags Details
top view as system froze on 21-15 kernel (3.05 KB, text/plain)
2004-10-27 16:14 EDT, Misao
no flags Details
My Complaint about RH Enterprise Service. (18.68 KB, text/plain)
2004-12-09 10:33 EST, Albert Graham
no flags Details
console output from kernel panic (1.42 KB, text/plain)
2004-12-13 15:16 EST, Olle Liljenzin
no flags Details
Kernel 2.4.21-23.ELsmp Panic under load (99.36 KB, image/jpeg)
2004-12-15 21:11 EST, Albert Graham
no flags Details
From Comment 152 (2.24 KB, text/plain)
2004-12-16 09:12 EST, Chris Evich
no flags Details
re: Comment 152 (731 bytes, text/plain)
2004-12-16 09:15 EST, Chris Evich
no flags Details

  None (edit)
Description Göran Uddeborg 2004-09-15 08:53:48 EDT
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.4.3)
Gecko/20040803

Description of problem:
After the latest kernel upgrade in U3 we have seen the kswapd kernel
run amok several times on particular machines.  Kswapd runs and eats
CPU time, but does not actually swap much.  Right now, after c:a 6
days uptime, it as accumulated almost 8 hours CPU time.

This happens occasionally, often on an unloaded machine.  We haven't
been able to figure out any particular pattern (yet).  Putting more
load on the machine could make kswapd calm down again.

We never saw this happen with the kernel from U2.

filadelfia> uname -a
Linux filadelfia 2.4.21-20.ELsmp #1 SMP Wed Aug 18 20:46:40 EDT 2004
i686 i686 i386 GNU/Linux
filadelfia> uptime
 13:55:11  up 6 days,  3:05,  0 users,  load average: 0.00, 0.00, 0.00
filadelfia> ps -C kswapd -fl
F S UID        PID  PPID  C PRI  NI ADDR    SZ WCHAN  STIME TTY      
   TIME CMD
1 S root        11     1  5  75   0    -     0 -      Sep09 ?       
07:55:51 [kswapd]
filadelfia> vmstat 
procs                      memory      swap          io     system   
     cpu
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us
sy wa id
 0  0   4644 4612624 387184 848828    0    0    15     7   12     9  0
 2  1 16
filadelfia> grep swap /proc/stat
swap 76596 212437


Version-Release number of selected component (if applicable):
kernel-2.4.21-20.EL

How reproducible:
Sometimes

Steps to Reproduce:
1. Spontaneous.

Additional info:
filadelfia> lsmod
Module                  Size  Used by    Not tainted
nls_iso8859-1           3516   0 (autoclean)
nfsd                   85168   8 (autoclean)
lp                      9124   0 (autoclean)
parport                38816   0 (autoclean) [lp]
autofs4                16952   3 (autoclean)
nfs                    99376  11 (autoclean)
lockd                  58992   1 (autoclean) [nfsd nfs]
sunrpc                 89212   1 (autoclean) [nfsd nfs lockd]
e1000                  76956   1
floppy                 57488   0 (autoclean)
sg                     37228   0 (autoclean)
sr_mod                 17784   0 (autoclean)
microcode               6848   0 (autoclean)
ide-scsi               12464   0
ide-cd                 34016   0
cdrom                  32544   0 [sr_mod ide-cd]
loop                   12696   0 (autoclean)
ext3                   89960   6
jbd                    55060   6 [ext3]
lvm-mod                64992   5
megaraid2              37928  11
sd_mod                 13712  22
scsi_mod              114344   5 [sg sr_mod ide-scsi megaraid2 sd_mod]
Comment 1 Larry Woodman 2004-09-15 13:25:08 EDT
Göran, this is the first I've heard of this problem.  If you do see
this happen please try to collect some data so I can figure out whats
happening.  If possible please collect 1.) "vmstat 1" output.  2.)
"top 1" output.  3.) "AltSysrq M" and "AltSysrq T" and attach the
/var/log/messages file.

Thanks, Larry Woodman
Comment 2 Göran Uddeborg 2004-09-15 16:42:43 EDT
We'll try to catch this info.  (We don't have a console on this
machine currently, so the SysRq info could be slightly harder.  I'll
see what we can do.)
Comment 3 Larry Woodman 2004-09-15 17:24:07 EDT
Just "echo m > /proc/sysrq-trigger" and "echo t > /proc/sysrq-trigger"

Larry
Comment 4 Göran Uddeborg 2004-09-15 17:38:37 EDT
Ah, I didn't know about /proc/sysrq-trigger.  You learn something each
day! :-)  Thanks!
Comment 5 Göran Uddeborg 2004-09-16 04:35:10 EDT
The problem was triggered again this morning (Central European time).
 We believe it started when we ran an "tar zxf".  I've made two
measurements, one while the tar was still running and one after we had
killed it.  I'll attach both.  (They are just raw outputs from the
script I made, except I've trimmed the extra lines from messages my
sed command wasn't smart enough to skip.  I wasn't sure how to format
things, but I'm sure you'll manage! :-)
Comment 6 Göran Uddeborg 2004-09-16 04:52:51 EDT
Created attachment 103900 [details]
Probe while tar was still running
Comment 7 Göran Uddeborg 2004-09-16 04:55:29 EDT
Created attachment 103901 [details]
Probe after tar was killed
Comment 8 Göran Uddeborg 2004-09-16 04:58:12 EDT
Created attachment 103902 [details]
Complete messages file

There are several abort messages from the megaraid module in there.  Could this
be the problem maybe?
Comment 9 Larry Woodman 2004-09-16 10:12:24 EDT
Göran, I still cant see exactly what kswapd is doing, the memory stats
really dont show the problem.  Can you grab me a few "AltSysrq P" and
"AltSysrq W" outputs when the kswapd is running amok.  This will give
us the backtrace of the running(kswapd) process.

Thanks, Larry Woodman

Comment 10 Göran Uddeborg 2004-09-16 11:46:41 EDT
Created attachment 103912 [details]
Output from a number of SysRq p and w

The "p" doesn't seem to write anything to messages, except for the header
"SysRq : Show Regs".  Is there something additional, apart from sysctl
kernel.sysrq I need to enable?
Comment 11 Göran Uddeborg 2004-09-16 11:49:59 EDT
When I do NOT want to change state from NEEDINFO to ASSIGNED, bugzilla
seems to do it automatically for me.  But this time when it would have
been appropriate, then I have to do it myself!

Oh well. :-)
Comment 12 Göran Uddeborg 2004-09-16 12:13:04 EDT
We just discovered there has been a thread on taroon-list around this
problem.

http://www.redhat.com/archives/taroon-list/2004-September/msg00073.html

Bug 121434 is mentioned in there.  It has a very long thread, but it
doesn't seem to be exactly the same symptoms.  Also, that was filed
back in April, and we started seeing this in update 3.

Could of course still be related in some way.
Comment 13 Tore Anderson 2004-09-16 14:45:35 EDT
I've seen this as well, on a couple of AS3 U3 boxes running
Cyrus-IMAPd.  When I'm migrating users to them by copying their mail
via IMAP, kswapd seems to be spinning on one CPU, and although there
are more of them, interactivity is dog slow.  You'd think the link you
were SSH'ing over were a mobile GSM-data link or something.  For
instance, running "df" takes over seven seconds.

When I'm not migrating users to them, kswapd calms down.  On the other
hand, there's no real load on the servers then either.

This seems to be happening even though virtually no swap is used. 
Indeed, it persist even if I disable swap entirely with "swapoff -a"!
I'll attach some system graphs you might find interesting.  If there's
any more information I can provide, please let me know.  I'd be
interested to hear of any workarounds too, even if these mean that I
cannot use swap.  As it stands, I have enough physical memory anyway.

The machines in question are HP ProLiant DL360 G3's.  HyperThreading
is enabled. The kernel I'm running is 2.4.21-20.ELsmp, sadly enough
tainted with HP's SecurePath binary modules (...please add support for
dm-multipath soon, so I can get rid of them!).

I cannot confirm the submitter's assertion that the problem was
introduced in U3, as this system was just recently put into production.

Tore Anderson
Comment 14 Tore Anderson 2004-09-16 14:47:47 EDT
Created attachment 103920 [details]
Memory usage for one of the affected machines, by day
Comment 15 Tore Anderson 2004-09-16 14:49:00 EDT
Created attachment 103921 [details]
CPU usage for one of the affected machines, by day
Comment 17 Tore Anderson 2004-09-17 03:10:23 EDT
This problem seemed to reach critical mass some hours after I wrote my
initial comment, pulling one machine down and making the other in a
seriously dysfunctional state - although user space was alive, lots of
processes was impossible to kill, every command (even the most simple
ones such as "cat file") took ages to finish, and the load average had
skyrocketed.

Cluster Manager didn't cope with the situation either, although that's
perhaps not to be expected as both members started getting real
problems almost simultaneously.

I booted both boxes, reverting to kernel 2.4.21-15.0.4.ELsmp, and up
until now I've not seen any sign of this bug being present in that
kernel.  So it seems Mr. Uddeborg is correct:  This bug is a new, er,
feature of U3.

The fact that I had to revert to an older kernel also means, I'm
afraid, that I cannot easily provide more debug information upon
request.  I hope that won't be detrimental to pinpointing and
correcting the bug.

Tore Anderson
Comment 18 Per Steinar Iversen 2004-09-17 03:45:59 EDT
We see this too on our IMAP server. The problem only occurs during
heavy IO, in our case when running tar for backup and when squatter
(Cyrus indexer) runs. The only thing in common is that a lot of files
are opened and read by these programs. The Cyrus spool partition is
ext3 and mounted with the noatime option.

The hardware is a Dell 2650, 2x3GHz CPU, 4GB RAM and a PERC 4/DC RAID
controller. We use the megaraid2 driver.

Currently we run on the previous kernel that has no such problem.
Comment 19 Larry Woodman 2004-09-17 07:23:45 EDT
kswapd is spinning in prune_icache, I'll look into this.

Larry 
Comment 20 Larry Woodman 2004-09-22 17:13:48 EDT
Göran, basically kswapd is spending lots of time in prune_icache(). 
First of all, how can I reproduce this problem?  I have tried and can
never catch the system in a state where kswap is eating up any
significant cpu time.  Second, when you see the system in this state,
can you "cat /proc/sys/fs/inode-nr" as well as getting me AltSysrq M,
T, P and W outouts?

Larry Woodman
Comment 21 Guil Barros 2004-09-22 23:23:09 EDT
Just wanted to chime in that we are having a very similar problem. I
can replicate it on a number of RHEL3 U3 boxes by doing pipeing a tar
through ssh to another box. Not sure if it matters, but this works
particularly well when really taxing the filesystem (i.e. my tar is of
100,000's of small 0-10B files.)
Comment 22 Göran Uddeborg 2004-09-23 07:47:27 EDT
As you may have guessed, we do not quite know what is necessary to
reproduce the problem.  We have many machines running RHEL3U3, but see
this problem on only one.  (At least we see it very often on just one.)

The hardware and software configuration of this machine is a bit
different from most of our machines.  It is a Dell PE2600 with three
external PV220, and the filesystem we use most on it is an LVM RAID
partition.  It is hard for me to guess what difference is the relevant
one.  I'll attach a number of outputs from various informational
commands, in addition to the SysRq you asked for.

Our standard way to trigger the behaviour we start extraction of a
large (compressed) tar archive.  It seems to always go into this mode
then.  If we break the tar process kswapd still remains running 100%.
If we remove the extracted files, kswapd goes back to normal.

As a lost note for now:  One more way we have found to affect the
behaviour is changing the value of
/proc/sys/vm/inactive_clean_percent.  We normally have it on the
default 30.  If we set it to 100 we don't see the effect any more (but
the machine runs slower).
Comment 23 Göran Uddeborg 2004-09-23 07:59:00 EDT
Created attachment 104173 [details]
Result in /var/log/messages from SysRq m, t, p, and w

SysRq p still doesn't write anything for somre reason.	The contents of
/proc/sys/fs/inode-nr was "124527  123958".
Comment 24 Göran Uddeborg 2004-09-23 08:01:49 EDT
Created attachment 104174 [details]
Output of "head /proc/megaraid/hba[1-3]/raiddrives-0-9"
Comment 25 Göran Uddeborg 2004-09-23 08:03:29 EDT
Created attachment 104175 [details]
Output of "pvdisplay /dev/sd[b-d]1"
Comment 26 Göran Uddeborg 2004-09-23 08:04:25 EDT
Created attachment 104176 [details]
Output of "vgdisplay /dev/vg00"
Comment 27 Göran Uddeborg 2004-09-23 08:06:56 EDT
Created attachment 104177 [details]
Output of "lvdisplay /dev/vg00/lvol2"
Comment 28 Göran Uddeborg 2004-09-23 08:10:05 EDT
Created attachment 104178 [details]
Output of "lspci -vv"
Comment 29 Göran Uddeborg 2004-09-23 08:13:08 EDT
Created attachment 104181 [details]
Contents of /proc/cpuinfo
Comment 30 Göran Uddeborg 2004-09-23 08:14:35 EDT
Created attachment 104182 [details]
Contents of /proc/meminfo
Comment 31 Tore Anderson 2004-09-23 08:27:15 EDT
For what it's worth, my hardware is very different to Göran's.  I have
a few HP ProLiant DL360 G3's, who use the cpqarray kernel module for
their internal storage.  The mail spool is stored on a HP HSV110 EVA,
driven by a QLogic QLA2342 HBA (qla2300 module).  HP SecurePath is
used.

I use LVM on the internal RAID, but plain ext3 directly on the SCSI
hard drive devices for the mail spool.

What is common seems to be that the system is doing lots and lots of
file operations on a large number of tiny files.  I was copying mail
via IMAP from one server when I got problems - an operation that
creates a massive amount of small files;  one per email.

Tore Anderson
Comment 32 Guil Barros 2004-09-23 09:56:31 EDT
Additional info: our systems that are showing this are Dell PE8450's,
HP DL380's and BL20p's. They are all attached to either a PV640 or
HSV100's via the qla2300 drivers... which i guess is another
similarity. If needed i can see if the problem also happens accessing
local disk. Is there any specific information that i can provide that
would help?

i have played with inactive_clean_percent which temporarily solves the
problem but it comes back in a couple of hours. have also changed
pagecache to no avail.

Are all of the systems affected by this using the qla2300 module by
any chance?
Comment 33 Larry Woodman 2004-09-23 11:13:08 EDT
Based on the AltSysrq-M output attached I think the system might be in
a state that the free and inactive clean pages lists are too low.  Can
you guys try setting the /proc/sys/vm/inactive_clean_percent all the
way to 100 just to see if the problem dissappears?  

I suspect that a combination of the number of inodes that are
currently unused(closed files), the average number of pages on each
inode is about the same as that of the unused inodes(several small
files were in use) and the pageing thresholds is causing kswapd to
make little or no progress when the system gets into this state.

Larry Woodman
Comment 34 Göran Uddeborg 2004-09-23 11:30:15 EDT
Created attachment 104195 [details]
Output of "lsmod"

No, we do not have qla2300 loaded.
Comment 35 Göran Uddeborg 2004-09-23 12:23:44 EDT
Created attachment 104198 [details]
Result in /var/log/messages from SysRq m, t, p, and w
Comment 36 Guil Barros 2004-09-23 12:45:49 EDT
setting inactive_clean_percent to 100 seems to make kswapd go away for
a while, but it comes back in a couple of hours if the IO load doesnt
stop in the meantime.
Comment 37 Larry Woodman 2004-09-23 13:56:05 EDT
OK, can you set it to 100, restart the test and after it "comes back"
get me "vnstat 1" "top" and "AltSysrq M, T and W outputs" so I can see
what stat the system is in at that point?  Sorry, I cant reproduce
this problem here for some reason and getting access to your machines
wouldnt make solving this any quicker.

Thanks for you help on this guys.

Larry Woodman

Comment 38 Göran Uddeborg 2004-09-24 08:05:13 EDT
Created attachment 104257 [details]
Test program to trigger kswapd

It was quite hard to trigger kswapd with inactive_clean_percent set to 100, but
finally we could by using this attached program.  It creates a lot of small C++
files.	Trying it elsewhere, we found we could trigger it on at least one more
machine.  So maybe this is a test case you could use to trigger the problem
yourself?

The procedure we have used which has shown most reliable have been to first use
all caches with a command like this

  dd if=/dev/zero of=/some/(local)/disk/file bs=1G count=6

and then run this test program.  Kswapd comes up after some time, sooner with
30 and later with 100.

I'll attach the probes from this test too.
Comment 39 Göran Uddeborg 2004-09-24 08:07:19 EDT
Created attachment 104258 [details]
Output of "vmstat 1 20", inactive_clean_percent=100
Comment 40 Göran Uddeborg 2004-09-24 08:08:17 EDT
Created attachment 104259 [details]
Output of "top b n 3", inactive_clean_percent=100
Comment 41 Göran Uddeborg 2004-09-24 08:09:58 EDT
Created attachment 104260 [details]
Result in /var/log/messages from SysRq m, t, and w, inactive_clean_percent=100
Comment 42 Larry Woodman 2004-09-24 13:29:48 EDT
I created a test kernel that prevents kswapd from spending so much
time pruning the inode cache.  Can you guys give this a test ASAP and
let me know how it works for you?  The kernel is here:

>>>http://people.redhat.com/~lwoodman/.RHEL3/

The patch I applied is this:
*****************************************************************************
--- linux-2.4.21/mm/vmscan.c.orig       2004-09-24 11:19:15.000000000
-0400
+++ linux-2.4.21/mm/vmscan.c    2004-09-24 11:27:52.000000000 -0400
@@ -1028,7 +1028,7 @@ static int do_try_to_free_pages(unsigned
        }
                                                                     
                                  
        ret += shrink_dcache_memory(DEF_PRIORITY, gfp_mask);
-       ret += shrink_icache_memory(1, gfp_mask);
+       ret += shrink_icache_memory(10, gfp_mask);
 #ifdef CONFIG_QUOTA
        ret += shrink_dqcache_memory(DEF_PRIORITY, gfp_mask);
 #endif
@@ -1075,6 +1075,7 @@ static int do_try_to_free_pages_kswapd(u
        int ret = 0;
        struct zone_struct * zone;
        pg_data_t *pgdat;
+       static unsigned long last_inodecache_shrink;
                                                                     
                                  
        /*
         * Eat memory from filesystem page cache, buffer cache,
@@ -1096,7 +1097,11 @@ static int do_try_to_free_pages_kswapd(u
                                                                     
                                  
                if (free_low(zone + ZONE_DMA) > 0 || free_low(zone +
ZONE_NORMAL) > 0) {
                        ret += shrink_dcache_memory(DEF_PRIORITY,
gfp_mask);
-                       ret += shrink_icache_memory(DEF_PRIORITY,
gfp_mask);
+                       /* shrink the icache by no more than 1% each
second from kswapd */
+                       if (time_after(jiffies, last_inodecache_shrink
+ HZ)) {
+                               ret += shrink_icache_memory(100,
gfp_mask);
+                               last_inodecache_shrink = jiffies;
+                       }
                        try_to_reclaim_buffers(DEF_PRIORITY, gfp_mask);
 #ifdef CONFIG_QUOTA
                        ret += shrink_dqcache_memory(DEF_PRIORITY,
gfp_mask);
Comment 43 Per Steinar Iversen 2004-09-26 04:08:11 EDT
This patch seems to have worked for me at least - the test kernel
behaves as expected during backup and a run of squatter - no problems
with kswapd.
Comment 44 Göran Uddeborg 2004-09-27 14:02:55 EDT
Created attachment 104388 [details]
/var/log/messages during restart and running test program

It did not work very well for us.  Starting our test program runs for a little
while, then kswapd starts going up, an we are thrown out.

Looking at the console or in messages reveals a lot of processes killed with
"Out of Memory" motivation.  I enclose an excerpt from messages, starting with
the boot sequence, and then when the test program is being started shortly
after.	This is repeatable.

Our original test case, file extraction with tar, seem to work with this
(20.11.inodeshrink) kernel, though.
Comment 45 Larry Woodman 2004-09-29 11:35:59 EDT
Göran, you are having a different type of problem now.  Your
attachment indicates that lowmem is *totally" consumed in the slabcache!!!

*****************************************************
kernel:   aa:0 ac:25 id:112 il:102 ic:0 fr:638
191132 pages of slabcache
*****************************************************

Please get me a "cat /proc/slabinfo" so we can determine who is
allocating all of those kernel data structures and not ever freeing
them.  None of the lsmod entries are indicative of this type of problem.

Larry
Comment 46 Göran Uddeborg 2004-09-30 04:17:44 EDT
> Göran, you are having a different type of problem now.

If you say so. :-)

It is triggered by the same program, this program runs fine with
2.4.21-15.0.3 but not later, and kswapd goes up high in top in both
cases.  So it looked related to me.

But I'm definitely not going to argue with you if it is the same bug
or not.  :-)

I'll attach slabinfo samples.
Comment 47 Göran Uddeborg 2004-09-30 04:21:34 EDT
Created attachment 104571 [details]
Last retrieved copy of slabinfo.

As I can't copy slabinfo after my login sessions get killed.  So I made a loop
copying /proc/slabinfo each second.  This is the last sample I got. 
Presumably, the looping shell was killed after that.
Comment 48 Göran Uddeborg 2004-09-30 04:24:21 EDT
Created attachment 104572 [details]
One of the first samples, for reference.

In case it could be of any help, I have one sample each second between these. 
That is, 198 of them.
Comment 49 Rik van Riel 2004-09-30 08:15:43 EDT
Most of your memory is consumed by inodes.

Larry, it would appear that inode reclaiming might need to be a bit
more aggressive, there don't seem to be anywhere near as many dentries
or open files as there are inodes being cached.
Comment 51 Larry Woodman 2004-09-30 13:29:35 EDT
 
Göran, the slabcache output you attached indicates that lowmem is
totally consumed in inodes and buffer_headers.  When this happens
kswapd should call prune_icache(which it does) and that should wake up
bdflush to push out the dirty pages so dirty pages so the inodes and
buffer_headers can be reclaimed.  Obviously something isnt working as
it should, can you reproduce this problem and get a few AltSysrq-T
outputs so I can see what bdflush is doing?

Thanks, Larry

Comment 52 Göran Uddeborg 2004-10-01 04:50:59 EDT
Created attachment 104620 [details]
Result in /var/log/messages from SysRq t repeatedly

I reran the triggering program, and in parallel did a SysRq-T every 15:th
second.  The behaviour was slightly different at first.  The machine still
started to throw out processes, but it didn't go down completely.  The
triggering program was apparently killed after a while, and the machine
recovered after that.

I tried once more, and then it crashed like before.  The attachment is from the
second run.  I don't know if it actually is the SysRq:s which affect the
behaviour, or if it is a coincidence.
Comment 53 Göran Uddeborg 2004-10-01 04:52:39 EDT
By the way, there is a jump in the comments from comment 49 to comment
51.  Is that intentional?  What happened to comment 50?
Comment 54 Damian Menscher 2004-10-08 18:49:14 EDT
Seeing this with RHEL3 U3 SMP kernel.  The trigger is when a user 
tries to make a copy of his data.  Only 5 gig, but it's spread across 
600,000 files.  ;)

My /proc/slabinfo confirms it's an inode_cache issue:
inode_cache       207606 354284    512 50612 50612    1 :  496  124

kswapd is taking anywhere from 70% to 399.9% of our cpu (it's a dual 
xeon with hyperthreading), which pushes the load up to about 12 
(normally we sit around 0.3).  Meanwhile vmstat shows *no* swapping 
actually taking place.

This is on a system with 4 gig ram and 4 gig swap.

Let me know if there's additional info I can provide (but keep in 
mind it's a production system, so I can only do "safe" operations).
Comment 55 Rainer Poehlmann 2004-10-13 04:38:10 EDT
We see exactly the same phenomenon since updating to EL3. kswapd eats
up CPU without doing actual swapping and really slowing down the system.
RH EL AS version 3, kernel 2.4.21-20.ELsmp
IBM x345 dual Xeon (HT on) server with 4GB RAM and 4GB swap space.
Unfortunately this is our central file server which has 3.4 TB
FibreChannel Storage attached to it (IBM FAStT600 + FAStT700
Expansion). Therefore user really suffer from an unresponsive file
server up to complete unresponsiveness.
Comment 56 Larry Woodman 2004-10-13 09:18:31 EDT
Is everyone that is experiencing this problem dealing with a huge
number of files(I see hundreds of thousands) or does it also show up
when there are smaller number of files involved?  Also, the origional
discription of this bug stated it was an x86_64 system.  Is this
happening on all architectures?

Larry
Comment 57 Damian Menscher 2004-10-13 09:54:56 EDT
I'm on a dual xeon with hyperthreading (so linux sees 4 processors).  
So x86, but NOT 64-bit.  Yes to the huge number of files.  I think it 
might be an issue when writing files only, not reading them, since 
our nightly backups don't seem to trigger it.  But I could be wrong 
about that.

Anyone seeing this on a NON-smp kernel?  Anyone have less than 3 gig 
of ram??

BTW, increasing inactive_clean_percent from the default of 30 to a 
value of 50 seems to help.

If nobody has the problem on an AMD system, I can attempt to 
reproduce it on our dual Opteron system (3.5 gig ram).
Comment 58 Per Steinar Iversen 2004-10-13 10:15:32 EDT
I have this problem on a dual Xeon server, 4GB RAM. kswapd sucks up
cpu when using tar (millions of files) and when squatter (cyrus
indexer) runs.
Comment 59 Göran Uddeborg 2004-10-13 10:30:57 EDT
For the original U3 kernel (kernel-smp-2.4.21-20.EL):
We have only seen this with a significant number of files, if not
hundreds of thousands.  It first happened when we tried to rsync large
source code CVS repository with some 70 000 files.  Our standard tar
extraction test case includes around 160 000 files.  Tar has only
started extracting those files when kswapd stops other activity, though.

With your test kernel from comment 42
(kernel-smp-2.4.21-20.11.inodeshrink.EL):
This one handles the tar file.  The test case we use here would create
around 2 million files, if it succeeded.

I don't believe my original description mentioned x86_64.  The machie
where this happens most often has dual Pentium 4 Xeon (Northwood)
Original OEM (hyperthreaded to four virtual).  We do have some AMD:s,
but we haven't seen this triggered there.  We haven't specifically tried.
Comment 60 Tom Sightler 2004-10-13 10:42:49 EDT
We've seen this on our backup server, primarily during the large
backup run at night, so I do think it can happen primarily during
reads.  This system is a 4-CPU Dell 6450 (PIII Xeon/700Mhz).

We have been running a kernel that we pulled from comment 42 as well,
although it's version with 2.4.21-20.12.  I don't know if it's just
luck or not, but so far it has made 7 days of uptime (the original
2.4.21-20 crashed anywhere from 24-72 hours everytime).

I didn't seen anything about this applying only to x86_64, although it
was originally posted from a machine that had x86_64 in it's User
Agent screen, the actualy description indicates an i686 and it was
files against i386.

Later,
Tom
Comment 61 Rainer Poehlmann 2004-10-13 11:14:55 EDT
We also observe it on a dual Xeon with hyperthreading, 4GB of RAM and
4 GB of swap space (2 x 2GB partitions).
As previously mentioned, this machine acts as NFS file server (<a
href="chttps://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=132639#c55">omment
53<\a>).
We saw this behaviour when we had local processes like backup or
tar/gunzip (which we sometimes cannot avoid) and additional load
caused through NFS activities from the NFS clients.
Comment 62 Larry Woodman 2004-10-13 13:41:53 EDT
So it appears that this problem only occurs when a a huge number of
files is involved and there are writes to those files.  I can see what
is going on in the system when this happens: kswapd calls prune_icache
which calls invalidate_inode_pages for each file and it cant do
anything because the pages caching the file data are dirty.  But, I
cant see what changed between RHEL3-U2 and RHEL3-U3 that caused this
change in behavior.  RHEL3-U2 was OK, correct?

Larry
Comment 63 Göran Uddeborg 2004-10-13 17:26:05 EDT
Yes, RHEL3-U2 was OK.
Comment 64 Per Steinar Iversen 2004-10-14 02:18:35 EDT
We see this problem when running backups and dumping the result across
the network to another machine - tar only reads, no local writes. The
imap partition is mounted with the noatime option. The last U2 kernel
is still ok.
Comment 65 Ville Herva 2004-10-19 03:06:14 EDT
We are seeing this too. Red Hat service request #368982, if anyone at
Red Hat is interested. I can dump the info here if anyone is
interested, too (as it seems the Red Hat support is babbling down at
"Dear Sir. Which kernel are you running. Thank You" level and not
making any progress.)

Summary: kswapd is spinning at the same place, few updatedb and rm
processes are unkillable. Test script is to continuously extract and
rm kernel source trees while running updatedb in parallel.
Comment 66 Ville Herva 2004-10-19 05:45:32 EDT
Created attachment 105430 [details]
Details of the problem we are seeing

Here are the details I sent to the Rde Hat support in case any one is
interested.
Comment 67 Ville Herva 2004-10-19 06:29:20 EDT
Basicly, it seems this happens every time I try to access very many
files (updatedb, cp, rm whatever); the dcache fills up and kswapd goes
into busy loop. 

This seems pretty bad to me, a mere "cp -ax / /mnt/otherdisk" was
enough to reproduce it (granted there were a lot of files on / since I
forgot to exclude /tmp, where I had ran the kernel extract torture
test, but "a lot of files" hardly qualifies as a "don't do that" for
an "enterprise" OS does it?)

 
This is the script I used for testing: 
 
while true; do 
 for i in $(seq 1 500); do 
  mkdir $i 
  cd $i && tar xjf /tmp/linux-2.4.27.tar.bz2
  cd .. 
 done 
 date >> filefill.log 
 df -m . >> filefill.log 
 rm -rf ? ?? ??? 
done 
 
and on another terminal, I ran 
  "while true; do updatedb; done". 

Here is the kswapd stack trace for completeness: 
 
kswapd        R 00000000  4332    11      1            12    13 (L-TLB) 
Call Trace:   [<c011d500>] smp_apic_timer_interrupt [kernel] 0x0
(0xc496df10) 
[<c011d570>] smp_apic_timer_interrupt [kernel] 0x70 (0xc496df14) 
[<c029f967>] __br_write_lock [kernel] 0x27 (0xc496df5c) 
[<c029f9aa>] .text.lock.brlock [kernel] 0x5 (0xc496df64) 
[<c014724c>] invalidate_inode_pages [kernel] 0x6c (0xc496df68) 
[<c017f3ea>] prune_icache [kernel] 0x1ba (0xc496df7c) 
[<c017f484>] shrink_icache_memory [kernel] 0x24 (0xc496dfa0) 
[<c015636e>] do_try_to_free_pages_kswapd [kernel] 0x14e (0xc496dfac) 
[<c0156508>] kswapd [kernel] 0x68 (0xc496dfd0) 
[<c01564a0>] kswapd [kernel] 0x0 (0xc496dfe4) 
[<c01095ad>] kernel_thread_helper [kernel] 0x5 (0xc496dff0) 


I also get some unkillable (kill -KILL) processes occasionally, here
are a few:

[3]root@elbrus:/tmp/burn>kill 12690 
[3]root@elbrus:/tmp/burn>kill 12690 
[3]root@elbrus:/tmp/burn>kill 12690 
[3]root@elbrus:/tmp/burn>kill 12690 
[3]root@elbrus:/tmp/burn>kill 12690 
[3]root@elbrus:/tmp/burn>kill -KILL 12690 
[3]root@elbrus:/tmp/burn>kill -KILL 12690 
[3]root@elbrus:/tmp/burn>kill -KILL 12690 
[3]root@elbrus:/tmp/burn>kill -KILL 12690 
[3]root@elbrus:/tmp/burn>kill -KILL 12690 
[3]root@elbrus:/tmp/burn>lsof -p 12690 
COMMAND   PID USER   FD   TYPE DEVICE    SIZE     NODE NAME 
rm      12690 root  cwd    DIR    8,2    4096 14683708 
/tmp/burn/yummi/1/linux-2.4.27/include/asm-sh 
rm      12690 root  rtd    DIR    8,2    4096        2 / 
rm      12690 root  txt    REG    8,2   26652   835635 /bin/rm 
rm      12690 root  mem    REG    8,2 1568924 14172162
/lib/tls/libc-2.3.2.so 
rm      12690 root  mem    REG    8,2  106912   524292 /lib/ld-2.3.2.so 
rm      12690 root    0u   CHR  136,1                3 /dev/pts/1 
rm      12690 root    1u   CHR  136,1                3 /dev/pts/1 
rm      12690 root    2u   CHR  136,1                3 /dev/pts/1 
rm      12690 root    3r   DIR    8,2    8192       21 /tmp/burn/yummi 
rm      12690 root    4r   DIR    8,2    4096 14683708 
/tmp/burn/yummi/1/linux-2.4.27/include/asm-sh 
[3]root@elbrus:/tmp/burn>top -b -n1 | egrep "(kswapd|updatedb|rm)" 
   11 root      25   0     0    0     0 SW   100.5  0.0 485:35   0 kswapd 
 8638 root      15   0   712  712   476 D     2.4  0.0   0:52   2
updatedb 
12690 root      15   0   404  404   336 D     0.0  0.0   0:00   0 rm 

And on another boot:

updatedb      D 00000000  3776 10304   8875                     (NOTLB) 
Call Trace:   [<c0123e24>] schedule [kernel] 0x2f4 (0xca567e14) 
[<c017e50c>] __wait_on_inode [kernel] 0x7c (0xca567e58) 
[<c017f8b8>] iget4_locked [kernel] 0x98 (0xca567eac) 
[<f887076d>] ext3_lookup [ext3] 0x7d (0xca567ed4) 
[<c0171a3c>] real_lookup [kernel] 0xec (0xca567ef8) 
[<c017206c>] link_path_walk [kernel] 0x45c (0xca567f18) 
[<c01725b9>] path_lookup [kernel] 0x39 (0xca567f58) 
[<c0172909>] __user_walk [kernel] 0x49 (0xca567f68) 
[<c016d9ae>] sys_lstat64 [kernel] 0x2e (0xca567f84) 
[<c0161d6b>] sys_fchdir [kernel] 0x4b (0xca567fa0) 
Comment 68 Larry Woodman 2004-10-19 14:34:32 EDT
I think I found the problem here, this patch prevents prune_icache
from making progress:
********************************************************************
--- linux-2.4.21/fs/inode.c.orig
+++ linux-2.4.21/fs/inode.c
@@ -285,7 +285,8 @@ void refile_inode(struct inode *inode)
                BUG(); */
        if (!inode) return;
        spin_lock(&inode_lock);
-       __refile_inode(inode);
+       if (!(inode->i_state & I_LOCK))
+               __refile_inode(inode);
        spin_unlock(&inode_lock);
 }
                                                                     
                                    
*****************************************************************

I removed it and built a new kernel, can someone try it out ASAP?
Its located here:

>>>http://people.redhat.com/~lwoodman/.RHEL3/


Larry Woodman
Comment 69 Rik van Riel 2004-10-19 14:47:45 EDT
Nice find Larry.  I suspect that should do the trick ...
Comment 70 Dan Taracks 2004-10-19 16:19:30 EDT
Larry,
Does this patch work for i386 as well - saw the file name was i686.
We have the same issue with Update 3.
Comment 71 Dan Taracks 2004-10-19 16:26:43 EDT
Larry,
Does this patch work for i386 as well - saw the file name was i686.
We have the same issue with Update 3.
Comment 72 Göran Uddeborg 2004-10-20 03:12:18 EDT
Larry, when I download
http://people.redhat.com/~lwoodman/.RHEL3/kernel-smp-2.4.21-22.kswapdfix.EL.i686.rpm
I just get 9240086 null bytes.  Something is wrong, obviously.  I've
tried from two different sites with the same result, so I believe it
is on your side.
Comment 73 Ville Herva 2004-10-20 03:52:14 EDT
Yeah, it seems full of NULL for me too.
Comment 74 Larry Woodman 2004-10-20 09:40:42 EDT
OK guys, my disk quota was exceeded.  Grab the kernel again from:

>>>http://people.redhat.com/~lwoodman/.RHEL3/


Larry
Comment 75 Ville Herva 2004-10-20 10:16:09 EDT
Ok, I'm running it now. Will take some time to tell for sure if kswapd
hangs or not...
Comment 76 Göran Uddeborg 2004-10-20 11:07:11 EDT
We have tried our test cases now, and the machine is still running
fine.  We managed to get kswapd up to around 20% of the CPU as a
maximum.  But it didn't go out of control, and it came back down again.

From our point of view, this is solved now.  (I would have resolved
the bugzilla if it wasn't for this bug's large fan club, eh, cc list.
 Maybe more people want to try the latest version first.  So I leave
the resolving to you Larry. :-)
Comment 77 Larry Woodman 2004-10-20 11:12:50 EDT
OK, thanks.  Can I get everyone to try out a more complete version?

Its located in:

>>>>http://people.redhat.com/~lwoodman/.RHEL3/kernel-smp-2.4.21-22.prune_icachefix.EL.i686.rpm


Thanks, Larry
Comment 78 Ville Herva 2004-10-20 17:13:43 EDT
My test case (extract 500 kernel source trees and run updatedb in 
parallel, see comment #67) still fails. (2.4.21-22.kswapdfix.ELsmp) 
kswapd seems to behave, but I get hung, unkillable updatedb process:

>pidof updatedb
31389
>kill -KILL 31389
>kill -KILL 31389
>kill -KILL 31389
>kill -KILL 31389
>kill -KILL 31389
>pidof updatedb
31389
>lsof -p 31389   
COMMAND    PID USER   FD   TYPE DEVICE    SIZE    NODE NAME
updatedb 31389 root  cwd    DIR   58,1    4096 7979042 /data/burn/3/
linux-2.4.27/arch/ppc64/kernel
updatedb 31389 root  rtd    DIR   58,0    4096       2 /
updatedb 31389 root  txt    REG   58,0   27520  392649 /usr/bin/
slocate
updatedb 31389 root  mem    REG   58,0 1568924  211818 /lib/tls/libc-
2.3.2.so
updatedb 31389 root  mem    REG   58,0   51920  570174 /lib/
libnss_files-2.3.2.so
updatedb 31389 root  mem    REG   58,0  106912  570141 /lib/ld-2.3.2.
so
updatedb 31389 root    0u   CHR  136,3               5 /dev/pts/3
updatedb 31389 root    1u   CHR  136,3               5 /dev/pts/3
updatedb 31389 root    2u   CHR  136,3               5 /dev/pts/3
updatedb 31389 root    3w   REG   58,0  499712  749779 /var/lib/
slocate/slocate.db.tmp
updatedb 31389 root    4r   DIR   58,1    8192   32769 /data/burn
updatedb 31389 root    5r   DIR   58,1    4096 7979042 /data/burn/3/
linux-2.4.27/arch/ppc64/kernel

sysrq-t:
updatedb      D 00000000  3324 31389   2883                     
(NOTLB)
Call Trace:   [<c0123f14>] schedule [kernel] 0x2f4 (0xe00f5e14)
[<c017f28c>] __wait_on_inode [kernel] 0x7c (0xe00f5e58)
[<c0180628>] iget4_locked [kernel] 0x98 (0xe00f5eac)
[<f889876d>] ext3_lookup [ext3] 0x7d (0xe00f5ed4)
[<c01727bc>] real_lookup [kernel] 0xec (0xe00f5ef8)
[<c0172dec>] link_path_walk [kernel] 0x45c (0xe00f5f18)
[<c0173339>] path_lookup [kernel] 0x39 (0xe00f5f58)
[<c0173689>] __user_walk [kernel] 0x49 (0xe00f5f68)
[<c016e51e>] sys_lstat64 [kernel] 0x2e (0xe00f5f84)
[<c016284b>] sys_fchdir [kernel] 0x4b (0xe00f5fa0)

Comment 79 Larry Woodman 2004-10-20 21:23:27 EDT
OK now please try the prune_icachefix kernel.

Larry
Comment 80 Ville Herva 2004-10-21 02:01:20 EDT
I've booted it now. It'll take at least a few hours to tell if it helps.

Out of interest, what's the difference between
2.4.21-22.prune_icachefix.ELsmp and
/vmlinuz-2.4.21-22.kswapdfix.ELsmp? Does it affect mainline?
Comment 81 Göran Uddeborg 2004-10-21 04:53:16 EDT
I've tried 2.4.21-22.prune_icachefix.EL and it behaves well in our tests.
Comment 82 Ville Herva 2004-10-21 14:19:11 EDT
I've been running the test case on one machine for eight hours and 
for couple of hours on another machine, both running 2.4.21-22.
prune_icachefix.ELsmp. No problems yet. Before, the problems have 
always happened in few hours. I'll let the scripts run for some time 
still, but looks like you nailed it, Larry! Good job!
Comment 83 Rob ter Veer 2004-10-22 04:37:59 EDT
We've tried this patch also in our pre-operational environment.
850.000 files in less than 10 minutes. Before the patched kernel this
took more than 30 minutes and resulted in an unusable system! We'll do
some more testing but I'm confident this has solved the problem.
Comment 84 Ville Herva 2004-10-22 06:07:44 EDT
Ummh, I'm seeing a problem after all.

test script:
__________________________________________________________________
 for i in $(seq 1 500); do
  mkdir $i
  cd $i
  tar xf /tmp/linux-2.4.27.tar
  cd ..
 done
 date >> filefill.log
 df -m . >> filefill.log
 rm -rf [1234567890] [1234567890][1234567890]
[1234567890][1234567890][1234567890]
done &
while true; do updatedb; done
__________________________________________________________________

The test script (see above) has been running for some time on two
machines, and on another one, I'm seeing some weird stuff:

mkdir: cannot create directory `286': File exists
tar: linux-2.4.27/include/linux/sunrpc/xprt.h: Cannot open: File exists
tar: linux-2.4.27/include/linux/sunrpc/msg_prot.h: Cannot open: File
exists
tar: linux-2.4.27/Documentation/sound/VIBRA16: Cannot open: File exists
tar: linux-2.4.27/Documentation/sound/rme96xx: Cannot open: File exists
tar: linux-2.4.27/Documentation/sound/OPL3: Cannot open: File exists
tar: linux-2.4.27/Documentation/sound/AudioExcelDSP16: Cannot open:
File exists
tar: linux-2.4.27/Documentation/sound/INSTALL.awe: Cannot open: File
exists
tar: linux-2.4.27/Documentation/sound/PSS-updates: Cannot open: File
exists
tar: Error exit delayed from previous errors
mkdir: cannot create directory `287': File exists
tar: linux-2.4.27/arch/i386/math-emu/fpu_system.h: Cannot open: File
exists
tar: linux-2.4.27/arch/i386/math-emu/poly_atan.c: Cannot open: File exists
tar: linux-2.4.27/arch/i386/math-emu/poly_l2.c: Cannot open: File exists
tar: linux-2.4.27/arch/i386/math-emu/reg_u_add.S: Cannot open: File exists
tar: linux-2.4.27/arch/i386/math-emu/reg_u_div.S: Cannot open: File exists
tar: linux-2.4.27/arch/i386/math-emu/div_Xsig.S: Cannot open: File exists
tar: Error exit delayed from previous errors
mkdir: cannot create directory `288': File exists
tar: linux-2.4.27/include/asm-sparc64/unistd.h: Cannot open: File exists
tar: linux-2.4.27/include/asm-sparc64/perfctr.h: Cannot open: File exists

[3]root@elbrus:/mnt/3ware1>ls -ld
288/linux-2.4.27/include/asm-sparc64/unistd.h
drwxrwxr-x    5 573      573          4096 Oct 22 04:40
288/linux-2.4.27/include/asm-sparc64/unistd.h
[3]root@elbrus:/mnt/3ware1>ls
288/linux-2.4.27/include/asm-sparc64/unistd.h | wc
    117     117    1552

Some files have turned into directories that contain files.

The script went through a couple of times without a hitch, but now it
begun giving those errors. On another box (where the script has been
running longer), there are no such problems.

The test partition is ext2 since I mistakenly did mkfs, not mkfs.ext3
(other machine uses ext3):
[3]root@elbrus:/mnt/3ware1>df -hT .
Filesystem    Type    Size  Used Avail Use% Mounted on
/dev/sda      ext2    230G   96G  122G  45% /mnt/3ware1

dmesg shows

EXT2-fs error (device sd(8,0)): ext2_free_inode: bit already cleared
for inode 25480758
EXT2-fs error (device sd(8,0)): ext2_free_inode: bit already cleared
for inode 25480759
EXT2-fs error (device sd(8,0)): ext2_free_inode: bit already cleared
for inode 25480760
EXT2-fs error (device sd(8,0)): ext2_check_page: bad entry in
directory #674668: directory entry across blocks - offset=0,
inode=807414319, rec_len=13176, name_len=67
EXT2-fs error (device sd(8,0)): ext2_free_inode: bit already cleared
for inode 16944685
EXT2-fs error (device sd(8,0)): ext2_free_inode: bit already cleared
for inode 16944686
EXT2-fs error (device sd(8,0)): ext2_free_inode: bit already cleared
for inode 16944687
EXT2-fs error (device sd(8,0)): ext2_free_inode: bit already cleared
for inode 16944688
EXT2-fs error (device sd(8,0)): ext2_free_inode: bit already cleared
for inode 16944689
EXT2-fs error (device sd(8,0)): ext2_free_inode: bit already cleared
for inode 16944690
EXT2-fs error (device sd(8,0)): ext2_free_inode: bit already cleared
for inode 16944691
EXT2-fs error (device sd(8,0)): ext2_free_inode: bit already cleared
for inode 16944692
EXT2-fs error (device sd(8,0)): ext2_free_inode: bit already cleared
for inode 16944693
EXT2-fs error (device sd(8,0)): ext2_free_inode: bit already cleared
for inode 16944694
EXT2-fs error (device sd(8,0)): ext2_free_inode: bit already cleared
for inode 16944695
EXT2-fs error (device sd(8,0)): ext2_free_inode: bit already cleared
for inode 16944696
EXT2-fs error (device sd(8,0)): ext2_free_inode: bit already cleared
for inode 16944697
EXT2-fs error (device sd(8,0)): ext2_free_inode: bit already cleared
for inode 16944698
EXT2-fs error (device sd(8,0)): ext2_free_inode: bit already cleared
for inode 16944699
EXT2-fs error (device sd(8,0)): ext2_free_inode: bit already cleared
for inode 16944700
EXT2-fs error (device sd(8,0)): ext2_free_inode: bit already cleared
for inode 16944701
EXT2-fs error (device sd(8,0)): ext2_free_inode: bit already cleared
for inode 16944702
EXT2-fs error (device sd(8,0)): ext2_free_inode: bit already cleared
for inode 16944703
EXT2-fs error (device sd(8,0)): ext2_free_blocks: bit already cleared
for block 33884878
EXT2-fs error (device sd(8,0)): ext2_free_blocks: bit already cleared
for block 33884879
EXT2-fs error (device sd(8,0)): ext2_free_blocks: bit already cleared
for block 33884880
EXT2-fs error (device sd(8,0)): ext2_free_blocks: bit already cleared
for block 33884881
EXT2-fs error (device sd(8,0)): ext2_free_blocks: bit already cleared
for block 33884882
EXT2-fs error (device sd(8,0)): ext2_check_page: bad entry in
directory #10831087: unaligned directory entry - offset=0,
inode=168459530, rec_len=30218, name_len=111
EXT2-fs error (device sd(8,0)): ext2_free_blocks: bit already cleared
for block 21664379
EXT2-fs error (device sd(8,0)): ext2_free_blocks: bit already cleared
for block 21664382

I think this is the fifth bug (four of which are kernel bugs) I've
found in RHEL3U3 in my very brief burn in tests. I think I'll scratch
the ext2 partition, and go with ext3 as it _seems_ stablish. Anyone is
free to try to reproduce the problem with a >100GB ext2 partition, and
the above test script. Unfortunately I don't think I have time right now.
Comment 85 Ville Herva 2004-10-22 06:25:39 EDT
To be fair, there is a change that the ext2 partition got corrupted
because of an unclean shutdown, but I'm not sure. But I really have no
time to reproduce this, I recreate the scratch partition as ext3,
rerun the test, and if that's ok, I'll forget about this.
Comment 86 Misao 2004-10-22 13:05:11 EDT
I had the same problem using RHEL 3.0 ES U3, on a qmail server with a
1GB raided RAMdisk. The last kernel rpm
kernel-smp-2.4.21-22.prune_icachefix.EL.i686.rpm seems to have fixed it.
I used a modified version of the test script above (changing to a
different tar.gz file) and could not get it to freak out like it did
before. Before, when qmail started to spam bomb its emails as it
normally does, kswapd would suddenly go nuts and hit 100%+ CPU, near
max real memory, but not touch swap. Now, it doesnt misbehave, though
I still dont see it using any swap at all. 
Stats:
HP Proliant DL380 G3
2GB HP RAM
Dual Xeon 3.0Ghz
72GB+ total HD
Comment 87 Misao 2004-10-22 22:50:52 EDT
Hmm, my bug returned:

procs                      memory      swap          io     system   
     cpu
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us
sy wa id
 6  0      0 340588   1284 1500984    0    0    23    83   60    42  1
 9  1 89
 7  0      0 341028   1320 1500904    0    0   396   860 1462 11664  7
85  0  8
 5  0      0 341312   1784 1500624    0    0   892   412  714 18509  1
99  0  0
 1  0      0 341900   1416 1500876    0    0   400   720 1444 11695  7
80  0 13
 2  0      0 341712   1424 1500940    0    0   212   584 1240  9039  8
82  0 10
15  0      0 342324   1384 1501036    0    0   256   740 1505  8509  8
70  0 22
 2  0      0 342524   1496 1500956    0    0   316  3312 1705 12072  9
76  0 15


Going to roll back to 20-15 kernel.
Comment 88 Göran Uddeborg 2004-10-23 07:37:12 EDT
Misao, could you explain what is the problem with the vmstat you show
in comment 87?  (Maybe it's obvious to everybody else.)  Did kswapd
take all that system time, and continue to do that?  Or is it
something else you want to say?
Comment 89 Misao 2004-10-23 23:56:39 EDT
This is definately kswapd going bonkers.
The only thing this server does is qmail smtp sends, and they don't
require this sort of overhead.
Comment 90 Christopher McCrory 2004-10-26 15:02:20 EDT
Is it possible to copy in the kernel-smp-unsupported rpm into
http://people.redhat.com/~lwoodman/.RHEL3/ ?

I'm seeing this same problem but with reiserfs

Comment 96 Misao 2004-10-27 15:58:35 EDT
Rollback to 21-15 did not help.
kswapd went nuts again, and caused the box to freeze up.
Same symptoms as last time.
Comment 97 Larry Woodman 2004-10-27 16:08:00 EDT
Misao, you must be experiencing a different problem than the other
folks are here.  Can you get your system into the "kswapd is going
nuts" state and get me several "AltSysrq-W" and "AltSysrq-M" outputs
so I can see whats going on?

Larry



Comment 98 Misao 2004-10-27 16:14:58 EDT
Created attachment 105864 [details]
top view as system froze on 21-15 kernel

top view as system froze on 21-15 kernel.
Comment 99 Larry Woodman 2004-10-27 17:10:10 EDT
But does this fail on this kernel"

>>>http://people.redhat.com/~lwoodman/.RHEL3/ ?
Comment 100 Anders Nielsen 2004-10-28 05:13:31 EDT
anielsen@jobindex.dk
Comment 101 Ernie Petrides 2004-10-28 19:38:33 EDT
A fix for this problem has just been committed to the RHEL3 U4
patch pool this evening (in kernel version 2.4.21-23.EL).
Comment 102 Ernie Petrides 2004-10-29 19:47:13 EDT
*** Bug 137513 has been marked as a duplicate of this bug. ***
Comment 103 Rob ter Veer 2004-11-02 04:32:07 EST
Ernie/Larry,

We've tested the patched kernel extensively and all seems to be 
working fine. 

Is the version mentioned by you officially supported and if so, where 
can we get it (2.4.21-23.EL) right now? We need to get our machine 
into production soon and would rather do it with a fully supported 
kernel.
Comment 104 Ed Voncken 2004-11-02 09:11:43 EST
I'm seeing some disturbing trend in our environment (approx. 40 RHEL
servers, production CVS, ATG etc.).

Over the last 3 days, 4 servers have keeled over. They've been running
the 2.4.21-20.ELsmp kernel since 2004-10-09 (our last maintenance window)

On one of the server I verified that kswapd is indeed consuming all
CPU (top output):

11 root      25   0     0    0     0 SW   98.0  0.0  2133m   3 kswapd

Is there an estimate when the patched kernel becomes available through
RHN?

Greetings,
  Ed.
Comment 105 Ernie Petrides 2004-11-02 15:49:01 EST
The U4 kernel is scheduled to be released during the first week
of December.  The 2.4.21-22.EL kernel is the latest one in the RHN
beta channel for RHEL3.  The 2.4.21-23.EL kernel is undergoing
internal Q/A at the moment, and I expect it to appear in the RHN
beta channel within a few days.  However, these kernels are not
yet released, and therefore they are unsupported and unsuitable
for production use.

There will be at least one more respin before U4 is released.
Comment 106 Tom Sightler 2004-11-02 18:30:23 EST
"..they are unsupported and unsuitable for production use."

I can't help myself, that is just too funny.  I would say that
2.4.21-20.EL, whether supported or not, is "unsuitable for production
use."  Ed has had 10% of his 40 servers crash in the last few days,
I've seen 8 out of 12 of my servers crashed at least once due to this
bug, and a friend that works at a company with about 50 RHEL servers
has had at least 15 servers hang with in what would appear to be the
same manner.  That's not very "suitable for production use" if you ask
me, although it may be "supported."

Seriously though, if you need this fix in a "supported" way now, you
can probably just open a real support case with Redhat.  Support will
likely simply tell you to run the kernel available in this Bugzilla
(perhaps even provide you a copy), and report any problems you have. 
They will then continue to support you even when running the
unsupported kernel because you are running the unsupported kernel at
the direction of support.

Later,
Tom
Comment 108 John Caruso 2004-11-10 17:24:09 EST
The test kernels mentioned in this bug appear to be no longer 
available.  We've been seeing this problem as well, and older kernels 
have even more severe problems, so we're interested in getting a copy 
of the test kernel.  Can it be made available again?  And/or, when is 
it going to be released?

As far as I can tell there is currently *no* stable kernel that's 
generally available for RHEL3.
Comment 109 Ernie Petrides 2004-11-10 18:03:04 EST
John, please try the kernel in the RHN beta channels.  Right now,
2.4.21-23.EL is there.  Tonight we're building -24.EL, but it won't
be available in the RHN beta channel until about a week from now.
Comment 110 Tom Sightler 2004-11-11 09:30:15 EST
Just as a point of curiosity, is anyone running the -23.EL kernels
with success?  I tested this kernel on two of our lab systems so far
and both have experienced hard locks, appearing to be caused by the
system going OOM.  These machines were previously stable for weeks
running -22_prune_icachefix.EL but the exact same load with -23.EL
succeeded in hanging both of these systems in less than 24 hours.

Perhaps it was a mistake in our testing and if others are seeing
success I may try again and capture any information I can after a hang.

Later,
Tom

Comment 111 John Caruso 2004-11-11 12:21:14 EST
Tom: After being pointed at the beta channel yesterday, I installed 
2.4.21-23.EL (hugemem) on a backup database server at about 9pm, with 
the intention of seeing if it was suitable for use on the primary.  
And this morning at 8am it was already hung--despite the fact that 
that machine was doing next to nothing at all.  I don't know what 
specifically was causing the hang, since there was nothing in the 
logs and I wasn't even able to see the console messages, but I have 
to conclude that 2.4.21-23.EL is in fact worse than 2.4.21-20.EL.
Comment 112 Tom Sightler 2004-11-11 16:02:33 EST
Thanks for the input.  Shortly after posting that note this morning I
rebooted one of the two systems back to -23.EL but then had lots of
meetings so didn't get to do anything with it.  I came back just a few
minutes ago to find a hung system.  Before the hang the system  was
just sitting there idle at a login prompt.

I think the -23 kernels have some serious issues.  Has internal
testing at Redhat turned up any such problems?  Do we need to file
different bugs against this issue.  I certainly don't want to see yet
another RHEL3 update kernel that is even more unusable than the
current one.

Later,
Tom
Comment 113 Larry Woodman 2004-11-11 17:01:58 EST
Can everyone who is having a problem with .23 get me top, vmstat and
AltSysrq M, W and T outputs so I can see where it is hanging? 

Larry Woodman
Comment 114 Tom Sightler 2004-11-11 17:52:48 EST
I'll be out of town until Wednesday but I will have physical access to
a couple of servers this weekend that are very similar to the systems
that are hanging with -23.EL in our lab.  These systems have been
running a slightly patched -20.EL (a patch for an unrelated problem -
Bugzilla 118839) for 8+ weeks without issues so we'll see how they
react with -23.EL and if I can reproduce the hang on those systems
I'll send you the output you requested.

Later,
Tom




Comment 115 Göran Uddeborg 2004-11-12 05:17:40 EST
Now we have tested the 2.4.21-23.ELsmp, and don't see any problem with
it.  I conclude that the bug I originally reported is solved.

I obviously can't answer for all the people who have been listening
and commenting here.  And, obviously again, there are some kind of
problems with this kernel too.  There isn't really anything which say
they are related, though.

So I close this report now since the issue I reported is solved.  I
encourage you Tom to start a separate case for the problems you are
seeing.  After all, if you started seeing hangs in the current kernel
and did a bugzilla search, would you read all the way down to comment
110 to find that there it was discussed? :-)

But please put a note here what bug number you get, for those of us
who want to follow that case too.  Even if we haven't seen any locks
here (yet), I would like to follow that discussion.
Comment 116 Ernie Petrides 2004-11-12 17:48:22 EST
Göran, thanks for the verification.  I'm reverting the state of this
bug to MODIFIED, since U4 is not quite yet released.  The Errata System
will change it to CLOSED/ERRATA automatically when U4 becomes available
in the main RHN channel.
Comment 117 Göran Uddeborg 2004-11-13 04:29:01 EST
:-) I used the resolution "NEXTRELEASE" exactly because U4 isn't
available yet.  It will be the "next release"! :-)

Of course, it doesn't really matter to me.  This bug is fixed in this
kernel, that is what I really care about.  And maybe "next release" is
meant to refer to RHEL4.
Comment 118 Ernie Petrides 2004-11-15 14:58:01 EST
Your last statement is correct: "next release" refers to RHEL4.
Comment 119 Tom Sightler 2004-11-16 08:56:04 EST
At this time I would agree that the lockup I'm seeing with -23.EL may 
indeed be yet another, different issue.  I will continue to test and 
gather information and open an new Bugzilla if I can make the problem 
reproducible (I'm only 2 for 3 so far).

Later,
Tom
Comment 120 Larry Woodman 2004-11-16 11:50:05 EST
Tom, if you can get me some AltSysrq M, T and W outputs when the
lockup occurs I can help you.

Larry Woodman 
Comment 121 Ville Herva 2004-11-16 12:04:16 EST
If you create a new bug for the -23.EL lockup, please add the bug 
number to this case, so we can follow it :).
Comment 122 Tom Sightler 2004-11-16 22:26:57 EST
Larry,

I have no doubt you can help me once I gather the information you 
requested, however, I've been at another office this week and could 
not reproduce the issue on either of the two servers I was testing on 
this weekend, even though they are practically identical to the 
systems in our lab that both experienced a problem.

I am back to my normal office this week and I will try my best to 
find time to attempt reproducing this issue on the lab servers that 
experienced the problems last week.

Thanks for your help.

Later,
Tom


Comment 123 Ernie Petrides 2004-11-17 18:29:00 EST
Tom, the latest kernel in the RHN beta channel is now 2.4.21-25.EL.
Please use that version for any future testing.  Thanks.
Comment 124 John 2004-11-18 19:30:22 EST
When will an official release be available for this bug?
Comment 125 Ernie Petrides 2004-11-18 19:57:29 EST
John, 1st (full) week in December.
Comment 126 Tolga Tarhan 2004-11-30 13:17:25 EST
RedHat Team --

I'm having an issue that seems like it might be related to this, but 
I want to confirm that it is before I try the beta kernel.

Here's a simple explanation: If I run tar to extract a large tar 
archive (about 30 gigabytes) full of small (100k or so) files most of 
the system memory (6 GB) gets used up as buffers/cache (as it 
should), however, kswapd goes nuts using almost all of the CPU and 
slowing down everything, including the tar process.

It's worth noting that the used swap memory does not go up (it's at 
about 7 megs and just staying there), so kswapd is obviously being 
ineffective.  From what I can tell, kswapd is just sitting there 
churning up CPU for no good reason.

Does this sound like a match?
Comment 127 Tolga Tarhan 2004-11-30 13:29:54 EST
Interestingly enough, suspending the tar process does NOT help the 
problem.  The free physical memory went up to 121 megs.  Not counting 
buffers and cache, there's 3738 free.  However, kswapd is still 
taking 99% of the CPU.  There doesn't seem to be anything I can do to 
stop it, other than reboot.
Comment 128 Larry Woodman 2004-11-30 13:31:57 EST
Tolga, this sounds exactly like the problem that we fixed in this bug
report.  Please try out the latest RHEL3-U4 kernel.

Larry Woodman
Comment 129 Ville Herva 2004-11-30 14:46:22 EST
Larry,

our stress test:

AMOUNT=200
DIR=/tmp/foo
mkdir $DIR
while true; do
 cd $DIR
 for i in $(seq 1 $AMOUNT); do
  mkdir $i
  cd $i
  tar xf /tmp/linux-2.4.27.tar
  cd $DIR
 done
 for i in $(seq 1 $AMOUNT); do
  echo -n "$i: " >> md5sum.log
  find $i -type f -print0 | xargs -0 cat | md5sum >> md5sum.log
 done
 for i in $(seq 1 $AMOUNT); do
  echo -n "$i: " >> md5sum2.log
  find $i -type f -print0 | xargs -0 cat | md5sum >> md5sum2.log
 done
 date >> filefill.log
 df -m . >> filefill.log
 for i in $(seq 1 $AMOUNT); do
   rm -rf ./"$i"
 done
done &
while true; do updatedb; done &



runs pretty fine with 2.4.21-25ELsmp (2x Xeon 3GHz HT) - it used to
guaranteedly cause the kswapd problem - but with the UP kernel
(2.4.21-25EL) it slows down to making almost no progress. It doesn't
cease completely nor lock up, though.

On a related note, even the SMP (2.4.21-25ELsmp) kernel is orders of
magnitude slower than 2.6.10rc2. Probably due to the vastly better
anticipatory IO scheduler 2.6 has...

Comment 130 Tolga Tarhan 2004-11-30 15:15:34 EST
Larry,

Thanks, installing 2.4.21-25.ELsmp from the beta channel _completely_ 
eliminated the problem for us.

Just out of curiosity, was this bug inherited from the upstream 
kernel code, or is it specific to the RH kernels?
Comment 131 Rik van Riel 2004-11-30 16:51:39 EST
Tolga,

I guess I can take most of the blame for these bugs, they were in part
inherited from the 2.4-rmap VM and in part RHEL3 specific.  The reason
for that is that the 2.4 upstream VM simply does not run at all on the
larger configurations supported by RHEL3, which meant we had to go
with a different VM.

Luckily the VM in the upstream 2.6 kernel does seem to scale well
enough, so we won't need to make as many changes for RHEL4 ;)
Comment 132 Larry Woodman 2004-11-30 17:23:15 EST
No, this bug was actually introduced by Red Hat in the RHEL3-U3
kernel.  The problem started when this patch was added to RHEL3-U3 to
prevent refiling inodes that are I_LOCK'd"
*****************************************************************    
                                                                     
                             
--- linux-2.4.21/fs/inode.c.orig
+++ linux-2.4.21/fs/inode.c
@@ -285,7 +285,8 @@ void refile_inode(struct inode *inode)
               BUG(); */
       if (!inode) return;
       spin_lock(&inode_lock);
-       __refile_inode(inode);
+       if (!(inode->i_state & I_LOCK))
+               __refile_inode(inode);
       spin_unlock(&inode_lock);
}
*********************************************************************
                                                                     
                                 

Since prune_icache sets I_LOCK before calling invalidate_inode_pages,
the inode never gets refiled and therefore never moved to the
inode_unused list, kswapd never makes progress freeing up unused
inodes and the list just keeps growing!


Larry
Comment 133 Ron N 2004-12-01 15:18:22 EST
Hello,

I am an escalation engineer for the Proliant line at HP.  I have a 
customer and have duplicated a similar issue.  The issue shows up 
with NFS under heavy I/O.  Can I get a link to the beta kernel fix 
you are discusing.
Comment 134 Brent Fox 2004-12-01 15:29:58 EST
Ron, do you have an RHN account?  That's where the RHEL beta packages
are made available.
Comment 135 Jason Smith 2004-12-01 16:40:26 EST
Can someone from RedHat please post this 2.4.21-25.ELsmp U4 beta
kernel somewhere.  I do not have a RHN account since we are using a
satellite server here which currently does not support synching the
beta channels.
Comment 136 Peter Martuccelli 2004-12-01 16:57:21 EST
What archs are you looking to test?
Comment 137 Jason Smith 2004-12-01 17:04:32 EST
I only need an rpm for i686 SMP.  - Thanks.
Comment 138 Peter Martuccelli 2004-12-01 18:00:33 EST
You will find the kernel you need here:

http://people.redhat.com/peterm/.132639/

We are rebuilding the U4 GA kernel again tonight, but the kernel you
will download has the fix(s) you need.  
Comment 139 Simon Matter 2004-12-06 02:33:18 EST
Kernel 2.4.21-26.EL fixes the problem I had with RHEL3 U3. I'm setting
up a new fileserver which replaces our old kernel 2.2 based server.
The new highend box connected to a SAN (HP EVA3000) just didn't manage
to rsync the files from the old server to the SAN.
Comment 140 Albert Graham 2004-12-09 10:33:35 EST
Created attachment 108212 [details]
My Complaint about RH Enterprise Service.
Comment 141 Matt Heaton 2004-12-12 11:15:56 EST
So does 2.4.21-0.1. that was just released a week ago fix this issue,
or do I still need to install the "beta" kernel?  Please let me know.

Thanks,
Matt
Comment 142 Simon Matter 2004-12-13 02:38:34 EST
2.4.21-0.1 doesn't fix this issue! Only 2.4.21-26.EL beta does.
Comment 143 Ernie Petrides 2004-12-13 15:06:56 EST
What was released a week+ ago was 2.4.21-20.0.1.EL, which was a
security errata release based on the U3 kernel.  The fix for this
bug is in the U4 kernel (2.4.21-27.EL), which is scheduled for
release a week from today.
Comment 144 Olle Liljenzin 2004-12-13 15:12:54 EST
One of the machines running 2.4.21-23 got a kernel panic today. It
looks like related to kswapd and possibly related to this bugzilla,
but I don't know if this is already fixed in later kernels.

I will attach the last console output.
Comment 145 Olle Liljenzin 2004-12-13 15:16:05 EST
Created attachment 108467 [details]
console output from kernel panic
Comment 146 Albert Graham 2004-12-15 21:07:25 EST
I have tested both 2.4.21-23 and 2.4.21-26 and can confirm once again
that this issue is not resolved.

I simply run two copies of tar (creating a backup of the whole OS) to
a second disk /backup, (excluding /proc /backup etc).

I let the first copy run for a few mins until most of the memory is
used by disk cache, then i start the second copy and BANG!

2.0G    sysback-test1.tar.gz  <-- 1st copy
208M    sysback-test2.tar.gz  <-- 2rd copy


I tested this on two different machines now, one with 4GB and one with
8GB - with the same results. - i.e. a  panic.

All of my tests work flawlessly using 2.4.21-15.0.4.ELsmp on both of
the above machines. If fact I can run as many copies of tar as I like
without crashing this kernel -

here is the kernel panic screen shoot of this panic.





Comment 147 Albert Graham 2004-12-15 21:11:50 EST
Created attachment 108667 [details]
Kernel 2.4.21-23.ELsmp Panic under load
Comment 148 Albert Graham 2004-12-15 21:22:06 EST
Can someone please point me to the latest beta kernel > 2.4.21-26 so I
can test it under load. Thanks
Comment 149 Albert Graham 2004-12-15 21:37:25 EST
I've just released that I also had a TOP view running on one of my
consoles at the time of the hang if it helps:

362 processes: 360 sleeping, 2 running, 0 zombie, 0 stopped
CPU states:  cpu    user    nice  system    irq  softirq  iowait    idle
           total   95.6%    0.0%    9.6%   0.4%     4.0%  191.2%   97.6%
           cpu00   26.6%    0.0%    3.1%   0.7%     1.9%   62.4%    4.9%
           cpu01   22.3%    0.0%    2.1%   0.0%     0.1%   69.7%    5.5%
           cpu02   19.3%    0.0%    2.3%   0.0%     0.3%   32.0%   45.8%
           cpu03   27.6%    0.0%    1.9%   0.0%     1.7%   27.0%   41.5%
Mem:  4100856k av, 4082228k used,   18628k free,       0k shrd, 
209688k buff
                   1307848k actv, 2043932k in_d,   63532k in_c
Swap: 1959920k av,       0k used, 1959920k free                
2732176k cached

  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME CPU COMMAND
 8596 root      25   0   688  672   268 S    60.4  0.0   9:12   2 gzip
 4369 tomcat    25   0  932M 932M 18204 S    33.4 23.2  10:54   2 java
12339 root      15   0  1128 1112   760 D     3.9  0.0   0:07   1 tar
 8594 root      15   0  1128 1112   740 D     2.1  0.0   1:29   2 tar
   11 root      15   0     0    0     0 SW    1.1  0.0   0:07   2 kswapd
12341 root      25   0   680  664   272 S     0.5  0.0   0:55   1 gzip
12874 root      16   0  1660 1648   912 R     0.5  0.0   0:00   0 top
 3468 root      16   0     0    0     0 SW    0.3  0.0   0:04   2
kjournald
 4285 named     25   0  4644 4644  2104 S     0.3  0.1   0:16   2 named
    1 root      15   0   888  880   468 S     0.0  0.0   0:05   2 init
    2 root      RT   0     0    0     0 SW    0.0  0.0   0:00   0
migration/0
    3 root      RT   0     0    0     0 SW    0.0  0.0   0:00   1
migration/1
    4 root      RT   0     0    0     0 SW    0.0  0.0   0:00   2
migration/2
    5 root      RT   0     0    0     0 SW    0.0  0.0   0:00   3
migration/3
    6 root      15   0     0    0     0 SW    0.0  0.0   0:00   3 keventd 

Comment 150 Albert Graham 2004-12-15 21:44:07 EST
While i'm at it here is some SAR output, interestingly it shows >100%
in some columns?
12:00:00 AM       CPU     %user     %nice   %system     %idle
12:10:00 AM       all      6.70      0.00      0.87     92.43
12:20:00 AM       all      4.16      0.00      0.63     95.21
12:40:00 AM       all    100.01      0.00      0.00      0.00
12:50:00 AM       all      5.42      0.00      0.75     93.83
01:00:01 AM       all     17.54      0.00      2.25     80.21
01:10:00 AM       all     22.94      0.00      4.47     72.58
01:50:00 AM       all    100.01      0.00    100.01      0.00
02:00:01 AM       all      1.77      0.00      0.34     97.89
02:10:00 AM       all      0.07      0.00      0.13     99.80
02:20:01 AM       all      0.08      0.00      0.12     99.81
02:30:00 AM       all      0.06      0.00      0.09     99.85
02:40:00 AM       all      0.07      0.00      0.12     99.81
Average:          all    363426.05      0.00      0.33    100.44


I thought only M$ scandisk was the only system that could calculate
>100% :)
Comment 151 Ernie Petrides 2004-12-16 00:43:02 EST
Albert, the U4 beta kernel (which we expect to be released on Monday)
can be found here:

  ftp://partners.redhat.com/a61d109e2483b0bf579b0b5f90a5ea8c/2.4.21-27.EL/

If you have a problem with that kernel, please open a new bug.  This
one will be automatically closed when U4 is released.  Thanks in advance.
Comment 152 Chris Evich 2004-12-16 09:08:30 EST
Received report of behavior possibly relevant to this bug, pasting it
here and also attaching files...


One of our servers was again in "kswapd overload mode". It was running
the 2.4.21-20.0.1.ELsmp kernel, but we have seen the same / similar
behavior with the beta (2.4.21-25.ELsmp) as well.  I was able to run a
couple of commands while the system was having problems and will
attach them. Characteristics of servers: 
  
- Dell PowerEdge/redhat RHEL3 servers
- many file systems (25+ disk partitions, approx 35 - 50 GB each, ext3) 
- lots of files / directories have been opened / read (updatedb and
large Maildir mailboxes) 
- lots of files / directories monitored by FAM (large IMAP Maildir
mailboxes) 
- 4 GB of RAM in servers 
- NFS services enabled   
 
Problem can be reproduced by running a stress test. It simply tar's to
/dev/null of all file systems in parallel and executes the
/etc/cron.daily/slocate.cron (updatedb) script as well. 
 
One server (running 2.4.21-25.ELsmp beta kernel) wedged after about an
hour. In this state, the system refuses any remote connections e.g.
ssh (although it takes your login and password and then hangs) and the
console goes completely dead. 

I have seen similar behavior with the 2.4.21-15.ELsmp kernel as well.
Ran this same test on our another server with the 2.4.21-4.ELsmp
kernel. It did not seem to be affected though the hardware is slightly
different.
Comment 153 Chris Evich 2004-12-16 09:12:49 EST
Created attachment 108693 [details]
From Comment 152
Comment 154 Chris Evich 2004-12-16 09:15:04 EST
Created attachment 108694 [details]
re: Comment 152
Comment 155 Larry Woodman 2004-12-16 09:53:05 EST
Chris, can you get me a few AltSysrq-W outputs when the system is in
the state where kswapd is eating up the cputime?  I did fix a problem
in U4 where inodes were nog getting refiled therefore they were not
getting reclaimed by kswapd but perhaps you are having a different
problem.

Larry Woodman
Comment 157 Need Real Name 2004-12-17 10:48:41 EST
ok. system has been in basically idle state overnight and kswapd has
crept up to ~35% of CPU:

(top cmd output)--------------------------------------------

 07:45:50  up 19:09,  3 users,  load average: 1.49, 1.41, 1.33
140 processes: 138 sleeping, 2 running, 0 zombie, 0 stopped
CPU states:  cpu    user    nice  system    irq  softirq  iowait    idle
           total    1.2%    0.0%   38.0%   0.0%     0.0%   23.6%  336.0%
           cpu00    1.5%    0.0%    0.7%   0.0%     0.0%    0.1%   97.4%
           cpu01    0.0%    0.0%    0.0%   0.0%     0.0%    0.0%  100.0%
           cpu02    0.0%    0.0%    1.3%   0.0%     0.0%   14.5%   84.0%
           cpu03    0.0%    0.0%   35.9%   0.0%     0.0%    9.1%   54.8%
Mem:  3598952k av, 1376124k used, 2222828k free,       0k shrd, 
752392k buff
                    748496k actv,  309236k in_d,   76616k in_c
Swap: 4096532k av,       0k used, 4096532k free                 
487216k cached
                                                                     
                          
  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME CPU COMMAND
   11 root      16   0     0    0     0 RW   37.3  0.0 135:35   2 kswapd
 2946 gdm       15   0 19328  18M  7512 S     0.9  0.5   0:54   0
gdmgreeter
 2593 root      15   0 21676  12M  2364 S     0.5  0.3   5:16   1 X
11250 root      15   0  1284 1276   892 R     0.3  0.0   0:00   0 top
 1581 root      25   0  8420 8392  3500 S     0.1  0.2   5:30   3
dcstor32d
 2022 root      19   0   572  572   428 S     0.1  0.0   0:00   1 racsrvc
    1 root      15   0   920  912   464 S     0.0  0.0   0:06   0 init
    
(alt-sysrq-w and alt-sysrq-m output)--------------------------------

Dec 17 07:42:28 platoon kernel: SysRq : Show CPUs
Dec 17 07:42:28 platoon kernel: CPU3:
Dec 17 07:42:28 platoon kernel: c6b7de8c 00000003 c01ce8ef 00000000
c03b2d34 00000077 00000006 c01ceb7a
Dec 17 07:42:28 platoon kernel:        00000077 c6b7df7c c04f74c4
f476d000 f476d000 c04f74c4 c6b7df7c c01ceadd
Dec 17 07:42:28 platoon kernel:        00000077 c6b7df7c c04f74c4
f476d000 00000011 00000002 00000003 c6b7df7c
Dec 17 07:42:28 platoon kernel: Call Trace:   [<c01ce8ef>]
sysrq_handle_showcpus [kernel] 0xf (0xc6b7de94)Dec 17 07:42:28 platoon
kernel: [<c01ceb7a>] __handle_sysrq_nolock [kernel] 0x7a (0xc6b7dea8)
Dec 17 07:42:28 platoon kernel: [<c01ceadd>] handle_sysrq [kernel]
0x5d (0xc6b7dec8)
Dec 17 07:42:28 platoon kernel: [<c01cc61b>] handle_scancode [kernel]
0x2ab (0xc6b7deec)
Dec 17 07:42:28 platoon kernel: [<c01cd7fd>] handle_kbd_event [kernel]
0xbd (0xc6b7df14)
Dec 17 07:42:28 platoon kernel: [<c01cd81c>] keyboard_interrupt
[kernel] 0x1c (0xc6b7df2c)
Dec 17 07:42:28 platoon kernel: [<c010dd39>] handle_IRQ_event [kernel]
0x69 (0xc6b7df30)
Dec 17 07:42:28 platoon kernel: [<c010df79>] do_IRQ [kernel] 0xb9
(0xc6b7df50)
Dec 17 07:42:28 platoon kernel: [<c010dec0>] do_IRQ [kernel] 0x0
(0xc6b7df74)
Dec 17 07:42:28 platoon kernel: [<c0109100>] default_idle [kernel] 0x0
(0xc6b7df7c)
Dec 17 07:42:28 platoon kernel: [<c0109100>] default_idle [kernel] 0x0
(0xc6b7df90)
Dec 17 07:42:28 platoon kernel: [<c0109129>] default_idle [kernel]
0x29 (0xc6b7dfa4)
Dec 17 07:42:28 platoon kernel: [<c01091c2>] cpu_idle [kernel] 0x42
(0xc6b7dfb0)
Dec 17 07:42:28 platoon kernel: [<c01295e3>] printk [kernel] 0x153
(0xc6b7dfcc)
Dec 17 07:42:28 platoon kernel:
Dec 17 07:42:28 platoon kernel: CPU2:
Dec 17 07:42:28 platoon kernel: c6b7ff64 00000002 c011c91f 00000000
00001f7c c03f2caa c0109100 00000000
Dec 17 07:42:28 platoon kernel:        c6b7e000 c6b7e000 c6b7e000
c0109100 00000000 00000068 00000068 fffffffb
Dec 17 07:42:28 platoon kernel:        c0109129 00000060 00000246
c01091c2 0602080b 00000000 00000000 00000000
Dec 17 07:42:28 platoon kernel: Call Trace:   [<c011c91f>]
smp_call_function_interrupt [kernel] 0x2f (0xc6b7ff6c)
Dec 17 07:42:28 platoon kernel: [<c0109100>] default_idle [kernel] 0x0
(0xc6b7ff7c)
Dec 17 07:42:28 platoon kernel: [<c0109100>] default_idle [kernel] 0x0
(0xc6b7ff90)
Dec 17 07:42:28 platoon kernel: [<c0109129>] default_idle [kernel]
0x29 (0xc6b7ffa4)
Dec 17 07:42:28 platoon kernel: [<c01091c2>] cpu_idle [kernel] 0x42
(0xc6b7ffb0)
Dec 17 07:42:28 platoon kernel: [<c01295e3>] printk [kernel] 0x153
(0xc6b7ffcc)
Dec 17 07:42:28 platoon kernel:
Dec 17 07:42:28 platoon kernel: CPU0:
Dec 17 07:42:28 platoon kernel: c03f1f88 00000000 c011c91f 00000000
00001fa0 c03f2caa c0109100 c5266b00
Dec 17 07:42:28 platoon kernel:        c03f0000 c03f0000 c03f0000
c0109100 00000000 00000068 00000068 fffffffb
Dec 17 07:42:28 platoon kernel:        c0109129 00000060 00200246
c01091c2 0002080b 00099800 c0107000 0008e000
Dec 17 07:42:28 platoon kernel: Call Trace:   [<c011c91f>]
smp_call_function_interrupt [kernel] 0x2f (0xc03f1f90)
Dec 17 07:42:28 platoon kernel: [<c0109100>] default_idle [kernel] 0x0
(0xc03f1fa0)
Dec 17 07:42:28 platoon kernel: [<c0109100>] default_idle [kernel] 0x0
(0xc03f1fb4)
Dec 17 07:42:28 platoon kernel: [<c0109129>] default_idle [kernel]
0x29 (0xc03f1fc8)
Dec 17 07:42:28 platoon kernel: [<c01091c2>] cpu_idle [kernel] 0x42
(0xc03f1fd4)
Dec 17 07:42:28 platoon kernel: [<c0107000>] stext [kernel] 0x0
(0xc03f1fe0)
Dec 17 07:42:28 platoon kernel:
Dec 17 07:42:28 platoon kernel: CPU1:
Dec 17 07:42:28 platoon kernel: c6979f64 00000001 c011c91f 00000000
00001f7c c03f2caa c0109100 c043b280
Dec 17 07:42:28 platoon kernel:        c6978000 c6978000 c6978000
c0109100 00000000 00000068 00000068 fffffffb
Dec 17 07:42:28 platoon kernel:        c0109129 00000060 00200246
c01091c2 0102080b 00000000 00000000 00000000
Dec 17 07:42:28 platoon kernel: Call Trace:   [<c011c91f>]
smp_call_function_interrupt [kernel] 0x2f (0xc6979f6c)
Dec 17 07:42:28 platoon kernel: [<c0109100>] default_idle [kernel] 0x0
(0xc6979f7c)
Dec 17 07:42:28 platoon kernel: [<c0109100>] default_idle [kernel] 0x0
(0xc6979f90)
Dec 17 07:42:28 platoon kernel: [<c0109129>] default_idle [kernel]
0x29 (0xc6979fa4)
Dec 17 07:42:28 platoon kernel: [<c01091c2>] cpu_idle [kernel] 0x42
(0xc6979fb0)
Dec 17 07:42:28 platoon kernel: [<c01292b3>] call_console_drivers
[kernel] 0x63 (0xc6979fc4)
Dec 17 07:42:28 platoon kernel: [<c01295e3>] printk [kernel] 0x153
(0xc6979ffc)
Dec 17 07:42:28 platoon kernel:
Dec 17 07:42:35 platoon kernel: SysRq : Show Memory
Dec 17 07:42:35 platoon kernel: Mem-info:
Dec 17 07:42:35 platoon kernel: Zone:DMA freepages:  2495 min:     0
low:     0 high:     0
Dec 17 07:42:35 platoon kernel: Zone:Normal freepages:  1838 min: 
1279 low:  4544 high:  6304
Dec 17 07:42:35 platoon kernel: Zone:HighMem freepages:552040 min:  
255 low: 10750 high: 16125
Dec 17 07:42:35 platoon kernel: Free pages:      556373 (552040 HighMem)
Dec 17 07:42:35 platoon kernel: ( Active: 186416/76696,
inactive_laundry: 41103, inactive_clean: 19194, free: 556373 )
Dec 17 07:42:35 platoon kernel:   aa:0 ac:0 id:0 il:0 ic:0 fr:2495
Dec 17 07:42:35 platoon kernel:   aa:0 ac:152947 id:0 il:34793 ic:2482
fr:1838
Dec 17 07:42:35 platoon kernel:   aa:14247 ac:19222 id:76696 il:6310
ic:16712 fr:552040
Dec 17 07:42:35 platoon kernel: 1*4kB 1*8kB 1*16kB 1*32kB 1*64kB
1*128kB 0*256kB 1*512kB 1*1024kB 0*2048kB 2*4096kB = 9980kB)
Dec 17 07:42:35 platoon kernel: 536*4kB 23*8kB 6*16kB 0*32kB 1*64kB
0*128kB 1*256kB 1*512kB 0*1024kB 0*2048kB 1*4096kB = 7352kB)
Dec 17 07:42:35 platoon kernel: 684*4kB 1220*8kB 801*16kB 516*32kB
359*64kB 241*128kB 126*256kB 61*512kB 17*1024kB 4*2048kB 494*4096kB =
2208160kB)
Dec 17 07:42:35 platoon kernel: Swap cache: add 0, delete 0, find 0/0,
race 0+0
Dec 17 07:42:35 platoon kernel: 14032 pages of slabcache
Dec 17 07:42:35 platoon kernel: 308 pages of kernel stacks
Dec 17 07:42:35 platoon kernel: 0 lowmem pagetables, 708 highmem
pagetables
Dec 17 07:42:35 platoon kernel: Free swap:       4096532kB
Dec 17 07:42:35 platoon kernel: 917472 pages of RAM
Dec 17 07:42:35 platoon kernel: 688096 pages of HIGHMEM
Dec 17 07:42:35 platoon kernel: 17734 reserved pages
Dec 17 07:42:35 platoon kernel: 261654 pages shared
Dec 17 07:42:35 platoon kernel: 0 pages swap cached
Comment 158 Need Real Name 2004-12-17 10:50:21 EST
p.s. forgot to mention, system is running beta kernel:

[root@platoon log]# uname -a
Linux platoon 2.4.21-25.ELsmp #1 SMP Fri Nov 12 21:34:51 EST 2004 i686
i686 i386 GNU/Linux
Comment 159 Simon Matter 2004-12-17 10:53:58 EST
Why don't you test with 2.4.21-26.EL or newer, which has fixed it for me?
Comment 160 Need Real Name 2004-12-17 11:08:33 EST
wow! these kernels are coming out like hotcakes. Well, I have been
working with the redhat support folks and the 2.4.21-25.ELsmp beta
kernel was the one they pointed me to. I will check with them to get
the appropriate one. thanks, John.
Comment 161 Larry Woodman 2004-12-17 11:41:53 EST
I think whats going on here is that updatedb runs over night and
basically touches every file on your systems mouted disk.  This
results in inode cache growing and consuming most of the system
memory.  You should be able to simulate this by running "du /".  You
should see the inode and dentry caches growing(in /proc/slabinfo and
/proc/sys/fs/inode-nr).  When these things run kswapd comps away at
the system memory in order to fill the inode and dentry cache requests.

I am working on providing upper limits to the size of the inode and
dentry caches but it wont be complete in time for RHEL3-U4.  While
this is inconvenient kswapd wont kill your system since RHEL3-U4
containes the prune_icache fix that allows the inodes to be rapidly
reclaimed.


Larry Woodman
Comment 162 Ville Herva 2004-12-17 14:34:59 EST
In my stress test (original report:
https://bugzilla.redhat.com/bugzilla/attachment.cgi?id=105430&action=view
 - the symptoms were different for latter RHEL3 beta kernels that
fixed the problem at least partly), I had updatedb running as well. It
was definetely the kernel dentry/inode cache filling up. When I
watched slabinfo during the stress test, it was dentry cache, not
inode cache that filled up. (Does updatedb actually need to open inodes?)

This, btw, has been a problem for 2.4 kernel on large memory machines
for a long time: I recall a locking up Itanium / 2GB box from a couple
of years a ago, where the problem was a indeed kernel's unability to
reap the cache.
Comment 163 Larry Woodman 2004-12-17 14:52:46 EST
Ville, usually dentrys and inodes are one to one.  The dentry code
keeps a reference to the inode so its not possible to free an inode
until the dentry has been free.  Try running "du /" and watching the
dentry and inode caches(and what happens to the pagecache and
anonymous memory).

I am currently working on a patch that will limit the size of both the
dentry and inode caches so that if you try to allocate above some
number you will be forced free some up.

Larry
Comment 164 Need Real Name 2004-12-17 16:47:22 EST
Just to follow up, we first experienced the kswapd problem while
copying large file systems from one disk to another. The kswap load
got so high that the server became unfunctional due to extreme
slowness. We also discovered that this bug appears when updatedb and
fam is running, along with high amounts of file related activity e.g.
file system tars. We have been using stress tests that involve
starting updatedb and running multiple parallel copies (tar) to
/dev/null. Not only do we get the kswapd problem, the system finally
gets to a point where it completely locks up and does not respond to
remote or console logins and has to be rebooted. I am able to do
alt-sysrq commands but they are not logged to messages files. However,
we do see many errors in log files of type:
--------------------------------------------------------------
Dec 17 11:59:56 platoon kernel: scsi : aborting command due to timeout
: pid 34682449, scsi0, channel 0, id 0, lun

Dec 17 13:07:19 platoon kernel: blk: queue e6c4d018, I/O limit 4095Mb
(mask 0xff ffffff)
----------------------------------------------------------------
The general configuration of our servers are:

- many file systems (25+ disk partitions, approx 35 - 50 GB each, ext3)
- lots of files / directories have been opened / read (updatedb and
large Maildir mailboxes)
- lots of files / directories monitored by FAM (large IMAP Maildir
mailboxes)
- 4 GB of RAM in servers
- NFS service enabled  
- smb (samba) service enabled
Comment 165 Matt Heaton 2004-12-18 15:08:31 EST
Is it just me, or is this pretty lame that this problem has existed 
since mid September on an "enterprise class" os and still isn't 
fixed.  Obviously it affects a LOT of people.  I know they are slowly 
woring on it, but it isn't good enough for me.  If I pay $1499 a year 
for AS I better have a functional product.  I don't want to run an 
old kernel that has tons of security problems simply so that my 
machines will stay up.  Just my 2 cents worth, but that is why all of 
redhats kernel "enhancements" are iffy at best.  Since the community 
at large doesn't test them, the paying customer is the guini(sp?) 
pig. I know the engineers are doing their best and I REALLY do 
appreciate.  I just wish I had a solution after several months.

Matt
Comment 166 Larry Woodman 2004-12-20 11:58:47 EST
Like I said earlier in this bug there was a change made to RHEL3-U3
that caused kswapd to fail to reclaim inodes because they did not get
refiled due the the I_LOCK being set, it was this patch:

@@ -285,7 +285,8 @@ void refile_inode(struct inode *inode)
                BUG(); */
        if (!inode) return;
        spin_lock(&inode_lock);
-       __refile_inode(inode);
+       if (!(inode->i_state & I_LOCK))
+               __refile_inode(inode);
        spin_unlock(&inode_lock);
 }

So, in RHEL3-U4 I changed prune_icache to refile the inode after
clearing the I_LOCK state.  Several people tested that patch and said
that it fixed the problem of kswapd consuming the system in
prune_icache() ant that patch is in RHEL3.U4.  Evidently there is some
 other problem that is preventing inodes form getting reclaimed in
this case.  First of all is everyone running the latest RHEL3-U4
kernel-2.4.21-27.EL?  and can someone please get me several AltSysrq-M
and AltSysrq-W outputs when this happens?

Larry Woodman
 
Comment 167 John Flanagan 2004-12-20 15:56:37 EST
An errata has been issued which should help the problem 
described in this bug report. This report is therefore being 
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files, 
please follow the link below. You may reopen this bug report 
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2004-550.html
Comment 169 kal 2005-09-26 12:12:34 EDT
Larry,

We are experiencing the same problem with 2.4.21-20.ELsmp the
kswapd occupies 130% of the CPU and hangs the system. We applied
the 2.4.21-27.ELsmp and did not fix the problem reverted back to
2.4.21-20.ELsmp.

where can can we get the 2.4.21-26.ELsmp version for this problem.

Thanks
kal

Note You need to log in before you can comment on or make changes to this bug.