Red Hat Bugzilla – Bug 67608
Random (but frequent) kernel crashes
Last modified: 2007-11-30 17:06:51 EST
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.9) Gecko/20020412
Description of problem:
Roughly once a day our RHAS2.1 crashes. It is pingable after crashing, but it
doesn't respond to keyboard input, and the screen is black. It is a DELL
Poweredge 2550 with two 1.4GHz PIII CPUs, 1Gb memory and 2Gb swap.
Version-Release number of selected component (if applicable):
Linux version 2.4.9-e.3smp (firstname.lastname@example.org) (gcc version
2.96 20000731 (Red Hat Linux 7.2 2.96-108.1)) #1 SMP Fri May 3 16:48:54 EDT 2002
Steps to Reproduce:
1. Get a DELL PE 2550
2. Install RHAS2.1
3. Don't know, but possibly load it heavily
Actual Results: After something like 0-24 hours, the machine goes dead, except
that it's still pingable.
Expected Results: It should just keep on running...
See also bug 67609.
Created attachment 63024 [details]
Attached the /var/log/messages from the machine. Those lines about waitpid()
failing with errno=512 look weird, as does the attempt to load the ^[J^S@ l\234G
module. Don't know if they are related to the crashing though.
I have installed your netdump 0.6.6-1 thingy on it, and successfully tested it
using your crash.o kernel module. However, after one of our "real" crashes, I
don't get any crash dump. Also, after a crash, it seems not to respond to
alt-sysrq 'k'ill, 's'ync or 'u'mount, but to 'b'oot.
A blind guess of mine is that maybe the scheduler freaks out in some way and
stops handing out time slices. Do you have any ideas about what I could
possibly do to (dis)prove that theory?
Regarding the load of this machine, it regularly goes up over 30, and I have
seen it reach 100.
I take back what I said about not responding to SysRq. Not only did the machine
respond to all SysRq requests, but it also sent the output (success) to our
netdump receiving server. So if we could just have a SysRq key combo for "you
have crashed, please dump core" (bug 68435) I could probably provide you with
lots more information, giving you a fighting chance of actually *doing* anything
about this bug.
Would be worth posting the sysrq-t, sysrq-m, and sysrq-p output that
got dumped via netdump to this bug report.
Will do that as soon as we get another crash. It's late friday afternoon here,
so it may not happen before the weekend. Your concern is much appreciated btw.
I've been working with Johan via email for a while now. We have tried doubling
the values of /proc/sys/vm/freepages and this works well most of the time. With
two memory gobbling programs running they can still lock the system. Sysrq-M at
that point will hang the box hard. Attaching an nmi_watchdog oops and partial
netdump of the hard hang.
System is 1GB Ram, 2GB swap.
Created attachment 65964 [details]
Created attachment 65965 [details]
partial netdump bzipped.
This does not appear to be a crash. It appears that kswapd consumes
the CPU when memory gets very low and memory gobbling programs continue
to run and consume memory. We have heard about this problem before and
have successfully reproduced it. Doubling the /proc/sys/vm/freepages
from 638 1279 1914 to 1276 2552 3814 appears to work around the problem
in all cases for a 256MB system. I will experiment with a much larger
system ant determinw the optimal freepages values.
looks like you are using clearcase binary only modules.
Is that correct ?
We are using Clearcase. I was not aware that Clearcase was loading any kernel
modules, and using lsmod while using XClearcase I cannot see which module should
have been loaded by Clearcase:
Module Size Used by Not tainted
i810_audio 25440 1 (autoclean)
ac97_codec 13664 0 (autoclean) [i810_audio]
soundcore 7940 2 (autoclean) [i810_audio]
radeon 105944 0
autofs 13796 0 (autoclean) (unused)
nfs 91936 7 (autoclean)
lockd 61184 1 (autoclean) [nfs]
sunrpc 86000 1 (autoclean) [nfs lockd]
3c59x 32264 1
ide-scsi 10464 0
ide-cd 35296 0
cdrom 35520 0 [ide-cd]
mousedev 5824 1
hid 22272 0 (unused)
input 6560 0 [mousedev hid]
usb-uhci 26948 0 (unused)
usbcore 68864 1 [hid usb-uhci]
ext3 73536 3
jbd 55048 3 [ext3]
aic7xxx 127200 4
sd_mod 13468 4
scsi_mod 124988 3 [ide-scsi aic7xxx sd_mod]
Created attachment 77304 [details]
Some logs caught with netdump and Alt-SysRq-TMPSUB
clearcase loads the mvfs module when you use it
Then how comes it isn't loaded when I use xclearcase? And the kernel says "not
tainted" while using xclearcase (or am I mis-reading the output from lsmod?)?
Have you found any trace of an mvfs module, or do you just suspect one from
general knowledge about Clearcase? Here's the output from 'clearcase -version'
in case that helps:
ClearCase version 4.0 (Fri Mar 03 19:21:10 EST 2000)
MVFS is not installed.
cleartool V4.0 (Fri Mar 3 12:11:01 EST 2000)
db_server V4.0 (Fri Mar 3 12:08:52 EST 2000)
VOB database schema version: 53
We do have the same symptoms on two Dell Poweredge 1650 with 2*1.4 Pentium III
2GB Ram, PERC Raid adapter and shared disc. The machines are connected in
cluster via a serial cable. Normally after 24h the first system hangs (like
described above) after two to three hours the second system hangs too.
The Kernel on all our machines is 2.4.9-e16smp.
The system load high neither the memory usage is not very. Since the system is
not yet in production. With this problem we cannot use the machines in
3 other dual Pentium 3 without a raid (extern) are working just fine without
The cluster logs and syslogs are not giving any hints.
Next time the machines are hanging I'll try to get some dumps. Netdump is not
possible since the network adapters are not supported.
Is this still a problem with the latest AS2.1 kernel errata, e.24?
We haven't upgraded the kernel because we are in the middle of a release cycle.
We'll do a kernel upgrade towards the end of july. Since I'll be on vacation
then, I probably won't be able to answer this question until the middle of august.
Jorit, if you can answer this before that, please do.
The 'problem' has been removed by no longer using the cluster software. I'm
sorry I cannot help.
The solution now used is a simple heartbeat between the two machines and it
I just tried my stress test over night, and the machine was fine when
I got back to work this morning.
I'm running kernel 2.4.9-e.34smp.
Created attachment 102780 [details]
This is the stress test I used. Don't know what's necessary, but to bring my
system to its knees I ran two instances of it in parallell (sp?), thusly:
while true ; do date ; ./oom ; done
That doesn't bring my system down any more, but I'm attaching it for reference.