67608 – Random (but frequent) kernel crashes

Bug 67608 - Random (but frequent) kernel crashes

Summary: Random (but frequent) kernel crashes

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Linux 2.1
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	2.1
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Larry Woodman
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:	68435
Blocks:
TreeView+	depends on / blocked

Reported:	2002-06-28 07:30 UTC by Johan Walles
Modified:	2007-11-30 22:06 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2004-08-17 06:36:06 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
/var/log/messages (78.70 KB, text/plain) 2002-06-28 11:06 UTC, Johan Walles	no flags	Details
nmi_watchdog oops (2.88 KB, text/plain) 2002-07-19 14:42 UTC, Norm Murray	no flags	Details
partial netdump bzipped. (6.38 MB, application/octet-stream) 2002-07-19 14:43 UTC, Norm Murray	no flags	Details
Some logs caught with netdump and Alt-SysRq-TMPSUB (19.05 KB, application/octet-stream) 2002-09-26 11:40 UTC, Johan Walles	no flags	Details
Stress test (1.01 KB, text/plain) 2004-08-17 06:40 UTC, Johan Walles	no flags	Details
View All

Description Johan Walles 2002-06-28 07:30:46 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.9) Gecko/20020412
Debian/0.9.9-6

Description of problem:
Roughly once a day our RHAS2.1 crashes.  It is pingable after crashing, but it
doesn't respond to keyboard input, and the screen is black.  It is a DELL
Poweredge 2550 with two 1.4GHz PIII CPUs, 1Gb memory and 2Gb swap.

Version-Release number of selected component (if applicable):
Linux version 2.4.9-e.3smp (bhcompile.redhat.com) (gcc version
2.96 20000731 (Red Hat Linux 7.2 2.96-108.1)) #1 SMP Fri May 3 16:48:54 EDT 2002

How reproducible:
Always

Steps to Reproduce:
1. Get a DELL PE 2550
2. Install RHAS2.1
3. Don't know, but possibly load it heavily


Actual Results:  After something like 0-24 hours, the machine goes dead, except
that it's still pingable.

Expected Results:  It should just keep on running...

Additional info:

Comment 1 Johan Walles 2002-06-28 07:48:33 UTC

See also bug 67609.

Comment 2 Johan Walles 2002-06-28 11:06:30 UTC

Created attachment 63024 [details]
/var/log/messages

Comment 3 Johan Walles 2002-06-28 11:10:41 UTC

Attached the /var/log/messages from the machine.  Those lines about waitpid()
failing with errno=512 look weird, as does the attempt to load the ^[J^S@ l\234G
module.  Don't know if they are related to the crashing though.

Comment 4 Johan Walles 2002-07-09 15:08:33 UTC

I have installed your netdump 0.6.6-1 thingy on it, and successfully tested it
using your crash.o kernel module.  However, after one of our "real" crashes, I
don't get any crash dump.  Also, after a crash, it seems not to respond to
alt-sysrq 'k'ill, 's'ync or 'u'mount, but to 'b'oot.

A blind guess of mine is that maybe the scheduler freaks out in some way and
stops handing out time slices.  Do you have any ideas about what I could
possibly do to (dis)prove that theory?

Comment 5 Johan Walles 2002-07-10 07:45:57 UTC

Regarding the load of this machine, it regularly goes up over 30, and I have
seen it reach 100.

Comment 6 Johan Walles 2002-07-11 07:54:14 UTC

I take back what I said about not responding to SysRq.  Not only did the machine
respond to all SysRq requests, but it also sent the output (success) to our
netdump receiving server.  So if we could just have a SysRq key combo for "you
have crashed, please dump core" (bug 68435) I could probably provide you with
lots more information, giving you a fighting chance of actually *doing* anything
about this bug.

Comment 7 Michael K. Johnson 2002-07-12 14:04:25 UTC

Would be worth posting the sysrq-t, sysrq-m, and sysrq-p output that
got dumped via netdump to this bug report.

Comment 8 Johan Walles 2002-07-12 14:43:39 UTC

Will do that as soon as we get another crash.  It's late friday afternoon here,
so it may not happen before the weekend.  Your concern is much appreciated btw.

Comment 9 Norm Murray 2002-07-19 14:41:26 UTC

I've been working with Johan via email for a while now. We have tried doubling
the values of /proc/sys/vm/freepages and this works well most of the time. With
two memory gobbling programs running they can still lock the system. Sysrq-M at
that point will hang the box hard. Attaching an nmi_watchdog oops and partial
netdump of the hard hang. 

System is 1GB Ram, 2GB swap.

Comment 10 Norm Murray 2002-07-19 14:42:18 UTC

Created attachment 65964 [details]
nmi_watchdog oops

Comment 11 Norm Murray 2002-07-19 14:43:18 UTC

Created attachment 65965 [details]
partial netdump bzipped.

Comment 12 Larry Woodman 2002-08-05 17:00:43 UTC

This does not appear to be a crash.  It appears that kswapd consumes
the CPU when memory gets very low and memory gobbling programs continue
to run and consume memory.  We have heard about this problem before and 
have successfully reproduced it.  Doubling the /proc/sys/vm/freepages
from 638 1279 1914 to 1276 2552 3814 appears to work around the problem
in all cases for a 256MB system.  I will experiment with a much larger
system ant determinw the optimal freepages values.

Larry Woodman

Comment 13 Arjan van de Ven 2002-09-26 10:10:49 UTC

looks like you are using clearcase binary only modules.
Is that correct ?

Comment 14 Johan Walles 2002-09-26 11:23:07 UTC

We are using Clearcase.  I was not aware that Clearcase was loading any kernel
modules, and using lsmod while using XClearcase I cannot see which module should
have been loaded by Clearcase:

johan@transwarp:~/views/CR444_transwarp_johan$ /sbin/lsmod 
Module                  Size  Used by    Not tainted
i810_audio             25440   1 (autoclean)
ac97_codec             13664   0 (autoclean) [i810_audio]
soundcore               7940   2 (autoclean) [i810_audio]
radeon                105944   0
autofs                 13796   0 (autoclean) (unused)
nfs                    91936   7 (autoclean)
lockd                  61184   1 (autoclean) [nfs]
sunrpc                 86000   1 (autoclean) [nfs lockd]
3c59x                  32264   1
ide-scsi               10464   0
ide-cd                 35296   0
cdrom                  35520   0 [ide-cd]
mousedev                5824   1
hid                    22272   0 (unused)
input                   6560   0 [mousedev hid]
usb-uhci               26948   0 (unused)
usbcore                68864   1 [hid usb-uhci]
ext3                   73536   3
jbd                    55048   3 [ext3]
aic7xxx               127200   4
sd_mod                 13468   4
scsi_mod              124988   3 [ide-scsi aic7xxx sd_mod]

Comment 15 Johan Walles 2002-09-26 11:40:14 UTC

Created attachment 77304 [details]
Some logs caught with netdump and Alt-SysRq-TMPSUB

Comment 16 Arjan van de Ven 2002-10-01 13:06:40 UTC

clearcase loads the mvfs module when you use it

Comment 17 Johan Walles 2002-10-01 13:26:29 UTC

Then how comes it isn't loaded when I use xclearcase?  And the kernel says "not
tainted" while using xclearcase (or am I mis-reading the output from lsmod?)?

Have you found any trace of an mvfs module, or do you just suspect one from
general knowledge about Clearcase?  Here's the output from 'clearcase -version'
in case that helps:

ClearCase version 4.0 (Fri Mar 03 19:21:10 EST 2000)
MVFS is not installed.
cleartool                         V4.0 (Fri Mar  3 12:11:01 EST 2000)
db_server                         V4.0 (Fri Mar  3 12:08:52 EST 2000)
VOB database schema version: 53

Comment 18 Jorit Dorn 2003-04-16 06:37:09 UTC

We do have the same symptoms on two Dell Poweredge 1650 with 2*1.4 Pentium III 
2GB Ram, PERC Raid adapter and shared disc. The machines are connected in 
cluster via a serial cable. Normally after 24h the first system hangs (like 
described above) after two to three hours the second system hangs too.

The Kernel on all our machines is 2.4.9-e16smp.

The system load high neither the memory usage is not very. Since the system is 
not yet in production. With this problem we cannot use the machines in 
production :-(

3 other dual Pentium 3 without a raid (extern) are working just fine without 
any problems.

The cluster logs and syslogs are not giving any hints.

Next time the machines are hanging I'll try to get some dumps. Netdump is not 
possible since the network adapters are not supported.

Comment 19 Larry Woodman 2003-06-23 17:38:35 UTC

Is this still a problem with the latest AS2.1 kernel errata, e.24?

Larry Woodman

Comment 20 Johan Walles 2003-06-27 14:00:51 UTC

We haven't upgraded the kernel because we are in the middle of a release cycle.
 We'll do a kernel upgrade towards the end of july.  Since I'll be on vacation
then, I probably won't be able to answer this question until the middle of august.

Jorit, if you can answer this before that, please do.

Comment 21 Jorit Dorn 2003-06-30 06:36:15 UTC

The 'problem' has been removed by no longer using the cluster software. I'm 
sorry I cannot help.

The solution now used is a simple heartbeat between the two machines and it 
works fine.

Sorry.
Jorit

Comment 23 Johan Walles 2004-08-17 06:36:06 UTC

I just tried my stress test over night, and the machine was fine when
I got back to work this morning.

I'm running kernel 2.4.9-e.34smp.

Comment 24 Johan Walles 2004-08-17 06:40:06 UTC

Created attachment 102780 [details]
Stress test

This is the stress test I used.  Don't know what's necessary, but to bring my
system to its knees I ran two instances of it in parallell (sp?), thusly:

while true ; do date ; ./oom ; done

That doesn't bring my system down any more, but I'm attaching it for reference.

Note You need to log in before you can comment on or make changes to this bug.