124150 – Kernel dump on mulitple systems

Bug 124150 - Kernel dump on mulitple systems

Summary: Kernel dump on mulitple systems

Keywords:
Status:	CLOSED CANTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 3
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	3.0
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Stephen Tweedie
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2004-05-24 15:04 UTC by Need Real Name
Modified:	2016-11-22 08:58 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2005-08-31 21:09:22 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
netdump log on computer conman1 (7.92 KB, text/plain) 2004-05-24 15:05 UTC, Need Real Name	no flags	Details
netdump log on computer mesohigh (36.24 KB, text/plain) 2004-05-24 15:05 UTC, Need Real Name	no flags	Details
netdump log from shear (7.47 KB, text/plain) 2004-05-24 16:34 UTC, Need Real Name	no flags	Details
netdump log from zenith (1.26 KB, text/plain) 2004-05-24 16:35 UTC, Need Real Name	no flags	Details
View All

Description Need Real Name 2004-05-24 15:04:08 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4b)
Gecko/20030507

Description of problem:
I have several Entrerprise 3 computers that locks up with a kernel
screen dump about once a week to once a month.  I am still able to
ping the computer but can't do anything else. I will attach to log
files that I got from a netdump.

Version-Release number of selected component (if applicable):
kernel = 2.4.21-15.ELsmp

How reproducible:
Sometimes

Steps to Reproduce:
1.Leave system running
2.Sometimes it will lock up in a week sometimes it will not.
3.
    

Actual Results:  Sometimes it will lock up in a week sometimes it will
not.

Expected Results:  It should never lock up.

Additional info:  I will attach to log files that I got from a netdump.

Comment 1 Need Real Name 2004-05-24 15:05:03 UTC

Created attachment 100499 [details]
netdump log on computer conman1

Comment 2 Need Real Name 2004-05-24 15:05:41 UTC

Created attachment 100500 [details]
netdump log on computer mesohigh

Comment 3 Rik van Riel 2004-05-24 15:13:23 UTC

Since both bugs are unrelated, in list walking code and not seen by
anybody else, the obvious suspect is random memory corruption.

Am I right in guessing that the crash is never the same on the
systems, but always has a slightly different backtrace and different
addresses?

If this happens on 2 computers out of a larger cluster, could you
please run memtest86 on the crashing machines and verify that the
memory is correct?

Stephen, if the memory turns out to be fine on both systems we could
have an ext3 bug on our hands. The chances are probably small, but
I've taken the liberty of assigning the bug to you anyway ;))

Comment 4 Need Real Name 2004-05-24 15:27:05 UTC

This is happening on over 15 computers:
8 new Dual CPU dells brough this year
4+ Dells 1-3 years old
1 micron 3 years old, single CPU
2 gateway aroun 1.5 years old, single CPU

Since you mention ext3, a side effect on this is that when the
computer comes back up a checks the journel, most of the time it comes
up fine but when a backup is done on the system, I get I/O errors on
files that did not change like man pages.  To fix this, I have to a
filesystem check which most of the time moves the files to lost+found.

Memtest86 an other memory and CPU burn in test have been done on a few
of the computers and it all came back fine.

4 of the Dells 1-3 years old, single CPU, ran fine with 7.3 (ext2)
over a year with no failures but fails within a week of loading
Enterprise 3.

I have been dealing with this problem since Feb. of this year.

Comment 5 Stephen Tweedie 2004-05-24 16:23:13 UTC

In the first attachment, we see the checkpoint code trying to free a
journal_head which has a NULL buffer_head.  That's a pattern I've
never, ever seen reported on any version of ext3.

The second attachment is memory corruption of a dcache hash list. 
That case has always, pretty much without exception, been an indicator
of hardware problems in the past.

So the two logs we've got so far look exactly like the sorts of things
I'd expect to see if hardware was going bad (possibly in unpredictable
ways, such as environment overheating.)  If there's a software fault
causing this, then there's nothing in the logs to indicate _what_
software fault it might be.

So the next step is definitely to gather more info.  If you could
attach as many of the oops logs you've got as possible, that would
help in finding common patterns.  Even better would be a crash dump
(you can't attach those, you'd need to submit them via the RHEL
enterprise support channels or provide a URL to them.)  If there are
15 machines displaying the symptoms, we should have a fair number of
examples to work with.

Comment 6 Need Real Name 2004-05-24 16:34:48 UTC

Created attachment 100510 [details]
netdump log from shear

Comment 7 Need Real Name 2004-05-24 16:35:53 UTC

Created attachment 100511 [details]
netdump log from zenith

Comment 8 Need Real Name 2004-05-24 16:44:28 UTC

These are all the logs I got so far.  About 1/2 the computers are in
the DMZ which netdump will not work through due to firewall issues.
So of the others I did not get netdumps becuase I either have not set
up netdump on them yet or thery failed before I knew about netdump.  I
sent a mail message to sct stating the locations of dump
files on a ftp site.  Did not want to include them here since I was
not sure about security issues.

Comment 9 Stephen Tweedie 2005-08-31 21:09:22 UTC

I'm closing this as CANTFIX for now as there is still insufficient information
here to identify, never mind get a proper resolution to, this problem.  The fact
that it has been routed around the proper support channels means it is not
getting attention right now.

There was one set of oopses that looked ext3-related; the others look like bad
hardware or other random corruption.  They really need looked at by the support
group for more extended diagnosis; it's not yet ready for an engineering fix, as
we haven't even identified whether it's a hardware, BIOS or OS fault yet.

For official Red Hat Enterprise Linux support, please log into the Red Hat
support website at http://www.redhat.com/support and file a support ticket,
or alternatively contact Red Hat Global Support Services at 1-888-RED-HAT1
to speak directly with a support associate and escalate an issue.

Note You need to log in before you can comment on or make changes to this bug.