Red Hat Bugzilla – Bug 124150
Kernel dump on mulitple systems
Last modified: 2007-11-30 17:07:02 EST
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4b)
Description of problem:
I have several Entrerprise 3 computers that locks up with a kernel
screen dump about once a week to once a month. I am still able to
ping the computer but can't do anything else. I will attach to log
files that I got from a netdump.
Version-Release number of selected component (if applicable):
kernel = 2.4.21-15.ELsmp
Steps to Reproduce:
1.Leave system running
2.Sometimes it will lock up in a week sometimes it will not.
Actual Results: Sometimes it will lock up in a week sometimes it will
Expected Results: It should never lock up.
Additional info: I will attach to log files that I got from a netdump.
Created attachment 100499 [details]
netdump log on computer conman1
Created attachment 100500 [details]
netdump log on computer mesohigh
Since both bugs are unrelated, in list walking code and not seen by
anybody else, the obvious suspect is random memory corruption.
Am I right in guessing that the crash is never the same on the
systems, but always has a slightly different backtrace and different
If this happens on 2 computers out of a larger cluster, could you
please run memtest86 on the crashing machines and verify that the
memory is correct?
Stephen, if the memory turns out to be fine on both systems we could
have an ext3 bug on our hands. The chances are probably small, but
I've taken the liberty of assigning the bug to you anyway ;))
This is happening on over 15 computers:
8 new Dual CPU dells brough this year
4+ Dells 1-3 years old
1 micron 3 years old, single CPU
2 gateway aroun 1.5 years old, single CPU
Since you mention ext3, a side effect on this is that when the
computer comes back up a checks the journel, most of the time it comes
up fine but when a backup is done on the system, I get I/O errors on
files that did not change like man pages. To fix this, I have to a
filesystem check which most of the time moves the files to lost+found.
Memtest86 an other memory and CPU burn in test have been done on a few
of the computers and it all came back fine.
4 of the Dells 1-3 years old, single CPU, ran fine with 7.3 (ext2)
over a year with no failures but fails within a week of loading
I have been dealing with this problem since Feb. of this year.
In the first attachment, we see the checkpoint code trying to free a
journal_head which has a NULL buffer_head. That's a pattern I've
never, ever seen reported on any version of ext3.
The second attachment is memory corruption of a dcache hash list.
That case has always, pretty much without exception, been an indicator
of hardware problems in the past.
So the two logs we've got so far look exactly like the sorts of things
I'd expect to see if hardware was going bad (possibly in unpredictable
ways, such as environment overheating.) If there's a software fault
causing this, then there's nothing in the logs to indicate _what_
software fault it might be.
So the next step is definitely to gather more info. If you could
attach as many of the oops logs you've got as possible, that would
help in finding common patterns. Even better would be a crash dump
(you can't attach those, you'd need to submit them via the RHEL
enterprise support channels or provide a URL to them.) If there are
15 machines displaying the symptoms, we should have a fair number of
examples to work with.
Created attachment 100510 [details]
netdump log from shear
Created attachment 100511 [details]
netdump log from zenith
These are all the logs I got so far. About 1/2 the computers are in
the DMZ which netdump will not work through due to firewall issues.
So of the others I did not get netdumps becuase I either have not set
up netdump on them yet or thery failed before I knew about netdump. I
sent a mail message to firstname.lastname@example.org stating the locations of dump
files on a ftp site. Did not want to include them here since I was
not sure about security issues.
I'm closing this as CANTFIX for now as there is still insufficient information
here to identify, never mind get a proper resolution to, this problem. The fact
that it has been routed around the proper support channels means it is not
getting attention right now.
There was one set of oopses that looked ext3-related; the others look like bad
hardware or other random corruption. They really need looked at by the support
group for more extended diagnosis; it's not yet ready for an engineering fix, as
we haven't even identified whether it's a hardware, BIOS or OS fault yet.
For official Red Hat Enterprise Linux support, please log into the Red Hat
support website at http://www.redhat.com/support and file a support ticket,
or alternatively contact Red Hat Global Support Services at 1-888-RED-HAT1
to speak directly with a support associate and escalate an issue.