Bug 140071 - Kernel panic, no obvious reason
Summary: Kernel panic, no obvious reason
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat Enterprise Linux 3
Classification: Red Hat
Component: kernel
Version: 3.0
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Steve Dickson
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2004-11-19 16:31 UTC by Göran Uddeborg
Modified: 2007-11-30 22:07 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2005-10-12 12:29:50 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Console output from the crash (7.67 KB, text/plain)
2004-11-19 16:32 UTC, Göran Uddeborg
no flags Details
Current mount status of the same machine. (2.70 KB, text/plain)
2005-01-19 13:03 UTC, Göran Uddeborg
no flags Details

Description Göran Uddeborg 2004-11-19 16:31:38 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.4.3)
Gecko/20040924

Description of problem:
One of our servers is occasionally crashing for no obvious reason.

Version-Release number of selected component (if applicable):
kernel-smp-2.4.21-20.EL

How reproducible:
Sometimes

Additional info:

We have connected a serial console to it and have the panic messages
saved.  See attachment.  The program mentioned "a.out" does not
reproducibly kill the machine, it was just running at this crash.  It
would produce a core dump, and, from what I understand from the log,
it was doing that at the crash.  Does the log tell you anything more?

Comment 1 Göran Uddeborg 2004-11-19 16:32:50 UTC
Created attachment 107062 [details]
Console output from the crash

Comment 2 Ernie Petrides 2004-11-19 19:19:56 UTC
Steve, crash was in rpc_clnt_sigunmask().


Comment 3 Steve Dickson 2005-01-18 14:15:37 UTC
How where the filesystem mounted, with the 'intr' option?

Comment 4 Göran Uddeborg 2005-01-19 13:03:55 UTC
Created attachment 109964 [details]
Current mount status of the same machine.

We don't believe we have changed any mount options since the crash occurred. 
So the attachment should give a fair view of what things looked like then. 
Obviously, I don't know exactly which directories were autmomounted.

But, although we don't believe so, we can't be COMPLETELY sure no option has
changed in the meantime.

Comment 5 Steve Dickson 2005-01-19 18:23:53 UTC
hmm... it does appear the process was in the 
middle of dumping core.... So could you cause 
a core dump on both an nfs and autofs filesytem
doing the following:

sleep 20 &
kill -11 sleep_pid



Comment 6 Göran Uddeborg 2005-01-21 13:26:04 UTC
The process could very well be in the middle of producing a core dump.
 The program did have that kind of error at the time.

Producing a core is not in itself enough to force a crash.  I've
dropped cores in a lot of places now, and nothing strange happens.

But I'm fully aware that this is very hard to debug, since we can't
reproduce the problem, and it does not happen often.  And we have
upgraded to the U4 kernel a few days ago, but the machine is still
running the same set of network servers.  LSF master server in particular.

The machine had another crash which also seems to be core dump related
a few days ago.  We reported it in bug 145331.  (Which was closed as a
duplicate of bug 140083.)

Comment 9 Steve Dickson 2005-01-27 21:43:37 UTC
Would be possible to setup netdump so we can get
a crash dump to analyze? That's probably the only way
we'll be able to figure this out.....

Comment 10 Göran Uddeborg 2005-01-28 10:04:01 UTC
We will setup a netdump server and client.  I'll be back when we have a dump.

Comment 11 Göran Uddeborg 2005-01-28 15:00:24 UTC
We can't start netdump; there is no netdump.o module in
kernel-smp-2.4.21-27.0.2.EL.x86_64!  Is that a bug which I should bugzilla?  Or
is there some reason for the omission?

Comment 12 Ernie Petrides 2005-02-01 01:23:01 UTC
Steve, netdump support on x86_64 was just added in U5, and thus is not
yet available on customer systems.  I think diskdump support is in U4
(on x86_64), so getting a crash dump might still be viable (if they're
using one of the supported disk drivers).  Please follow up with DaveA
or one of the diskdump developers.

Comment 16 Göran Uddeborg 2005-02-02 10:32:26 UTC
We have an adapter which identifies as "Adaptec 29320 Ultra320 SCSI adapter" and
uses the aic79xx driver, so diskdump should work.  We're attaching an extra disk
and will set things up.

Comment 17 David Juran 2005-05-10 08:01:28 UTC
We finally manage to capture a diskdump (-:
Is anyone still interested? 
Since the dump potentially may contain sensitive information I don't want to
publish it here in the public bugzilla. (It's rather large as well...) Maybe I
could make it available over a support ticket?

Comment 18 Steve Dickson 2005-09-06 11:21:08 UTC
Is this panic still happening with the U6 kernel?

Comment 19 Göran Uddeborg 2005-09-08 12:33:54 UTC
Our logs are not perfectly maintained, but I can't find a crash recently, no.

This could be becaulse of the kernel update.  It might also be because the
machine isn't any more running LSF servers as it used to.  But in either case,
it doesn't seem to crash any more.

Comment 20 Steve Dickson 2005-09-09 15:28:36 UTC
Fine... if this happens again, please update this bug report... 

Comment 21 Steve Dickson 2005-09-09 19:09:41 UTC
WRT to comment #17, Please make that dump available to
your customer support person so we can take a look at it...


Comment 22 Göran Uddeborg 2005-09-09 20:16:12 UTC
Um, David, where is that dump?  There is a directory
/var/crash/127.0.0.1-2005-05-09-18:00 on schellville, but it is empty.

Comment 23 David Juran 2005-09-12 14:40:06 UTC
I guess someone cleaned it away )-: Those things are around 6 GB of size...
Anyway, make sure there is enough room in /var/crash and run
chkconfig diskdump on;service diskdump start and you ought to get a new dump
next time the computer dies.

Comment 24 Ernie Petrides 2005-09-12 18:45:49 UTC
Reverting to NEEDINFO until we can get a dump to work with.

If you can't reproduce this crash again within a few weeks, please close this
bug report with the CANTFIX or WORKSFORME disposition.  Thanks in advance.


Comment 25 Göran Uddeborg 2005-09-12 19:43:59 UTC
Diskdump is already enabled.  As I mentioned in comment 19, the machine has been
more stable recently.  For one reason or another.  I'll keep a closer eye for a
while, and if it seems ok I'll close as you say.

Comment 26 Göran Uddeborg 2005-10-12 12:29:50 UTC
It has stayed up this month since I filed comment 25.  If it is because of
updated kernels or because the machine isn't used in the same way now, I don't
know.  But in any case, it doesn't happen any more.


Note You need to log in before you can comment on or make changes to this bug.