From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.4.3) Gecko/20040924 Description of problem: One of our servers is occasionally crashing for no obvious reason. Version-Release number of selected component (if applicable): kernel-smp-2.4.21-20.EL How reproducible: Sometimes Additional info: We have connected a serial console to it and have the panic messages saved. See attachment. The program mentioned "a.out" does not reproducibly kill the machine, it was just running at this crash. It would produce a core dump, and, from what I understand from the log, it was doing that at the crash. Does the log tell you anything more?
Created attachment 107062 [details] Console output from the crash
Steve, crash was in rpc_clnt_sigunmask().
How where the filesystem mounted, with the 'intr' option?
Created attachment 109964 [details] Current mount status of the same machine. We don't believe we have changed any mount options since the crash occurred. So the attachment should give a fair view of what things looked like then. Obviously, I don't know exactly which directories were autmomounted. But, although we don't believe so, we can't be COMPLETELY sure no option has changed in the meantime.
hmm... it does appear the process was in the middle of dumping core.... So could you cause a core dump on both an nfs and autofs filesytem doing the following: sleep 20 & kill -11 sleep_pid
The process could very well be in the middle of producing a core dump. The program did have that kind of error at the time. Producing a core is not in itself enough to force a crash. I've dropped cores in a lot of places now, and nothing strange happens. But I'm fully aware that this is very hard to debug, since we can't reproduce the problem, and it does not happen often. And we have upgraded to the U4 kernel a few days ago, but the machine is still running the same set of network servers. LSF master server in particular. The machine had another crash which also seems to be core dump related a few days ago. We reported it in bug 145331. (Which was closed as a duplicate of bug 140083.)
Would be possible to setup netdump so we can get a crash dump to analyze? That's probably the only way we'll be able to figure this out.....
We will setup a netdump server and client. I'll be back when we have a dump.
We can't start netdump; there is no netdump.o module in kernel-smp-2.4.21-27.0.2.EL.x86_64! Is that a bug which I should bugzilla? Or is there some reason for the omission?
Steve, netdump support on x86_64 was just added in U5, and thus is not yet available on customer systems. I think diskdump support is in U4 (on x86_64), so getting a crash dump might still be viable (if they're using one of the supported disk drivers). Please follow up with DaveA or one of the diskdump developers.
We have an adapter which identifies as "Adaptec 29320 Ultra320 SCSI adapter" and uses the aic79xx driver, so diskdump should work. We're attaching an extra disk and will set things up.
We finally manage to capture a diskdump (-: Is anyone still interested? Since the dump potentially may contain sensitive information I don't want to publish it here in the public bugzilla. (It's rather large as well...) Maybe I could make it available over a support ticket?
Is this panic still happening with the U6 kernel?
Our logs are not perfectly maintained, but I can't find a crash recently, no. This could be becaulse of the kernel update. It might also be because the machine isn't any more running LSF servers as it used to. But in either case, it doesn't seem to crash any more.
Fine... if this happens again, please update this bug report...
WRT to comment #17, Please make that dump available to your customer support person so we can take a look at it...
Um, David, where is that dump? There is a directory /var/crash/127.0.0.1-2005-05-09-18:00 on schellville, but it is empty.
I guess someone cleaned it away )-: Those things are around 6 GB of size... Anyway, make sure there is enough room in /var/crash and run chkconfig diskdump on;service diskdump start and you ought to get a new dump next time the computer dies.
Reverting to NEEDINFO until we can get a dump to work with. If you can't reproduce this crash again within a few weeks, please close this bug report with the CANTFIX or WORKSFORME disposition. Thanks in advance.
Diskdump is already enabled. As I mentioned in comment 19, the machine has been more stable recently. For one reason or another. I'll keep a closer eye for a while, and if it seems ok I'll close as you say.
It has stayed up this month since I filed comment 25. If it is because of updated kernels or because the machine isn't used in the same way now, I don't know. But in any case, it doesn't happen any more.