Red Hat Bugzilla – Bug 158039
nfsd oopses on testing kernel update for FC3
Last modified: 2007-11-30 17:11:06 EST
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.7.8) Gecko/20050512 Fedora/1.0.4-2 Firefox/1.0.4
Description of problem:
Got all of these oopses on the same box over the past few weeks, running various different kernels. It might be faulty hardware, so take it with a grain of salt, but I don't have any other boxes with identical hardware configuration to tell whether it's something specific to the set of modules involved, nor easy local access to run hardware tests. There are two ext3 oopses and some nfsd oopses from the stable kernel as well, could this all be caused filesystem corruption? I'm thinking of bringing the system down for an fsck.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1.Boot up either the stable or the testing 2.6.11 FC3 kernel and let it run for days.
Actual Results: Oopses I'll attach.
Expected Results: No such oopses.
Created attachment 114493 [details]
fsck didn't find any inconsistencies, but a local user reported some recent
suspicion on overheating, and the failures appear to be related with peak use.
Oops are never good for data integrity.
Why do you think this is faulty hardware?
That was the suspicion of another sysadmin. Apparently the box has never been
exactly rock solid, with some programs crashing every now and then, odd messages
on cron mail, and so on, but this had never (apparently) affected its ability to
serve out filesystems over nfs. The box was recently taken off to a computer
repair facility at the uni, and they suspected the goop that attaches the cooler
to the processor might be at fault, and replaced it, but that had no effect
whatsoever. If anything, crashes are now more frequent.
Besides, we have many other boxes running NFS servers with the very same
software, although not exactly the same hardware, so I found it unlikely that
things would crash so often for one box and not for others. This one isn't even
the most heavily used server. I figured, if such oopses should be hitting
others, you'd know about it, so I thought I'd file it, but don't waste too much
time on it until we can get better assurance that it's not caused by hardware
problems. I've downgraded to 2.6.10-1.670_FC3 yesterday, and now the box is off
line. I can't tell whether it crashed or was taken to the repair facility
again. Aah, the wonders of being a remote sysadmin :-)
The box failed again, and was taken to the repair office again. They ran a
memtest again, and found both memory modules to be defective. I'll probably
have to go on site and verify the testing, but we're now pretty sure it's
hardware failure. Sorry about the noise.
(s/1.670_FC3/1.770_FC3/ in the previous comment, BTW)