From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4b) Gecko/20030507 Description of problem: I have several Entrerprise 3 computers that locks up with a kernel screen dump about once a week to once a month. I am still able to ping the computer but can't do anything else. I will attach to log files that I got from a netdump. Version-Release number of selected component (if applicable): kernel = 2.4.21-15.ELsmp How reproducible: Sometimes Steps to Reproduce: 1.Leave system running 2.Sometimes it will lock up in a week sometimes it will not. 3. Actual Results: Sometimes it will lock up in a week sometimes it will not. Expected Results: It should never lock up. Additional info: I will attach to log files that I got from a netdump.
Created attachment 100499 [details] netdump log on computer conman1
Created attachment 100500 [details] netdump log on computer mesohigh
Since both bugs are unrelated, in list walking code and not seen by anybody else, the obvious suspect is random memory corruption. Am I right in guessing that the crash is never the same on the systems, but always has a slightly different backtrace and different addresses? If this happens on 2 computers out of a larger cluster, could you please run memtest86 on the crashing machines and verify that the memory is correct? Stephen, if the memory turns out to be fine on both systems we could have an ext3 bug on our hands. The chances are probably small, but I've taken the liberty of assigning the bug to you anyway ;))
This is happening on over 15 computers: 8 new Dual CPU dells brough this year 4+ Dells 1-3 years old 1 micron 3 years old, single CPU 2 gateway aroun 1.5 years old, single CPU Since you mention ext3, a side effect on this is that when the computer comes back up a checks the journel, most of the time it comes up fine but when a backup is done on the system, I get I/O errors on files that did not change like man pages. To fix this, I have to a filesystem check which most of the time moves the files to lost+found. Memtest86 an other memory and CPU burn in test have been done on a few of the computers and it all came back fine. 4 of the Dells 1-3 years old, single CPU, ran fine with 7.3 (ext2) over a year with no failures but fails within a week of loading Enterprise 3. I have been dealing with this problem since Feb. of this year.
In the first attachment, we see the checkpoint code trying to free a journal_head which has a NULL buffer_head. That's a pattern I've never, ever seen reported on any version of ext3. The second attachment is memory corruption of a dcache hash list. That case has always, pretty much without exception, been an indicator of hardware problems in the past. So the two logs we've got so far look exactly like the sorts of things I'd expect to see if hardware was going bad (possibly in unpredictable ways, such as environment overheating.) If there's a software fault causing this, then there's nothing in the logs to indicate _what_ software fault it might be. So the next step is definitely to gather more info. If you could attach as many of the oops logs you've got as possible, that would help in finding common patterns. Even better would be a crash dump (you can't attach those, you'd need to submit them via the RHEL enterprise support channels or provide a URL to them.) If there are 15 machines displaying the symptoms, we should have a fair number of examples to work with.
Created attachment 100510 [details] netdump log from shear
Created attachment 100511 [details] netdump log from zenith
These are all the logs I got so far. About 1/2 the computers are in the DMZ which netdump will not work through due to firewall issues. So of the others I did not get netdumps becuase I either have not set up netdump on them yet or thery failed before I knew about netdump. I sent a mail message to sct stating the locations of dump files on a ftp site. Did not want to include them here since I was not sure about security issues.
I'm closing this as CANTFIX for now as there is still insufficient information here to identify, never mind get a proper resolution to, this problem. The fact that it has been routed around the proper support channels means it is not getting attention right now. There was one set of oopses that looked ext3-related; the others look like bad hardware or other random corruption. They really need looked at by the support group for more extended diagnosis; it's not yet ready for an engineering fix, as we haven't even identified whether it's a hardware, BIOS or OS fault yet. For official Red Hat Enterprise Linux support, please log into the Red Hat support website at http://www.redhat.com/support and file a support ticket, or alternatively contact Red Hat Global Support Services at 1-888-RED-HAT1 to speak directly with a support associate and escalate an issue.