Bug 210161
| Summary: | Kernel panic when booting from SW RAID | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 4 | Reporter: | Frank Bures <fbures> | ||||||||||
| Component: | kernel | Assignee: | Doug Ledford <dledford> | ||||||||||
| Status: | CLOSED WORKSFORME | QA Contact: | Brian Brock <bbrock> | ||||||||||
| Severity: | high | Docs Contact: | |||||||||||
| Priority: | medium | ||||||||||||
| Version: | 4.3 | CC: | jarod | ||||||||||
| Target Milestone: | --- | ||||||||||||
| Target Release: | --- | ||||||||||||
| Hardware: | x86_64 | ||||||||||||
| OS: | Linux | ||||||||||||
| Whiteboard: | |||||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||||
| Doc Text: | Story Points: | --- | |||||||||||
| Clone Of: | Environment: | ||||||||||||
| Last Closed: | 2009-08-22 13:36:34 UTC | Type: | --- | ||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||
| Documentation: | --- | CRM: | |||||||||||
| Verified Versions: | Category: | --- | |||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
| Embargoed: | |||||||||||||
| Attachments: |
|
||||||||||||
|
Description
Frank Bures
2006-10-10 14:55:10 UTC
One more thing: I am using the 3w-9xxx.ko module supplied by 3Ware. I tried several versions of these modules including updates to the 3Ware 9500S-12 controller BIOS to no avail. It doesn't help solve the problem here, but out of curiosity, why do software RAID when you have a pretty decent hardware RAID controller? As for trying to resolve the software raid problem, we need more details to have a chance at diagnosing the problem. Can you attach a serial console to the machine in question and capture all output as the system boots on one of the kernels that panics? Two reasons for SW RAID: 1. SW RAID is considerably faster 2. I had troubles with RHEL4 recognizing almost 4TB HW RAID-5. It was some time ago so I do not remember all the gory details, but basically I was not able to install RHEL4 on such volume. Therefore I decided to use SW RAID. I will post the console output as soon as I am able to obtain it. Created attachment 142504 [details]
Kernel panic screen right after the Red Hat nash starting
I tried to log the console, but I could not. Logging stops after the BIOS exits and the boot process starts. The last message on the console is "Red Hat nash starting" immediately followed by kernel panic. I photographed the panic console screen and I am attaching the JPG file. You can likely get some additional useful information if you edit your kernel params, removing the 'quiet' option. That suppresses lots of fun software raid startup stuff that would otherwise get spewed after nash starts doing its thing. I will do exactly that. However, next week I will be at the LISA conference, so there will be no reply from me till Dec. 11 at least. What makes the things rather complicated is that each panic leaves the RAID in dirty state that triggers the automatic recovery upon next boot. I cannot cause another panic before the recovery is finished, otherwise I run into serious troubles. Therefore I have to switch to the "good" kernel, let the recovery run its pace, then shutdown and test other kernel again. This is my main backup system so tinkering with it is a bit scary. I do not have any other 4TB machine to use instead :-) Thanks for you help so far. Frank Created attachment 143632 [details]
Console log up to the kernel panic
Sorry for my delay in replying.
I am attaching the complete console log from the boot start
up to the kernel panic.
/dev/md0 = /
/dev/md1 = /backup
Hrm... Looks like we get far enough that we see there's an ext3 file system on the raid volume, but we're kicking the bucket somewhere in the raid5 reconstruction code. For the sake of clarity, is the output correct in that there really are only 11 drives in the array, not 12 as you'd previously indicated? Also, you mentioned using 3ware's kernel module: is that being used in all cases, or only with the older/working kernel? I'm wondering if that has anything to do with the problem... There are 12 drives, but one of them is hot spare. So there are only 11 active drives. As far as the 3ware module is concerned, I just noticed that the version in the working kernel is different than the one in the panicking kernels. Unfortunately I cannot do anything with it remotely. I'll get back to you around Jan. 9, 2007. Happy new year to all at RedHat! Created attachment 145344 [details]
Boot log up to the kernel panic
Sorry for the delayed response.
I am attaching the boot log up to the kernel panic.
This time the 3Ware module is exactly the same one as the one
that works OK with kernel 2.6.9-34.0.2.ELsmp
Reassigning to Doug, who I believe is our software raid specialist. Wierd. The CPU is pretty recent. Can you boot into the working kernel and attach the output of /proc/cpuinfo to this bug? For some reason, the machine is choking on the RAID subsystems decision to use the SSE instruction set for parity calculations (it's giving an invalid opcode error, not the typical null pointer oops). This seems to imply that there is an issue on this particular CPU with possibly just some SSE instructions or something like that. Created attachment 146947 [details]
/proc/cpuinfo
Here's the cpuinfo. BTW, I think I've seen some settings in BIOS
involving SSE. I will experiment with them as soon as I can get the hold
of the machine.
Thanks
I disabled the SSE instruction set in BIOS and rebooted to 42.0.3 kernel. It panicked the exactly same way as it always does. Back to 34.0.2... -ESTALEBUG. I was unable to reproduce this issue, and I'm guessing you've long since moved beyond this problem in one way or another. Closing this bug out. The problem disappeared with the update to RHEL5. Cheers Frank |