Description of problem: Since kernel 2.6.9-42.ELsmp the kernel panics when booting from SW RAID. Version-Release number of selected component (if applicable): kernel 2.6.9-42.ELsmp and all subsequent releases How reproducible: always Steps to Reproduce: 1. Create a system with 12 SATA 370GB disks connected to 3Ware 9500S-12 SATA controller. Use a separate IDE disk (/dev/hda) for '/boot' 2. Create 3 partitions on each of the 12 SATA disks: first - small partition for swap, second - 10 GB partition for '/' and the third - rest of the disk for a large 4 TB volume. 3. Configure two SW RAIDS: RAID-5 over the 10GB partitions for '/' and another RAID-5 for the large 4TB volume. Install RHEL4 with kernel 2.6.9-34.0.2.ELsmp. Everything works OK. 4. After upgrade to any higher kernel, the kernel panics during boot, unable to access /dev/md0 that holds the '/' file system. Actual results: Kernel 2.6.9.-42 and higher is unable to access /dev/md0 holding the '/' file system. Expected results: kernel should boot normally as kernel 2.6.9-34.0.2 does. Additional info: I did not report the bug earlier, because I thought it was a temporary problem. However as it has persisted over three recent versions of kernel, I decided to report it.
One more thing: I am using the 3w-9xxx.ko module supplied by 3Ware. I tried several versions of these modules including updates to the 3Ware 9500S-12 controller BIOS to no avail.
It doesn't help solve the problem here, but out of curiosity, why do software RAID when you have a pretty decent hardware RAID controller? As for trying to resolve the software raid problem, we need more details to have a chance at diagnosing the problem. Can you attach a serial console to the machine in question and capture all output as the system boots on one of the kernels that panics?
Two reasons for SW RAID: 1. SW RAID is considerably faster 2. I had troubles with RHEL4 recognizing almost 4TB HW RAID-5. It was some time ago so I do not remember all the gory details, but basically I was not able to install RHEL4 on such volume. Therefore I decided to use SW RAID. I will post the console output as soon as I am able to obtain it.
Created attachment 142504 [details] Kernel panic screen right after the Red Hat nash starting
I tried to log the console, but I could not. Logging stops after the BIOS exits and the boot process starts. The last message on the console is "Red Hat nash starting" immediately followed by kernel panic. I photographed the panic console screen and I am attaching the JPG file.
You can likely get some additional useful information if you edit your kernel params, removing the 'quiet' option. That suppresses lots of fun software raid startup stuff that would otherwise get spewed after nash starts doing its thing.
I will do exactly that. However, next week I will be at the LISA conference, so there will be no reply from me till Dec. 11 at least. What makes the things rather complicated is that each panic leaves the RAID in dirty state that triggers the automatic recovery upon next boot. I cannot cause another panic before the recovery is finished, otherwise I run into serious troubles. Therefore I have to switch to the "good" kernel, let the recovery run its pace, then shutdown and test other kernel again. This is my main backup system so tinkering with it is a bit scary. I do not have any other 4TB machine to use instead :-) Thanks for you help so far. Frank
Created attachment 143632 [details] Console log up to the kernel panic Sorry for my delay in replying. I am attaching the complete console log from the boot start up to the kernel panic. /dev/md0 = / /dev/md1 = /backup
Hrm... Looks like we get far enough that we see there's an ext3 file system on the raid volume, but we're kicking the bucket somewhere in the raid5 reconstruction code. For the sake of clarity, is the output correct in that there really are only 11 drives in the array, not 12 as you'd previously indicated? Also, you mentioned using 3ware's kernel module: is that being used in all cases, or only with the older/working kernel? I'm wondering if that has anything to do with the problem...
There are 12 drives, but one of them is hot spare. So there are only 11 active drives. As far as the 3ware module is concerned, I just noticed that the version in the working kernel is different than the one in the panicking kernels. Unfortunately I cannot do anything with it remotely. I'll get back to you around Jan. 9, 2007. Happy new year to all at RedHat!
Created attachment 145344 [details] Boot log up to the kernel panic Sorry for the delayed response. I am attaching the boot log up to the kernel panic. This time the 3Ware module is exactly the same one as the one that works OK with kernel 2.6.9-34.0.2.ELsmp
Reassigning to Doug, who I believe is our software raid specialist.
Wierd. The CPU is pretty recent. Can you boot into the working kernel and attach the output of /proc/cpuinfo to this bug? For some reason, the machine is choking on the RAID subsystems decision to use the SSE instruction set for parity calculations (it's giving an invalid opcode error, not the typical null pointer oops). This seems to imply that there is an issue on this particular CPU with possibly just some SSE instructions or something like that.
Created attachment 146947 [details] /proc/cpuinfo Here's the cpuinfo. BTW, I think I've seen some settings in BIOS involving SSE. I will experiment with them as soon as I can get the hold of the machine. Thanks
I disabled the SSE instruction set in BIOS and rebooted to 42.0.3 kernel. It panicked the exactly same way as it always does. Back to 34.0.2...
-ESTALEBUG. I was unable to reproduce this issue, and I'm guessing you've long since moved beyond this problem in one way or another. Closing this bug out.
The problem disappeared with the update to RHEL5. Cheers Frank