Red Hat Bugzilla – Bug 210161
Kernel panic when booting from SW RAID
Last modified: 2009-08-22 10:40:06 EDT
Description of problem:
Since kernel 2.6.9-42.ELsmp the kernel panics when booting from SW RAID.
Version-Release number of selected component (if applicable):
kernel 2.6.9-42.ELsmp and all subsequent releases
Steps to Reproduce:
1. Create a system with 12 SATA 370GB disks connected to 3Ware 9500S-12 SATA
controller. Use a separate IDE disk (/dev/hda) for '/boot'
2. Create 3 partitions on each of the 12 SATA disks: first - small partition
for swap, second - 10 GB partition for '/' and the third - rest of the disk for
a large 4 TB volume.
3. Configure two SW RAIDS: RAID-5 over the 10GB partitions for '/' and another
RAID-5 for the large 4TB volume.
Install RHEL4 with kernel 2.6.9-34.0.2.ELsmp. Everything works OK.
4. After upgrade to any higher kernel, the kernel panics during boot, unable to
access /dev/md0 that holds the '/' file system.
Kernel 2.6.9.-42 and higher is unable to access /dev/md0 holding the '/' file
kernel should boot normally as kernel 2.6.9-34.0.2 does.
I did not report the bug earlier, because I thought it was a temporary problem.
However as it has persisted over three recent versions of kernel, I decided to
One more thing: I am using the 3w-9xxx.ko module supplied by 3Ware.
I tried several versions of these modules including updates to the 3Ware
9500S-12 controller BIOS to no avail.
It doesn't help solve the problem here, but out of curiosity, why do software
RAID when you have a pretty decent hardware RAID controller?
As for trying to resolve the software raid problem, we need more details to have
a chance at diagnosing the problem. Can you attach a serial console to the
machine in question and capture all output as the system boots on one of the
kernels that panics?
Two reasons for SW RAID:
SW RAID is considerably faster
I had troubles with RHEL4 recognizing almost 4TB HW RAID-5. It was some time
ago so I do not remember all the gory details, but basically I was not able to
install RHEL4 on such volume. Therefore I decided to use SW RAID.
I will post the console output as soon as I am able to obtain it.
Created attachment 142504 [details]
Kernel panic screen right after the Red Hat nash starting
I tried to log the console, but I could not. Logging stops after the BIOS exits
and the boot process starts. The last message on the console is "Red Hat nash
starting" immediately followed by kernel panic. I photographed the panic
console screen and I am attaching the JPG file.
You can likely get some additional useful information if you edit your kernel
params, removing the 'quiet' option. That suppresses lots of fun software raid
startup stuff that would otherwise get spewed after nash starts doing its thing.
I will do exactly that. However, next week I will be at the LISA conference, so
there will be no reply from me till Dec. 11 at least. What makes the things
rather complicated is that each panic leaves the RAID in dirty state that
triggers the automatic recovery upon next boot. I cannot cause another panic
before the recovery is finished, otherwise I run into serious troubles.
Therefore I have to switch to the "good" kernel, let the recovery run its pace,
then shutdown and test other kernel again. This is my main backup system so
tinkering with it is a bit scary. I do not have any other 4TB machine to use
Thanks for you help so far.
Created attachment 143632 [details]
Console log up to the kernel panic
Sorry for my delay in replying.
I am attaching the complete console log from the boot start
up to the kernel panic.
/dev/md0 = /
/dev/md1 = /backup
Hrm... Looks like we get far enough that we see there's an ext3 file system on
the raid volume, but we're kicking the bucket somewhere in the raid5
For the sake of clarity, is the output correct in that there really are only 11
drives in the array, not 12 as you'd previously indicated? Also, you mentioned
using 3ware's kernel module: is that being used in all cases, or only with the
older/working kernel? I'm wondering if that has anything to do with the problem...
There are 12 drives, but one of them is hot spare. So there
are only 11 active drives.
As far as the 3ware module is concerned, I just noticed that the
version in the working kernel is different than the one in the panicking
kernels. Unfortunately I cannot do anything with it remotely.
I'll get back to you around Jan. 9, 2007.
Happy new year to all at RedHat!
Created attachment 145344 [details]
Boot log up to the kernel panic
Sorry for the delayed response.
I am attaching the boot log up to the kernel panic.
This time the 3Ware module is exactly the same one as the one
that works OK with kernel 2.6.9-34.0.2.ELsmp
Reassigning to Doug, who I believe is our software raid specialist.
Wierd. The CPU is pretty recent. Can you boot into the working kernel and
attach the output of /proc/cpuinfo to this bug? For some reason, the machine is
choking on the RAID subsystems decision to use the SSE instruction set for
parity calculations (it's giving an invalid opcode error, not the typical null
pointer oops). This seems to imply that there is an issue on this particular
CPU with possibly just some SSE instructions or something like that.
Created attachment 146947 [details]
Here's the cpuinfo. BTW, I think I've seen some settings in BIOS
involving SSE. I will experiment with them as soon as I can get the hold
of the machine.
I disabled the SSE instruction set in BIOS and rebooted to 42.0.3 kernel.
It panicked the exactly same way as it always does.
Back to 34.0.2...
-ESTALEBUG. I was unable to reproduce this issue, and I'm guessing you've long since moved beyond this problem in one way or another. Closing this bug out.
The problem disappeared with the update to RHEL5.