Bug 210161 - Kernel panic when booting from SW RAID
Summary: Kernel panic when booting from SW RAID
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel
Version: 4.3
Hardware: x86_64
OS: Linux
medium
high
Target Milestone: ---
: ---
Assignee: Doug Ledford
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2006-10-10 14:55 UTC by Frank Bures
Modified: 2009-08-22 14:40 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-08-22 13:36:34 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Kernel panic screen right after the Red Hat nash starting (41.97 KB, image/jpeg)
2006-11-30 18:55 UTC, Frank Bures
no flags Details
Console log up to the kernel panic (30.69 KB, application/octet-stream)
2006-12-14 14:37 UTC, Frank Bures
no flags Details
Boot log up to the kernel panic (28.78 KB, text/plain)
2007-01-11 13:27 UTC, Frank Bures
no flags Details
/proc/cpuinfo (1.19 KB, text/plain)
2007-01-30 18:43 UTC, Frank Bures
no flags Details

Description Frank Bures 2006-10-10 14:55:10 UTC
Description of problem:
Since kernel 2.6.9-42.ELsmp the kernel panics when booting from SW RAID.

Version-Release number of selected component (if applicable):

kernel 2.6.9-42.ELsmp and all subsequent releases

How reproducible:

always

Steps to Reproduce:
1. Create a system with 12 SATA 370GB disks connected to 3Ware 9500S-12 SATA
controller.  Use a separate IDE disk (/dev/hda) for '/boot'
2.  Create 3 partitions on each of the 12 SATA disks: first - small partition
for swap, second - 10 GB partition for '/' and the third - rest of the disk for
 a large 4 TB volume.
3.  Configure two SW RAIDS: RAID-5 over the 10GB partitions for  '/' and another
RAID-5 for the large 4TB volume.
Install RHEL4 with kernel 2.6.9-34.0.2.ELsmp.  Everything works OK.
4.  After upgrade to any higher kernel, the kernel panics during boot, unable to
access /dev/md0 that holds the '/' file system.
  
Actual results:
Kernel 2.6.9.-42 and higher is unable to access /dev/md0 holding the '/' file
system.

Expected results:
kernel should boot normally as kernel 2.6.9-34.0.2 does.

Additional info:
I did not report the bug earlier, because I thought it was a temporary problem.
 However as it has persisted over three recent versions of kernel, I decided to
report it.

Comment 1 Frank Bures 2006-10-10 14:58:37 UTC
One more thing:  I am using the 3w-9xxx.ko module supplied by 3Ware.
I tried several versions of these modules including updates to the 3Ware
9500S-12 controller BIOS to no avail.

Comment 2 Jarod Wilson 2006-11-30 15:44:25 UTC
It doesn't help solve the problem here, but out of curiosity, why do software
RAID when you have a pretty decent hardware RAID controller?

As for trying to resolve the software raid problem, we need more details to have
a chance at diagnosing the problem. Can you attach a serial console to the
machine in question and capture all output as the system boots on one of the
kernels that panics?


Comment 3 Frank Bures 2006-11-30 15:56:48 UTC
Two reasons for SW RAID:
1.
SW RAID is considerably faster
2.
I had troubles with RHEL4 recognizing almost 4TB HW RAID-5.  It was some time
ago so I do not remember all the gory details, but basically I was not able to
install RHEL4 on such volume.  Therefore I decided to use SW RAID.

I will post the console output as soon as I am able to obtain it.

Comment 4 Frank Bures 2006-11-30 18:55:15 UTC
Created attachment 142504 [details]
Kernel panic screen right after the Red Hat nash starting

Comment 5 Frank Bures 2006-11-30 18:57:06 UTC
I tried to log the console, but I could not.  Logging stops after the BIOS exits
and the boot process starts.  The last message on the console is "Red Hat nash
starting" immediately followed by kernel panic.  I photographed the panic
console screen and I am attaching the JPG file.

Comment 6 Jarod Wilson 2006-11-30 21:47:00 UTC
You can likely get some additional useful information if you edit your kernel
params, removing the 'quiet' option. That suppresses lots of fun software raid
startup stuff that would otherwise get spewed after nash starts doing its thing.

Comment 7 Frank Bures 2006-12-01 22:51:56 UTC
I will do exactly that.  However, next week I will be at the LISA conference, so
there will be no reply from me till Dec. 11 at least.  What makes the things
rather complicated is that each panic leaves the RAID in dirty state that
triggers the automatic recovery upon next boot.  I cannot cause another panic
before the recovery is finished, otherwise I run into serious troubles. 
Therefore I have to switch to the "good" kernel, let the recovery run its pace,
then shutdown and test other kernel again.  This is my main backup system so
tinkering with it is a bit scary.  I do not have any other 4TB machine to use
instead :-)
Thanks for you help so far.
Frank

Comment 8 Frank Bures 2006-12-14 14:37:12 UTC
Created attachment 143632 [details]
Console log up to the kernel panic

Sorry for my delay in replying.
I am attaching the complete console log from the boot start
up to the kernel panic.
/dev/md0 = /
/dev/md1 = /backup

Comment 9 Jarod Wilson 2006-12-21 19:50:41 UTC
Hrm... Looks like we get far enough that we see there's an ext3 file system on
the raid volume, but we're kicking the bucket somewhere in the raid5
reconstruction code.

For the sake of clarity, is the output correct in that there really are only 11
drives in the array, not 12 as you'd previously indicated? Also, you mentioned
using 3ware's kernel module: is that being used in all cases, or only with the
older/working kernel? I'm wondering if that has anything to do with the problem...

Comment 10 Frank Bures 2006-12-31 14:52:49 UTC
There are 12 drives, but one of them is hot spare.  So there 
are only 11 active drives.
As far as the 3ware module is concerned, I just noticed that the
version in the working kernel is different than the one in the panicking
kernels.  Unfortunately I cannot do anything with it remotely.
I'll get back to you around Jan. 9, 2007.
Happy new year to all at RedHat!



Comment 11 Frank Bures 2007-01-11 13:27:29 UTC
Created attachment 145344 [details]
Boot log up to the kernel panic

Sorry for the delayed response.
I am attaching the boot log up to the kernel panic.
This time the 3Ware module is exactly the same one as the one
that works OK with kernel 2.6.9-34.0.2.ELsmp

Comment 12 Jarod Wilson 2007-01-11 20:08:09 UTC
Reassigning to Doug, who I believe is our software raid specialist.

Comment 13 Doug Ledford 2007-01-30 18:07:33 UTC
Wierd.  The CPU is pretty recent.  Can you boot into the working kernel and
attach the output of /proc/cpuinfo to this bug?  For some reason, the machine is
choking on the RAID subsystems decision to use the SSE instruction set for
parity calculations (it's giving an invalid opcode error, not the typical null
pointer oops).  This seems to imply that there is an issue on this particular
CPU with possibly just some SSE instructions or something like that.

Comment 14 Frank Bures 2007-01-30 18:43:27 UTC
Created attachment 146947 [details]
/proc/cpuinfo

Here's the cpuinfo.  BTW, I think I've seen some settings in BIOS 
involving SSE.	I will experiment with them as soon as I can get the hold
of the machine.
Thanks

Comment 15 Frank Bures 2007-02-02 16:48:20 UTC
I disabled the SSE instruction set in BIOS and rebooted to 42.0.3 kernel.
It panicked the exactly same way as it always does.
Back to 34.0.2...



Comment 16 Doug Ledford 2009-08-22 13:36:34 UTC
-ESTALEBUG.  I was unable to reproduce this issue, and I'm guessing you've long since moved beyond this problem in one way or another.  Closing this bug out.

Comment 17 Frank Bures 2009-08-22 14:40:06 UTC
The problem disappeared with the update to RHEL5.

Cheers
Frank


Note You need to log in before you can comment on or make changes to this bug.