Bug 87509 - Dual Xeon, Supermicro P4DC6+, adaptec SCSI RAID 5, installs ok, reboot, kernel panics (unable to mount root FS)
Summary: Dual Xeon, Supermicro P4DC6+, adaptec SCSI RAID 5, installs ok, reboot, kerne...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Linux
Classification: Retired
Component: anaconda
Version: 7.3
Hardware: i386
OS: Linux
medium
high
Target Milestone: ---
Assignee: Jeremy Katz
QA Contact: Mike McLean
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2003-03-28 01:24 UTC by Kyle Simpson
Modified: 2007-04-18 16:52 UTC (History)
0 users

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2003-07-29 13:34:31 UTC
Embargoed:


Attachments (Terms of Use)
JPG screen shot of the first page show during a normal (non-rescue) boot... (361.41 KB, image/jpeg)
2003-04-02 21:31 UTC, Kyle Simpson
no flags Details
JPG screen shot of the next page shown during a normal (non-rescue) boot... (329.90 KB, image/jpeg)
2003-04-02 21:34 UTC, Kyle Simpson
no flags Details

Description Kyle Simpson 2003-03-28 01:24:59 UTC
Description of problem:

I built a system with a supermicro P4DC6+ motherboard, dual Xeon 2.2ghz 
processors, adaptec 2005s ZCR Raid card, 4 Quantum SCA 36GB hds, and 1024MB ram.

I have attempted (about a hundred times) to install several different versions 
of Redhat, including 7.3 (+ errata updates) and 8.0.  Every single time, no 
matter what I do during installation, or what boot parameters I give the 
kernel, I get the exact same behavior-

The installation process completes just fine, and the system goes for reboot... 
on boot, it goes through and does the hardware detection, finding the cdrom, 
the floppy, etc.  At this point, on this and all subsequent reboots, I get the 
exact same error message:

loading scsi_mod module
kmod failed: /sbin/kmod -k -s block-major-8, errno=2
VFS: cannot open root device on "sda2" or 08:02
please append a correct "root=" boot option
Kernel Panic: VFS: unable to mount root FS on 08:02

However, if I boot to the "linux rescue" or to a boot disk, booting is fine, 
and it mounts my sysimage, and I can access all my mounted partitions, 
including / and /boot.  The failure ONLY occurs on a normal boot, and it occurs 
when choosing either the "2.4.18-3SMP" or the "2.4.18-3" kernel from LILO or 
GRUB.

Now, through the course of all my investigations, I have determined the 
following things:

1. the dpt_i2o driver (which takes the place of the generic aic7xxx scsi 
driver) is the appropriate driver (according to adaptec) for the raid scsi sub-
system.  I confirm this by seeing when the "rescue" or "boot disk" kernels 
load, they refer to loading the dpt_i2o driver, and all works fine. I however 
do not see the "dpt_i2o" driver referred to during normal boot, I suspect 
because the boot is failing too early, though it would make sense that it would 
need to load earlier to mount the root FS.

2. in /etc/modules.conf, I do see:
alias scsi_hostadapter dpt_i2o

so, I know the installation is telling the kernel to load the correct scsi 
driver.  I do not know where the "loading scsi_mod module" line is coming from 
in the boot process, but it appears right after the line about "RedHat Nash" 
starting, so maybe its in there?

3. There are hundreds of newsgroups that I have found on this subject, most of 
which seem to hint at the fact that maybe the RedHat installation is not 
correctly building in the dpt_i2o driver into the kernel, and that the system 
is then trying to load the generic "block-major-8" scsi driver instead, which 
fails to give access to the root FS. also, hints have been that "loading 
scsi_mod module" followed by kmod means its trying to load the scsi driver as a 
module, from the not-yet-mounted FS, so it obviously can't access the driver 
from an FS that isn't loaded yet.

4. I have tried about every possible kernel command out there, 
like "apic", "noapic", "noapm", and a whole host of others, turning off power 
APM modes, forcing single proc, etc etc etc. None have any affect whatsoever.

5. I have however found several different listings from people who HAVE 
successfully installed redhat on nearly identical configurations (motherboard, 
etc), and I was very careful to check all the hardware compatibility lists for 
any known conflicts before venturing in this direction.

6. I also updated the motherboard's bios to the latest rev, 1.2c.

7. I found several newgroups indicating that maybe "hyperthreading" was causing 
a problem, so I disabled that. no go. I have stripped down the system 
completely, taking out the NIC, swapping video cards, taking down the system to 
one processor, taking out the RAID card (and installing to just one SCSI hd), 
disabling the SCSI subsystem all together (and trying to install to an IDE hd). 
None of this has made any difference in solving my problem!


I have spent hours on the phone with redhat support, with adaptec support, and 
now with supermicro tech support. Everyone seems to be baffled as to what might 
be the cause.  It seems to me that a default install onto hardware which DOES 
otherwise work with other OS's like microsoft, and getting this error 
persistently, it in my mind seems like it has to be some bug or undocumented 
conflict.

I would certainly appreciate some guidance in this, I'm at the end of my rope 
and my neck is on the line with my employer to figure this out!


Version-Release number of selected component (if applicable):
kernel-2.4.18-3SMP  and  kernel-2.4.18-3

How reproducible:
Everytime

Steps to Reproduce:
1. Install Redhat 7.3 or 8.0
2. reboot
3. choose either SMP or -up Kernel (from either GRUB or LILO)
    
Actual results:
booting began as normal, detecting the cdrom, floppy, etc. It gets to the line 
that says RedHat Nash starting, then things go haywire.

Expected results:
the booting should have continued, mounting the root FS on /dev/sda2

Additional info:
This system has the supermicro P4DC6+ motherboard, with dual Xeon 2.2ghz's, 
1024MB ram, the Adaptec 2005s ZCR Raid card, and 4 Quantum SCA 36GB hds.  The 
scsi disks are setup to be a RAID-5 array, with about 105 GB in the array. I am 
partitioning the array as follows:

/dev/sda1    /boot   128 MB
/dev/sda2    /       3000 MB
/dev/sda4    /usr    10000 MB
/dev/sda5    /tmp    1000 MB
/dev/sda6    /home   5000 MB
/dev/sda7    /swap   2048 MB
/dev/sda8    /var    85300 MB

Comment 1 Jeremy Katz 2003-03-28 03:37:24 UTC
What's the output of `/sbin/mkinitrd -v -f /tmp/initrd.test 2.4.18-3smp` if you
boot into rescue mode and run it from chrooted in /mnt/sysimage

Comment 2 Kyle Simpson 2003-03-28 18:00:35 UTC
using 
modules: ./kernel/drivers/scsi/scsi_mod.o ./kernel/drivers/scsi/sd_mod.o ./kerne
l/drivers/scsi/dpt_i2o.o

using loopback device /dev/loop1

/sbin/nash -> /tmp/initrd.ZQF4K3/bin/insmod

`/lib/modules/2.4.18-3smp/./kernel/drivers/scsi_mod.o' -> 
`/tmp/initrd.ZQF4K3/lib/scsi_mod.o'

`/lib/modules/2.4.18-3smp/./kernel/drivers/sd_mod.o' -> 
`/tmp/initrd.ZQF4K3/lib/sd_mod.o'

`/lib/modules/2.4.18-3smp/./kernel/drivers/dpt_i2o.o' -> 
`/tmp/initrd.ZQF4K3/lib/dpt_i2o.o'

Loding module scsi_mod
Loading module sd_mod
Loading module dpt_i2o

Comment 3 Jeremy Katz 2003-04-02 20:34:05 UTC
Do you see the dpt_i2o module loaded when you boot?  Does it find the drives
correctly?

Comment 4 Kyle Simpson 2003-04-02 21:31:08 UTC
Created attachment 90851 [details]
JPG screen shot of the first page show during a normal (non-rescue) boot...

Sorry this one is so blurry, something weird happened in the conversion... i
tried adjusting the colors to make it somewhat possible to make out the words
(don't stare too long it'll hurt your eyes)... you can make out things like the
"kswapd" and "apm disabled - amp not SMP safe" and the "PIIX4" lines referring
to loading IDE stuff right before it recognizes the CDROM on the IDE bus.

Comment 5 Kyle Simpson 2003-04-02 21:34:44 UTC
Created attachment 90852 [details]
JPG screen shot of the next page shown during a normal (non-rescue) boot...

This one is quite a bit clearer (for some weird reason) and you can make out
most of what's happening, including the RAMdisk call and the "md" driver
loading and detecting the RAID arrays and disks, then farther down, RedHat Nash
starts, then VFS mounts root (as ext2), then "Loading scsi_mod module" then it
continues with the Kernel panic error I originally posted.

Comment 6 Kyle Simpson 2003-04-02 21:41:23 UTC
so, based on what I see in those screen shots, as well as what I see on the 
screen in person, I do not see any reference to the "dpt_i2o" driver being 
loaded, but do see references to "scsi_mod" and "md" (which I am wondering if 
that is the same as the sd_mod)... however, as I mentioned before, during the 
rescue boot and the boot-disk boot, the blue text-GUI screen pops up (shortly 
after the place that the kernel panic is happening in the regular boot) and 
refers to "Loading dpt_i2o"...

i figured it would be easier to give you the screen shots (blurry or not) to 
show you where in the boot process the error occurs, cause I am not familiar 
enough with it to understand all of what I'm seeing.

Comment 7 Kyle Simpson 2003-04-11 16:29:56 UTC
So it's been 9 days since my last post, and I still haven't heard back from you 
all.  Do you need more information?  What can I do?

Comment 8 Kyle Simpson 2003-05-12 19:54:21 UTC
It's been a month, and I still haven't heard from anyone... what's going on, 
have you all given up on this problem?  What can I do to get a resolution to it?

Comment 9 Jeremy Katz 2003-07-28 23:38:54 UTC
Does the new initrd (/ newer kernel erratas) work any better?

Comment 10 Kyle Simpson 2003-07-29 13:34:31 UTC
well, about a month ago, we decided (after trying everything else and having no 
luck) to try to install windows on the machine, and had similar problems with 
its install.  That led us to believe there was some hardware problem.  After 
swapping in and out everything else we could, we were left with trying 
different RAM. We bought 4 256MB chips of ram (to replace 2 512MB chips and 2 
CRIMMS).  Supermicro indicated that this board would accept up to 512MB chips, 
but the memory vendor we purchased from said that the board should only be used 
with 256MB chips. So we bought those, installed them, and subsequent installs 
of RedHat and Windows worked flawlessly.

To me this indicates either corrupt RAM chips (which I have no way to test) or 
a flaw in how Supermicro lists the specs for that motherboard. In either case, 
apparently the root of the problem was RAM related, and had nothing to do with 
the OS. I do apologize for the miscommunication of it being a bug to you.

I wish there was some way for an OS installation (which all along was going 
just fine, and the problem occured only on boot AFTER a successful install) to 
detect the hardware problem instead of it just waiting for the first boot 
before the problem occurs.  BUT, microsoft's installations were just hanging in 
the middle, without even a boot, so I guess it's not something I could expect 
of either OS.  <shrugs> Oh well, thanks anyway.


Note You need to log in before you can comment on or make changes to this bug.