Bug 479602 - cciss driver broken in rawhide
Summary: cciss driver broken in rawhide
Keywords:
Status: CLOSED DUPLICATE of bug 487358
Alias: None
Product: Fedora
Classification: Fedora
Component: mkinitrd
Version: rawhide
Hardware: All
OS: Linux
low
medium
Target Milestone: ---
Assignee: Peter Jones
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks: F11Beta, F11BetaBlocker
TreeView+ depends on / blocked
 
Reported: 2009-01-11 23:28 UTC by Aron Griffis
Modified: 2009-03-09 09:37 UTC (History)
19 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-03-09 09:37:08 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
rawhide failed boot (32.04 KB, text/plain)
2009-01-11 23:28 UTC, Aron Griffis
no flags Details
f10 successful boot (33.02 KB, text/plain)
2009-01-11 23:28 UTC, Aron Griffis
no flags Details

Description Aron Griffis 2009-01-11 23:28:04 UTC
Created attachment 328685 [details]
rawhide failed boot

Description of problem:
Updated a DL380 and DL385 from F10 to Rawhide.  Now they fail to boot because they can't find the root partition.

Version-Release number of selected component (if applicable):
kernel-2.6.29-0.25.rc0.git14.fc11.x86_64
lvm2-2.02.43-1.fc11.x86_64
mkinitrd-6.0.73-7.fc11.x86_64

How reproducible:
every time

Steps to Reproduce:
1. install F10
2. "yum upgrade" to rawhide
3. reboot
  
Actual results:
see attached rawhide-2.6.29-0.25.rc0.git14.fc11.x86_64.txt

Expected results:
see attached fedora10-2.6.27.5-117.fc10.x86_64.txt

Additional info:
Some info from booting into F10:
$ parted /dev/cciss/c0d0 p
Model: Compaq Smart Array (cpqarray)
Disk /dev/cciss/c0d0: 293GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos

Number  Start   End     Size    Type      File system  Flags
 1      32.3kB  1053MB  1053MB  primary   ext3         boot 
 2      1053MB  1579MB  526MB   primary   ext3              
 3      1579MB  2106MB  526MB   primary   ext3              
 4      2106MB  293GB   291GB   extended                    
 5      2106MB  2632MB  526MB   logical   ext3              
 6      2632MB  3159MB  526MB   logical   ext3              
 7      3159MB  3685MB  526MB   logical   ext3              
 8      3685MB  293GB   290GB   logical                lvm  

$ pvs
  PV               VG         Fmt  Attr PSize   PFree  
  /dev/block/104:8 VolGroup00 lvm2 a-   269.88G 240.59G

$ vgs
  VG         #PV #LV #SN Attr   VSize   VFree  
  VolGroup00   1   2   0 wz--n- 269.88G 240.59G

$ lvs
  LV       VG         Attr   LSize  Origin Snap%  Move Log Copy%  Convert
  fedora10 VolGroup00 -wi-ao 19.53G                                      
  swap     VolGroup00 -wi-ao  9.75G

Comment 1 Aron Griffis 2009-01-11 23:28:41 UTC
Created attachment 328686 [details]
f10 successful boot

Comment 2 Aron Griffis 2009-01-27 02:02:40 UTC
still broken as of:
kernel-2.6.29-0.48.rc2.git1.fc11.x86_64
lvm2-2.02.43-1.fc11.x86_64
mkinitrd-6.0.75-1.fc11.x86_64

Comment 3 Aron Griffis 2009-01-27 18:15:01 UTC
I added the following commands to the initrd:

find /dev/cciss
find /dev/mapper
showlabels

Here's the resulting output:

/dev/cciss/c0d0p8
/dev/cciss/c0d0p7
/dev/cciss/c0d0p6
/dev/cciss/c0d0p5
/dev/cciss/c0d0p4
/dev/cciss/c0d0p3
/dev/cciss/c0d0p2
/dev/cciss/c0d0p1
/dev/cciss/c0d0
/dev/mapper/control
a09428de-ba1f-47d1-bf30-80abe6842bbf/dev/cciss/c0d0p2 /boot 6c2d6985-57f1-4f88-9e2a-28800d53fe62
/dev/cciss/c0d0p3 /mnt/c0d0p2 5410d8e7-85e8-4965-a934-2e59caa14c1c 
/dev/cciss/c0d0p6 /mnt/c0d0p6 ddf7782b-ae7b-429b-bf70-c6093b5a5c89
qknw7u-GfUf-2E9L-M3AO-xLYu-qcKi-A5e7kh/dev/cciss/c0d0p5 /boot1 d5ede95b-a337-4038-81d2-bf16c3202534
/dev/cciss/c0d0p7 /mnt/c0d0p5 57c6efcf-2b77-4033-90b4-f0d7b692bcae

So it appears that the cciss driver in the initrd is working correctly.  Next
suspect would be lvm's scanning.

Comment 4 Aron Griffis 2009-01-28 23:50:54 UTC
Okay, setting sysfs_scan = 0 in /etc/lvm/lvm.conf allows the system to boot.  The problem is that lvm scans /sys/class/block to determine the validity of devices.  Here's what it sees on rawhide:

$ /bin/ls -lF /sys/class/block/ 2>&1 | grep cciss
/bin/ls: cannot access /sys/class/block/cciss/c0d0: No such file or directory
/bin/ls: cannot access /sys/class/block/cciss/c0d0p1: No such file or directory
/bin/ls: cannot access /sys/class/block/cciss/c0d0p2: No such file or directory
/bin/ls: cannot access /sys/class/block/cciss/c0d0p3: No such file or directory
/bin/ls: cannot access /sys/class/block/cciss/c0d0p4: No such file or directory
/bin/ls: cannot access /sys/class/block/cciss/c0d0p5: No such file or directory
/bin/ls: cannot access /sys/class/block/cciss/c0d0p6: No such file or directory
/bin/ls: cannot access /sys/class/block/cciss/c0d0p7: No such file or directory
/bin/ls: cannot access /sys/class/block/cciss/c0d0p8: No such file or directory
l????????? ? ?    ?    ?                ? cciss/c0d0
l????????? ? ?    ?    ?                ? cciss/c0d0p1
l????????? ? ?    ?    ?                ? cciss/c0d0p2
l????????? ? ?    ?    ?                ? cciss/c0d0p3
l????????? ? ?    ?    ?                ? cciss/c0d0p4
l????????? ? ?    ?    ?                ? cciss/c0d0p5
l????????? ? ?    ?    ?                ? cciss/c0d0p6
l????????? ? ?    ?    ?                ? cciss/c0d0p7
l????????? ? ?    ?    ?                ? cciss/c0d0p8

and here is what it sees on f10:

$ /bin/ls -lF /sys/class/block/ 2>&1 | grep cciss
lrwxrwxrwx 1 root root 0 2009-01-13 18:39 cciss!c0d0 -> ../../devices/pci0000:40/0000:40:10.0/0000:46:00.0/block/cciss!c0d0/
lrwxrwxrwx 1 root root 0 2009-01-13 18:39 cciss!c0d0p1 -> ../../devices/pci0000:40/0000:40:10.0/0000:46:00.0/block/cciss!c0d0/cciss!c0d0p1/
lrwxrwxrwx 1 root root 0 2009-01-13 18:39 cciss!c0d0p2 -> ../../devices/pci0000:40/0000:40:10.0/0000:46:00.0/block/cciss!c0d0/cciss!c0d0p2/
lrwxrwxrwx 1 root root 0 2009-01-13 18:39 cciss!c0d0p3 -> ../../devices/pci0000:40/0000:40:10.0/0000:46:00.0/block/cciss!c0d0/cciss!c0d0p3/
lrwxrwxrwx 1 root root 0 2009-01-13 18:39 cciss!c0d0p4 -> ../../devices/pci0000:40/0000:40:10.0/0000:46:00.0/block/cciss!c0d0/cciss!c0d0p4/
lrwxrwxrwx 1 root root 0 2009-01-13 18:39 cciss!c0d0p5 -> ../../devices/pci0000:40/0000:40:10.0/0000:46:00.0/block/cciss!c0d0/cciss!c0d0p5/
lrwxrwxrwx 1 root root 0 2009-01-13 18:39 cciss!c0d0p6 -> ../../devices/pci0000:40/0000:40:10.0/0000:46:00.0/block/cciss!c0d0/cciss!c0d0p6/
lrwxrwxrwx 1 root root 0 2009-01-13 18:39 cciss!c0d0p7 -> ../../devices/pci0000:40/0000:40:10.0/0000:46:00.0/block/cciss!c0d0/cciss!c0d0p7/
lrwxrwxrwx 1 root root 0 2009-01-13 18:39 cciss!c0d0p8 -> ../../devices/pci0000:40/0000:40:10.0/0000:46:00.0/block/cciss!c0d0/cciss!c0d0p8/

It appears that the older cciss driver replaced the slashes with bangs to make the file valid.  The newer driver gets it wrong.

Comment 5 Doug Chapman 2009-01-29 22:24:10 UTC
I am able to reproduce this even without LVM.  Also I don't think this is limited to cciss.  I think this is the same issue I was running into on big ia64 servers about 6 months ago.

Somehow these kernel config options got turned OFF in Fedora even though they default is 'y':

CONFIG_SYSFS_DEPRECATED=y
CONFIG_SYSFS_DEPRECATED_V2=y


when I ran into this back then we had them turned on in the ia64 config file.  It appears that something in nash requires this in some situations.  Since nash is nearly (ok completely) impossible to debug I am not sure where to go with this.

Can we get these options turned back on in the Fedora kernel configs?  They always used to be on and default to on upstream.

Comment 6 Aron Griffis 2009-01-30 16:02:28 UTC
Doug, I don't think CONFIG_SYSFS_DEPRECATED is related to this bug.  The lvm tools prefer /sys/class/block if it exists, so they'll ignore /sys/block either way.

Rather the problem was corruption in the /sys/class/block directory, seemingly caused by the cciss driver, though it could have been in generic code that the cciss driver calls.  I'm using the past tense because it seems to be fixed now in kernel-2.6.29-0.66.rc3.fc11.x86_64.  Looking at the kernel rpm changelog, I think the fix came from upstream, because there doesn't seem to be anything relevant to this problem applied by Red Hat.

Either way, glad to see it fixed.

Comment 7 Michael Cutler 2009-02-25 17:26:28 UTC
I'm having the very same problem, on a HP DL380 G4 server upgrading from FC10 to Rawhide I get the 'kernel-PAE-2.6.29-0.145.rc6.fc11.i686' kernel, reboot the machine and it cannot find the root partition.

/boot/grub/grub.conf:
#boot=/dev/cciss/c0d0
default=0
timeout=5
splashimage=(hd0,0)/grub/splash.xpm.gz
hiddenmenu
title Fedora (2.6.29-0.145.rc6.fc11.i686.PAE)
        root (hd0,0)
        kernel /vmlinuz-2.6.29-0.145.rc6.fc11.i686.PAE ro root=/dev/VolGroup00/LogVol00 rhgb quiet
        initrd /initrd-2.6.29-0.145.rc6.fc11.i686.PAE.img
title Fedora (2.6.29-0.145.rc6.fc11.i686.PAE) using UUID instead
        root (hd0,0)
        kernel /vmlinuz-2.6.29-0.145.rc6.fc11.i686.PAE ro root=UUID=099c4b4e-7655-47d7-90d4-7fc9a3bfcc9d rhgb quiet
        initrd /initrd-2.6.29-0.145.rc6.fc11.i686.PAE.img
title Fedora (2.6.27.15-170.2.24.fc10.i686.PAE)
        root (hd0,0)
        kernel /vmlinuz-2.6.27.15-170.2.24.fc10.i686.PAE ro root=UUID=099c4b4e-7655-47d7-90d4-7fc9a3bfcc9d rhgb quiet
        initrd /initrd-2.6.27.15-170.2.24.fc10.i686.PAE.img


I tried two grub config's where the first was made by the RPM installation "root=/dev/VolGroup00/LogVol00" this failed, just out of curiosity I created a second grub boot line using the UUID instead "root=UUID=099c4b4e-7655-47d7-90d4-7fc9a3bfcc9d" neither boot the machine.

Comment 8 Michael Cutler 2009-02-25 17:44:25 UTC
I've just repeated this test using 'kernel-PAE-2.6.29-0.137.rc5.git4.fc11.i686' and got the same problem. The boot fails saying:

  Reading all physical volumes. This may take a while...
  Volume group "VolGroup00" not found
Unable to access resume device (/dev/VolGroup00/LogVol01)
mount: could not find filesystem '/dev/root'

Comment 9 Aron Griffis 2009-02-25 18:06:20 UTC
Michael: suspect your problem is different.  The problem in this bug is that the /sys/class/block/ entries were broken for cciss.  That's fixed now, as mentioned in comment 6

Rather I suspect the problem you're facing is that the cciss driver is being omitted from the initrd.  I don't know if there's a bug open for that presently.  Doug?

Comment 10 Milan Broz 2009-02-25 18:07:44 UTC
updated lvm tools (since 2.02.29) works correctly with both old and new sysfs
structure, the bug here is problem with wrong device initialization in initrd
probably.

I saw similar problem when testing upstream kernel with
CONFIG_SYSFS_DEPRECATED_V2 set to off and I found that in mkinird is hardcoded
old sys path causing wrong drivers in initrd...
(and the drivers were not put into initrd at all)

For me this helps (but it is probably unrelated to your problem...):

--- mkinitrd.old        2009-02-10 20:22:35.000000000 +0100
+++ mkinitrd    2009-02-25 18:44:52.000000000 +0100
@@ -331,7 +331,8 @@
         sysfs=$(readlink ${sysfs%/*})
     fi

-    if [[ ! "$sysfs" =~ '^/sys/devices/.*/block/.*$' ]]; then
+#    if [[ ! "$sysfs" =~ '^/sys/devices/.*/block/.*$' ]]; then
+    if [[ ! "$sysfs" =~ '^/sys/block/.*$' ]]; then
         error "WARNING: $sysfs is a not a block sysfs path, skipping"
         return
     fi

Comment 11 Michael Cutler 2009-02-25 18:26:57 UTC
Thanks guys, by recreating the initrd manually I have pushed the cciss driver in and the box is now working again on the latest rawhide kernel:

mkinitrd --with=cciss initrd-2.6.29-0.145.rc6.fc11.i686.PAE.img 2.6.29-0.145.rc6.fc11.i686.PAE

Regards,
MC

Comment 12 Hans de Goede 2009-03-09 09:37:08 UTC
I don't believe this is cciss related, but rather is an issue when upgrading from
F-10 to rawhide when using yum (rather then anaconda).

The problem is you are still running an older kernel when mkinitrd gets run. Yum upgrading is unsupported, but don't worry I agree this one is rather bad. So I'll fix it.

I'll dup this to the bug of tracking the general issue of running mkinitrd when
running an older kernel. Note that bug has a patch attached, so if you want you
can give that a shot.

*** This bug has been marked as a duplicate of bug 487358 ***


Note You need to log in before you can comment on or make changes to this bug.