Created attachment 328685 [details] rawhide failed boot Description of problem: Updated a DL380 and DL385 from F10 to Rawhide. Now they fail to boot because they can't find the root partition. Version-Release number of selected component (if applicable): kernel-2.6.29-0.25.rc0.git14.fc11.x86_64 lvm2-2.02.43-1.fc11.x86_64 mkinitrd-6.0.73-7.fc11.x86_64 How reproducible: every time Steps to Reproduce: 1. install F10 2. "yum upgrade" to rawhide 3. reboot Actual results: see attached rawhide-2.6.29-0.25.rc0.git14.fc11.x86_64.txt Expected results: see attached fedora10-2.6.27.5-117.fc10.x86_64.txt Additional info: Some info from booting into F10: $ parted /dev/cciss/c0d0 p Model: Compaq Smart Array (cpqarray) Disk /dev/cciss/c0d0: 293GB Sector size (logical/physical): 512B/512B Partition Table: msdos Number Start End Size Type File system Flags 1 32.3kB 1053MB 1053MB primary ext3 boot 2 1053MB 1579MB 526MB primary ext3 3 1579MB 2106MB 526MB primary ext3 4 2106MB 293GB 291GB extended 5 2106MB 2632MB 526MB logical ext3 6 2632MB 3159MB 526MB logical ext3 7 3159MB 3685MB 526MB logical ext3 8 3685MB 293GB 290GB logical lvm $ pvs PV VG Fmt Attr PSize PFree /dev/block/104:8 VolGroup00 lvm2 a- 269.88G 240.59G $ vgs VG #PV #LV #SN Attr VSize VFree VolGroup00 1 2 0 wz--n- 269.88G 240.59G $ lvs LV VG Attr LSize Origin Snap% Move Log Copy% Convert fedora10 VolGroup00 -wi-ao 19.53G swap VolGroup00 -wi-ao 9.75G
Created attachment 328686 [details] f10 successful boot
still broken as of: kernel-2.6.29-0.48.rc2.git1.fc11.x86_64 lvm2-2.02.43-1.fc11.x86_64 mkinitrd-6.0.75-1.fc11.x86_64
I added the following commands to the initrd: find /dev/cciss find /dev/mapper showlabels Here's the resulting output: /dev/cciss/c0d0p8 /dev/cciss/c0d0p7 /dev/cciss/c0d0p6 /dev/cciss/c0d0p5 /dev/cciss/c0d0p4 /dev/cciss/c0d0p3 /dev/cciss/c0d0p2 /dev/cciss/c0d0p1 /dev/cciss/c0d0 /dev/mapper/control a09428de-ba1f-47d1-bf30-80abe6842bbf/dev/cciss/c0d0p2 /boot 6c2d6985-57f1-4f88-9e2a-28800d53fe62 /dev/cciss/c0d0p3 /mnt/c0d0p2 5410d8e7-85e8-4965-a934-2e59caa14c1c /dev/cciss/c0d0p6 /mnt/c0d0p6 ddf7782b-ae7b-429b-bf70-c6093b5a5c89 qknw7u-GfUf-2E9L-M3AO-xLYu-qcKi-A5e7kh/dev/cciss/c0d0p5 /boot1 d5ede95b-a337-4038-81d2-bf16c3202534 /dev/cciss/c0d0p7 /mnt/c0d0p5 57c6efcf-2b77-4033-90b4-f0d7b692bcae So it appears that the cciss driver in the initrd is working correctly. Next suspect would be lvm's scanning.
Okay, setting sysfs_scan = 0 in /etc/lvm/lvm.conf allows the system to boot. The problem is that lvm scans /sys/class/block to determine the validity of devices. Here's what it sees on rawhide: $ /bin/ls -lF /sys/class/block/ 2>&1 | grep cciss /bin/ls: cannot access /sys/class/block/cciss/c0d0: No such file or directory /bin/ls: cannot access /sys/class/block/cciss/c0d0p1: No such file or directory /bin/ls: cannot access /sys/class/block/cciss/c0d0p2: No such file or directory /bin/ls: cannot access /sys/class/block/cciss/c0d0p3: No such file or directory /bin/ls: cannot access /sys/class/block/cciss/c0d0p4: No such file or directory /bin/ls: cannot access /sys/class/block/cciss/c0d0p5: No such file or directory /bin/ls: cannot access /sys/class/block/cciss/c0d0p6: No such file or directory /bin/ls: cannot access /sys/class/block/cciss/c0d0p7: No such file or directory /bin/ls: cannot access /sys/class/block/cciss/c0d0p8: No such file or directory l????????? ? ? ? ? ? cciss/c0d0 l????????? ? ? ? ? ? cciss/c0d0p1 l????????? ? ? ? ? ? cciss/c0d0p2 l????????? ? ? ? ? ? cciss/c0d0p3 l????????? ? ? ? ? ? cciss/c0d0p4 l????????? ? ? ? ? ? cciss/c0d0p5 l????????? ? ? ? ? ? cciss/c0d0p6 l????????? ? ? ? ? ? cciss/c0d0p7 l????????? ? ? ? ? ? cciss/c0d0p8 and here is what it sees on f10: $ /bin/ls -lF /sys/class/block/ 2>&1 | grep cciss lrwxrwxrwx 1 root root 0 2009-01-13 18:39 cciss!c0d0 -> ../../devices/pci0000:40/0000:40:10.0/0000:46:00.0/block/cciss!c0d0/ lrwxrwxrwx 1 root root 0 2009-01-13 18:39 cciss!c0d0p1 -> ../../devices/pci0000:40/0000:40:10.0/0000:46:00.0/block/cciss!c0d0/cciss!c0d0p1/ lrwxrwxrwx 1 root root 0 2009-01-13 18:39 cciss!c0d0p2 -> ../../devices/pci0000:40/0000:40:10.0/0000:46:00.0/block/cciss!c0d0/cciss!c0d0p2/ lrwxrwxrwx 1 root root 0 2009-01-13 18:39 cciss!c0d0p3 -> ../../devices/pci0000:40/0000:40:10.0/0000:46:00.0/block/cciss!c0d0/cciss!c0d0p3/ lrwxrwxrwx 1 root root 0 2009-01-13 18:39 cciss!c0d0p4 -> ../../devices/pci0000:40/0000:40:10.0/0000:46:00.0/block/cciss!c0d0/cciss!c0d0p4/ lrwxrwxrwx 1 root root 0 2009-01-13 18:39 cciss!c0d0p5 -> ../../devices/pci0000:40/0000:40:10.0/0000:46:00.0/block/cciss!c0d0/cciss!c0d0p5/ lrwxrwxrwx 1 root root 0 2009-01-13 18:39 cciss!c0d0p6 -> ../../devices/pci0000:40/0000:40:10.0/0000:46:00.0/block/cciss!c0d0/cciss!c0d0p6/ lrwxrwxrwx 1 root root 0 2009-01-13 18:39 cciss!c0d0p7 -> ../../devices/pci0000:40/0000:40:10.0/0000:46:00.0/block/cciss!c0d0/cciss!c0d0p7/ lrwxrwxrwx 1 root root 0 2009-01-13 18:39 cciss!c0d0p8 -> ../../devices/pci0000:40/0000:40:10.0/0000:46:00.0/block/cciss!c0d0/cciss!c0d0p8/ It appears that the older cciss driver replaced the slashes with bangs to make the file valid. The newer driver gets it wrong.
I am able to reproduce this even without LVM. Also I don't think this is limited to cciss. I think this is the same issue I was running into on big ia64 servers about 6 months ago. Somehow these kernel config options got turned OFF in Fedora even though they default is 'y': CONFIG_SYSFS_DEPRECATED=y CONFIG_SYSFS_DEPRECATED_V2=y when I ran into this back then we had them turned on in the ia64 config file. It appears that something in nash requires this in some situations. Since nash is nearly (ok completely) impossible to debug I am not sure where to go with this. Can we get these options turned back on in the Fedora kernel configs? They always used to be on and default to on upstream.
Doug, I don't think CONFIG_SYSFS_DEPRECATED is related to this bug. The lvm tools prefer /sys/class/block if it exists, so they'll ignore /sys/block either way. Rather the problem was corruption in the /sys/class/block directory, seemingly caused by the cciss driver, though it could have been in generic code that the cciss driver calls. I'm using the past tense because it seems to be fixed now in kernel-2.6.29-0.66.rc3.fc11.x86_64. Looking at the kernel rpm changelog, I think the fix came from upstream, because there doesn't seem to be anything relevant to this problem applied by Red Hat. Either way, glad to see it fixed.
I'm having the very same problem, on a HP DL380 G4 server upgrading from FC10 to Rawhide I get the 'kernel-PAE-2.6.29-0.145.rc6.fc11.i686' kernel, reboot the machine and it cannot find the root partition. /boot/grub/grub.conf: #boot=/dev/cciss/c0d0 default=0 timeout=5 splashimage=(hd0,0)/grub/splash.xpm.gz hiddenmenu title Fedora (2.6.29-0.145.rc6.fc11.i686.PAE) root (hd0,0) kernel /vmlinuz-2.6.29-0.145.rc6.fc11.i686.PAE ro root=/dev/VolGroup00/LogVol00 rhgb quiet initrd /initrd-2.6.29-0.145.rc6.fc11.i686.PAE.img title Fedora (2.6.29-0.145.rc6.fc11.i686.PAE) using UUID instead root (hd0,0) kernel /vmlinuz-2.6.29-0.145.rc6.fc11.i686.PAE ro root=UUID=099c4b4e-7655-47d7-90d4-7fc9a3bfcc9d rhgb quiet initrd /initrd-2.6.29-0.145.rc6.fc11.i686.PAE.img title Fedora (2.6.27.15-170.2.24.fc10.i686.PAE) root (hd0,0) kernel /vmlinuz-2.6.27.15-170.2.24.fc10.i686.PAE ro root=UUID=099c4b4e-7655-47d7-90d4-7fc9a3bfcc9d rhgb quiet initrd /initrd-2.6.27.15-170.2.24.fc10.i686.PAE.img I tried two grub config's where the first was made by the RPM installation "root=/dev/VolGroup00/LogVol00" this failed, just out of curiosity I created a second grub boot line using the UUID instead "root=UUID=099c4b4e-7655-47d7-90d4-7fc9a3bfcc9d" neither boot the machine.
I've just repeated this test using 'kernel-PAE-2.6.29-0.137.rc5.git4.fc11.i686' and got the same problem. The boot fails saying: Reading all physical volumes. This may take a while... Volume group "VolGroup00" not found Unable to access resume device (/dev/VolGroup00/LogVol01) mount: could not find filesystem '/dev/root'
Michael: suspect your problem is different. The problem in this bug is that the /sys/class/block/ entries were broken for cciss. That's fixed now, as mentioned in comment 6 Rather I suspect the problem you're facing is that the cciss driver is being omitted from the initrd. I don't know if there's a bug open for that presently. Doug?
updated lvm tools (since 2.02.29) works correctly with both old and new sysfs structure, the bug here is problem with wrong device initialization in initrd probably. I saw similar problem when testing upstream kernel with CONFIG_SYSFS_DEPRECATED_V2 set to off and I found that in mkinird is hardcoded old sys path causing wrong drivers in initrd... (and the drivers were not put into initrd at all) For me this helps (but it is probably unrelated to your problem...): --- mkinitrd.old 2009-02-10 20:22:35.000000000 +0100 +++ mkinitrd 2009-02-25 18:44:52.000000000 +0100 @@ -331,7 +331,8 @@ sysfs=$(readlink ${sysfs%/*}) fi - if [[ ! "$sysfs" =~ '^/sys/devices/.*/block/.*$' ]]; then +# if [[ ! "$sysfs" =~ '^/sys/devices/.*/block/.*$' ]]; then + if [[ ! "$sysfs" =~ '^/sys/block/.*$' ]]; then error "WARNING: $sysfs is a not a block sysfs path, skipping" return fi
Thanks guys, by recreating the initrd manually I have pushed the cciss driver in and the box is now working again on the latest rawhide kernel: mkinitrd --with=cciss initrd-2.6.29-0.145.rc6.fc11.i686.PAE.img 2.6.29-0.145.rc6.fc11.i686.PAE Regards, MC
I don't believe this is cciss related, but rather is an issue when upgrading from F-10 to rawhide when using yum (rather then anaconda). The problem is you are still running an older kernel when mkinitrd gets run. Yum upgrading is unsupported, but don't worry I agree this one is rather bad. So I'll fix it. I'll dup this to the bug of tracking the general issue of running mkinitrd when running an older kernel. Note that bug has a patch attached, so if you want you can give that a shot. *** This bug has been marked as a duplicate of bug 487358 ***