Bug 466071

Summary:

VolGroup00 not found -- since 2.6.27-0.391-rc8.git7.fc10.i686

Product:

[Fedora] Fedora

Reporter:

Charlie Moschel <fred99>

Component:

mkinitrd

Assignee:

Peter Jones <pjones>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Fedora Extras Quality Assurance <extras-qa>

Severity:

high

Docs Contact:

Priority:

medium

Version:

rawhide

CC:

akataria, dcantrell, eparis, erik-fedora, gerwinkrist, jeff, jlaska, katzj, kernel-maint, liblit, mhw, mishu, pavel1r, pjones, wally, wtogami

Target Milestone:

---

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

mkinitrd-6.0.70

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2008-11-02 22:12:13 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

438944

Attachments:

Description	Flags
diff of serial console output, LVM working vs not working	none
kernel 0.398 not booting; DM-* built in	none
Same kernel, rebuilt with DM-* as modules, boots OK	none
A patch which should hopefully help...	none
Slightly modified patch	none
VMware Config for my guest that is having problems booting F10	none
Serial debug output from vmware workstation 6.5 F10, 2.6.27.4-58	none
init file from 2.6.27.4-58	none

Description Charlie Moschel 2008-10-08 03:39:57 UTC

Created attachment 319721 [details]
diff of serial console output, LVM working vs not working

Description of problem:
VolGroup00 is not found, so root is not found: system doesn't boot.


Version-Release number of selected component (if applicable):
2.6.27-0.382 worked OK (back to FC7 ..)
2.6.27-0.391 and all since do not work

How reproducible:
Every time one tries to boot a recent kernel on my test system with LVM rootfs

Steps to Reproduce:
1. Install a kernel rpm newer than 2.6.27-0.382 on a system with LVM rootfs
2. Try to boot
3.
  
Actual results:
mount: error mounting /dev/root on /sysroot as ext3: No such file or directory

Expected results:
 2 logical volume(s) in volume group "VolGroup00" now active
 Creating root device.
 Mounting root filesystem 

Additional info:

Diff'ing configs of the working & broken kernels show that CONFIG_DM_SNAPSHOT, CONFIG_DM_MIRROR, and CONFIG_DM_ZERO were changed from modules (working kernel) to built-in (failing kernel).  No idea why that should matter, but it appears to.  Only other config change was adding VIDEO_HDPVR.

This is a vmware image that has tracked rawhide since about FC7.  I'm filing it against the kernel, please change if that is wrong.  

I will attach a diff of console output.

Comment 1 Walter Francis 2008-10-08 14:07:00 UTC

Me too..  32 bit host (Fedora 9), 32 bit guest, Vmware Workstation version 6.0.4, build 93057.

Here is a transcription of my boot log, please excuse any potential typos, I transcribed this into a text document:

2.6.27-0.377.rc8.git1.fc10.i686 boots fine.

But this kernel fails with the following kernels:

2.6.27-0.392.rc8.git7.fc10.i686 and 2.6.27-0.398.rc9

Reading all physical volumes.  This may take a while..
Activating logical volumes
  volume group "VolGroup00" not found
Unable to access resume device (UUID=xxxx)
Creating root device.
Mounting root filesystem.
mount: error mounting /dev/root on /sysroot as ext3: No such file or directory
Setting up other filesystems.
setuproot: moving /dev/fail: No such file or directory
setuproot: error mounting /proc: No such file or directory
setuproot: error mounting /sys: No such file or directory
Mount failed ofr selinuxfs on /selinux: No such file or directory
Switching to new root and running init.
switchroot: mount failed: No such file or directory
Booting has failed.

Comment 2 James Laska 2008-10-08 14:20:48 UTC

Can you upload your bootloader config file (/etc/grub.conf)?

Comment 3 Walter Francis 2008-10-08 14:31:23 UTC

Sure, just a few notes on it..  I have to put acpi=off or the guest speed runs at a crazy speed, about 6x real time, making it impossible to type or reliably do anything.  I tried about everything, trust me.. ;)  this was the best solution.  Now it runs a little slow, but it works.  

I have of course removed it during messing with this problem, and the problem remains.  I've also removed quiet and rhgb, otherwise I wouldn't see the errors at all.  

The UUID is representative of the boot text above, I just didn't want to type the whole thing into a text document.  :)

# grub.conf generated by anaconda
#
# Note that you do not have to rerun grub after making changes to this file
# NOTICE:  You have a /boot partition.  This means that
#          all kernel and initrd paths are relative to /boot/, eg.
#          root (hd0,0)
#          kernel /vmlinuz-version ro root=/dev/VolGroup00/LogVol00
#          initrd /initrd-version.img
#boot=/dev/sda
default=0
timeout=5
splashimage=(hd0,0)/grub/splash.xpm.gz
hiddenmenu
title Fedora (2.6.27-0.398.rc9.fc10.i686)
	root (hd0,0)
	kernel /vmlinuz-2.6.27-0.398.rc9.fc10.i686 ro root=UUID=20fff10d-354a-4804-beca-420e14862002 rhgb quiet acpi=off
	initrd /initrd-2.6.27-0.398.rc9.fc10.i686.img
title Fedora (2.6.27-0.392.rc8.git7.fc10.i686)
	root (hd0,0)
	kernel /vmlinuz-2.6.27-0.392.rc8.git7.fc10.i686 ro root=UUID=20fff10d-354a-4804-beca-420e14862002 rhgb quiet acpi=off
	initrd /initrd-2.6.27-0.392.rc8.git7.fc10.i686.img
title Fedora (2.6.27-0.377.rc8.git1.fc10.i686)
	root (hd0,0)
	kernel /vmlinuz-2.6.27-0.377.rc8.git1.fc10.i686 ro root=UUID=20fff10d-354a-4804-beca-420e14862002 rhgb quiet acpi=off
	initrd /initrd-2.6.27-0.377.rc8.git1.fc10.i686.img

Comment 4 James Laska 2008-10-08 15:00:09 UTC

Can you perhaps also include the output of the command: blkid ?

That will help ensure that a partition uses the UUID label noted above.

From there ... we might want to inspect what recent packages were installed.  The older kernels (392) still boot right?  So something installed after may have contributed to your system failing to boot.

Are the initrd images drastically different between the booting initrd and the non-booting initrd?  You can try ...

$ zcat /boot/initrd-2.6.27-0.398.rc9.fc10.i686.img
 | cpio -it | sed 's|modules/[^/]*|modules/VERSION|' > /tmp/new.lst

$ zcat /boot/initrd-2.6.27-0.377.rc8.git1.fc10.i686.img

 | cpio -it | sed 's|modules/[^/]*|modules/VERSION|' > /tmp/old.lst

$ diff -U0 /tmp/new.lst /tmp/old.lst 

That should hilight any major differences between your working initrd and the non-working initrd.

Also ... if you take the "quiet" boot option out ... is there more verbosity that might pinpoint the root cause?

Comment 5 Walter Francis 2008-10-08 16:12:04 UTC

Sure, no problem.  I've taken out quiet and rhgb from the start, but I haven't tried the new kernel option (is it noisy?  I heard it a few weeks ago but can't find it now..)

The following is output from the (working) 352.rc7 kernel:

#blkid

/dev/sda1: LABEL="/boot" UUID="419e9ac4-5d05-474b-992c-71a0eb256767" TYPE="ext3" SEC_TYPE="ext2" 
/dev/sda2: UUID="6j9U4i-bh5G-kEZV-CiZX-c1aY-iPGD-fVsoWV" TYPE="lvm2pv" 
/dev/mapper/VolGroup00-LogVol00: UUID="20fff10d-354a-4804-beca-420e14862002" TYPE="ext3" SEC_TYPE="ext2" 
/dev/mapper/VolGroup00-LogVol01: TYPE="swap" UUID="94ae151e-4827-4929-b005-51e47622b25c" 
/dev/VolGroup00/LogVol00: UUID="20fff10d-354a-4804-beca-420e14862002" TYPE="ext3" 
/dev/VolGroup00/LogVol01: TYPE="swap" UUID="94ae151e-4827-4929-b005-51e47622b25c" 

# cat diff (of output from your suggested commands)

--- initrd.new	2008-10-07 12:21:50.000000000 -0400
+++ initrd.old	2008-10-07 12:21:53.000000000 -0400
@@ -14,0 +15 @@
+lib/modules/VERSION/dm-zero.ko
@@ -16,0 +18 @@
+lib/modules/VERSION/dm-log.ko
@@ -19,0 +22 @@
+lib/modules/VERSION/dm-snapshot.ko
@@ -25,0 +29 @@
+lib/modules/VERSION/dm-mirror.ko
@@ -66 +70 @@
-etc/ld.so.conf.d/kernel-2.6.27-0.398.rc9.fc10.i686.conf
+etc/ld.so.conf.d/kernel-2.6.27-0.372.rc8.fc10.i686.conf
@@ -68 +72 @@
-etc/ld.so.conf.d/kernel-2.6.27-0.392.rc8.git7.fc10.i686.conf
+etc/ld.so.conf.d/kernel-2.6.27-0.352.rc7.git1.fc10.i686.conf

Comment 6 Walter Francis 2008-10-08 16:13:46 UTC

I forgot to mention; yes, the old 353.rc7 kernel boots and everything seems fine.  Just can't boot past the console logs in my first comment on the later kernels.

Comment 7 James Laska 2008-10-08 16:26:03 UTC

I'm I'm reading the output correctly, that means that somehow the dm-*.ko modules are present in your old initrd, but not your new initrd.  With out those modules, I don't think LVM will work properly, as indicated by your boot failures.

If you boot into the old kernel ... and rebuild the initrd?

$ /sbin/mkinitrd -v -f /boot/initrd-2.6.27-0.398.rc9.fc10.i686.img 2.6.27-0.398.rc9.fc10.i686

Any warning/error messages then?  If you repeat the previous diff'ing commands do you still see the missing dm-*.ko modules?

Comment 8 Walter Francis 2008-10-08 16:40:39 UTC

No warnings..  Here's the new diff, slightly different, but looks like the dm-* modules are the same, and the kernel still won't boot.  I've tried the noisy option, which does seem to output more overall messages, but nothing more in relation to the mounting of the filesystems, etc..  Same output as comment #1.

--- initrd.new	2008-10-07 12:32:25.000000000 -0400
+++ initrd.old	2008-10-07 12:32:16.000000000 -0400
@@ -13 +12,0 @@
-lib/libply.so.1
@@ -15,0 +15 @@
+lib/modules/VERSION/dm-zero.ko
@@ -17,0 +18 @@
+lib/modules/VERSION/dm-log.ko
@@ -20,0 +22 @@
+lib/modules/VERSION/dm-snapshot.ko
@@ -26,0 +29 @@
+lib/modules/VERSION/dm-mirror.ko
@@ -45 +47,0 @@
-lib/libply.so.1.0.0
@@ -68 +70 @@
-etc/ld.so.conf.d/kernel-2.6.27-0.398.rc9.fc10.i686.conf
+etc/ld.so.conf.d/kernel-2.6.27-0.372.rc8.fc10.i686.conf
@@ -70 +72 @@
-etc/ld.so.conf.d/kernel-2.6.27-0.392.rc8.git7.fc10.i686.conf
+etc/ld.so.conf.d/kernel-2.6.27-0.352.rc7.git1.fc10.i686.conf
@@ -78,0 +81 @@
+usr/lib/libnash.so.6.0.64
@@ -80,0 +84 @@
+usr/lib/libply.so.1
@@ -82 +85,0 @@
-usr/lib/libnash.so.6.0.65
@@ -87,0 +91,2 @@
+usr/lib/libbdevid.so.6.0.64
+usr/lib/libply.so.1.0.0
@@ -96 +100,0 @@
-usr/lib/libbdevid.so.6.0.65
@@ -139,0 +144,2 @@
+usr/bin
+usr/bin/plymouth
@@ -172 +177,0 @@
-bin/plymouth

Comment 9 Peter Jones 2008-10-08 17:25:30 UTC

(In reply to comment #7)
> I'm I'm reading the output correctly, that means that somehow the dm-*.ko
> modules are present in your old initrd, but not your new initrd.  With out
> those modules, I don't think LVM will work properly, as indicated by your boot
> failures.

Those are built in to the kernel now.

Comment 10 Charlie Moschel 2008-10-08 17:29:03 UTC

Yes, as I wrote in the 'Additional Info' of the original report, the dm-zero, dm-snapshot, and dm-mirror config options were changed from building as modules to building as built-ins.  You can confirm this by diff'ing the /boot/*config files between a working version and a failing version.  So, AFAIK, the modules are not expected to be in the initrd, since they are built in.  But, they don't work as built-ins.

I don't remember seeing a dm-log module or built-in.  

Could boot failure be due to link order of built-in dm-*?  Perhaps missing dm-log as built-in?

I will post the further requested info when I'm back at that machine, in about 5 hours.  I can say that my UUIDs in /etc/grub.conf are the same for working and non-working kernels.  Also, the original attachment (diff of working/failing console logs) was taken without rhgb and without quiet kernel command line options. https://bugzilla.redhat.com/attachment.cgi?id=319721

Comment 11 Dave Jones 2008-10-08 17:47:40 UTC

we used to carry a patch up until F9 that worked around a bug in vmwares emulation of mpt fusion scsi adaptors.  I dropped that during F10 development in the hope that vmware had gotten around to fixing it.

I bet if I re-add that workaround this will start working again.
Should be in tomorrows build.

Comment 12 Dave Jones 2008-10-08 17:59:26 UTC

does it work if you switch vmware to use the buslogic emulation ?

also, is there an updated version of vmware you can try ?

Comment 13 Charlie Moschel 2008-10-08 18:45:29 UTC

(In reply to comment #11)
> we used to carry a patch up until F9 that worked around a bug in vmwares
> emulation of mpt fusion scsi adaptors.  I dropped that during F10 development
> in the hope that vmware had gotten around to fixing it.

IIRC, that prevented the disk from being detected.  In this case, sda is detected, but VolGroup00 is not.

> 
> I bet if I re-add that workaround this will start working again.
> Should be in tomorrows build.

I will try as soon as it appears, or sooner from koji.

Comment 14 Charlie Moschel 2008-10-08 18:48:04 UTC

(In reply to comment #12)
> does it work if you switch vmware to use the buslogic emulation ?

Will try that ..

> 
> also, is there an updated version of vmware you can try ?

No, this is the latest Workstation build, ver 6.5.0 (not a beta)

Comment 15 Walter Francis 2008-10-08 20:18:21 UTC

OOOPS!  I just realized my VMware Workstation was VERY old...  I have updated
to the latest version (6.5.0, build 118166) and the booting problem goes away.

And my crazy clock problem goes away too!  Horray!

Comment 16 Charlie Moschel 2008-10-09 00:10:18 UTC

(In reply to comment #14)
> (In reply to comment #12)
> > does it work if you switch vmware to use the buslogic emulation ?
> 
> Will try that ..

OK, this is odd.  Rebuilt initrd for both 2.6.27-0.382 and 2.6.27-0.398, adding --with=BusLogic.  No errors reported.  Shut down VM, changed to buslogic scsi emultion, reboot.

Results:

2.6.27-0.398 fails with buslogic
2.6.27-0.382 also fails with buslogic (!? with dm-* as modules!)
2.6.27-0.398 (still) fails with lsilogic
2.6.27-0.382 (still) *works* with lsilogic (with dm-* as modules) 

/dev/sda is detected in all cases, but VolGroup00 is not found:

Loading BusLogic module^M
pci 0000:00:10.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17
scsi: ***** BusLogic SCSI Driver Version 2.1.16 of 18 July 2002 *****
scsi: Copyright 1995-1998 by Leonard N. Zubkoff <lnz>
scsi2: Configuring BusLogic Model BT-958 PCI Wide Ultra SCSI Host Adapter
scsi2:   Firmware Version: 5.07B, I/O Address: 0x10C0, IRQ Channel: 17/Level
scsi2:   PCI Bus: 0, Device: 16, Address: 0xD8800000, Host Adapter SCSI ID: 7
scsi2:   Parity Checking: Enabled, Extended Translation: Enabled
scsi2:   Synchronous Negotiation: Ultra, Wide Negotiation: Enabled
scsi2:   Disconnect/Reconnect: Enabled, Tagged Queuing: Enabled
scsi2:   Scatter/Gather Limit: 128 of 8192 segments, Mailboxes: 211
scsi2:   Driver Queue Depth: 211, Host Adapter Queue Depth: 192
scsi2:   Tagged Queue Depth: Automatic, Untagged Queue Depth: 3
scsi2: *** BusLogic BT-958 Initialized Successfully ***
scsi2 : BusLogic BT-958
scsi 2:0:0:0: Direct-Access     VMware,  VMware Virtual S 1.0  PQ: 0 ANSI: 2
Making device-mapper control node
Scanning logical volumes
sd 2:0:0:0: [sda] 25165824 512-byte hardware sectors (12885 MB)
sd 2:0:0:0: [sda] Write Protect is off
sd 2:0:0:0: [sda] Cache data unavailable
sd 2:0:0:0: [sda] Assuming drive cache: write through
sd 2:0:0:0: [sda] 25165824 512-byte hardware sectors (12885 MB)
sd 2:0:0:0: [sda] Write Protect is off
sd 2:0:0:0: [sda] Cache data unavailable
sd 2:0:0:0: [sda] Assuming drive cache: write through
 sda: sda1 sda2
sd 2:0:0:0: [sda] Attached SCSI disk
sd 2:0:0:0: Attached scsi generic sg1 type 0
  Reading all physical volumes.  This may take a while...
Activating logical volumes
  Volume group "VolGroup00" not found
Creating root device.
Mounting root filesystem.
mount: error mounting /dev/root on /sysroot as ext3: No such file or directory

> 
> > 
> > also, is there an updated version of vmware you can try ?
> 
> No, this is the latest Workstation build, ver 6.5.0 (not a beta)

I have console boot logs for each case ... let me know if I should attach them, or what other (emulated) hardware info you need.

Comment 17 Charlie Moschel 2008-10-09 01:21:47 UTC

(In reply to comment #11)
> we used to carry a patch up until F9 that worked around a bug in vmwares
> emulation of mpt fusion scsi adaptors.  I dropped that during F10 development
> in the hope that vmware had gotten around to fixing it.

If you mean https://bugzilla.redhat.com/show_bug.cgi?id=230703
then this was fixed in VMware Workstation 6.0 in late 2007.

Not sure when you dropped it in F10, but I haven't had any vmware scsi problems during F10 development -- until now.

> 
> I bet if I re-add that workaround this will start working again.
> Should be in tomorrows build.

Comment 18 Charlie Moschel 2008-10-09 02:29:00 UTC

(In reply to comment #13)
> (In reply to comment #11)
> > we used to carry a patch up until F9 that worked around a bug in vmwares
> > emulation of mpt fusion scsi adaptors.  I dropped that during F10 development
> > in the hope that vmware had gotten around to fixing it.
> 
> IIRC, that prevented the disk from being detected.  In this case, sda is
> detected, but VolGroup00 is not.
> 
> > 
> > I bet if I re-add that workaround this will start working again.
> > Should be in tomorrows build.
> 
> I will try as soon as it appears, or sooner from koji.

Sorry, I pulled kernel-2.6.27-0.408.rc9.git1.fc10.i686.rpm (and firmware) from koji, but it still fails with both lsilogic and buslogic scsi.

Comment 19 Dave Jones 2008-10-09 02:42:58 UTC

ok. thanks for testing that theory.
I guess we can drop that patch again given your remarks in comment #17
Good to know vmware fixed that.

As to what's causing your bug.. back to the drawing board. No idea right now.

Comment 20 Ben Liblit 2008-10-09 20:29:25 UTC

I'll chime in to say that I'm seeing the same problem.  My VMware virtual machine started as a fresh install of the Fedora 10 beta: <http://www.thoughtpolice.co.uk/vmware/#fedora10beta>.  This initially came with kernel-2.6.27-0.352.rc7.git1.fc10.i686, and that kernel boots just fine.  However, the updated kernel-2.6.27-0.398.rc9.fc10.i686 from rawhide fails to boot, with output like that given in comment #1.

Comment 21 Charlie Moschel 2008-10-10 04:46:46 UTC

(In reply to comment #20)
> I'll chime in to say that I'm seeing the same problem.  My VMware virtual
> machine started as a fresh install of the Fedora 10 beta:
> <http://www.thoughtpolice.co.uk/vmware/#fedora10beta>.  This initially came
> with kernel-2.6.27-0.352.rc7.git1.fc10.i686, and that kernel boots just fine. 
> However, the updated kernel-2.6.27-0.398.rc9.fc10.i686 from rawhide fails to
> boot, with output like that given in comment #1.

OK, thanks, you saved me a re-install to test the theory that this was related to the age of my initial install.

I rebuilt the 2.6.27-0.398 kernel, changing CONFIG_DM-SNAPSHOT, CONFIG-DM_MIRROR, and CONFIG_DM_ZERO from built-ins back to modules.  Essentially this reverted:
* Fri Oct 03 2008 Dave Jones <davej>
- Demodularise some of the devicemapper modules that always get loaded.

My rebuilt kernel boots fine with these as modules, yet fails with them as built-ins.

No idea why, and open for suggestions.

Comment 22 Peter Jones 2008-10-10 14:57:18 UTC

Any chance you can give us a more complete boot log?

Comment 23 Ben Liblit 2008-10-10 18:34:27 UTC

I'm not sure who you mean by "you" in comment #22.  Do you want a more complete boot log from me, regarding my configuration in comment #20?  Or a more complete log from Charlie Moschel, regarding his configuration in comment #21?

Comment 24 Charlie Moschel 2008-10-10 22:23:59 UTC

Created attachment 320070 [details]
kernel 0.398 not booting; DM-* built in

Comment 25 Charlie Moschel 2008-10-10 22:25:16 UTC

Created attachment 320071 [details]
Same kernel, rebuilt with DM-* as modules, boots OK

Comment 26 Ben Liblit 2008-10-10 22:28:44 UTC

Just out of curiosity, how did you (comment #24) capture the kernel messages when booting failed?  I assumed I'd have to transcribe by hand, but I doubt you did that.

Comment 27 Charlie Moschel 2008-10-10 22:34:01 UTC

(In reply to comment #22)
> Any chance you can give us a more complete boot log?

Sure, two uploaded as attachments.  Both are 2.6.27-0.398.rc9.fc10.i686, one that fails and one that works.  (Sorry, looks like some ^M s slipped in there ...).

I will try to reproduce this on real HW this weekend; so far looks like it's only been reported by VMware users.

Comment 28 Charlie Moschel 2008-10-10 22:41:49 UTC

(In reply to comment #26)
> Just out of curiosity, how did you (comment #24) capture the kernel messages
> when booting failed?  I assumed I'd have to transcribe by hand, but I doubt you
> did that.

In VMware, add a serial port to the VM if you haven't already.  Then choose to connect that port to a file (other options are to use the real port on the host or to use a network socket).  You have to name the file.

When booting linux, interrupt the normal boot and add "console=ttyS0,19200n81" to the command line.  You won't see the normal stream of boot messages; they are going to the file instead.  These logs are without "quiet" and without RHGB.

regards, ... Charlie

Comment 29 Alok Kataria 2008-10-11 01:46:24 UTC

I gave this FC kernel, version "2.6.27-1.fc10.i686" a spin on my VM and saw a similar crash to what you guys have mentioned. 

I then compile booted mainline 2.6.27 on this VM without a problem. Yes DM modules were built into the kernel. FWIW, i have been regularly booting the 27-rc and Ingo's tip kernel on both 32 and 64 bit without any problem for some time (with same config, dm inbuilt). This is just to highlight that I too am utterly confued after seeing this crash. 

Just one question, has anybody seen a similar crash with 64bit VM's or is this only limited to 32bit.

Anyways, I will try to debug this more over the weekend and get back with any info that i have. 

Thanks,
Alok

Comment 30 Charlie Moschel 2008-10-11 03:09:20 UTC

(In reply to comment #29)

> Just one question, has anybody seen a similar crash with 64bit VM's or is this
> only limited to 32bit.
> 

I don't use any 64bit VMs, so I can't say.  But right now I'm grabbing both 32bit and 64bit F-10 snapshots announced a few minutes ago.  I'll test a 64bit VM and let you know.

I suspect it is somehow related to ordering.  As a layman looking at a diff of the console logs, this is the rough order of messages:

Working:

    scsi2 : ioc0: LSI53C1030 B0, FwRev=00000000h, Ports=1, MaxQ=128, IRQ=17
 * Loading dm-mirror module
    <scsi disk sda probing here>
 * Loading dm-zero module
    <scsi sda partitions listed here>
 * Loading dm-snapshot module
   Making device-mapper control node
   Scanning logical volumes
     Reading all physical volumes.  This may take a while...
     Found volume group "VolGroup00" using metadata type lvm2
   Activating logical volumes
     2 logical volume(s) in volume group "VolGroup00" now active

But failing gives a different order:

    scsi2 : ioc0: LSI53C1030 B0, FwRev=00000000h, Ports=1, MaxQ=128, IRQ=17
 * Making device-mapper control node
   Scanning logical volumes
     Reading all physical volumes.  This may take a while...
    <scsi disk sda probing here>
 * Activating Logical Volumes
    <scsi disk sda partitions listed here>
 * Volume group "VolGroup00" not found

Seems to start scanning before probing of sda & partition table is done, but I'm grasping straws here.

Comment 31 Charlie Moschel 2008-10-12 01:42:26 UTC

> Just one question, has anybody seen a similar crash with 64bit VM's or is this
> only limited to 32bit.

64bit crashes exactly the same way.  

Tested with F10-snap1-x86_64 (live cd installed to disk).  Note that if you boot the live CD again after installing to the disk, the LVM partitions on disk can be mounted just fine with that kernel.  It is just the detection during boot that fails.

The VM is a version 6.5 VM, if that matters.

Comment 32 Michael H. Warfield 2008-10-13 16:29:59 UTC

Issues regarding VMware drivers are very interesting.

I've been looking in to this and I see where this has created a race condition.  The changes from 2.6.26 to 2.6.27 have resulted in insmod coming back from loading the scsi drivers before the drivers have settled.  This is with the VMware emulated drivers but this is NOT using vmware tools or the vmware drivers in the guest.  In early 2.6.27-rc kernels the dm-mirror, dm-zero, and dm-snapshot drivers were being loaded from initrd and created enough of a delay that lvm started up after the scsi drivers had settled.  Now that they're no longer loaded, lvm is starting too quickly and is looking for the physically volumes before the scsi drivers have had a chance to recognize the devices.  I'm not sure I would lay this purely at the feet of VMware.  What prevents other drivers from returning quickly and settling down devices later?

I forced wait_for_scsi="yes" on line 1411 of mkinitrd and now the initrd waits for the scsi devices to setting and continues on.  That fixes the problem for me.  Without this, we are at the mercy of the drivers to have fully settled before returning from insmod and starting lvm.

Comment 33 Alok Kataria 2008-10-13 17:40:25 UTC

(In reply to comment #32)
> Issues regarding VMware drivers are very interesting.
> 
> I've been looking in to this and I see where this has created a race condition.
>  The changes from 2.6.26 to 2.6.27 have resulted in insmod coming back from
> loading the scsi drivers before the drivers have settled.  This is with the
> VMware emulated drivers but this is NOT using vmware tools or the vmware
> drivers in the guest.  In early 2.6.27-rc kernels the dm-mirror, dm-zero, and
> dm-snapshot drivers were being loaded from initrd and created enough of a delay
> that lvm started up after the scsi drivers had settled.  Now that they're no
> longer loaded, lvm is starting too quickly and is looking for the physically
> volumes before the scsi drivers have had a chance to recognize the devices. 
> I'm not sure I would lay this purely at the feet of VMware.  What prevents
> other drivers from returning quickly and settling down devices later?

Michael, 
This is interesting, and am glad that you have found a workaround for this. 
Though, what i am worried about is whether this behavior of SCSI drivers is related to some config option or something. I have been trying to compile/boot the FC10 sources with a config file which has DM modules compiled in the kernel. 
I have not been able to reproduce the problem with the compiled kernel, but always hit this if i boot directly of the binary. Can any of you who has seen this with compiled sources upload the .config that you used. 

> 
> I forced wait_for_scsi="yes" on line 1411 of mkinitrd and now the initrd waits
> for the scsi devices to setting and continues on.  That fixes the problem for
> me.  Without this, we are at the mercy of the drivers to have fully settled
> before returning from insmod and starting lvm.

Also, will this solution be accepted and included for FC10 ? 

Thanks,
Alok

Comment 34 Charlie Moschel 2008-10-13 18:17:06 UTC

> I have not been able to reproduce the problem with the compiled kernel, but
> always hit this if i boot directly of the binary. Can any of you who has seen
> this with compiled sources upload the .config that you used. 

Alok, the config files for each Fedora kernel are in /boot, next to the kernel image and initrd files.  I used the procedure detailed here:
https://fedoraproject.org/wiki/Building_a_custom_kernel
to locally rebuild a released kernel to confirm that reverting to modules for DM-* did solve the issue (comment #21).  I did not try to locally build a kernel with _no_ changes, though.

> 
> > 
> > I forced wait_for_scsi="yes" on line 1411 of mkinitrd and now the initrd waits
> > for the scsi devices to setting and continues on.  That fixes the problem for
> > me.  Without this, we are at the mercy of the drivers to have fully settled
> > before returning from insmod and starting lvm.

I will be able to confirm this in a few hours.

> 
> Also, will this solution be accepted and included for FC10 ? 

There has been a lot of effort recently on speeding the boot process (perhaps that is part of the problem?  Hmm....).  I'm sure forcing this scsi wait on every configuration should be avoided if possible, but OTOH this race needs to be closed.  Perhaps there is another way to avoid it.

Regards, ..... Charlie
> 
> Thanks,
> Alok

Comment 35 Charlie Moschel 2008-10-13 23:41:16 UTC

(In reply to comment #32)

> I'm not sure I would lay this purely at the feet of VMware.  What prevents
> other drivers from returning quickly and settling down devices later?
> 
> I forced wait_for_scsi="yes" on line 1411 of mkinitrd and now the initrd waits
> for the scsi devices to setting and continues on.  That fixes the problem for
> me.  Without this, we are at the mercy of the drivers to have fully settled
> before returning from insmod and starting lvm.

I can confirm that this fix works for me.

You get the same result without editing /sbin/mkinitrd by adding "--with=scsi_wait_scan" to the mkinitrd command line.  Obviously this kernel module was written just for this purpose, but what's the best way to trigger mkinitrd to include it?

Thanks for your help!

Comment 36 Michael H. Warfield 2008-10-13 23:53:27 UTC

Wow...  I have no clue how I managed to miss a command-line option for it.  I swear I searched through the script looking for "wait" or "sleep" and a couple of others seeing what could be done.  That's much better if it, at least, has a command line option instead of editing the darn thing.  I obviously didn't look deeply enough.

Comment 37 Michael H. Warfield 2008-10-14 01:12:13 UTC

Interesting...  "-with=scsi_wait_scan" actually does something slightly different but may well be completely equivalent.  That actually does load a "scsi_wait_scan" module while forcing "wait_for_scsi=yes" causes the command "stabilized --hash --interval 250 /proc/scsi/scsi" to be inserted into the init script.  Not sure which is better and both probably work.

As far as triggering mkinitrd to included it, you can create a file /etc/sysconfig/mkinitrd with this:

MODULES="scsi_wait_scan"

That will do the same thing as --with=scsi_wait_scan on the command line.

I don't see any way to trick it into the "wait_for_scsi" trick so using "scsi_wait_scan" may well be the better answer all the way around.

Comment 38 Michael H. Warfield 2008-10-14 01:48:07 UTC

Just confirming, yeah, undoing my modifications to mkinitrd and then creating the /etc/sysconfig/mkinitrd file with "MODULES=scsi_wait_scan" and recreating a new initrd worked like a charm for me as well.  That's now my preferred workaround.

Comment 39 Charlie Moschel 2008-10-14 03:53:34 UTC

(In reply to comment #37)

> As far as triggering mkinitrd to included it, you can create a file
> /etc/sysconfig/mkinitrd with this:
> 
> MODULES="scsi_wait_scan"

Yeah, locally, that's a workaround.  But I was thinking of what changes have to be put into anaconda, so that workaround can be triggered at installation.  As it is now, a fresh install using device mapper is still broken on vmware, and maybe other scsi drivers that also return quickly.  You have to boot a rescue disk, chroot, apply the workaround, etc. before your newly installed F10 will boot.

Comment 40 Peter Jones 2008-10-14 19:24:47 UTC

Created attachment 320345 [details]
A patch which should hopefully help...

Can you try applying this patch to /sbin/mkinitrd and regenerating your initrd, to see if it helps?

Comment 41 Charlie Moschel 2008-10-15 02:33:10 UTC

Created attachment 320381 [details]
Slightly modified patch

Well, the *idea* works :)  But the patch had some issues, and threw a bunch of: /sbin/mkinitrd: line 1482: [: missing `]'
/sbin/mkinitrd: line 1487: -o: command not found
/sbin/mkinitrd: line 1488: -o: command not found
/sbin/mkinitrd: line 1489: -o: command not found
/sbin/mkinitrd: line 1490: -o: command not found
/sbin/mkinitrd: line 1491: -o: command not found
/sbin/mkinitrd: line 1492: -o: command not found
/sbin/mkinitrd: line 1493: -o: command not found
/sbin/mkinitrd: line 1494: ]: command not found
Adding module mptbase
/sbin/mkinitrd: line 1482: [: missing `]' .....

Slightly modified patch attached, which works for me.

BTW, based on your comment, I'm not sure I follow.  What should the vmware code be doing that it isn't, or what is it not doing that it should?

Thanks a lot for your help on this!

Comment 42 Michael H. Warfield 2008-10-15 14:05:32 UTC

Ok...  The modified patch seems to work fine for me as well.

That takes us back to using the "stabilize" command instead of the scsi_wait_scan module.  Advantages and disadvantages to each method?

I noticed that the scsi_wait_scan module results in that module lingering in the running kernel after it's needed but is it faster than the stabilize method?  I could see where it would be faster if it's talking directly to the SCSI drivers and coming of a queue or a semaphore as soon as the scan is complete.

The stabilize command is doing what?  Polling /proc/scsi/scsi for changes periodically and returning if it sees no changes after a period of time?  That would introduce more latency in the bootup but would not leave a lingering kernel module.

Are there any other SCSI drivers we should be concerned about or any possibility the behavior of another would change?

Comment 43 Charlie Moschel 2008-10-15 15:05:45 UTC

> Are there any other SCSI drivers we should be concerned about or any
> possibility the behavior of another would change?

Maybe sym53c8xx. https://bugzilla.redhat.com/show_bug.cgi?id=466607

Comment 44 Charlie Moschel 2008-10-20 03:09:27 UTC

(In reply to comment #43)
> > Are there any other SCSI drivers we should be concerned about or any
> > possibility the behavior of another would change?
> 
> Maybe sym53c8xx. https://bugzilla.redhat.com/show_bug.cgi?id=466607

I was able to reproduce this under F10-snap2, using real HW: sybios 53c875 SCSI on a celeron 733, so it's not just a VMware issue.  I don't have a serial console log, but on the remaining screen of boot text I can see that the "Booting has failed" message from init is sent long before sda partition tables are found.

But note that the method used in above patch in comment #41 does *not* fix the issue for this HW: emitting the 'stabilized' call into init does not work.  (I verified that the rebuilt init had that call).

The other approach, adding the "scsi_wait_scan" module *did* fix the problem for this HW.

To summarize:

Buslogic under vmware: fix with either scsi_wait_scan or 'stabilized'
LSILogic under vmware: fix with either scsi_wait_scan or 'stabilized'
Symbios 53c875 on real HW: fix only with scsi_wait_scan
Symbios ?? scsi on real HW: fix only with scsi_wait_scan (from 466607)

I'll post under https://bugzilla.redhat.com/show_bug.cgi?id=466607 too, but I'm thinking that should be marked as a duplicate of this, and this should probably be moved from kernel to mkinitrd.

Regards, ... Charlie

Comment 45 Gerwin Krist 2008-10-27 19:21:03 UTC

I have the same problem with a 3ware raid card (3w-xxxx module). Booting to an Fc9 kernel is the only solution for me.

Comment 46 Gerwin Krist 2008-10-27 19:43:21 UTC

I can confirm that when using:

"
/etc/sysconfig/mkinitrd with this:
MODULES="scsi_wait_scan"
"

the machine starts to boot again ...

Comment 47 Charlie Moschel 2008-10-30 13:53:26 UTC

I can confirm that mkinitrd-6.0.68 fixes the original issue of this bug: booting in vmware.  So this bug can be closed.

But (at least) 3ware, symbios 53c875, and perhaps Toshiba L10 laptop still do not boot, as in https://bugzilla.redhat.com/show_bug.cgi?id=466607.  These need the scsi_wait_scan module, or if scsi_wait_scan is too heavy-handed, need to dig further into why these devices fail with "stabilized --hash --interval 250 /proc/scsi/scsi"

Comment 48 Walter Francis 2008-11-03 20:17:33 UTC

mkinitrd-6.0.68 not .69 fix the problem for me, the exact same problem persists for my VMware installation.  Unless I put the scsi_wait_scan info into /etc/sysconfig/mkinitrd my system will not boot a newly installed kernel.

What other information can I provide in order to make sure this bug is fixed, as it is not for me.

Comment 49 Michael H. Warfield 2008-11-03 20:42:23 UTC

I just updated one of my F10 VMware images to the latest rawhide.  I got mkinitrd-6.0.69-1.fc10.i386, kernel-2.6.27.4-68.fc10.i686 and nash-6.0.69-1.fc10.i386.  System updated and then rebooted fine without scsi_wait_scan called out in /etc/sysconfig/mkinitrd.  In fact, I didn't even have the /etc/sysconfig/mkinitrd on the system this time for this test.  Works fine for the first time in several updates.

What hard drive controller do you have configured?  ATAPI, BusLogic SCSI, or LSI Logic SCSI?

Comment 50 Walter Francis 2008-11-03 21:03:56 UTC

(In reply to comment #49)
> I just updated one of my F10 VMware images to the latest rawhide.  I got
> mkinitrd-6.0.69-1.fc10.i386, kernel-2.6.27.4-68.fc10.i686 and
> nash-6.0.69-1.fc10.i386.  System updated and then rebooted fine without
> scsi_wait_scan called out in /etc/sysconfig/mkinitrd.  In fact, I didn't even
> have the /etc/sysconfig/mkinitrd on the system this time for this test.  Works
> fine for the first time in several updates.
> 
> What hard drive controller do you have configured?  ATAPI, BusLogic SCSI, or
> LSI Logic SCSI?

Honestly I don't know, but I believe BusLogic.  I don't see it in the config, at least in a way which is obvious to me.

I've always installed Linux in VMware using this install path:  Advanced, but select ALL DEFAULTS.  The default "Linux" path seemed to cause me more problems, and AFAIK this was related to LSI vs BusLogic.

I am attaching the config for that VM.  There are "F9" mentions in it, but it's because it was a F9 -> F10 upgrade path that I was testing and I've been upgrading it since.

Comment 51 Walter Francis 2008-11-03 21:05:38 UTC

Created attachment 322368 [details]
VMware Config for my guest that is having problems booting F10

I've hand-modified it slightly to take out any password related items (vnc etc), otherwise I haven't altered anything.  The title is F9, that's because it was an F9 -> F10 upgrade (snap1) which has been continuously updated since it was upgraded a few weeks ago.

Comment 52 Charlie Moschel 2008-11-04 05:18:01 UTC

(In reply to comment #50)
> > What hard drive controller do you have configured?  ATAPI, BusLogic SCSI, or
> > LSI Logic SCSI?
> 
> Honestly I don't know, but I believe BusLogic.  I don't see it in the config,
> at least in a way which is obvious to me.

(Comment #15 seemed to say your boot problem was gone?)  Scsi in recent VMware is lsilogic, unless you manually edit the vmx file.  

Are you sure you are using the new mkinitrd?  Have you installed an updated kernel since you updated mkinitrd?  The new initrd won't be built until a kernel is installed, and if you install a new kernel and the new mkintird at the same time, I'm not sure which would be done first.  

Can you attach a serial boot log for your a failed boot, as in comment #28?

And can you attach your init file by:
mkdir temp; cd temp
zcat /boot/initrd-x.y.z.img | cpio -i

Your init file will be in the temp directory with other cruft.

Comment 53 Walter Francis 2008-11-04 18:04:25 UTC

(In reply to comment #52)

> (Comment #15 seemed to say your boot problem was gone?)  Scsi in recent VMware
> is lsilogic, unless you manually edit the vmx file.  

I installed this under VMware Server 5, updated to VMware Workstation 6.5.  I've also done the "Upgrade" option under VMware, but I did that post-F10.  

And yes, I thought the problem was gone, but it re-appeared later and has been 100% unable to boot newer kernels since without using the scsi_wait mkinitrd modification.
 
> Are you sure you are using the new mkinitrd?  Have you installed an updated
> kernel since you updated mkinitrd?  The new initrd won't be built until a
> kernel is installed, and if you install a new kernel and the new mkintird at
> the same time, I'm not sure which would be done first.  

Yes, sorry I wasn't clear there.  One kernel was installed post mkinitrd .68 and another under mkinitrd .69, so two kernels after the mkinitrd which was stated to fix the problem.

> Can you attach a serial boot log for your a failed boot, as in comment #28?

Neat trick, attaching.

> And can you attach your init file by:
> mkdir temp; cd temp
> zcat /boot/initrd-x.y.z.img | cpio -i
> 
> Your init file will be in the temp directory with other cruft.

Attaching it..  I'm having to do -58 as I 'fixed' -68 yesterday by editing /etc/sysconfig/mkinitrd and manually running mkinitrd so I could boot the latest kernel.

Comment 54 Walter Francis 2008-11-04 18:05:43 UTC

Created attachment 322458 [details]
Serial debug output from vmware workstation 6.5 F10, 2.6.27.4-58

Serial debug output of failure to boot under latest mkinitrd, .68.

Comment 55 Walter Francis 2008-11-04 18:17:06 UTC

Created attachment 322462 [details]
init file from 2.6.27.4-58

Init file generated by mkinitrd 68 or 69, for kernel 2.6.26.4-58

Comment 56 Charlie Moschel 2008-11-04 18:29:13 UTC

(In reply to comment #54)
> Created an attachment (id=322458) [details]
> Serial debug output from vmware workstation 6.5 F10, 2.6.26.4-58
> 
> Serial debug output of failure to boot under latest mkinitrd, .68.

Grrr .... friggin case in module names ...

Patch in comment 41 checked for module 'buslogic', but module name is 'BusLogic'.  , so 'stabilized()' call was never emitted.  Edit your mkinitrd, to change buslogic to BusLogic and you should be  all set.

Comment 57 Walter Francis 2008-11-04 18:36:08 UTC

(In reply to comment #56)
> (In reply to comment #54)
> > Created an attachment (id=322458) [details] [details]
> > Serial debug output from vmware workstation 6.5 F10, 2.6.26.4-58
> > 
> > Serial debug output of failure to boot under latest mkinitrd, .68.
> 
> Grrr .... friggin case in module names ...
> 
> Patch in comment 41 checked for module 'buslogic', but module name is
> 'BusLogic'.  , so 'stabilized()' call was never emitted.  Edit your mkinitrd,
> to change buslogic to BusLogic and you should be  all set.

Woot!  That did it, verified directly using mkinitrd after modifying for s/buslogic/BugLogic/  

Thanks!

Comment 58 Charlie Moschel 2008-11-05 02:30:00 UTC

(In reply to comment #57)
> Woot!  That did it, verified directly using mkinitrd after modifying for
> s/buslogic/BugLogic/  

I can confirm that this is fixed in mkinitrd-6.0.70