466607 – initrd doesn't wait for (scsi) hba to scan the bus, causing boot failure

Bug 466607 - initrd doesn't wait for (scsi) hba to scan the bus, causing boot failure

Summary: initrd doesn't wait for (scsi) hba to scan the bus, causing boot failure

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	mkinitrd
Sub Component:
Version:	10
Hardware:	i386
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Peter Jones
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Duplicates (7):	466534 470726 471093 472989 476337 476472 477596 (view as bug list)
Depends On:
Blocks:	F10Target
TreeView+	depends on / blocked

Reported:	2008-10-11 05:59 UTC by Chris Kloiber
Modified:	2013-01-09 00:50 UTC (History)
CC List:	30 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2008-12-23 19:07:34 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
mkinitrd.log file as requested (12.27 KB, text/plain) 2008-10-12 15:52 UTC, Chris Kloiber	no flags	Details
Time honored photos of the screen showing failure... (2.87 MB, application/x-gzip) 2008-10-12 16:11 UTC, Chris Kloiber	no flags	Details
Requested init from failed initial ramdisk (1.77 KB, text/plain) 2008-10-16 22:50 UTC, Chris Kloiber	no flags	Details
init file from Toshiba L10 which fails to boot (1.71 KB, application/octet-stream) 2008-10-22 07:42 UTC, Mike Thompson	no flags	Details
stabilizedHash function from nash.c, with 'last' properly initialized (2.42 KB, text/plain) 2008-10-30 17:49 UTC, Charlie Moschel	no flags	Details
View All

Description Chris Kloiber 2008-10-11 05:59:22 UTC

Description of problem:

Tried F10 installation to HP lp1000r server, install seems to work ok, but when attempting to boot into the system the boot fails as it appears to attempt mounting rootfs before enumerating the scsi disks or assembling the LVM.

Version-Release number of selected component (if applicable):

F10-Beta

How reproducible:

100%

Steps to Reproduce:
1. Install F10-Beta (my cd drive is bad, I did an http install)
2. Reboot

Comment 1 Hans de Goede 2008-10-11 20:22:56 UTC

Thanks for reporting this, but I'm afraid we need more info before we can fix this.

Can you please boot with a rescue CD and then chroot to to the root of the F-10 install. Then as preparation do:

mount /boot
mount proc /proc -t procfs
mount sysfs /sys -t sysfs

And then run
mkinitrd -v -f /boot/initrd-XXXXXX.img XXXXXX

Get the first XXXXXX by using tab-completion (there should be only one option) and use the same for the second XXXXXX, this should give a couple of screens of output, so better do:
mkinitrd -v -f /boot/initrd-XXXXXX.img XXXXXX &> /tmp/log

Then save /tmp/log somewhere and attach it here,

Thanks!

Comment 2 Chris Kloiber 2008-10-12 01:22:08 UTC

Will do Sunday or Monday, after I replace the busted CD drive. Thanks.

Comment 3 Hans de Goede 2008-10-12 07:31:37 UTC

(In reply to comment #2)
> Will do Sunday or Monday, after I replace the busted CD drive. Thanks.

Ok, also it would be interesting if you could see if the system will boot after you've issued that command (it regenerates the initrd, and the environment after install could be different from that during install).

Comment 4 Chris Kloiber 2008-10-12 15:52:11 UTC

Created attachment 320144 [details]
mkinitrd.log file as requested

Comment 5 Chris Kloiber 2008-10-12 16:11:03 UTC

Created attachment 320145 [details]
Time honored photos of the screen showing failure...

Rebuilding the initial ramdisk did not help, but these photos tell the tale.

The scsi drivers are loaded first, but there is a scsi bus reset. The LVM does not get assembled in time to change to the new root and booting fails. Afterwards, the scsi disks become available, but it's too late.

Is there a way to test if LVM is required, and if so test to see if it's up before attempting to pivot-root?

Comment 6 Hans de Goede 2008-10-13 11:51:39 UTC

Hmm, interesting, changing the component to mkinitrd, hopefully Peter has an idea whats going on here and how to fix it.

Comment 7 Peter Jones 2008-10-14 15:20:19 UTC

Can you extract the "init" file from the initrd and attach it here?  The process is:

mkdir tmp
cd tmp
zcat /boot/initrd-$VERSION.img | cpio -di

that should leave a bunch of stuff, including a file named "init", in the "tmp" directory.

Comment 8 Charlie Moschel 2008-10-15 14:50:03 UTC

(In reply to comment #5)
> Created an attachment (id=320145) [details]
> Time honored photos of the screen showing failure...
> 
> Rebuilding the initial ramdisk did not help, but these photos tell the tale.
> 
> The scsi drivers are loaded first, but there is a scsi bus reset. The LVM does
> not get assembled in time to change to the new root and booting fails.
> Afterwards, the scsi disks become available, but it's too late.

Chris, take a look at https://bugzilla.redhat.com/show_bug.cgi?id=466071
It looks similar to me, and was initially triggered (running under VMware) by moving dm-* from modules to kernel built-ins.  (That's done in a later kernel than what you are testing).  Try the mkinitrd patch near the end, but add your scsi driver if it's not there.

> 
> Is there a way to test if LVM is required, and if so test to see if it's up
> before attempting to pivot-root?

Comment 9 Michael H. Warfield 2008-10-15 15:54:24 UTC

I concur these sounds to be two aspects of the same problem, a race condition between the SCSI drivers and starting LVM.  Retry the instructions for rebuilding the initrd but add "--with=scsi_wait_scan" on the mkinitrd command line and try that.

In comparing the kernel messages from F9 with a 2.6.26.6 kernel and the 2.6.27 kernels, I'm getting the impression that some of the SCSI drivers are returning from initialization earlier than they use to.  That speeds up the boot but opens up this race condition when lvm runs before all the devices have been scanned and are stable.  That was made worse by moving the dm-* modules into the kernel itself but the race condition exists in either case.  With a driver that takes a long time to settle, the time loading the dm-* modules might not be long enough to win the race for the good guys.

Comment 10 Chris Kloiber 2008-10-16 03:52:25 UTC

I will do so as soon as possible. I installed another OS but am not using it yet. Will reinstall again then try your suggestions.

Comment 11 Chris Kloiber 2008-10-16 22:50:01 UTC

Created attachment 320619 [details]
Requested init from failed initial ramdisk

Here's the init from the failed initial ramdisk. Will try the mkinitrd patch next.

Comment 12 Chris Kloiber 2008-10-16 23:22:53 UTC

Booting still fails after rebuilding the initrd with the patched mkinitrd.

Comment 13 Charlie Moschel 2008-10-17 00:06:54 UTC

(In reply to comment #12)
> Booting still fails after rebuilding the initrd with the patched mkinitrd.

Did you add your scsi adapter to the list of modules that sets 'wait_for_scsi'?  It will be obvious looking at the patch; something like '-o "sym53c8xx" == "$module" \'

If that still doesn't work, can you try adding "--with=scsi_wait_scan" to your mkinitrd command line?

From what I can tell,the init file in the initrd used to insmod scsi_scan_wait, then rmmod it right away (at least on my F8 system), in addition to using the 'stabilized' command built into nash.  Now F10 tries to use neither in some cases.  But the emulated scsi adapters in VMware still needs the stabilized call, hence the patch to mkinitrd to re-enable it for the two scsi modules used by VMware.  It may be that your sym53c8xx needs it too, by virtue of it's bus reset.

regards, ...... Charlie

Comment 14 Chris Kloiber 2008-10-17 01:46:18 UTC

No I didn't. I'll try tomorrow as I'm at work from now till 8am.

Comment 15 Chris Kloiber 2008-10-20 01:26:03 UTC

Sorry for the delay, I was ill yesterday. I have had success with an unmodified mkinitrd and adding the --with=scsi_wait-scan option. I did try adding the sym53c8xx drvier to the patch first, but that didn't seem to work.

Comment 16 Charlie Moschel 2008-10-20 03:13:00 UTC

(In reply to comment #15)
> Sorry for the delay, I was ill yesterday. I have had success with an unmodified
> mkinitrd and adding the --with=scsi_wait-scan option. I did try adding the
> sym53c8xx drvier to the patch first, but that didn't seem to work.

Thanks, I was able to grab some scsi HW and confirm this today as well.  My comments are in https://bugzilla.redhat.com/show_bug.cgi?id=466071

Looks like the same bug; I'd suggest marking this a duplicate of 466071.

Comment 17 Chris Kloiber 2008-10-20 21:57:53 UTC

Fine with me, as long as it's fixed. Thank You very much.

Comment 18 Mike Thompson 2008-10-21 07:48:47 UTC

(In reply to comment #17)
> Fine with me, as long as it's fixed. Thank You very much.

Looks to me like its not fixed....
I still get the issue on my Toshiba Satellite L10 (Just like an Acer Aspire), even after the mkinitrd with the --with=scsi_wait-scan option.

Comment 19 Charlie Moschel 2008-10-21 10:29:03 UTC

(In reply to comment #18)
> Looks to me like its not fixed....
> I still get the issue on my Toshiba Satellite L10 (Just like an Acer Aspire),
> even after the mkinitrd with the --with=scsi_wait-scan option.

Typo?  There are two underscores in scsi_wait_scan ...

Can you attach a boot log by serial console, or a photo of the last screen of text?  Can you post your init (as shown in comment #7)?

Comment 20 Charlie Moschel 2008-10-21 10:30:54 UTC

-ENEED_COFFEE.  Why would scsi_wait_scan be expected to work on a laptop?

Comment 21 Michael H. Warfield 2008-10-21 13:44:20 UTC

Maybe it's a SATA drive (mine is)?  But even PATA subsystems are using the SCSI subsystem now, aren't they?  And mkinitrd was checking for ata_piix, sata_*, pata_* and then running the stabilized command against the scsi subsystem.  They are all coming up /dev/sd* now.

Comment 22 Mike Thompson 2008-10-22 07:42:41 UTC

Created attachment 321124 [details]
init file from Toshiba L10 which fails to boot

Here's a copy of my INIT file as requested

Comment 23 Mike Thompson 2008-10-22 07:43:42 UTC

(In reply to comment #19)
INIT file has now been attached.

> (In reply to comment #18)
> > Looks to me like its not fixed....
> > I still get the issue on my Toshiba Satellite L10 (Just like an Acer Aspire),
> > even after the mkinitrd with the --with=scsi_wait-scan option.
> 
> Typo?  There are two underscores in scsi_wait_scan ...
> 
> Can you attach a boot log by serial console, or a photo of the last screen of
> text?  Can you post your init (as shown in comment #7)?

Comment 24 Gerwin Krist 2008-10-27 19:42:42 UTC

I have the same problem with a 3ware raid card (module: 3w-xxxx). Using the 

"
/etc/sysconfig/mkinitrd with this:
MODULES="scsi_wait_scan"
"

The machine booted again

Comment 25 Michael H. Warfield 2008-10-27 21:20:46 UTC

This is the second SCSI card this is now reported on and it is not merely VMware.  I believe this now qualifies as an F10 blocker.

According to the "QA/ReleaseCriteria", https://fedoraproject.org/wiki/QA/ReleaseCriteria , Failure to boot on any configuration qualifies as a blocker:

==
A note on MUST and SHOULD
Items described with the word "MUST" are required to work for all releases, including Test releases.
Additionally, the items described with the word "SHOULD" are required for all final releases."
==

==
  Boot

    * The installed system MUST boot and start up properly. "
==

Comment 26 Charlie Moschel 2008-10-30 17:49:32 UTC

Created attachment 321966 [details]
stabilizedHash function from nash.c, with 'last' properly initialized

OK, I think I've found a bug in stabilizedHash() in nash.c, where it returns a stable condition when in fact nothing has happened yet.  [NB this is from code review only, not yet tested.]

Current mkinitrd emits "stabilized --hash --interval 250 /proc/scsi/scsi" if a delay for scsi settling is needed.  This will result in stabilizedHash() being called with file=/proc/scsi/scsi, iterations=-1, interval_ts=250 mS, and goal=10.

AFAICT, stabilizedHash() with these arguments should watch /proc/scsi/scsi for changes (by using adler checksum), and either:
 * return -1 if error, OR
 * return 0 if no changes have been observed in 10 consecutive checksums, each delayed by 250 mS, OR
 * return 1 if changes were found at some point, but no further changes have been seen for 10 consecutive checksums (again each delayed by 250 mS).

I think the problem is in the very first compare in the do{} loop, where the first checksum result is compared to 'last', which is initialized to -1.  They don't match, so this incorrectly shows that a change has been seen.  Now, if nothing happens in the next 9 tests, stabilizedHash() returns 1, indicating a stable scsi, when in fact *nothing* has happened yet.  The fix should be to initialize 'last' with a call to checksumFd *before* entering the do{} loop.

I guess adding this 'stabilized' call was enough to fix vmware scsi (as in BZ 466071) because the 2.25 S delay is long enough.  But for some real HW (symbios 53c8xx, 3ware, and maybe others), the delay is not long enough, and this bug caused nash to keep going rather than detect that no settling happened (rc=0, which would trigger a sleep(10) in stabilizedCommand()).  

Also, even if this fixes the stabilize call, mkinitrd still has to be modified to emit the stabilized call for the affected scsi modules.

I've attached the stabilizedHash function with proposed modification rather than putting it inline.  If this makes sense to any nash hackers, I'll attach a proper diff, and do some further testing.

Comment 27 Bryan L. Gay 2008-10-30 23:08:40 UTC

Just to chime in here:

I've duplicated this issue on my Dell Inspiron 6000 with PATA hard drive (showing up as /dev/sda) on F9. The last two kernel releases have this bug. Let me know 
what testing will contribute to this thread.

I couldn't test F10 on my desktop using Intel due to X not loading the Intel Video drivers, even in anaconda, so I had to go back to F9. Once that problem is resolved, I can test with an Intel-based 32bit system (HP/Compaq dc5000 MT IDE) with on-board video. I'll be looking for the BZ# on this one to follow-up.

My 64-bit AMD system on nVidia chipset does not exibit this problem on F10, all updates applied. I installed this system from the iso and have been applying the updates daily.

I also have a 32-bit AMD system on nVidia chipset installed using the development netboot which pulls the current packages from the dev tree, and it has been updated daily. Neither of the AMD nVidia-based systems exhibits this problem.

There are a LOT of Dell Inspiron systems out there, and I'm willing to bet this issue affects a significant number of them.

Comment 28 Mike Thompson 2008-10-31 03:11:00 UTC

You can get around the Intel video driver problem by forcing X to use the vesa driver.
Either modify your /etc/X11/xorg.conf to include:
Section "Device"
	Identifier "Videocard0"
	Driver "vesa"

Or add the kernel boot parameter xdriver=vesa

I get the same problems (Intel video and FC10 boot failure) on my Toshiba L10.
To run the latest kernel(s) I have to overwrite the initrd with the one from kernel 2.6.27-0.392.rc8.git7.fc10.



(In reply to comment #27)
> Just to chime in here:
> 
> I've duplicated this issue on my Dell Inspiron 6000 with PATA hard drive
> (showing up as /dev/sda) on F9. The last two kernel releases have this bug. Let
> me know 
> what testing will contribute to this thread.
> 
> I couldn't test F10 on my desktop using Intel due to X not loading the Intel
> Video drivers, even in anaconda, so I had to go back to F9. Once that problem
> is resolved, I can test with an Intel-based 32bit system (HP/Compaq dc5000 MT
> IDE) with on-board video. I'll be looking for the BZ# on this one to follow-up.
> 
> My 64-bit AMD system on nVidia chipset does not exibit this problem on F10, all
> updates applied. I installed this system from the iso and have been applying
> the updates daily.
> 
> I also have a 32-bit AMD system on nVidia chipset installed using the
> development netboot which pulls the current packages from the dev tree, and it
> has been updated daily. Neither of the AMD nVidia-based systems exhibits this
> problem.
> 
> There are a LOT of Dell Inspiron systems out there, and I'm willing to bet this
> issue affects a significant number of them.

Comment 29 Hans de Goede 2008-10-31 10:32:11 UTC

(In reply to comment #26)
> Created an attachment (id=321966) [details]
> stabilizedHash function from nash.c, with 'last' properly initialized
> 
> OK, I think I've found a bug in stabilizedHash() in nash.c, where it returns a
> stable condition when in fact nothing has happened yet.  [NB this is from code
> review only, not yet tested.]
> 
> Current mkinitrd emits "stabilized --hash --interval 250 /proc/scsi/scsi" if a
> delay for scsi settling is needed.  This will result in stabilizedHash() being
> called with file=/proc/scsi/scsi, iterations=-1, interval_ts=250 mS, and
> goal=10.
> 
> AFAICT, stabilizedHash() with these arguments should watch /proc/scsi/scsi for
> changes (by using adler checksum), and either:
>  * return -1 if error, OR
>  * return 0 if no changes have been observed in 10 consecutive checksums, each
> delayed by 250 mS, OR
>  * return 1 if changes were found at some point, but no further changes have
> been seen for 10 consecutive checksums (again each delayed by 250 mS).
> 
> I think the problem is in the very first compare in the do{} loop, where the
> first checksum result is compared to 'last', which is initialized to -1.  They
> don't match, so this incorrectly shows that a change has been seen.  Now, if
> nothing happens in the next 9 tests, stabilizedHash() returns 1, indicating a
> stable scsi, when in fact *nothing* has happened yet.  The fix should be to
> initialize 'last' with a call to checksumFd *before* entering the do{} loop.
> 

This indeed seems to be a bug in the code. I've prepared a test release with your suggested fix in it:
http://koji.fedoraproject.org/koji/taskinfo?taskID=913443

Can you install nash from this one, regenerate your initrd using mkinitrd -f, and see if that fixes things?

Thanks.

Comment 30 Erik van Pienbroek 2008-10-31 21:19:43 UTC

(In reply to comment #29)
> This indeed seems to be a bug in the code. I've prepared a test release with
> your suggested fix in it:
> http://koji.fedoraproject.org/koji/taskinfo?taskID=913443
> 
> Can you install nash from this one, regenerate your initrd using mkinitrd -f,
> and see if that fixes things?

Booting still fails for me with this build of mkinitrd. I now see the message 'Could not detect stabilization, waiting 10 seconds'..then the computer sleeps for 10 seconds..and finally device-mapper bails out causing a segfault in nash (as has been the case for me for a large number of F-10 kernels)

Comment 31 Charlie Moschel 2008-11-01 01:58:34 UTC

(In reply to comment #29)
> (In reply to comment #26)
> > OK, I think I've found a bug in stabilizedHash() in nash.c, where it returns a
> > stable condition when in fact nothing has happened yet.  [NB this is from 
> This indeed seems to be a bug in the code. I've prepared a test release with
> your suggested fix in it:
> http://koji.fedoraproject.org/koji/taskinfo?taskID=913443
> 
> Can you install nash from this one, regenerate your initrd using mkinitrd -f,
> and see if that fixes things?

[Sorry for the length ..]
I tested using sym53c8xx scsi driver, so first I had to add that module to the list of modules in mkinitrd that trigger a "stabilized" call to be emitted.

The good news is that stabilizedHash() properly returns 0 if nothing is stable in the allotted time.

The bad news is that sym53c8xx is not stable with the standard "stabilized --hash --interval 250 /proc/scsi/scsi" call, nor even with an interval of 500.  In those cases, stabilizedCommand() in nash punts, echoing "Could not detect stabilization ..." and calls sleep(10).  That *is* enough time for sym53c8xx to stabilize, and boot continues OK after the sleep.

Changing to an interval of 1000 is also enough for sym53c8xx to stabilize, and boot OK *without* triggering the sleep(10).  But I suspect that there is an error lurking in the timeout values, because (by eye) the delay does not seem as long as it should be before punting (ie, what should be (10) 1000 mS intervals does not seem nearly that long).

But, I think for scsi drivers, loading the scsi_wait_scan module is the right thing to do.  BZ 454663 shows that it was automatically loaded if mkinitrd found 'scsi_mod'.  Now 'scsi_mod' is built-in, so nothing triggers the 'emit "modprobe scsi_wait_scan" ' and 'emit "rmmod scsi_wait_scan" ' pair any longer.  (Telling mkinitrd to use scsi_wait_scan will load the module, but not rmmod it).

To summarize (sorry for the length):
* stabilizeHash() is working as designed, but timeout lengths are suspect
* Some modules (3ware, sym53c8xx) still don't use 'stabilized() when they need to.
* Best approach may be to bring back scsi_wait_scan module.  You don't want to add every scsi module to the test in mkinitrd to see if stabilized() is needed.

Regards, ... Charlie

Comment 32 Charlie Moschel 2008-11-01 02:05:20 UTC

(In reply to comment #30)

> Booting still fails for me with this build of mkinitrd. I now see the message
> 'Could not detect stabilization, waiting 10 seconds'..then the computer sleeps
> for 10 seconds..and finally device-mapper bails out causing a segfault in nash
> (as has been the case for me for a large number of F-10 kernels)

If nash segfaults for you, I think that's a different bug, worth a new BZ.  But the new sleeping message means stabilizedHash() is working as intended, returning 0 if no changes are detected.

In your new BZ entry, attaching a working and non-working init file would help.  You can extract them as described in comment #7.

Comment 33 Hans de Goede 2008-11-02 09:09:26 UTC

(In reply to comment #31)
> To summarize (sorry for the length):
> * stabilizeHash() is working as designed, but timeout lengths are suspect

Did you time this using a stopwatch or something like that, I've audited the
sleeping code and can find no errors.

> * Some modules (3ware, sym53c8xx) still don't use 'stabilized() when they need
> to.

Ack

> * Best approach may be to bring back scsi_wait_scan module.  You don't want to
> add every scsi module to the test in mkinitrd to see if stabilized() is needed.

I don't know why this was removed in the first place.

Anyways I'm out of ideas here, Peter?

Comment 34 Jeremy Katz 2008-11-03 19:34:59 UTC

(In reply to comment #33)
> (In reply to comment #31)
> > * Best approach may be to bring back scsi_wait_scan module.  You don't want to
> > add every scsi module to the test in mkinitrd to see if stabilized() is needed.
> 
> I don't know why this was removed in the first place.
> 
> Anyways I'm out of ideas here, Peter?

scsi_wait_scan became not really an option when we built in modules like sd_mod, etc which was done to speed up boot

Comment 35 Charlie Moschel 2008-11-04 05:28:27 UTC

(In reply to comment #34)
> (In reply to comment #33)
> > (In reply to comment #31)
> > > * Best approach may be to bring back scsi_wait_scan module.  You don't want to
> > > add every scsi module to the test in mkinitrd to see if stabilized() is needed.
> > 
> > I don't know why this was removed in the first place.
> > 
> > Anyways I'm out of ideas here, Peter?
> 
> scsi_wait_scan became not really an option when we built in modules like
> sd_mod, etc which was done to speed up boot

Then can't we include scsi_wait_scan unconditionally?  Seems the intent in the past was to use scsi_mod as a trigger to insmod scsi_wait_scan in init, so if you build-in scsi_mod uncoditionally, shouldn't we unconditionally wait like we used to?

I've read that Arjan's long term goal is to build a kernel that doesn't need an initrd to boot for most cases, but will fall back if /sbin/init can't be found.  But that's a little more work, and not really doable before F10 :)

Comment 36 Peter Jones 2008-11-04 15:38:54 UTC

diff --git a/nash/nash.c b/nash/nash.c
index 5f8cc49..3a66f92 100644
--- a/nash/nash.c
+++ b/nash/nash.c
@@ -2110,8 +2110,7 @@ stabilizedPoll(char *path, int iterations, struct timespec
 }
 
 static int
-stabilizedHash(char *path, int iterations, struct timespec interval,
-    int goal)
+stabilizedHash(char *path, int iterations, struct timespec interval, int goal)
 {
     int fd;
     uint32_t last = -1;
@@ -2131,7 +2130,7 @@ stabilizedHash(char *path, int iterations, struct timespec
                 ret = changed;
                 break;
             }
-        } else {
+        } else if (last != -1) {
             changed = 1;
             count = 0;
         }

Comment 37 Patrick C. F. Ernzer 2008-11-04 16:44:21 UTC

same problem on a qla2xxx HBA equipped machine.

using the instructions from Comment #24 (and Comment #25 from Bug 377921)
seems to solve it, had 10 consecutive successful boots (5 warm, 5 cold)

WRT to Comment #31, the system I tested on had 2 qla2xxx HBAs but only a disk
on the first, so there was some delay while it was probing the second HBA. If
you think it is worthwile testing on a machine where both HBAs have connected
cables I can do that (due to how the system is built that will be multipath),
just set NEEDINFO on me in that case. (reason I ask is that I did experience
bug 467897 on 5.3 Alpha1 on these machines (both single and multipath)

Comment 38 Peter Jones 2008-11-04 20:15:26 UTC

This should be fixed in mkinitrd-6.0.70-1 .

Comment 39 Charlie Moschel 2008-11-05 03:55:23 UTC

(In reply to comment #38)
> This should be fixed in mkinitrd-6.0.70-1 .

Sorry, 6.0.70 doesn't fix it with my sym53c8xx test system: there is no stabilized() call emitted into init.  I guess this part of the diff is looking for scsi storage devices to set wait_for scsi_ & trigger stabilized(), but I have nothing under /sys/devices like scsi:* , or even /sys.

@@ -355,7 +357,12 @@
     while [ "$PWD" != "/sys/devices" ]; do
         deps=
         if [ -f modalias ]; then
-            moduledep $(cat modalias)
+            MODALIAS=$(cat modalias)
+            if [ "${modalias::7}" == "scsi:t-" ]; then
+                wait_for_scsi=yes
+            fi
+            moduledep $MODALIAS
+            unset MODALIAS
         fi

What other info would help?

The good news is that your change to remove the plymouth screen when root isn't mounted worked great :)

Comment 40 Gerwin Krist 2008-11-07 17:06:42 UTC

When installing the new kernel without the workaround from #24 the system will NOT boot. Installing it with the workaround does work. This is with a 3ware raid controller.

Comment 41 Bryan Bozwood-Davies 2008-11-10 13:43:12 UTC

I would like to add that this fault also occurs with my Adaptec 29160 (AIC-7xxx) based controller. Again using the workaround from #24 makes the system bootable again.

Comment 42 Patrick C. F. Ernzer 2008-11-10 18:22:32 UTC

FWIW, on my qla2xxx equipped machines mkinitrd-6.0.70-1 solves the problem. (10 successful boots on 2 different boxes)
This was a plain rawhide install with no changes apart from adjusting grub.conf so that I have a timeout and removing 'rhgb quiet', so that I could see in case of trouble.

But do keep in mind that there is nothing on the second of my 2 qla2xxx controllers, so this will introduce a delay of it's own. (as already said in Comment #37)
sadly rawhide's multipath is giving me no joy today, so I can not test with two controllers having disk access (the hardware is such that the second controller either has no access to anything or it has access bt always as a secondary path to the same disks)

Comment 43 Charlie Moschel 2008-11-10 21:18:29 UTC

(In reply to comment #42)
> FWIW, on my qla2xxx equipped machines mkinitrd-6.0.70-1 solves the problem. (10
> successful boots on 2 different boxes)

That's because qlaxxxx is now special-cased in mkinitrd near line 1490, so that 'wait_for_scsi' is set, which will emit a call to stabilized() in your init file.

Seems that any scsi modules *not* special-cased there, and not detected by the modalias "scsi:t-" match, will not have *any* waiting at all (neither a call to stabilized() nor load module scsi_scan_wait).  Appears to be at least sym53c8xx, 3ware, AIC-7XXX and one or more toshiba laptop models, from looking through bugzilla entries.

That pool of users will have a bad F10 experience: boot failure on a new install or upgrade.  I'd like to suggest again that F10 include scsi_scan_wait module unconditionally (see comment #35).  

Or, least expand the special-case module list in mkinitrd to include the known failures in BZ, so they too cause a stabilized() call to be emitted into init.  But that seems like a more fragile approach: how many more will boot failures will be found on F10 release?

Comment 44 Patrick C. F. Ernzer 2008-11-10 21:23:51 UTC

in addition to comment #42

decided to just live with mpath flakyness to wrap up this test, LVM is
intelligent enough to notice when PVs are duplicate, so I could determine that
machine also boots fine when I have the somewhat shorter delay of probing both
qla2xxx HBAs (instaead of getting a fail because disconnectd on second)

Machine still booted up fine 10 times in a row

one thing of note is that all tests in this comment and in #42 printed "Could
not detect stabilization, waiting 10 seconds" I presume this is exactly as
intended.

Comment 45 Patrick C. F. Ernzer 2008-11-10 21:26:45 UTC

crap, forgot one line:
there will be cases when my boxes will take more then 10 seconds, it's a corner case as it requires me to have most other boxes in the chassis doing a LIP UP/DOWN at the time the one box tries to boot, so I can live with Fedora not coming up but it's not ideal (as it's a case that in experience happens once a year and I can work arounf it I am not too fussed)

Comment 46 Zak Peirce 2008-11-13 15:25:01 UTC

*** Bug 471093 has been marked as a duplicate of this bug. ***

Comment 47 Zak Peirce 2008-11-13 15:29:10 UTC

(In reply to comment #24)
> I have the same problem with a 3ware raid card (module: 3w-xxxx). Using the 
> 
> "
> /etc/sysconfig/mkinitrd with this:
> MODULES="scsi_wait_scan"
> "
> 
> The machine booted again

I do not have an etc/sysconfig/mkinitrd

I changed line 69 in my /sbin/mkinitrd from 

wait_for_scsi="no"

to 

wait_for_scsi="yes"

and that took care of my problem

Comment 48 Gerwin Krist 2008-11-13 16:26:32 UTC

Zak: The file indeed does not exist by default. If you create it, it's working :)

Comment 49 Zak Peirce 2008-11-14 15:27:10 UTC

Gerwin,

thanks for the info the solution from #24 resolved the issue for me :)

Comment 50 Ugo Viti 2008-11-20 12:40:03 UTC

Hi,

this problem happen even installing fedora 10 (rawhide) into a 3ware sata array (installed 30 min ago using yesterday rawhide image).

as reported into bug https://bugzilla.redhat.com/show_bug.cgi?id=462260

adding MODULES="scsi_wait_scan" to the /etc/sysconfig/mkinitrd file and
rebuilding the initrd of running kernel solved the problem.

please make a patch upstream before fedora 10 release, else many scsi/sata/raid systems after a clean fedora 10 install will be unbootable.

Best Regards

Comment 51 Bug Zapper 2008-11-26 03:46:50 UTC

This bug appears to have been reported against 'rawhide' during the Fedora 10 development cycle.
Changing version to '10'.

More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 52 Ed 2008-11-29 15:26:39 UTC

This bug still exists in F10 release.  I suspect a sizable pool of systems unable to install.

Comment 53 Solomon Peachy 2008-12-01 13:00:15 UTC

This bug affected me during an F9-F10 upgrade.  The system was booting fine with the 2.6.27.5-37.fc9.i686 kernel but failed with the 2.6.27.5-117.fc10.i686 kernel.  The LVM scan was being kicked off before the SCSI busses had been enumerated.

Upon digging, it turns out that mkinitrd wasn't adding the scsi_wait_scan module at all because for some reason the $wait_for_scsi variable was never being set to 'yes' -- so by changing the default value of $wait_for_scsi to 'yes', I was able to get mkinitrd to generate a working initrd and now the system boots.  This is with mkinitrd-6.0.71-2.fc10.i386

/etc/modprobe.conf already has (among other things) these lines:

  alias scsi_hostadapter sym53c8xx
  alias scsi_hostadapter1 3w-xxxx

Comment 54 Ed 2008-12-01 13:28:17 UTC

I was able to regen an initrd file by adding /etc/sysconfig/mkinitrd with MODULES="scsi_wait_scan"

My SCSI controller is an Adaptec SCSI/320 and this was a fresh install as opposed to an upgrade.

After regenerating the initrd file I was able to boot.

Comment 55 Hans de Goede 2008-12-15 09:17:09 UTC

*** Bug 470726 has been marked as a duplicate of this bug. ***

Comment 56 Hans de Goede 2008-12-15 09:17:53 UTC

*** Bug 476472 has been marked as a duplicate of this bug. ***

Comment 57 Hans de Goede 2008-12-15 09:50:04 UTC

Hello world (aka all in the CC list)

First of all apologies for not responding to this bug for so long, we've been
swamped with other stuff. We understand this is a rather critical bug.

I've prepared a mkinitrd build, which we believe fixes this, you can find it
here:
http://koji.fedoraproject.org/koji/buildinfo?buildID=73912

To give this version a try you need to download the:
mkinitrd-6.0.71-3.fc10.XXXX.rpm
nash-6.0.71-3.fc10.XXXX.rpm
libbdevid-python-6.0.71-3.fc10.XXXX.rpm
Files for your architecture from:
http://koji.fedoraproject.org/koji/buildinfo?buildID=73912

And then as root do:

KVER=######-####
rpm -Uvh mkinitrd-6.0.71-3.fc10.*.rpm nash-6.0.71-3.fc10.*.rpm \
  libbdevid-python-6.0.71-3.fc10.*.rpm
mv /boot/initrd-$(KVER).img /boot/initrd-$(KVER).img.old
mkinitrd /boot/initrd-$(KVER).img $(KVER)

Where KVER should be set to the kernel version-release for the non-bootable kernel you are trying to fix. This is the string between "initrd-" and ".img" when you do a:
ls /boot

You can find the version-release string for all installed kernels by doing:
rpm -q --qf "%{VERSION}-%{RELEASE}\n" kernel

This will print the full name-version-release.arch for each installed kernel followed by the version-release.

If you do not have any bootable kernel at all, you can try to fix this by booting the install CD in to rescue mode and then do:
chroot /mnt/sysimage
Before exucuting the commands above.

Please let us know if this fixes things for you (for those who have created an /etc/sysconfig/mkinitrd file, we want to know if it fixes things without needing that file).

Comment 58 Mike Thompson 2008-12-15 11:30:48 UTC

Still no luck on my Toshiba L10 (See above)


(In reply to comment #57)
> Hello world (aka all in the CC list)
> 
> First of all apologies for not responding to this bug for so long, we've been
> swamped with other stuff. We understand this is a rather critical bug.
> 
> I've prepared a mkinitrd build, which we believe fixes this, you can find it
> here:
> http://koji.fedoraproject.org/koji/buildinfo?buildID=73912
> 
> To give this version a try you need to download the:
> mkinitrd-6.0.71-3.fc10.XXXX.rpm
> nash-6.0.71-3.fc10.XXXX.rpm
> libbdevid-python-6.0.71-3.fc10.XXXX.rpm
> Files for your architecture from:
> http://koji.fedoraproject.org/koji/buildinfo?buildID=73912
> 
> And then as root do:
> 
> KVER=######-####
> rpm -Uvh mkinitrd-6.0.71-3.fc10.*.rpm nash-6.0.71-3.fc10.*.rpm \
>   libbdevid-python-6.0.71-3.fc10.*.rpm
> mv /boot/initrd-$(KVER).img /boot/initrd-$(KVER).img.old
> mkinitrd /boot/initrd-$(KVER).img $(KVER)
> 
> Where KVER should be set to the kernel version-release for the non-bootable
> kernel you are trying to fix. This is the string between "initrd-" and ".img"
> when you do a:
> ls /boot
> 
> You can find the version-release string for all installed kernels by doing:
> rpm -q --qf "%{VERSION}-%{RELEASE}\n" kernel
> 
> This will print the full name-version-release.arch for each installed kernel
> followed by the version-release.
> 
> If you do not have any bootable kernel at all, you can try to fix this by
> booting the install CD in to rescue mode and then do:
> chroot /mnt/sysimage
> Before exucuting the commands above.
> 
> Please let us know if this fixes things for you (for those who have created an
> /etc/sysconfig/mkinitrd file, we want to know if it fixes things without
> needing that file).

Comment 59 Hans de Goede 2008-12-15 12:13:22 UTC

(In reply to comment #58)
> Still no luck on my Toshiba L10 (See above)
> 

As adding scsi_wait_scan did not help for you, you are experiencing a different problem then the original bug reporter (and thus then what this bug is about).

Looking at your attached init file you have noatime in the options field of your fstab line for /, this is known to cause issues with mkinitrd, try removing that option and rerunning mkinitrd.

Comment 60 Charlie Moschel 2008-12-15 12:45:29 UTC

(In reply to comment #57)

Thanks for working on this!

> If you do not have any bootable kernel at all, you can try to fix this by

Or can add the kernel option "scsi_mod.scan=sync" to the command line

Comment 61 Julian G 2008-12-15 15:43:00 UTC

with the new builds it seems to boot without failures. :) tested with removed /etc/sysconfig/mkinitrd on 3ware 9650SE-LP2.

thanks

Comment 62 Michal Jaegermann 2008-12-15 18:31:32 UTC

Re comment #60:
> Or can add the kernel option "scsi_mod.scan=sync"

That may not help.  I was hit with this yesterday (see bug 476472 for details) in  another settings and in that setting there is no scsi_mod module around.  OTOH mkinitrd from updates-testing (mkinitrd-6.0.71-3.fc10) does produce initrd with scsi_wait_scan insertions and that is what is needed.

Comment 63 Charlie Moschel 2008-12-15 23:42:10 UTC

(In reply to comment #62)
> Re comment #60:
> > Or can add the kernel option "scsi_mod.scan=sync"
> 
> That may not help.  I was hit with this yesterday (see bug 476472 for details)
> in  another settings and in that setting there is no scsi_mod module around. 

scsi_mod is built in to the kernel now, and that was one of the problems: mkinitrd looked for that module to trigger using scsi_wait_scan.  When the module disappeared, so did the waiting mechanism, since the configs were using async scsi scan.  The line above is a kernel option, not a module option.

> OTOH mkinitrd from updates-testing (mkinitrd-6.0.71-3.fc10) does produce initrd
> with scsi_wait_scan insertions and that is what is needed.

That's what's important, I can confirm it works here too.

Comment 64 Mack Benz 2008-12-15 23:45:08 UTC

Quote:
Originally Posted by Sauron View Post
Boot with the install CD, go into rescue mode, click 'no' to network interfaces, click 'continue' to load the existing filesystem. When the # prompt comes up type:
chroot /mnt/sysimage

Please provide us the output of the command 'blkid' and the contents of your current /etc/fstab file

Then we can look and see if that is in fact your problem. There are a number of things that can create similar looking symptoms...
Okay, a bit of background.

I've solved the problem when I don't make a LVM from a fresh install of FC10 using the scsi wait command. But then I realized I want to be able to add harddrives later and make my LVM bigger w/out formatting or losing any files.

I haven't been able to get my OS to boot using the scsi wait command since I created a fresh install with an LVM group.


So, attached is a jpeg of my screen w/

# blkid


Again, I've been reading all of the threads here and on bugzilla, but for actually getting around the LVM part I haven't found anything specific enough to my situation.

Thanks for the continued help gang, it's gratefully appreciated 

http://forums.fedoraforum.org/attachment.php?attachmentid=17629&d=1229383080

Comment 65 Hans de Goede 2008-12-16 15:56:39 UTC

(In reply to comment #64)
> Quote:
> Originally Posted by Sauron View Post
> Boot with the install CD, go into rescue mode, click 'no' to network
> interfaces, click 'continue' to load the existing filesystem. When the # prompt
> comes up type:
> chroot /mnt/sysimage
> 
> Please provide us the output of the command 'blkid' and the contents of your
> current /etc/fstab file
> 
> Then we can look and see if that is in fact your problem. There are a number of
> things that can create similar looking symptoms...
> Okay, a bit of background.
> 
> I've solved the problem when I don't make a LVM from a fresh install of FC10
> using the scsi wait command. But then I realized I want to be able to add
> harddrives later and make my LVM bigger w/out formatting or losing any files.
> 
> I haven't been able to get my OS to boot using the scsi wait command since I
> created a fresh install with an LVM group.
> 
> 
> So, attached is a jpeg of my screen w/
> 
> # blkid
> 
> 
> Again, I've been reading all of the threads here and on bugzilla, but for
> actually getting around the LVM part I haven't found anything specific enough
> to my situation.
> 
> Thanks for the continued help gang, it's gratefully appreciated 
> 
> http://forums.fedoraforum.org/attachment.php?attachmentid=17629&d=1229383080

I do not see what this comment has to do with this report. Please file a separate bug report and clearly and precisely explain:
1) what exactly you are trying to accomplish
2) what happens (which messages, etc)
3) what you would have wanted to happen

Comment 66 Hans de Goede 2008-12-16 16:06:16 UTC

*** Bug 466534 has been marked as a duplicate of this bug. ***

Comment 67 Patrick C. F. Ernzer 2008-12-16 16:49:32 UTC

note the below has nothing to do with previous comments of mine (these were QLA equipped machines, the below is on a sym53c8xx equipped machine)

(In reply to comment #57)
> I've prepared a mkinitrd build, which we believe fixes this, you can find it
> here: http://koji.fedoraproject.org/koji/buildinfo?buildID=73912
[...]
> Please let us know if this fixes things for you (for those who have created an
> /etc/sysconfig/mkinitrd file, we want to know if it fixes things without
> needing that file).

freshly kickstarting a sym53c8xx equipped machine with the instructions from comment #57 added to %post works. Thanks for fixing this.

[start quote from %post section]
# Bug 466607
mkdir /root/bz466607
cd /root/bz466607
wget ftp://server.testing.local/pub/pcfe/bz466607/*

rpm -Uvh mkinitrd-6.0.71-3.fc10.*.rpm nash-6.0.71-3.fc10.*.rpm \
  libbdevid-python-6.0.71-3.fc10.*.rpm

for KVER in `ls -1 /lib/modules/`
do
  mv /boot/initrd-${KVER}.img /boot/initrd-${KVER}.img.old
  mkinitrd /boot/initrd-${KVER}.img ${KVER}
done
[end]

Comment 68 Bryan Bozwood-Davies 2008-12-17 11:53:00 UTC

With the new builds from comment 57# it seems to boot without failure. 
Tested without /etc/sysconfig/mkinitrd on AIC7xxx.

thanks

Comment 69 Fedora Update System 2008-12-18 00:41:27 UTC

mkinitrd-6.0.71-3.fc10 has been pushed to the Fedora 10 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 70 Michal Jaegermann 2008-12-19 16:51:40 UTC

> mkinitrd-6.0.71-3.fc10 has been pushed to the Fedora 10 stable repository.

Maybe it was pushed; for sure it was annouced (FEDORA-2008-11149) as something which "can be installed" but so far it does not look like that it made there.
Other packages from the same batch of annoucements, and even some which were not
advertised at all (dhcp-4.0.0-33.fc10, iscsi-initiator-utils-6.2.0.870-1.0.fc10), did show up but mkinitrd-6.0.71-3.fc10 and its nash at this moment are not in update repos.

Comment 71 Mike Thompson 2008-12-21 00:40:41 UTC

noatime was NOT set in my fstab, so I don't know why my init file showed that it was (????)
However I rebuilt the machine using the released version of Fedora 10 and the problem has gone away.

Can noatime be set elsewhere??

  Thanks for your help.

  MikeT


(In reply to comment #59)
> (In reply to comment #58)
> > Still no luck on my Toshiba L10 (See above)
> > 
> 
> As adding scsi_wait_scan did not help for you, you are experiencing a different
> problem then the original bug reporter (and thus then what this bug is about).
> 
> Looking at your attached init file you have noatime in the options field of
> your fstab line for /, this is known to cause issues with mkinitrd, try
> removing that option and rerunning mkinitrd.

Comment 72 Braden McDaniel 2008-12-21 18:07:09 UTC

*** Bug 476337 has been marked as a duplicate of this bug. ***

Comment 73 Michal Jaegermann 2008-12-21 18:11:58 UTC

Re comment #69:
The next batch of updates showed up in repositories and mkinitrd-6.0.71-3.fc10, contrary to what "Fedora Update System" says, is still not there.

Comment 74 Hans de Goede 2008-12-23 19:07:34 UTC

(In reply to comment #73)
> Re comment #69:
> The next batch of updates showed up in repositories and mkinitrd-6.0.71-3.fc10,
> contrary to what "Fedora Update System" says, is still not there.


We had a hickup in the updates system, we've fixed things now and the update is available now. We believe we have tracked down the cause and this will not happen again.

Regards,

Hans

Comment 75 Dmitry Burstein 2008-12-27 18:54:02 UTC

I do have the kminitrd/nash of the latest version (6.0.71-3.fc10) but it didn't solve the problem - still the latest kernel I can boot on my ICH5R system is the 2.6.27.5-41.fc9.i686 one.

Comment 76 Hans de Goede 2008-12-30 20:56:07 UTC

*** Bug 472989 has been marked as a duplicate of this bug. ***

Comment 77 Andre Robatino 2009-01-18 12:19:41 UTC

Got this from a clean Rawhide install from the Jan. 17 x86_64 boot.iso as a guest OS in VirtualBox.  The install completes successfully (unlike the Jan. 16 image) but get this error shortly after seeing the Plymouth progress bar.  Should this be reopened?

  Reading all physical volumes.  This may take a while...
  Found volume group "VolGroup" using metadata type lvm2
  2 logical volume(s) in volume group "VolGroup" now active
mount: error mounting /dev/root on /sysroot as ext3: No such file or directory

Comment 78 Hans de Goede 2009-01-19 13:59:44 UTC

(In reply to comment #77)
> Got this from a clean Rawhide install from the Jan. 17 x86_64 boot.iso as a
> guest OS in VirtualBox.  The install completes successfully (unlike the Jan. 16
> image) but get this error shortly after seeing the Plymouth progress bar. 
> Should this be reopened?
> 

No, your disks are being found, as it finds your volume group, what you are seeing is a different bug.

Comment 79 Hans de Goede 2009-02-17 12:42:50 UTC

*** Bug 477596 has been marked as a duplicate of this bug. ***

Comment 80 Joachim Frieben 2009-02-20 16:23:33 UTC

- I have installed F11 Alpha from the live CD to a system with a SCSI
  controller AIC-7880, and it fails to find the volume group on the
  attached disk upon reboot as described above.
- 2nd attempt installing a minimum F9 with a later upgrade to current
  rawhide:
  * Latest F9 kernel 2.6.27.x succeeds installed from F9/with F9 updates.
  * After upgrading to "rawhide" and installing F10 and "rawhide" kernels
    no volume group is found.
  * On an IDE based system, F11 Alpha installed without similar problems.
  * Install from the F10 live CD also fails.

Comment 81 Hans de Goede 2009-02-22 14:18:37 UTC

(In reply to comment #80)
> - I have installed F11 Alpha from the live CD to a system with a SCSI
>   controller AIC-7880, and it fails to find the volume group on the
>   attached disk upon reboot as described above.

Their is no comment from you above, so I assume you mean as with the other reporters, given that this bug has been fixed in rawhide for a while now, your case clearly is different from the issue handled in this bug. Please file a new bug, with a complete description of what *exactly* you are seeing as messages, what kind of install you *exactly* did, how your system *exactly* looks like, etc.

Comment 82 Joachim Frieben 2009-02-24 14:25:09 UTC

(In reply to comment #81)
Device is an AIC-7880 onboard controller (INTEL PR440FX).
- Official F10 live CD fails.
- live CD spin from current "rawhide" tree succeeds.
- updating to mkinitrd-6.0.71-3.fc10 from a rescue environment and
  installing a recent update kernel in order to create a new working
  initrd also allows to recover a working boot environment. Too bad,
  there is no updated install image, not even an updates.img file.

Comment 83 Brent Rowse 2009-05-03 19:33:14 UTC

I just attempted to upgrade to FC10.  I am using a 3Ware RAID controller.  After the upgrade I experienced a problem that seems to match this bug.  I have tried to use mkinitrd to create a new image file, but it will not create a new file.  I have also directed the output to a log file and the file comes up blank.  Any suggestions?

Comment 84 Braden McDaniel 2009-05-03 23:02:15 UTC

Can you boot with the kernel option described in comment #60?

If not, you're probably observing a different problem.

Also, in my experience (using a 3ware 9550), this has been fixed.  Perhaps you've just been unable to update.  But if you *have* updated and things still aren't working, I suggest filing a new bug.

Comment 85 Brent Rowse 2009-05-04 23:43:37 UTC

Yes, I was able to add the kernel option descrbied in comment #60 to the kernel command line and I am now up and running.  Good suggestion.

Note You need to log in before you can comment on or make changes to this bug.

braden
browse
bryan
bugzilla
dcantrell
dmitryburstein
ed
erik-fedora
franta
fred99
gerwinkrist
Glen.E.Gardner
hdegoede
jcm
jfrieben
j.golderer
john.ellson
katzj
mack
mhw
michal
mikebt
pcfe
pizza
pjones
plastikman
robatino
ugo.viti
williams
wtogami