Bug 466534

Summary: can't find /dev/root - scsi + smp ???
Product: [Fedora] Fedora Reporter: John Ellson <john.ellson>
Component: mkinitrdAssignee: Peter Jones <pjones>
Status: CLOSED DUPLICATE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: medium    
Version: 10CC: amann, bertrand.benoit, dcantrell, eharrison, fred99, hdegoede, jdunn, katzj, kernel-maint, mwc-250sav, pallas, pjones, tilmann, tjb, wtogami, yaneti
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-12-10 15:27:03 EST Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Bug Depends On:    
Bug Blocks: 438944    
Attachments:
Description Flags
final screen of failed boot sequence
none
Full text capture of failed boot
none
Full text capture of failed boot - mkinitrd-6.0.70-1.fc10.i386
none
make scsi_scan_wait be used by default
none
Fix to mkinitrd to wait for scsi none

Description John Ellson 2008-10-10 15:18:35 EDT
Description of problem:
I have two systems now that won't boot recent kernels.  They hang with:
  Can't find /dev/root

My two sytems are:
  - A quad-cpu, x86_64, with 6 SCSI drives and / on /dev/sda1.
  - A dual-cpu, i686, with / on software raid over 2 SCSI drives.

I think there is an attempt to mount / before the SCSI initialization has completed?

There is also BZ #462233 which looks like the same problem, also SCSI + SMP.

BZ #459109 might be the same problem, but I can't tell if its SMP.



Version-Release number of selected component (if applicable):
kernel-2.6.27-0.398.rc9.fc10.x86_64

The kernel on the F10-Beta-x86_64-Live_CD released Oct 8th, 2008.
 
Any kernel after about kernel-2.6.26.3-14.fc8

How reproducible:
100%

Steps to Reproduce:
1. reboot
2.
3.
  
Actual results:
Boot hangs looking for /dev/root

Expected results:


Additional info:
Comment 1 Peter Jones 2008-10-10 15:25:49 EDT
Can you add a more complete log, please?
Comment 2 John Ellson 2008-10-10 15:42:32 EDT
Yes.. give me few hours.

BTW  This might be caused by:

   BZ #461850
Comment 3 John Ellson 2008-10-10 21:12:52 EDT
Well that was a really frustrating exercise!  Whats the trick to getting a serial connection running these days!  All the Howtos are useless. /etc/initrd has changed, /var/lock has changed, ....   I had minicom to minicom working briefly, then nothing!  I couldn't get them to do it again.  Is something else locking up /dev/ttyS0?
-----------------

Anyway, I took some pics instead.   The attached shows boot failing .. followed by sdb waking up.  /dev/sdb is the second drive in the striped pair forming /dev/md0 with the root
filesystem on it.

Perhaps something is waiting only for the first scsi drive response?

My other system at work isn't raid, but it does have 6 scsi drives, so again if something is waiting for the first response only it would fail.
Comment 4 John Ellson 2008-10-10 21:13:52 EDT
Created attachment 320082 [details]
final screen of failed boot sequence
Comment 5 John Ellson 2008-10-11 14:57:52 EDT
Created attachment 320104 [details]
Full text capture of failed boot

[Isn't there a standard that requires ttyS0 on the top connector? !!!  And where is my DB9 breakout box? ]

Finally got the serail cable to work.  Here is the capture of the failed boot.
Comment 6 John Ellson 2008-10-11 20:23:10 EDT
bug #466607 might be the same problem
Comment 7 John Ellson 2008-10-11 20:42:46 EDT
[OK, thats how to get bugs to hyperlink...]

   Possibly the same problem:
        bug #466607
        bug #464636
        bug #462233
        bug #461850
        bug #459109
        bug #454663
Comment 8 John Ellson 2008-10-13 13:55:41 EDT
Going by <https://fedoraproject.org/wiki/QA/ReleaseCriteria>
I propose that this bug is an F10 blocker.
Comment 9 John Ellson 2008-10-17 15:33:43 EDT
Still doesn't boot:
    kernel-2.6.27.2-23.rc1.fc10.i686
    mkinitrd-6.0.67-1.fc10.i386
Comment 10 John Ellson 2008-10-29 22:33:34 EDT
Still doesn't boot:
    kernel-2.6.27.4-58.fc10.i686
    mkinitrd-6.0.68-1.fc10.i386
Comment 11 Thomas J. Baker 2008-10-31 10:01:26 EDT
I just did a clean install of rawhide on a Dell Precision 670 and had the same problem. Booting rescue and remaking the initrd worked around the problem.
Comment 12 Thomas J. Baker 2008-10-31 15:54:21 EDT
I should have said remaking the initrd with the --with=scsi_wait-scan worked around the problem.
Comment 13 John Ellson 2008-10-31 21:38:33 EDT
Confirming that:
    mkinitrd --with=scsi_wait-scan ...
worked for me too.
    kernel-2.6.27.4-68.fc10.i686
    mkinitrd-6.0.69-1.fc10.i386


How does this help someone installing from a Fedora-10-Live DVD ?  Can this option be provided as a kernel option?
Comment 14 John Ellson 2008-11-06 12:10:05 EST
I was hoping that this would the problem:

    mkinitrd-6.0.70-1.fc10
    ----------------------
    * Tue Nov  4 17:00:00 2008 Peter Jones <pjones redhat com> - 6.0.70-1
      ...
    - Make scsi waiting happen on any device with a scsi modalias.

but no such luck.

Using:
    kernel-2.6.27.4-79.fc10.i686
    mkinitrd-6.0.70-1.fc10.i386

The "mkinitrd --with=scsi_wait-scan ..." still provides a workaround.

Capture of failed boot coming next...
Comment 15 John Ellson 2008-11-06 12:11:50 EST
Created attachment 322760 [details]
Full text capture of failed boot - mkinitrd-6.0.70-1.fc10.i386
Comment 16 John Ellson 2008-11-06 12:36:08 EST
I don't fully understand the implications of ".. with a scsi modalias" ?

In case its relevant, my /etc/modprobe.conf contains:

    alias eth0 e100
    alias scsi_hostadapter aic7xxx
    install snd-emu10k1 /sbin/modprobe --ignore-install snd-emu10k1 && /usr/sbin/alsactl restore >/dev/null 2>&1 || :
    alias char-major-81 bttv
    alias usb-controller uhci-hcd
Comment 17 Jesse Keating 2008-11-10 17:51:39 EST
I'm kicking this one over to F10Target.
Comment 18 Thomas J. Baker 2008-11-11 16:00:19 EST
I just had the problem again with the latest kernel update. Is this not a blocker because it's not happening to everyone with scsi/lvm?
Comment 19 Jesse Keating 2008-11-11 16:48:34 EST
That's correct.  So far, very few people are reporting this.  We'd take a fix for it, but I don't believe we'd delay the release for it.
Comment 20 Charlie Moschel 2008-11-13 19:55:55 EST
Created attachment 323519 [details]
make scsi_scan_wait be used by default

> That's correct.  So far, very few people are reporting this.  We'd take a fix
> for it, but I don't believe we'd delay the release for it.

I respectfully disagree; turning on SCSI_SCSAN_ASYNC and removing the scsi_wait_scan from mkinitrd is a bad combination that has caused quite a few problems in Fedora:

bug #471903
bug #470726
bug #466607
bug #466071
bug #466534
bug #465225
bug #454663

And RHEL:
bug #464636
bug #461850
bug #459109

These are only the ones assigned to mkinitrd; there are probably others that didn't (yet) get correctly assigned.  Some bugs above have many 'me too's.

Attached patch will add scsi_scan_wait by default.  This is how it used to be done, before scsi_mod was built into the kernel (see bug #454663 for the first report).

Another way to fix this is to add more tests to trigger the use of nash's stabilized() call.  But LKML guidance suggests scsi_scan_wait is the right way (TM), and the current stabilized() call is not long enough in at least a couple of cases (bug #461850 and bug #466607)
Comment 21 Charlie Moschel 2008-11-13 19:59:31 EST
(In reply to comment #20)
> bug #471903

sorry, that should be bug #471093
Comment 22 Lubomir Bulej 2008-11-18 11:16:38 EST
Created attachment 323920 [details]
 Fix to mkinitrd to wait for scsi

Hello,

the attached patch fixes mkinitrd for me, but serves mainly to show the main
reason, which is twofold. First, there is a case typo which results in scsi
devices not being recognized and the consequently variable wait_for_scsi is not
set to "yes". 

The second cause is due to emitmodules() being called twice (once for
GRAPHICSMODS modlist, and once for MODULES modlist -- the default). The problem
is that when the GRAPHICSMODS modlist is being processed, the wait_for_scsi
variable will get unset and will not be used later, when MODULES modlist is
being handled. 

The patch fixes the first typo and prevents wait_for_scsi variable to be unset
during handling the GRAPHICSMODS modlist.

I consider it rather dirty, but well, so is the emitmodules() function that has
side effect on global variables and is suddenly (I assume with advent of
plymouth) called twice.
Comment 23 Dr. Tilmann Bubeck 2008-11-19 13:07:11 EST
Please fix this problem. Keep in mind, that an unfixed version of fedora is not able to be installed on an SCSI system and probably other systems. Offering an updated version through a online repository is useless, because the system ist not able to boot after the primary installation. I suffered from this problem too and found "scsi_wait-scan" to be a workaround to use the system at last.

This will affect quiete a few people.
Comment 24 Andrew Mann 2008-11-23 14:25:07 EST
I just went through a several hour process of diagnosing and hacking around this on an install of the fedora 10 pre-release. Similar system configuration to what has already been reported:  dual core proc, scsi controller (3ware 9550SX), lvm.

I'd like to add in the factors that make this hard to diagnose:
- The new graphical boot that's enabled by default goes to 100% (or a full white progress bar) and stops there.  Hitting esc to bring up the text display just shows a blank screen through the whole process - I presume due to the 'quiet' kernel parameter.
- The default grub.conf has a timeout of 0, so the kernel boots immediately giving you no time to alter the kernel boot parameters (there might be a key you can hold during boot to stop this, but my grub-fu is weak).
- Once you do strip rhgb and quiet from the boot parameters, it's not very obvious that it's the delayed scsi device identification that causes lvm to fail. Typically the first checks are to make sure the lvm structures are ok disk, and that the appropriate drivers are being loaded in the initrd.
- The real difficulty here will be that there is no obvious string to search on to find this problem. Only after diagnosing the problem was I able to relate it to this bug.  To most, this will be "fedora 10 doesn't boot after install," a rather vague problem.

On the up side, I now know more about the fedora boot process :)
Comment 25 Bug Zapper 2008-11-25 22:45:24 EST
This bug appears to have been reported against 'rawhide' during the Fedora 10 development cycle.
Changing version to '10'.

More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping
Comment 26 Eric Harrison 2008-12-09 20:40:25 EST
This fixes this issue for Adaptec SCSI cards:

# rpm -q mkinitrd
mkinitrd-6.0.71-2.fc10.i386


--- /sbin/mkinitrd.orig	2008-12-09 17:00:49.000000000 -0800
+++ /sbin/mkinitrd	2008-12-09 17:30:54.000000000 -0800
@@ -1518,6 +1518,7 @@
                 -o "BusLogic" == "$module" \
                 -o "mptbase" == "$module" \
                 -o "pata_" == "${module::5}" \
+                -o "aic7" == "${module::4}" \
                 -o "qla" == "${module::3}" \
                 -o "sata_" == "${module::5}" \
                 ]; then
Comment 27 Warren Togami 2008-12-09 20:49:10 EST
http://kojipkgs.fedoraproject.org/packages/mkinitrd/6.0.71/3.fc10/
This was fixed in a more generic way (without hard coding controller names) in this build.  Please test it and report back.
Comment 28 John Ellson 2008-12-10 02:02:24 EST
Works for me.
    kernel-2.6.28-0.121.rc7.git5.fc11.i686
    mkinitrd-6.0.73-5.fc11.i386


Do you want the console log?
Comment 29 Charlie Moschel 2008-12-10 08:19:10 EST
Easiest path forward I see for a new install affected by this:

* Symptom: F10 installs OK, but your SCSI system won't boot after install is done.

* Workaround: 
  - Hit ESC (?) early enough to interrupt the boot; 
  - Add "scsi_mod.scan=sync" to the kernel command line,
  - After boot and firstboot complete, update mkinitrd
  - After updating mkinitrd, you must rebuild your /boot/initrd (run /usr/libexec/plymouth/plymouth-update-initrd as root).

Could this be added to the common bugs page at fedoraproject.org?
Comment 30 Eric Harrison 2008-12-10 11:32:22 EST
(In reply to comment #27)
> http://kojipkgs.fedoraproject.org/packages/mkinitrd/6.0.71/3.fc10/
> This was fixed in a more generic way (without hard coding controller names) in
> this build.  Please test it and report back.


Confirmed that mkinitrd-6.0.71-3.fc10 works for me.
Comment 31 Dave Jones 2008-12-10 15:27:03 EST

*** This bug has been marked as a duplicate of bug 470628 ***
Comment 32 Hans de Goede 2008-12-16 11:06:16 EST

*** This bug has been marked as a duplicate of bug 466607 ***