Bug 473305

Summary: Booting from SCSI/RAID device fails.
Product: [Fedora] Fedora Reporter: Phil Bayfield <phil>
Component: initscriptsAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED DUPLICATE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: urgent Docs Contact:
Priority: high    
Version: 10CC: bobgus, csnook, davea, davej, humpf, kernel-maint, LaKing, marco.crosio, mark, mkoles2, notting, quintela, russ+bugzilla-redhat, schasj, schorschi, sergio.pasra, sirlight, tristan.santore
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-12-09 19:12:19 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Phil Bayfield 2008-11-27 16:07:14 UTC
SCSI/RAID devices are unable to boot Fedora Core 10

Error given is: cannot mount /dev/root onto /sysroot

Also output such as the follows is given, then the system hangs:

sd 2:0:2:0: [sdc] Write cache: enabled, read cache: enabled, supports DPO and FUA
sd 2:0:2:0: [sdc] 143666192 512-byte hardware sectors (73557 MB)
sd 2:0:2:0: [sdc] Write Protect is off
sd 2:0:2:0: [sdc] Mode Sense: ed 00 10 08
sd 2:0:2:0: [sdc] Write cache: enabled, read cache: enabled, supports DPO and FUA
sdc: sdc1
sd 2:0:2:0: [sdc] Attached SCSI disk
sd 2:0:2:0: Attached scsi generic sg4 type 0

Seams to effect many different RAID controllers and both i386/x64 releases, this post on the forums has more details:

http://forums.fedoraforum.org/showthread.php?t=205115

Comment 1 Jon 2008-11-27 23:38:06 UTC
I can verify that this is a bug on my Iseries 345 server. 

Had to do the mkinit --with=scsi_wait_scan command to add the module into the initrd image file before I can boot the newly installed F10. 

Jon

Comment 2 Steve Schaeffer 2008-11-30 14:56:44 UTC
I'm seeing the same problem on a Dell Latitude C840 with IDE drive. The mkinitrd --with=scsi_wait_scan workaround provides no relief on this system.

Comment 3 Maxim Kolesnikov 2008-12-01 20:27:48 UTC
I can confirm that this bug manifests itself on my Dell Precision 530 with the SCSI drive. Upgrade from Fedora 9 to Fedora 10 was done using yum. It boots fine when I choose the kernel from fc9.

Comment 4 Matt Castelein 2008-12-04 00:06:20 UTC
I have the same issue as far as I can tell, on an Adaptec RAID 3805.  Upgraded to 10, rebooted, nothing.. Trudged down to the machine and hooked up a monitor to find this:

sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, supports DPO and FUA
 sda: sda1 sda2 sda3
sd 0:0:0:0: [sda] Attached SCSI removable disk
sd 0:0:0:0: Attached scsi generic sg0 type 0
sd 0:0:1:0: [sdb] 3900682240 512-byte hardware sectors (1997149 MB)
sd 0:0:1:0: [sdb] Write Protect is off
sd 0:0:1:0: [sdb] Write cache: enabled, read cache: enabled, supports DPO and FUA
 sdb: sdb1
sd 0:0:1:0: [sdb] Attached SCSI removable disk
sd 0:0:1:0: Attached scsi generic sg1 type 0
scsi 0:1:0:0: Attached scsi gerneric sg2 type 0
scsi 0:1:1:0: Attached scsi gerneric sg3 type 0
scsi 0:1:3:0: Attached scsi gerneric sg4 type 0
scsi 0:1:4:0: Attached scsi gerneric sg5 type 0
scsi 0:1:5:0: Attached scsi gerneric sg6 type 0
scsi 0:1:6:0: Attached scsi gerneric sg7 type 0
scsi 0:1:7:0: Attached scsi gerneric sg8 type 0

[hangs here]

2.6.27.5-117.fc10.x86_64...  the previous 2.6.27.5-41.fc9.x86_64 is working fine.

Comment 5 marco crosio 2008-12-07 08:06:00 UTC
This bug has been triaged :)

Comment 6 Tristan Santore 2008-12-07 20:42:51 UTC
I can also confirm this bug on a system with an Adaptec 5805 card.
Same output as seen with Matt Castelein, booted without "QUIET" kernel arg
I fixed the issue by rebuilding the initrd with the following arg, as stated above, --with=scsi_wait_scan.

To fix this issue, boot from live-media.
su -    to become root on the live-system.
vgchange -ay to enable the LVM volumes. (if being used)
mkdir /mnt/sysimage
mount -t ext3 /dev/VGhere/device /mnt/sysimage
If you are using multiple mount points/partitions mount them into /mnt/sysimage
mount -t ext3 /dev/sdxx /mnt/sysimage/boot (where sdxx is the /boot partition)
mount --bind /dev/ /mnt/sysimage/dev
mount --bind /proc /mnt/sysimage/proc
mount --bind /sys /mnt/sysimage/sys
chroot /mnt/sysimage
mv initrd-2.6.27.7-130.fc10.x86_64.img initrd-2.6.27.7-130.fc10.x86_64.img.old
mkinitrd --with=scsi_wait_scan initrd-2.6.27.7-130.fc10.x86_64.img 2.6.27.7-130.fc10.x86_64
exit      to leave the chroot
reboot
Pray!

Please note I had previously updated the kernel from the testing repo, and the same issue persisted with the scsi sitting there and waiting. Please amend your initrd version and kernel version to reflect yours.

I hope this helps some of you out there. I had some issues with the live-usb/cd media, that is, it was unstable. Try a few times if necessary.

Thank you to Jon and also to Kevin Fenzi

Comment 7 Dave Jones 2008-12-09 19:12:19 UTC

*** This bug has been marked as a duplicate of bug 470628 ***

Comment 8 Matt Castelein 2008-12-09 23:37:44 UTC
(In reply to comment #7)
> 
> *** This bug has been marked as a duplicate of 470628 ***

That bug doesn't look at all like what I'm seeing, but I'll be patient and wait for a fix.

Comment 9 Mark Mielke 2008-12-09 23:51:44 UTC
I agree - why is this a duplicate? The other bug looks like it is taking +10 seconds longer to boot, whereas this bug is that the boot is not waiting long enough. Dave: Would you be able to provide an explanation that would comfort us?

I also experience this bug - with or without LVM - and the scsi_wait_scan fix worked for me immediately.

Comment 10 Matt Castelein 2008-12-09 23:56:38 UTC
the fix from the other bug (rebuild initrd) did in fact fix this for me.. I guess I was mistaken.

Comment 11 Jon 2008-12-10 00:04:41 UTC
@  Mark Mielke

Aye Mark, this is NOT a duplicate of 470628 by the descirption of the bug unless someone explain WHY it is a duplicate. 

Jon

Comment 12 Tristan Santore 2008-12-10 00:24:22 UTC
I think I can shine a bit of light on the issue. If you look at Post 36 in https://bugzilla.redhat.com/show_bug.cgi?id=470628
then you can see that there is an IF block. Also at the top of this mkinitrd (as in F10) you can find that block. I assume because it doesnt find the right module
it doesnt set wait_for_scsi and doesnt call emit "stablizied  --hash --interval 250 /proc/scsi/scsi", hence it sits there after listing all sg devices,
but does not proceed to loading the kernel etc.. Simply because it cant find the right /device nodes to use after it then says it cant switch /root...or whatever it is. So, the mkinitrd script needs changing to rectify this issue. If you check out the Fedora 8 mkinitrd the IF block is different.
In the mean time I left instruction for anyone to fix this issue themselves. If you look at the spec file for the kernel it calls /sbin/new-kernel-pkg, which in turn calls mkinitrd. At first I thought that was the issue, but seeing the other bug, I see that the mkinitrd has to be the problem.

Comment 13 Charlie Moschel 2008-12-10 03:53:55 UTC
I think you can work around this *at boot* by adding "scsi_mod.scan=sync" to the kernel command line (untested).  That would save you needing to use a rescue disk to rebuild initrd on your new install.  Can somebody confirm that the above command line works around the problem?  (I don't have the hardware handy anymore).  If that's confirmed, it's probably worth putting on the Fedora 'common problems' page.

Also, mkinitrd has just been rebuilt http://koji.fedoraproject.org/koji/buildinfo?buildID=73912:

* Mon Dec 08 2008 Hans de Goede <hdegoede> - 6.0.71-3
- Use scsi_wait_scan on scsi devices instead of stabilized (#470628)

Should be out for testing soon, I'd guess.  Too bad it won't help new installs though.  

Can somebody confirm that the above command line works around the problem (I don't have the hardware handy anymore)?  If that's confirmed, it's probably worth putting on the Fedora 'common problems' page.

Comment 14 csklho 2008-12-10 07:23:31 UTC
Adding "scsi_mod.scan=sync" to kernel command line really works without rebuild initrd. Thanks

Comment 15 Phil Bayfield 2008-12-10 08:45:35 UTC
Dave Jones why have you marked this as a duplicate? It is not a duplicate at all. The description and symptoms are different!!

Comment 16 Jon 2008-12-10 09:06:13 UTC
Peoples,

May I suggest something? 

Change the status to FIXED and put the fix info in. That way EVERYBODY will know it's fixed. Otherwise, there are going to a LOT of people hitting on this problem looking at WONTFIX and go screaming why it is not going to be fixed and adding more comments to this bug where there is a fix already out under a different bug number. This way, EVERYBODY WINS!!! 

Jon

Comment 17 Bill Nottingham 2008-12-10 15:46:23 UTC
It's the same issue as 470628, in that we're not using scsi_wait_scan in the initrd. It manifests either as 'could not detect stabilization, waiting 10 seconds', or just a failure to boot.

*** This bug has been marked as a duplicate of bug 470628 ***

Comment 18 Dave Abbott 2008-12-19 06:16:31 UTC
Some of you have solved the scsi raid problem , but it is not clear to me exactly what the specific commands should be and where.  The simplest seems to be adding "scsi_mod.scan=sync" to kernel command line.  However, I can’t get the command line screen except in rescue mode.  I only get the graphical screen on my dual Xeon processor server with hardware scsi raid at initial install time, or by pressing the "F8 key repeatedly immediately after the scsi adapter finds the raid. Then I can either select boot after whiich I get this message:"Reading all physical volumes.  This may take a while... "VolGroup00" not found  Unable to access resume device(/dev/VolGroup00/LogVol01)
mount: error mounting /dev/root on /sysroot as ext 3: No such file or directory"
When pressing the "F8" key immediately after the scsi adapter finds the RAID, I get the grapical screen where I can edit grub, but isn't it to late there to edit the kernel?

My partitions are defaulted as below on a RAID1 consisting of two 34875MB hard drives
VolGroup00           mount point    Type
    LogVol00              /                 ext3
    LogVol01                                swap
Hard Drives
    /dev/i2o/hda 
    /dev/i2o/hda1     /boot             ext3
    /dev/i2o/hda2     VolGroup00    LVM PV

When pressing the "F8" key immediately after the scsi adapter finds the RAID, I get the grapical screen where I can edit grub, but isn't it to late there to edit the kernel.
  
Since I installed with an iso image DVD, Live CD looks more complicated.   Does someone have a suggestion?

Comment 19 Bill Nottingham 2008-12-19 17:09:08 UTC
Install the update mkinitrd and remake your initrd.

Comment 20 Tristan Santore 2008-12-19 19:32:10 UTC
I left instructions on Common Bugs on how to fix the issue. Please follow them.
With regards to grub not showing. Repeatedly hit the up and down arrow keys and grub will show. The timeout is set very low. I hope this helps.

Comment 21 Dave Abbott 2008-12-25 03:32:36 UTC
I thought I followed Tristan Santore's instructions, but still no joy.
After booting with the live-media, I issued the following commands:
1.su -
2.vgchange -ay
3.mkdir /mnt/sysimage
4.mount -t ext3 /dev/VolGroup00/LogVol00 /mnt/sysimage
5.mount -t ext3 /dev/i2o/hda1 /mnt/sysimage/boot
6.mount --bind /dev /mnt/sysimage/dev
7.mount --bind /proc /mnt/sysimage/proc
8.mount --bind /sys /mnt/sysimage/sys
9.cd /mnt/sysimage
10.chroot /mnt/sysimage
11.cd boot
12.mv initrd-2.6.27.5-117.fc10.i686.img initrd-2.6.27.5-117.fc10.i686.img.old
13.mkinitrd --with=scsi_wait_scan initrd2.6.27.5-117.fc10.i686.img initrd-2.6.27.5-117.fc10.i686
14.exit
15.reboot
"Pulled Fedora LiveCD as soon as the system ejected the disk"
 
Got the following message on reboot:
Reading all physical volumes.  This may take awhile...
Volume Group "VolGroup00" not found
unable to access resume device (dev/VolGroup00/LogVol01)
mount:error mounting /dev/root on /sysroot as ext3: No such file or directory
 
Then it hangs with a flashing curser.  The following is from my review of partitions:


My partitions are defaulted as below on a RAID1 consisting of two 34875MB hard
drives
VolGroup00           mount point    Type
    LogVol00              /         ext3
    LogVol01                        swap
Hard Drives
    /dev/i2o/hda 
    /dev/i2o/hda1     /boot         ext3
    /dev/i2o/hda2     VolGroup00    LVM PV

Help!
Dave Abbott

Comment 22 Bob Gustafson 2009-01-07 21:44:27 UTC
I had the same problem and after some spasing around with mkinitrd (seemed like a problem I had with FC7 and FC8 - skipped FC9 because of other anaconda agony), I found your scsi_mod.scan=sync fix. This did the trick (so far).

After booting from the initial install, I put the fix into the grub.conf file and booted again. Worked good.

Then did ´yum update´ and things seemed to progress - downloading and then a problem

Running rpm_check_debug
ERROR with rpm_check_debug vs depsolve:
bind is needed by (installed) caching-nameserver-31:9.4.2-4.fc7.i386
Complete!
(1.[u´Please report this error in bugzilla´]

---

This may be unique to me as I am using Bernstein´s djbdns bind replacement.

I wonder if there is an override for rpm_check_debug..

Later

Comment 23 Tristan Santore 2009-01-13 12:50:45 UTC
Bob, the issue above has nothing to do with the mkinitrd issue. And thats a Fedora 7 package and Fedora 7 is EOL. If you upgraded, then that is a stale leftover!

I have updated the common bugs section on the Fedora wiki to reflect the following change. As mkinitrd is now fixed and in the repository, please use the kernel argument specified to boot the system. Update the system immediately and remove the added mod_scsi arg from your grub entry, if you set this in the grub.conf.
If you edited the grub entry from grub directly, that is during boot, then updating will suffice.

Regards,
Tristan Santore

Comment 24 Bob Gustafson 2009-01-13 16:13:21 UTC
I just now finished upgrading my second FC8 system to FC10. This system also has a RAID1 disk array. It also has dual Xeon processors and so the PAE kernel option showed up in my /boot directory.

After the install from DVD, I edited the grub.conf file to add in the root=/dev/rootvg/root and scsi_mod.scan=sync kernel boot options.

On initial boot, I was surprised to see just

GRUB

At the top left of my screen. Waiting awhile did not change the screen.

Going back to the install disk and entering rescue mode, I worked with grub

grub> find /boot/grub/stage1

(hd0,0)
(hd1,0)

I figured that these were the MBR of both of the RAIDed disks.

Doing:

grub> setup (hd0)
grub> setup (hd1)

Booting the ´local´ disk gave me the native grub prompt

When I entered the kernel /vmlinuz .. string followed by a TAB, grub showed the possible completions (very useful feature). I saw then that I had the PAE kernel as well as other choices.

Entering initrd /initrd .. TAB and filling out the rest of the command appropriately.

Then the ´boot´ command did the right thing.

When the new FC10 system came up, I did a few sanity checks, then

yum update.

There was only one file giving a dependency check problem, so I did the rpm -e on that file and retried the ´yum update´ - success.

But I still had the boot problem. Grub would not find the grub.conf file. I had to do

grub> install ...stage1.. stage2..

(I don´t have the web page in front of me at the moment. Google was helpful though). Also having multiple systems/screens around while searching for answers is also helpful..)

After this, everything seems to work fine (I am writing this on the newly renovated system).

It is good to know that I can edit away the scsi_mod.scan=sync command from the grub kernel line

The only other weird thing is disk labeling and the strange UUID string that pops into my grub kernel commands (and perhaps fstab). I don´t quite understand enough to tell whether they are correct or not, so my grub kernel policy has been to delete all LABEL and UUID strings. Seems to work ok without them.

----

I now have one last system to upgrade from FC8 to FC10. It also has RAID1, but hardware (motherboard) RAID. Should be quicker to recognize, yes?

The problem here is that the machine has no DVD drive and it is my firewall/gateway machine. Which means that while I am installing, all other machines here are off-line. A risk.  It also is a x86_64 machine.

Wish me luck.

Comment 25 LaKing 2009-01-29 04:03:25 UTC
Hi folks. This is a big bug.

Clean install from scrach, FC10 64bit OS on SSD drive, 4 SATA drives in two software raid stripes. Installation ends without errors.
After boot, none of the arrays got up. Also, an array created in FC6 didnt show up, and couldnt reassamble/mount. The drives seemed to be present. ...
For some other reason had to change the M3A motherboard to a P6T.
Clean install again and again, same reults. None of the Sata Software arrays showed up. An Array of two IDE drives and SATA drives without raid work as they supposed to be. 
In the boot log's 3 MDADM errors, one says something about not enough memory.(12G)
Googleing for solution, i saw these two Bug threads. 
Updated mkinitrd with yum to the latest version .. didn't help.
Added scsi_mod.scan=sync to kernel arguments .. didn't help.

Some note's.:
My system didnt hang at boot, but I didn't boot from those RAIDs.
SAS controller or SATA controller .. doesent make any difference.
I dont think this bug is a duplicate of 470628
I can reproduce this anytime, and can give more info's.
I need a stabile solution, relative urgent. Willing to help to figure out.

Comment 26 Bob Gustafson 2009-01-29 04:54:56 UTC
(In reply to comment #25)

> Hi folks. This is a big bug.
> 
 
> Some note's.:
> My system didnt hang at boot, but I didn't boot from those RAIDs.

I think the reason you got as far as you did is because you were NOT booting from the RAID disks.

Check out Bug #474399 and Bug #475024 - it may be closer to your problem

Comment 27 LaKing 2009-01-29 06:37:11 UTC
Yes, those bugs seem to come from the same family, or are at least relatives, but we need the mother of those bugs. Its dangerous, people could loose a lot of data.
Of course I would hang somwhere in the boot process if the filesystem would start on a RAID, so my report makes the bug - in case they have the same family - more visible, especially as i still dont see my bug perfectly described.

The key things in here:
-Install a clean system, Partition 5 drives, installation go's fine.
-At first boot the fresh installation doe's NOT use 4 disks properly.
-No updates or workarounds fix the problem.

Considering the mass of affected hardware, this is a very general bug.
We can trace the bug in each scenario, or we can handle it as one, wich has already several threads.
I have no clue what was changed from FC9 to FC10, most people have no such issues in previous versions.
Someone who have that bug rightnow could post the MDAMD error entry's from the boot log, maybe that could be the fingerprint of our bug.

Comment 28 Tristan Santore 2009-01-29 07:57:11 UTC
1. May I remind everybody that bugzilla is not meant as a support forum and this bug is marked as closed! So this is not the right place to post anyway.

2. LaKing, rebuild an initrd to --preload the required scsi modules.
Make sure that is not the issue. Further, please join a support forum and ask there, before posting a bug. You can join #fedora on #freenode for such help,
or ask in a mailing list.I would also like to ask you to post more detailed bug information and what you have tried, before posting here. Please go the support avenue first though.

Thank you.

Regards,
Tristan

Comment 29 schorschi 2009-06-12 04:42:08 UTC
This issue is still real, and happening, can we get some resolution?  I have Dell 2950s with Fedora 10 or 11 that installs pending, and install goes well, but after install reboot, nothing.  I am surprised this is closed when there are a lot of google hits on this issue for many different SCSI controllers.