Bug 125808 (IT_47358)

Summary: (cciss) anaconda fails to install GRUB into MBR
Product: Red Hat Enterprise Linux 4 Reporter: Andrew Bond <andrew.bond>
Component: anacondaAssignee: Jeremy Katz <katzj>
Status: CLOSED CURRENTRELEASE QA Contact: Mike McLean <mikem>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.0CC: bpeck, brian.b, flanagan, jlaska, jturner, ltroan, mjseger, tao, wtogami
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2004-12-13 17:44:19 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 135876    
Attachments:
Description Flags
Screenshot showing panic on gandalf none

Description Andrew Bond 2004-06-11 18:11:32 UTC
From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET 
CLR 1.0.3705; .NET CLR 1.1.4322)

Description of problem:
After a successfull install of RHEL4 on a new boot drive, the system 
hangs on reboot.  It never makes it out of the BIOS because it does 
not find a valid MBR.  A second install was done to verify that all 
the grub settings were correctly listed, and the "Installing Grub" 
window pops up briefly at the end of the install.  However, the 
machine finds no kernel to boot after the install reboot.

This was seen on a HP DL760G2 and a HP DL580G2.  A test was run where 
the DL760G2 had all its other controllers disabled so that the 
install process would only see one controller and one disk on that 
controller.  Even in this configuration the same behaviour was 
observed.

Workaround:  Booting the install CD in "linux rescue" mode and 
running "grub-install /dev/cciss/c0d0" fixes the problem and the 
machine boots into the RHEL4 kernel fine after that.

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1.Install RHEL4 on cciss based driver
2.Reboot
3.
    

Actual Results:  Does not find MBR to boot from

Additional info:

Comment 1 Jeremy Katz 2004-06-14 15:43:40 UTC
If you switch to tty5 after the "installing bootloader" window pops up
and goes away, are there any errors from the run of grub?

Comment 2 Andrew Bond 2004-06-14 19:20:52 UTC
Here's what is shows.

grub> root (hd0,0)
       Filesystem type is ext2fs, partition type 0x83
grub> install /grub/stage1 d (hd0) /grub/stage2 p 
(hd0,0)/grub/grub.conf
Error 15:  File not found
grub>


Comment 3 Jeremy Katz 2004-06-14 21:08:38 UTC
Interesting.. if you switch to tty2, does
/mnt/sysimage/boot/grub/grub.conf exist?

Comment 4 Andrew Bond 2004-06-29 20:32:18 UTC
Yes, the file /mnt/sysimage/boot/grub/grub.conf does exist in tty2 at
the end of the install after the "File not found" message shows up.

Comment 5 Andrew Bond 2004-07-30 18:59:31 UTC
Same problem happens in alpha4.  Same error message (File not found)
as before with the same grub commands listed in tty5.  There are two
sysimage mounts:

/dev/cciss/c0d0p3 is mounted as /mnt/sysimage
/dev/cciss/c0d0p1 is mounted as /mnt/sysimage/boot

Comment 7 Jeremy Katz 2004-08-20 18:46:59 UTC
I think this might be fixed with newer trees.  Between changes in
parted, grub and the kernel, one of the above I think has fixed
things.  I managed to reproduce the problem on an older tree with my
cciss box here and I have just installed with a current RHEL4 tree
without problems.

Comment 8 Andrew Bond 2004-10-04 19:28:30 UTC
I tried installing beta1 on a the DL760G2 I was having problems with
before and got the following grub errors on tty5 after the install.

grub> root(hd31,0)
Error 12: Invalid device requested
grug> install /grub/stage1 d (hd0) /grub/stage2 p (hd31,0)/grub/grub.conf
Error 12: Invalid device requested
grub>

The fact that it is indicating hd31 bothers me, so I'm redoing the
install to make sure the right devices are chosen.

I also tried the beta1 install on a DL580G2 and also got a grub error,
but a different message.

grub> root (hd0,0)
Filesystem type unknown, partition type 0x83
grub> install /grub/stage1 d (hd0) /grub/stage2 p (hd0,0)/grub/grub.conf
Error 17: Cannot mount selected partition
grub>

Both machines are in my lab in Raleigh (behind RH firewall) and are
hooked to an IP based KVM switch if access to them is needed.

Comment 9 Andrew Bond 2004-10-06 15:07:03 UTC
The DL760G2 installation error was related to having scsi devices
available on the system during install and the installer defaulting to
sda as the boot device for grub.

I reinstalled and moved /dev/cciss/c0d0 up to the top of the boot
order during the install process.  The grub install still failed, but
this time the error was exactly the same "Error 17:  Cannot mount
selected partition" that was seen during the DL580G2 install above.


Comment 11 Bill Peck 2004-10-21 20:45:19 UTC
RHEL4-re1021.nightly actually finished installing but upon reboot the
system hung with just GRUB on the screen.

RHEL4-Beta2-RC-re1021.0 installed but hung at the Post Install screen.
Virtual console 4 showed some kernel panic.  When I attempted to
reboot I recieved an additional panic.  I captured the screen and will
attach it here.


Comment 12 Bill Peck 2004-10-21 20:48:25 UTC
Created attachment 105609 [details]
Screenshot showing panic on gandalf

Comment 13 Jeremy Katz 2004-10-25 14:45:11 UTC
"RHEL4-Beta2-RC-re1021.0 installed but hung at the Post Install screen.
Virtual console 4 showed some kernel panic." -- This is concerning and
probably the root cause of the problem.  I just did an install on an
ML350 with cciss and grub installed and the machine rebooted fine. 
Need more information on the panic to be able to say anything more here.

Comment 14 Bill Peck 2004-10-25 16:46:55 UTC
Tried to reproduce here in westford using a DL 380 G3.  Installed
fine.  I have a feeling it has to do with the number of controllers
and disks on the system in Raleigh.

Andy can you disconnect the external storage?  Update the bugzilla
here when you have and I'll kick off a grid install of rhel4 beta2.

Comment 15 Andrew Bond 2004-10-25 20:41:51 UTC
The system should no longer see any other cciss controller besides the
embedded one.  Try a grid install.

Comment 16 Bill Peck 2004-10-26 13:31:07 UTC
This made the difference.  Without the storage attached the system
installed fine.

Jeremy, I'm not sure what to do next.  Do you have any ideas?

Comment 17 Jeremy Katz 2004-10-26 13:41:55 UTC
It sounds to me like with the other controllers attached, the internal
one is no longer controller 0.  Which then means you need to go to the
advanced boot loader screen and reorder them into the proper BIOS order.

Comment 18 Bill Peck 2004-10-26 13:52:01 UTC
I might buy that if rhel3u4 didn't install perfectly with the storage
attached.

Comment 19 Jeremy Katz 2004-10-26 13:58:11 UTC
RHEL3 U4 uses a 2.4 kernel and thus doesn't use ACPI for device
enumeration. 

The 2.6 kernel can and will change the order your devices get detected in.

Comment 20 Bill Peck 2004-10-26 14:09:19 UTC
So if the BIOS shows the internal controller as the boot controller
and we still have this problem then I should add kernel to this
bugzilla?  Since it must be a kernel bug?

Andy, Can you confirm that the internal controller is the boot
controller? I'm pretty sure it is.

Comment 21 Jeremy Katz 2004-10-26 14:19:28 UTC
There is no generic, reliable way to get information on what the BIOS
order actually is.

EDD 3.0 is supposed to solve this someday, but implementation by BIOS
vendors is poor at best and thus there continues to be no solution.

Comment 22 Andrew Bond 2004-10-26 14:58:14 UTC
I have done more than 10 installs where this problem happens now.  In
all but one of them the embedded controller is always identified by
the install as the boot controller (/dev/cciss/c0d0).  The only time
it is not is when there are fibre drives attached and then it
identifies the boot device.  This is the only time that the install
has not identified the correct controller and manual intervention was
needed to adjust the install boot order.  This is what happened in my
10/4/2004 comment regarding the DL760G2.  Reordering the controllers
during install fixed that problem, but then the grub install still failed.

On all these systems the embedded controller is always the boot
controller and I've never seen Linux (2.4 or 2.6) identify it as
anything other than the first cciss device (/dev/cciss/c0d0).

It seems to me that since the install steps have identified the right
controller the grub install at the end of the install should be trying
to use the right controller.  But maybe that isn't happening.  Is the
logic between the grub installer screen and the actual grub install
different?

Comment 23 Larry Troan 2004-10-29 17:38:09 UTC
From HP........................
Hi Larry and Chris, 
For this
bugzilla:https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=125808 

Andy and I are not convinced that everything is working fine in RHâs
code.  Can we get this reopened?  

Andy is not doing anything abnormal.  He just has two controllers in
the system and he canât install without having to use a rescue CD. 
Also, there is a bug on tty5 that is showing a failure with GRUB.

Comment 24 Warren Togami 2004-10-30 10:42:01 UTC
*** Bug 136989 has been marked as a duplicate of this bug. ***

Comment 25 Warren Togami 2004-10-30 11:12:39 UTC
The errors that Larry Troan mentions in Comment #23 on VT5 are like
this for LVM Auto-Partitioned:
=================================
  Creating directory "/var/lock/lvm"
  Wiping cache of LVM-capable devices
  Wiping internal cache
Reading all physical volumes.  This may take a while...
   Finding all volume groups.
   Finding volume group "VolGroup00"
Found volume group "VolGroup00" using metadata type lvm2


    GNU GRUB  version 0.95  (640K lower / 3072K upper memory)

[ Minimal BASH-like line editing is supported.  For the first word, TAB
  lists possible commmand completions.  Anywhere else TAB lists the
possible
  completions of a device/filename.]
grub> root (hd0,0)
 Filesystem type unknown, parition type 0x83
grub> install /grub/stage1 d (hd0) /grub/stage2 p (hd0,0)/grub/grub.conf

Error 17: Cannot mount selected partition
grub> 


And this is after manual partition without LVM:
===============================================
 Filesystem type is ext2fs, partition type 0x83
grub> install /boot/grub/stage1 d (hd0) /boot/grub/stage2 p
(hd0,0)/boot/grub/grub.conf

Error 15: File not found
grub>

Comment 26 Warren Togami 2004-10-30 11:14:26 UTC
I don't know if Bug #123249 is related to this in any way, or is no
longer an issue itself.

Comment 27 Jeremy Katz 2004-11-01 14:21:07 UTC
This is working for me 100% of the time on my cciss test box :-/

Not being able to mount the device sounds odd -- what does
/boot/grub/device.map show?

Comment 28 Andrew Bond 2004-11-01 16:15:30 UTC
device.map shows two entries.  One for fd0 and one for /dev/cciss/c0d0.

It appears the problem is related to whether the drives you are
installing on were partitioned before or not.

I did a minimal install on a DL580G2 with 1 PCI controller in it where
I installed on top of the partitions from a previous install (although
formatting the ext3 partitions again).  Everything worked fine in this
setup.

However, I rebuilt the RAID device on the internal controller so that
the partitions and data on those drives would be destroyed.  I then
attempted another minimal install and this time it failed.

Jeremy, 
If you have a browser with java runtime in it you can access the
console for the machine that failed the install.  It is sitting at one
of the console windows right now and you can poke around.  Contact me
if you want the access details.

Comment 29 Warren Togami 2004-11-03 09:51:03 UTC
Jeremy Katz was able to reproduce the GRUB install failure on his
cciss test box earlier today.  He mentioned that the clue given in
Comment #28 about changing the partitions seems to trigger the
problem.  Hopefully this will lead to resolution.

Comment 30 Jeremy Katz 2004-11-03 22:04:58 UTC
I reproduced once yesterday and now it's not happening.  Although
maybe that's because my partitioning is ending up looking fairly
identical.  

Blurgh.

Comment 31 Andrew Bond 2004-11-04 17:14:14 UTC
When I was rebuilding the hardware RAID device in comment #28 I had
swapped drives locations to make sure no resident data made any sense.
 I believe all the problems I have reported were with installs to new
drive sets since I usually keep the old OS drives around in case I
need to boot them.  

Although I don't think that was the case with Bill's failed install
using the testgrid.  I think the only difference there was the
external storage was removed.  Maybe it's related to both external
storage present and no old data present.



Comment 32 Warren Togami 2004-11-06 10:53:24 UTC
For me it is reproducible 100% if...
* Parition for the first time after creating the logical drive, which
wipes out the existing paritition table.
* Re-partition from LVM to manual partition and back, using different
partition sizes.


Comment 34 Brian Baker 2004-11-18 15:03:36 UTC
This can also be reproduced by zeroing out the front of the drive:
	dd bs=1024 count=1024 < /dev/zero > /dev/cciss/c0d0

It can be fixed by booting from a grub floppy and running:
	root (hd0,0)
	setup (hd0)

You can make a grub floppy by:
	cd /usr/share/grub/*
	cat stage1 stage2 > /dev/fd0

Comment 35 Jeremy Katz 2004-11-30 22:54:19 UTC
Fixed in grub-0.95-3.1.  Tested on my cciss box here where I was seeing it (and
definitely reproduced), then logged in and ran my new binary as the test case
without a tree.

Testing once this makes it into a tree (hopefully tomorrow) would be much
appreciated.  

Comment 36 Warren Togami 2004-12-01 04:02:04 UTC
It will be in RHEL4's tree (are there nightlys), rawhide, or both?
Is rawhide install even working?


Comment 37 Jeremy Katz 2004-12-01 04:07:08 UTC
It'll be in both (0.95-4 for rawhide).  Rawhide installs should work with some
of the changes I've made today, but it could be broken.

Comment 38 Brian Baker 2004-12-01 16:30:41 UTC
I have several engineers that have found this bug in one form or 
another.  I would like for them to be able to retest without having 
to wait until RC1.  Can you attach the grub-0.95-4 file to this 
bugzilla or put in on a people page and help me understand how they 
might be able to retest their issues?  

Why does the boot procedure work with the old grub package when 
running grub-install manually?  Does this imply that there is a 
problem with anaconda?  Just curious what the fix was.

Comment 39 Jeremy Katz 2004-12-01 17:14:05 UTC
grub-install works from rescue mode because you've rebooted and then
not changed /boot so there's a consistent view of the device being
accessed both as /dev/cciss/c0d0 and what's on the filesystem on
/dev/cciss/c0d0p1.  I've changed grub so that it uses O_DIRECT which,
with the syncs already occurring, is enough to ensure a consistent view.

Unfortunately, just putting a new grub package up isn't quite enough
as the invocation is a bit convoluted.  BUT, 
http://people.redhat.com/~katzj/grub/ has grub-0.95-3.1 (which is the
RHEL4 build with the fix).  You can grab it after the install is done
by switching to tty2, grabbing the package, installing it, and then
running (from chroot'd into /mnt/sysimage)
`echo -e "root (hd0,0)\n install /grub/stage1 d (hd0) /grub/stage2 p 
 (hd0,0)/grub/grub.conf" | /sbin/grub --batch
--device-map=/boot/grub/device.map `

Comment 40 Brian Baker 2004-12-02 17:17:43 UTC
These steps didn't work for us.  However, we set up an NFS share, 
swapped out the old package for the new one and the install worked.  
We then put the old package back on the share, reinstalled, and 
verified that it still failed.  Thanks for all of the help.

This is definitely in RC1, right?

Comment 41 Jay Turner 2004-12-04 10:12:42 UTC
HP, would be great to get your confirmation that all is working as expected with
the next drop of code.

Comment 42 Larry Troan 2004-12-09 19:15:13 UTC
<ootpa> jay, is the fix for bug 125808 (grub not properly installed in MBR)
        in the partner 12/06 images or is this 12/20?
<jkt>   ootpa, checking that out...
<jkt>   ootpa, the version that we think is correct is indeed in the 12/06 drop
        that the partners have

Comment 43 Andrew Bond 2004-12-09 19:25:01 UTC
I just verified this morning that the problem appears to be fixed in
the  12/6 code drop.  I did a CD install using the 12/6 ISO images on
a DL760G2 and a DL580G2 and both installs worked fine and grub
installed correctly.  Neither of these machines had ever had a
successfull install of any prior RHEL4 release because of the grub
problem.

From my standpoint the problem has been fixed.  Brian can provide
feedback from any other HP install tests.

Comment 44 Brian Baker 2004-12-13 17:37:47 UTC
Agreed.  We have also verified that this is fixed.

Comment 45 James Laska 2004-12-13 17:44:19 UTC
According to comment#43 and comment#44 ... this issue has been resolved in the
re1206.0 drop.  Please reopen this defect if the problem resurfaces.