Bug 125808 (IT_47358)
Summary: | (cciss) anaconda fails to install GRUB into MBR | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 4 | Reporter: | Andrew Bond <andrew.bond> | ||||
Component: | anaconda | Assignee: | Jeremy Katz <katzj> | ||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Mike McLean <mikem> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 4.0 | CC: | bpeck, brian.b, flanagan, jlaska, jturner, ltroan, mjseger, tao, wtogami | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | i686 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2004-12-13 17:44:19 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 135876 | ||||||
Attachments: |
|
Description
Andrew Bond
2004-06-11 18:11:32 UTC
If you switch to tty5 after the "installing bootloader" window pops up and goes away, are there any errors from the run of grub? Here's what is shows. grub> root (hd0,0) Filesystem type is ext2fs, partition type 0x83 grub> install /grub/stage1 d (hd0) /grub/stage2 p (hd0,0)/grub/grub.conf Error 15: File not found grub> Interesting.. if you switch to tty2, does /mnt/sysimage/boot/grub/grub.conf exist? Yes, the file /mnt/sysimage/boot/grub/grub.conf does exist in tty2 at the end of the install after the "File not found" message shows up. Same problem happens in alpha4. Same error message (File not found) as before with the same grub commands listed in tty5. There are two sysimage mounts: /dev/cciss/c0d0p3 is mounted as /mnt/sysimage /dev/cciss/c0d0p1 is mounted as /mnt/sysimage/boot I think this might be fixed with newer trees. Between changes in parted, grub and the kernel, one of the above I think has fixed things. I managed to reproduce the problem on an older tree with my cciss box here and I have just installed with a current RHEL4 tree without problems. I tried installing beta1 on a the DL760G2 I was having problems with before and got the following grub errors on tty5 after the install. grub> root(hd31,0) Error 12: Invalid device requested grug> install /grub/stage1 d (hd0) /grub/stage2 p (hd31,0)/grub/grub.conf Error 12: Invalid device requested grub> The fact that it is indicating hd31 bothers me, so I'm redoing the install to make sure the right devices are chosen. I also tried the beta1 install on a DL580G2 and also got a grub error, but a different message. grub> root (hd0,0) Filesystem type unknown, partition type 0x83 grub> install /grub/stage1 d (hd0) /grub/stage2 p (hd0,0)/grub/grub.conf Error 17: Cannot mount selected partition grub> Both machines are in my lab in Raleigh (behind RH firewall) and are hooked to an IP based KVM switch if access to them is needed. The DL760G2 installation error was related to having scsi devices available on the system during install and the installer defaulting to sda as the boot device for grub. I reinstalled and moved /dev/cciss/c0d0 up to the top of the boot order during the install process. The grub install still failed, but this time the error was exactly the same "Error 17: Cannot mount selected partition" that was seen during the DL580G2 install above. RHEL4-re1021.nightly actually finished installing but upon reboot the system hung with just GRUB on the screen. RHEL4-Beta2-RC-re1021.0 installed but hung at the Post Install screen. Virtual console 4 showed some kernel panic. When I attempted to reboot I recieved an additional panic. I captured the screen and will attach it here. Created attachment 105609 [details]
Screenshot showing panic on gandalf
"RHEL4-Beta2-RC-re1021.0 installed but hung at the Post Install screen. Virtual console 4 showed some kernel panic." -- This is concerning and probably the root cause of the problem. I just did an install on an ML350 with cciss and grub installed and the machine rebooted fine. Need more information on the panic to be able to say anything more here. Tried to reproduce here in westford using a DL 380 G3. Installed fine. I have a feeling it has to do with the number of controllers and disks on the system in Raleigh. Andy can you disconnect the external storage? Update the bugzilla here when you have and I'll kick off a grid install of rhel4 beta2. The system should no longer see any other cciss controller besides the embedded one. Try a grid install. This made the difference. Without the storage attached the system installed fine. Jeremy, I'm not sure what to do next. Do you have any ideas? It sounds to me like with the other controllers attached, the internal one is no longer controller 0. Which then means you need to go to the advanced boot loader screen and reorder them into the proper BIOS order. I might buy that if rhel3u4 didn't install perfectly with the storage attached. RHEL3 U4 uses a 2.4 kernel and thus doesn't use ACPI for device enumeration. The 2.6 kernel can and will change the order your devices get detected in. So if the BIOS shows the internal controller as the boot controller and we still have this problem then I should add kernel to this bugzilla? Since it must be a kernel bug? Andy, Can you confirm that the internal controller is the boot controller? I'm pretty sure it is. There is no generic, reliable way to get information on what the BIOS order actually is. EDD 3.0 is supposed to solve this someday, but implementation by BIOS vendors is poor at best and thus there continues to be no solution. I have done more than 10 installs where this problem happens now. In all but one of them the embedded controller is always identified by the install as the boot controller (/dev/cciss/c0d0). The only time it is not is when there are fibre drives attached and then it identifies the boot device. This is the only time that the install has not identified the correct controller and manual intervention was needed to adjust the install boot order. This is what happened in my 10/4/2004 comment regarding the DL760G2. Reordering the controllers during install fixed that problem, but then the grub install still failed. On all these systems the embedded controller is always the boot controller and I've never seen Linux (2.4 or 2.6) identify it as anything other than the first cciss device (/dev/cciss/c0d0). It seems to me that since the install steps have identified the right controller the grub install at the end of the install should be trying to use the right controller. But maybe that isn't happening. Is the logic between the grub installer screen and the actual grub install different? From HP........................ Hi Larry and Chris, For this bugzilla:https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=125808 Andy and I are not convinced that everything is working fine in RHâs code. Can we get this reopened? Andy is not doing anything abnormal. He just has two controllers in the system and he canât install without having to use a rescue CD. Also, there is a bug on tty5 that is showing a failure with GRUB. *** Bug 136989 has been marked as a duplicate of this bug. *** The errors that Larry Troan mentions in Comment #23 on VT5 are like this for LVM Auto-Partitioned: ================================= Creating directory "/var/lock/lvm" Wiping cache of LVM-capable devices Wiping internal cache Reading all physical volumes. This may take a while... Finding all volume groups. Finding volume group "VolGroup00" Found volume group "VolGroup00" using metadata type lvm2 GNU GRUB version 0.95 (640K lower / 3072K upper memory) [ Minimal BASH-like line editing is supported. For the first word, TAB lists possible commmand completions. Anywhere else TAB lists the possible completions of a device/filename.] grub> root (hd0,0) Filesystem type unknown, parition type 0x83 grub> install /grub/stage1 d (hd0) /grub/stage2 p (hd0,0)/grub/grub.conf Error 17: Cannot mount selected partition grub> And this is after manual partition without LVM: =============================================== Filesystem type is ext2fs, partition type 0x83 grub> install /boot/grub/stage1 d (hd0) /boot/grub/stage2 p (hd0,0)/boot/grub/grub.conf Error 15: File not found grub> I don't know if Bug #123249 is related to this in any way, or is no longer an issue itself. This is working for me 100% of the time on my cciss test box :-/ Not being able to mount the device sounds odd -- what does /boot/grub/device.map show? device.map shows two entries. One for fd0 and one for /dev/cciss/c0d0. It appears the problem is related to whether the drives you are installing on were partitioned before or not. I did a minimal install on a DL580G2 with 1 PCI controller in it where I installed on top of the partitions from a previous install (although formatting the ext3 partitions again). Everything worked fine in this setup. However, I rebuilt the RAID device on the internal controller so that the partitions and data on those drives would be destroyed. I then attempted another minimal install and this time it failed. Jeremy, If you have a browser with java runtime in it you can access the console for the machine that failed the install. It is sitting at one of the console windows right now and you can poke around. Contact me if you want the access details. Jeremy Katz was able to reproduce the GRUB install failure on his cciss test box earlier today. He mentioned that the clue given in Comment #28 about changing the partitions seems to trigger the problem. Hopefully this will lead to resolution. I reproduced once yesterday and now it's not happening. Although maybe that's because my partitioning is ending up looking fairly identical. Blurgh. When I was rebuilding the hardware RAID device in comment #28 I had swapped drives locations to make sure no resident data made any sense. I believe all the problems I have reported were with installs to new drive sets since I usually keep the old OS drives around in case I need to boot them. Although I don't think that was the case with Bill's failed install using the testgrid. I think the only difference there was the external storage was removed. Maybe it's related to both external storage present and no old data present. For me it is reproducible 100% if... * Parition for the first time after creating the logical drive, which wipes out the existing paritition table. * Re-partition from LVM to manual partition and back, using different partition sizes. This can also be reproduced by zeroing out the front of the drive: dd bs=1024 count=1024 < /dev/zero > /dev/cciss/c0d0 It can be fixed by booting from a grub floppy and running: root (hd0,0) setup (hd0) You can make a grub floppy by: cd /usr/share/grub/* cat stage1 stage2 > /dev/fd0 Fixed in grub-0.95-3.1. Tested on my cciss box here where I was seeing it (and definitely reproduced), then logged in and ran my new binary as the test case without a tree. Testing once this makes it into a tree (hopefully tomorrow) would be much appreciated. It will be in RHEL4's tree (are there nightlys), rawhide, or both? Is rawhide install even working? It'll be in both (0.95-4 for rawhide). Rawhide installs should work with some of the changes I've made today, but it could be broken. I have several engineers that have found this bug in one form or another. I would like for them to be able to retest without having to wait until RC1. Can you attach the grub-0.95-4 file to this bugzilla or put in on a people page and help me understand how they might be able to retest their issues? Why does the boot procedure work with the old grub package when running grub-install manually? Does this imply that there is a problem with anaconda? Just curious what the fix was. grub-install works from rescue mode because you've rebooted and then not changed /boot so there's a consistent view of the device being accessed both as /dev/cciss/c0d0 and what's on the filesystem on /dev/cciss/c0d0p1. I've changed grub so that it uses O_DIRECT which, with the syncs already occurring, is enough to ensure a consistent view. Unfortunately, just putting a new grub package up isn't quite enough as the invocation is a bit convoluted. BUT, http://people.redhat.com/~katzj/grub/ has grub-0.95-3.1 (which is the RHEL4 build with the fix). You can grab it after the install is done by switching to tty2, grabbing the package, installing it, and then running (from chroot'd into /mnt/sysimage) `echo -e "root (hd0,0)\n install /grub/stage1 d (hd0) /grub/stage2 p (hd0,0)/grub/grub.conf" | /sbin/grub --batch --device-map=/boot/grub/device.map ` These steps didn't work for us. However, we set up an NFS share, swapped out the old package for the new one and the install worked. We then put the old package back on the share, reinstalled, and verified that it still failed. Thanks for all of the help. This is definitely in RC1, right? HP, would be great to get your confirmation that all is working as expected with the next drop of code. <ootpa> jay, is the fix for bug 125808 (grub not properly installed in MBR) in the partner 12/06 images or is this 12/20? <jkt> ootpa, checking that out... <jkt> ootpa, the version that we think is correct is indeed in the 12/06 drop that the partners have I just verified this morning that the problem appears to be fixed in the 12/6 code drop. I did a CD install using the 12/6 ISO images on a DL760G2 and a DL580G2 and both installs worked fine and grub installed correctly. Neither of these machines had ever had a successfull install of any prior RHEL4 release because of the grub problem. From my standpoint the problem has been fixed. Brian can provide feedback from any other HP install tests. Agreed. We have also verified that this is fixed. According to comment#43 and comment#44 ... this issue has been resolved in the re1206.0 drop. Please reopen this defect if the problem resurfaces. |