Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1084852

Summary: grub stage1.5 hangs server
Product: Red Hat Enterprise Linux 6 Reporter: jas
Component: grubAssignee: David Kaspar // Dee'Kej <deekej>
Status: CLOSED WORKSFORME QA Contact: Release Test Team <release-test-team-automation>
Severity: high Docs Contact:
Priority: unspecified    
Version: 6.5CC: jas, rvokal
Target Milestone: rcKeywords: FastFix
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-10-13 10:34:11 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1356056    
Attachments:
Description Flags
first 64 bytes of disk after grub root setup procedure run none

Description jas 2014-04-07 03:21:50 UTC
Created attachment 883446 [details]
first 64 bytes of disk after grub root setup procedure run

Description of problem:

I have a Dell R720xd system with an LSI 9207-8i HBA running IT (non-RAID) firmware.  The first two SAS disks in the chassis are recognized by the RHEL6 server as /dev/sda and /dev/sdv.  I kickstart the system with mirrored MD /boot comprising of /dev/sda1 and /dev/sdv1 (500 MB), and mirrored MD root comprising /dev/sda2 and /dev/sdv2 (rest of disk).  During kickstart, the grub bootloader is installed.  According to /var/log/anaconda.program.log, grub is installed like this:

install --stage2=/boot/grub/stage2 /grub/stage1 d (hd0) /grub/stage2 p (hd0,0)/grub/grub.conf

After kickstart, I can boot the system from either disk.

If a disk fails, and I replace the disk, I re-copy the partition table from the remaining good disk:

eg. sfdisk -d /dev/sda | sfdisk --force /dev/sdv

I then need to re-install grub on the new disk so that it's bootable.  If I follow the common procedure of:

grub> root (hd0,0)
grub> setup (hd1)

... the grub boot loader is installed to the right disk.  However, when I try to boot using that disk, the system hangs.  There are no error messages.

The difference in grub bootloader installation between Anaconda version, and my secondary method (root/setup) is that Anaconda points stage1 directly at stage2 in /boot.  However, "setup (hd1)" embeds stage1.5 loader.  Not being an expert in grub, it's not obvious why "setup" can't simply choose to install grub without embedding stage1.5 since stage1.5 is not needed (/boot is in proper location for it to be accessed directly).

If I dd the original MBR that was installed my Anaconda to the replacement disk, then the system boots fine.

NOTE: This server will be going into production within the next few weeks, after which I won't be able to test changes.  I know this is unlikely to be dealt with before then.  However, I'm reporting the bug now in the hopes that the problem will either be resolved in a future RHEL6 update, or it will help someone else to avoid the testing I've been through.  I'm quite positive it's a grub bug.

Steps to Reproduce:
1. kickstart system - I can boot from either disk
2. manually fail MD RAID1
3. install replacement disk
4. copy partition table from existing disk
5. allow MD to rebuild
6. re-install grub using "root/setup" procedure.
7. attempt to boot from the disk - the system locks up

Additional info:
I suspect this is related: http://bugs.centos.org/print_bug_page.php?bug_id=1940

I'm not sure if this is related: https://www.illumos.org/issues/4659

Attached is a copy of the first 64 bytes of the disk after root/setup procedure, and stage1.5 installed. (dd if=/dev/sdX of=/tmp/sdX count=64)

Comment 4 Jan Grulich 2015-04-02 10:59:59 UTC
I backported the fix from https://www.illumos.org/issues/4659, could you test whether it solves your problem? Here [1] is a scratch build with the mentioned fix.

[1] - https://brewweb.devel.redhat.com/taskinfo?taskID=8938616

Comment 5 Jan Grulich 2015-04-07 12:40:02 UTC
I haven't realized that you might not have access there, so you can find the build also here [1].

[1] - https://jgrulich.fedorapeople.org/grub/

Comment 6 jas 2015-05-11 19:06:46 UTC
Hi Jan.

I'm trying to find time to test this.  I finally had a few minutes, but unfortunately, the permission on the files is not correct so I get "forbidden" when trying to download.  

Jason.

Comment 7 Jan Grulich 2015-05-12 07:16:52 UTC
It should be fixed now.

Comment 8 jas 2015-05-25 20:34:44 UTC
Hi Jan,

I spent a full work day debugging this issue.  First, before trying the patched grub, I wanted to verify that I could replicate the issue without the original hardware.  I kickstarted a virtual system with VirtualBox, and shared md for /boot and /.   I tested various failover strategies, and each time I replaced the disk, and re-installed grub, I was able to boot successfully.  However, this wasn't using the LSI 9207-8i card that was installed in the original server, and since the Illumos bug report had also made reference to the problem occuring with disks connected to an LSI HBA, I got some physical server hardware close enough to the model where I had originally discovered the problem, and I installed an LSI 9207-8i there. I once again thoroughly tested failover once again.  Fortunately, I was not able to make the problem occur.  This means that somewhere along the line, the problem has been corrected.    Sorry for my delay in testing.

Comment 9 David Kaspar // Dee'Kej 2016-05-12 15:56:33 UTC
Hello,

I just got to this BZ to investigate it more. But before I actually do it, could you please tell me if this bug is still problem for you?

Thank you!

David

Comment 10 jas 2016-05-12 16:24:40 UTC
Hi.
As per comment 8, I was not able to replicated it.  Therefore, the problem has already been resolved.

Thanks!

Comment 12 David Kaspar // Dee'Kej 2016-10-13 10:34:11 UTC
As per comment 10, the reporter is no longer facing this issue, and we're not able to reproduce it.