Bug 1259784 - iSCSI boot Intel x540 server gets stuck in grub after RHEL-7.2 installation
iSCSI boot Intel x540 server gets stuck in grub after RHEL-7.2 installation
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: anaconda (Show other bugs)
7.2
Unspecified Unspecified
unspecified Severity unspecified
: rc
: ---
Assigned To: Anaconda Maintenance Team
Martin Hoyer
: Regression, TestBlocker
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-09-03 10:17 EDT by Martin Hoyer
Modified: 2016-10-14 16:30 EDT (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-10-14 16:30:54 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
console.log (211.47 KB, text/plain)
2015-09-03 10:17 EDT, Martin Hoyer
no flags Details
anaconda.log (10.97 KB, text/plain)
2015-09-03 10:18 EDT, Martin Hoyer
no flags Details
storage.log (137.50 KB, text/plain)
2015-09-03 10:19 EDT, Martin Hoyer
no flags Details
sys.log (144.66 KB, text/plain)
2015-09-03 10:19 EDT, Martin Hoyer
no flags Details
program.log (76.33 KB, text/plain)
2015-09-03 10:20 EDT, Martin Hoyer
no flags Details
ifcfg.log (27.25 KB, text/plain)
2015-09-03 10:20 EDT, Martin Hoyer
no flags Details
packaging.log (146.20 KB, text/plain)
2015-09-03 10:21 EDT, Martin Hoyer
no flags Details
kickstart (18.43 KB, text/plain)
2015-09-16 13:34 EDT, Martin Hoyer
no flags Details
manual install console log (389.59 KB, text/plain)
2015-09-17 09:09 EDT, Martin Hoyer
no flags Details
updates.img with EDD patch from pjones (11.39 KB, application/x-gzip)
2015-09-23 16:46 EDT, Brian Lane
no flags Details
storage_manual (153.90 KB, text/plain)
2015-09-24 03:12 EDT, Martin Hoyer
no flags Details
program_manual (91.32 KB, text/plain)
2015-09-24 03:13 EDT, Martin Hoyer
no flags Details
Cleaned up console log. (115.64 KB, text/plain)
2015-09-25 15:46 EDT, Peter Jones
no flags Details
demangled "manual" log (245.93 KB, text/plain)
2015-09-25 17:51 EDT, Peter Jones
no flags Details
vnc_run_29.09.2015 (93.86 KB, application/x-gzip)
2015-09-29 09:32 EDT, Martin Hoyer
no flags Details
automatic_install_29.09.2015 (127.76 KB, application/x-gzip)
2015-09-29 09:33 EDT, Martin Hoyer
no flags Details

  None (edit)
Description Martin Hoyer 2015-09-03 10:17:39 EDT
Created attachment 1069877 [details]
console.log

Description of problem:

System gets stuck in grub and cannot find boot device after successful installation of RHEL-7.2 and first reboot.

This problem is occurring on iSCSI boot server with Intel x540 HBA 

RHEL-7.1 and RHEL-6.7 are OK

Version-Release number of selected component (if applicable):
anaconda-21.48.22.29-1.el7
kernel-3.10.0-302.el7.x86_64.rpm

How reproducible:
100%

Steps to Reproduce:
Try to install RHEL-7.2 on the iSCSI boot machine with x540 HBA

Actual results:
stuck at grub>

Expected results:
boots successfully
Comment 1 Martin Hoyer 2015-09-03 10:18:33 EDT
Created attachment 1069878 [details]
anaconda.log
Comment 2 Martin Hoyer 2015-09-03 10:19:07 EDT
Created attachment 1069879 [details]
storage.log
Comment 3 Martin Hoyer 2015-09-03 10:19:47 EDT
Created attachment 1069880 [details]
sys.log
Comment 4 Martin Hoyer 2015-09-03 10:20:21 EDT
Created attachment 1069881 [details]
program.log
Comment 5 Martin Hoyer 2015-09-03 10:20:44 EDT
Created attachment 1069882 [details]
ifcfg.log
Comment 6 Martin Hoyer 2015-09-03 10:21:12 EDT
Created attachment 1069883 [details]
packaging.log
Comment 9 David Cantrell 2015-09-10 11:40:21 EDT
What's in the grub config file?
Comment 10 Martin Hoyer 2015-09-11 11:05:03 EDT
I can't read the .cfg file for some reason:
grub> ls /grub2      
themes/ device.map i386-pc/ locale/ fonts/ grubenv grub.cfg 
grub> ls -l -a /grub2
DIR          20150911075111 ./
DIR          20150911075047 ../
DIR          20150911074718 themes/
64           20150911075031 device.map
DIR          20150911075032 i386-pc/
DIR          20150911075031 locale/
DIR          20150911075031 fonts/
1024         20150911075111 grubenv

grub> cat /grub2/grub.cfg
error: file `/grub2/grub.cfg' not found.
Comment 11 David Cantrell 2015-09-16 11:08:31 EDT
Can you attach the kickstart file you are using?
Comment 12 Martin Hoyer 2015-09-16 13:34:07 EDT
Created attachment 1074139 [details]
kickstart
Comment 13 Brian Lane 2015-09-16 14:54:26 EDT
The bootloader is installed to /dev/sdb with /boot on /dev/sdb1, are you sure the system is booting from this disk? If it isn't that could explain comment 10.
Comment 14 Martin Hoyer 2015-09-17 09:09:43 EDT
Created attachment 1074461 [details]
manual install console log

I can see /boot on /dev/sdb1 in grub but I cannot read anything on /dev/sdb
When tried to boot it with root=/dev/sdb it fell to emergency mode and there is no /dev/sdb.

When I run the installation manually via vnc and choose only the iscsi disk, it boots successfully, despite the fact I had to reboot it manually after installation due to:
Reached target Shutdown.  
dracut Warning: Killing all remaining processes  
dracut Warning: Cannot umount /oldroot 
dracut Warning: Blocking umount of /oldroot [42092] /bin/umount/mnt/sysimage/boot-n 

Not sure what is the difference
Comment 18 Peter Jones 2015-09-23 15:02:03 EDT
(In reply to Martin Hoyer from comment #14)
> Created attachment 1074461 [details]
> manual install console log
> 
> I can see /boot on /dev/sdb1 in grub but I cannot read anything on /dev/sdb
> When tried to boot it with root=/dev/sdb it fell to emergency mode and there
> is no /dev/sdb.
> 
> When I run the installation manually via vnc and choose only the iscsi disk,
> it boots successfully, despite the fact I had to reboot it manually after
> installation due to:
> Reached target Shutdown.  
> dracut Warning: Killing all remaining processes  
> dracut Warning: Cannot umount /oldroot 
> dracut Warning: Blocking umount of /oldroot [42092]
> /bin/umount/mnt/sysimage/boot-n 
> 
> Not sure what is the difference

Can you show us storage.log and program.log from a manual install that works?
Comment 19 Brian Lane 2015-09-23 16:46 EDT
Created attachment 1076302 [details]
updates.img with EDD patch from pjones

Also, could you give it a try with this updates.img and attach the storage.log from it. It includes some EDD changes and more logging.
Comment 22 Martin Hoyer 2015-09-24 03:12 EDT
Created attachment 1076367 [details]
storage_manual
Comment 23 Martin Hoyer 2015-09-24 03:13 EDT
Created attachment 1076368 [details]
program_manual
Comment 24 Peter Jones 2015-09-25 15:46 EDT
Created attachment 1077225 [details]
Cleaned up console log.

Since the console log that's attached is pretty much the least readable log anybody has ever produced, here's a (somewhat) cleaned up version...
Comment 25 Peter Jones 2015-09-25 16:32:07 EDT
Okay, so this machine has the following PCI option ROMs loading things into the boot order:

SATA PCI option ROM (providing no disks)
HPSA PCI option ROM providing a local RAID device (sda)
Emulex PXE+UNDI PCI option ROM driving 2 controllers
Broadcom NetXtreme Option ROM driving at least one port
Intel iSCSI option ROM providing 2 paths to an Equalogic iscsi target (sdb)

It's booting through the Broadcom device, which is getting one pxelinux.cfg when booting the installer, and a different one when it's rebooted.  On the reboot, it is booting pxelinux and then chaining to a local disk.

First, I just want to say that this is fundamentally not a reasonable way to configure a computer.  There is basically *zero* chance that EDD will provide correct information - each option ROM that adds a disk is adding its own EDD entry, and from the looks of it it's probably EDD 2.0 (i.e. the only useful part is the MBR signature), and also each one basically gets to determine where in the EDD list its entries go, whether that's the boot order or not.  So there's a high chance for firmware bugs to make this useless.  So there's really no way to know what order the firmware will attempt to boot the disks.  Additionally, there's no reason to believe that chaining through PXE to a hard disk in such a configuration will even be invariant, especially if there is variation in the steps that occur before doing so.

Personally, I still think this is booting the wrong drive after the reboot on the non-manual install, and there's a leftover grub boot sector there.

If there is any chance in figuring this out, we're going to need full logs for the successful manual install, including the console log for it and the ks.cfg that *is* being loaded, as well as an explanation of the manual test procedure in *excruciating* detail - which options you picked during booting, what menus you went through before anaconda started, etc.
Comment 26 Peter Jones 2015-09-25 16:35:24 EDT
So, as bcl asked in comment#19, please provide full logs of both a broken and working run with the updates.img that's attached, including ks.cfg and a console log for each of them.

It may also be helpful if you can do test this without PXE chaining being involved at all.  There's a reasonably strong chance that's effectively a random number generator.
Comment 28 Peter Jones 2015-09-25 17:51 EDT
Created attachment 1077282 [details]
demangled "manual" log

Demangled manual console log...
Comment 29 Peter Jones 2015-09-25 17:54:25 EDT
The manual console log attached here does not appear to be from the same system as the manual storage.log ?  The console log shows sda as an unpartitioned SATA disk connected to an ata_piix device.
Comment 30 Martin Hoyer 2015-09-29 09:32 EDT
Created attachment 1078334 [details]
vnc_run_29.09.2015

So, with my manual install in *excruciating* detail:
I used kernel_options="vnc" and ks_meta="manual" in beaker,
In storage I selected the network disk only: 50GiB iscsi-iqn.2001-05.com.equallogic:0-af1ff6-57d5d9bd9-d087097399a53f34-storageqe-82-boot-lun-0 with automatic partitioning.
Used nfs source, software selection: minimal install with debugging tools.
Kdump enabled.

Logs attached
Comment 31 Martin Hoyer 2015-09-29 09:33 EDT
Created attachment 1078335 [details]
automatic_install_29.09.2015

automatic install with provided updates.img
Comment 33 Martin Hoyer 2015-11-13 10:03:39 EST
(In reply to Peter Jones from comment #25)
> Okay, so this machine has the following PCI option ROMs loading things into
> the boot order:
> 
> SATA PCI option ROM (providing no disks)
> HPSA PCI option ROM providing a local RAID device (sda)
> Emulex PXE+UNDI PCI option ROM driving 2 controllers
> Broadcom NetXtreme Option ROM driving at least one port
> Intel iSCSI option ROM providing 2 paths to an Equalogic iscsi target (sdb)

You're right, I should have familiarize myself with the machine config first.
I have disabled SATA PCI and Emulex adapter, since they're not being used.
Boot order is now:
Broadcom NX (PXE)
x540 device
HPSA device

Firmware and bios has been updated, including x540.

Problem still persist. End up in grub after installation every time.

I even tried to disable HP Smart array controller - same thing.
Comment 34 Tomasz Kepczynski 2016-03-24 07:53:15 EDT
We've seen similar issues in our LAB.

There are two failing scenarios so far:
1. Grub is loaded but falls back to command line (as described here).
2. Kernel panic with the messages pointing to problems with initramfs.

We were able to examine the target disk in the second scenario above and found out that:
1. /boot file system was corrupted. Running fsck from other machine on the partition fixed a few issues including cleaning some inodes and recovering journal.
2. initramfs on /boot had file length of 0. Not a good sign. Rescue initramfs was usable for boot in rescue mode.
3. / filesystem was corrupted as well and in similar fashion as /boot filesystem.

I suspect that in the failing scenario filesystem were not correctly unmounted and hence the corrupion of the filesystem itself and initial ramdisk.

Please note that we've seen these issues not only on Intel NICs, but when PXE booting and attaching to iSCSI target using Broadcom NIC as well.

iSCSI target used was standard Linux target a included in CentOS 7.2. Testing was performed with plain PXE install image of RHEL Server 7.2 x86_64 using legacy boot. Platform: DELL R630 with all local hard disks detached.

Note You need to log in before you can comment on or make changes to this bug.