Red Hat Bugzilla – Bug 1259784
iSCSI boot Intel x540 server gets stuck in grub after RHEL-7.2 installation
Last modified: 2016-10-14 16:30:54 EDT
Created attachment 1069877 [details]
Description of problem:
System gets stuck in grub and cannot find boot device after successful installation of RHEL-7.2 and first reboot.
This problem is occurring on iSCSI boot server with Intel x540 HBA
RHEL-7.1 and RHEL-6.7 are OK
Version-Release number of selected component (if applicable):
Steps to Reproduce:
Try to install RHEL-7.2 on the iSCSI boot machine with x540 HBA
stuck at grub>
Created attachment 1069878 [details]
Created attachment 1069879 [details]
Created attachment 1069880 [details]
Created attachment 1069881 [details]
Created attachment 1069882 [details]
Created attachment 1069883 [details]
What's in the grub config file?
I can't read the .cfg file for some reason:
grub> ls /grub2
themes/ device.map i386-pc/ locale/ fonts/ grubenv grub.cfg
grub> ls -l -a /grub2
DIR 20150911075111 ./
DIR 20150911075047 ../
DIR 20150911074718 themes/
64 20150911075031 device.map
DIR 20150911075032 i386-pc/
DIR 20150911075031 locale/
DIR 20150911075031 fonts/
1024 20150911075111 grubenv
grub> cat /grub2/grub.cfg
error: file `/grub2/grub.cfg' not found.
Can you attach the kickstart file you are using?
Created attachment 1074139 [details]
The bootloader is installed to /dev/sdb with /boot on /dev/sdb1, are you sure the system is booting from this disk? If it isn't that could explain comment 10.
Created attachment 1074461 [details]
manual install console log
I can see /boot on /dev/sdb1 in grub but I cannot read anything on /dev/sdb
When tried to boot it with root=/dev/sdb it fell to emergency mode and there is no /dev/sdb.
When I run the installation manually via vnc and choose only the iscsi disk, it boots successfully, despite the fact I had to reboot it manually after installation due to:
Reached target Shutdown.
dracut Warning: Killing all remaining processes
dracut Warning: Cannot umount /oldroot
dracut Warning: Blocking umount of /oldroot  /bin/umount/mnt/sysimage/boot-n
Not sure what is the difference
(In reply to Martin Hoyer from comment #14)
> Created attachment 1074461 [details]
> manual install console log
> I can see /boot on /dev/sdb1 in grub but I cannot read anything on /dev/sdb
> When tried to boot it with root=/dev/sdb it fell to emergency mode and there
> is no /dev/sdb.
> When I run the installation manually via vnc and choose only the iscsi disk,
> it boots successfully, despite the fact I had to reboot it manually after
> installation due to:
> Reached target Shutdown.
> dracut Warning: Killing all remaining processes
> dracut Warning: Cannot umount /oldroot
> dracut Warning: Blocking umount of /oldroot 
> Not sure what is the difference
Can you show us storage.log and program.log from a manual install that works?
Created attachment 1076302 [details]
updates.img with EDD patch from pjones
Also, could you give it a try with this updates.img and attach the storage.log from it. It includes some EDD changes and more logging.
Created attachment 1076367 [details]
Created attachment 1076368 [details]
Created attachment 1077225 [details]
Cleaned up console log.
Since the console log that's attached is pretty much the least readable log anybody has ever produced, here's a (somewhat) cleaned up version...
Okay, so this machine has the following PCI option ROMs loading things into the boot order:
SATA PCI option ROM (providing no disks)
HPSA PCI option ROM providing a local RAID device (sda)
Emulex PXE+UNDI PCI option ROM driving 2 controllers
Broadcom NetXtreme Option ROM driving at least one port
Intel iSCSI option ROM providing 2 paths to an Equalogic iscsi target (sdb)
It's booting through the Broadcom device, which is getting one pxelinux.cfg when booting the installer, and a different one when it's rebooted. On the reboot, it is booting pxelinux and then chaining to a local disk.
First, I just want to say that this is fundamentally not a reasonable way to configure a computer. There is basically *zero* chance that EDD will provide correct information - each option ROM that adds a disk is adding its own EDD entry, and from the looks of it it's probably EDD 2.0 (i.e. the only useful part is the MBR signature), and also each one basically gets to determine where in the EDD list its entries go, whether that's the boot order or not. So there's a high chance for firmware bugs to make this useless. So there's really no way to know what order the firmware will attempt to boot the disks. Additionally, there's no reason to believe that chaining through PXE to a hard disk in such a configuration will even be invariant, especially if there is variation in the steps that occur before doing so.
Personally, I still think this is booting the wrong drive after the reboot on the non-manual install, and there's a leftover grub boot sector there.
If there is any chance in figuring this out, we're going to need full logs for the successful manual install, including the console log for it and the ks.cfg that *is* being loaded, as well as an explanation of the manual test procedure in *excruciating* detail - which options you picked during booting, what menus you went through before anaconda started, etc.
So, as bcl asked in comment#19, please provide full logs of both a broken and working run with the updates.img that's attached, including ks.cfg and a console log for each of them.
It may also be helpful if you can do test this without PXE chaining being involved at all. There's a reasonably strong chance that's effectively a random number generator.
Created attachment 1077282 [details]
demangled "manual" log
Demangled manual console log...
The manual console log attached here does not appear to be from the same system as the manual storage.log ? The console log shows sda as an unpartitioned SATA disk connected to an ata_piix device.
Created attachment 1078334 [details]
So, with my manual install in *excruciating* detail:
I used kernel_options="vnc" and ks_meta="manual" in beaker,
In storage I selected the network disk only: 50GiB iscsi-iqn.2001-05.com.equallogic:0-af1ff6-57d5d9bd9-d087097399a53f34-storageqe-82-boot-lun-0 with automatic partitioning.
Used nfs source, software selection: minimal install with debugging tools.
Created attachment 1078335 [details]
automatic install with provided updates.img
(In reply to Peter Jones from comment #25)
> Okay, so this machine has the following PCI option ROMs loading things into
> the boot order:
> SATA PCI option ROM (providing no disks)
> HPSA PCI option ROM providing a local RAID device (sda)
> Emulex PXE+UNDI PCI option ROM driving 2 controllers
> Broadcom NetXtreme Option ROM driving at least one port
> Intel iSCSI option ROM providing 2 paths to an Equalogic iscsi target (sdb)
You're right, I should have familiarize myself with the machine config first.
I have disabled SATA PCI and Emulex adapter, since they're not being used.
Boot order is now:
Broadcom NX (PXE)
Firmware and bios has been updated, including x540.
Problem still persist. End up in grub after installation every time.
I even tried to disable HP Smart array controller - same thing.
We've seen similar issues in our LAB.
There are two failing scenarios so far:
1. Grub is loaded but falls back to command line (as described here).
2. Kernel panic with the messages pointing to problems with initramfs.
We were able to examine the target disk in the second scenario above and found out that:
1. /boot file system was corrupted. Running fsck from other machine on the partition fixed a few issues including cleaning some inodes and recovering journal.
2. initramfs on /boot had file length of 0. Not a good sign. Rescue initramfs was usable for boot in rescue mode.
3. / filesystem was corrupted as well and in similar fashion as /boot filesystem.
I suspect that in the failing scenario filesystem were not correctly unmounted and hence the corrupion of the filesystem itself and initial ramdisk.
Please note that we've seen these issues not only on Intel NICs, but when PXE booting and attaching to iSCSI target using Broadcom NIC as well.
iSCSI target used was standard Linux target a included in CentOS 7.2. Testing was performed with plain PXE install image of RHEL Server 7.2 x86_64 using legacy boot. Platform: DELL R630 with all local hard disks detached.