Bug 1966129 - [4.9] Openshift Installer| UEFI mode | BM hosts have BIOS halted
Summary: [4.9] Openshift Installer| UEFI mode | BM hosts have BIOS halted
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Bare Metal Hardware Provisioning
Version: 4.8
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: ---
: 4.9.0
Assignee: Bob Fournier
QA Contact: Lubov
jfrye
URL:
Whiteboard:
: 1970514 1976074 (view as bug list)
Depends On:
Blocks: 1970632 1971018 1972213 1973314 1976074 1976079
TreeView+ depends on / blocked
 
Reported: 2021-05-31 12:59 UTC by Nikita
Modified: 2021-10-18 17:32 UTC (History)
30 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Release Note text: Previously, in UEFI mode, the `ironic-python-agent` created a UEFI bootloader entry after downloading the RHCOS image. When using an RHCOS image based on RHEL 8.4, the image could fail to boot using this entry. If the entry installed by Ironic was used when booting the image, the boot could fail and output a BIOS error screen. This is fixed by the `ironic-python-agent` configuring the boot entry based on a CSV file located in the image, instead of using a fixed boot entry. The image boots properly without error. ------- Cause: In UEFI mode, after downloading the RHCOS image, the ironic-python-agent creates a UEFI bootloader entry. When using an RHCOS image based on RHEL 8.4, the image may fail to boot using this entry. Consequence: Depending on the ordering of UEFI entries, if the entry installed by Ironic is used when booting the image the boot may fail and output a BIOS error screen. Fix: The ironic-python-agent will configure the boot entry based on a CSV file located in the image instead of using a fixed boot entry. Result: The image boots properly without an error.
Clone Of:
: 1970632 1971014 1972213 1976074 (view as bug list)
Environment:
Last Closed: 2021-10-18 17:32:17 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
BIOS error (49.07 KB, image/png)
2021-05-31 12:59 UTC, Nikita
no flags Details
BIOS error second BM worker (55.55 KB, image/png)
2021-05-31 13:00 UTC, Nikita
no flags Details
Installer (97.27 KB, text/plain)
2021-06-02 08:07 UTC, Nikita
no flags Details
Ironic deploy ramdisk log from failed master deployment in Lubov's setup (12.35 MB, text/plain)
2021-06-02 14:38 UTC, Bob Fournier
no flags Details
BIOS halt console with RHEL 8.4 (147.92 KB, image/png)
2021-06-07 17:55 UTC, Bob Fournier
no flags Details
R740 BIOS settings when failure with RHEL 8.4 occurred (146.70 KB, image/png)
2021-06-07 17:56 UTC, Bob Fournier
no flags Details
boot video with BIOS halt from idrac (77.05 KB, application/octet-stream)
2021-06-08 00:21 UTC, Bob Fournier
no flags Details
loaderror.png , error after removign files from ESP (24.44 KB, image/png)
2021-06-10 16:14 UTC, Derek Higgins
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift ironic-image pull 180 0 None closed Update python-ironic-lib 2021-06-16 09:06:01 UTC
Github openshift ironic-ipa-downloader pull 71 0 None closed Fix for UEFI bootloader entry 2021-06-16 09:06:04 UTC
OpenStack Storyboard 2008962 0 None None None 2021-06-10 18:35:12 UTC
OpenStack gerrit 795862 0 None MERGED Utilize CSV file for EFI loader selection 2021-06-11 11:22:35 UTC
OpenStack gerrit 795882 0 None MERGED Utilize CSV file for EFI loader selection 2021-06-16 09:05:56 UTC
OpenStack gerrit 795965 0 None MERGED Utilize CSV file for EFI loader selection 2021-06-16 09:05:59 UTC
Red Hat Product Errata RHSA-2021:3759 0 None None None 2021-10-18 17:32:52 UTC

Internal Links: 1976074

Description Nikita 2021-05-31 12:59:47 UTC
Created attachment 1788310 [details]
BIOS error

Version:

$ openshift-install version
4.8.0-0.nightly-2021-05-29-114625
built from commit 0629539ff309d5e2f0fe0d9253d92b0abfa36ddb
release image quay.io/openshift-release-dev/ocp-release-nightly@sha256:76dc7eb6793d6a1897abf573baa7b5dd4bab3a25d058cf4e5f0014815506cdfe

Platform:
baremetal


Please specify:
IPI

What happened?
After OCP installation:
3 virt masters nodes looks OK. Up and running
2 BM nodes - BIOS crash. Screenshots are attached 


What did you expect to happen?

BM workers up and running. Probably something wrong with boot loader

How to reproduce it (as minimally and precisely as possible)?

This issue reproducible on all 4.8 after 4.8.0-fc.3 build on top of BM workers. First time we found this issue on top of 4.8.0-fc.5 build.  In our lab we use Dell Power Edge R740.

Comment 1 Nikita 2021-05-31 13:00:51 UTC
Created attachment 1788311 [details]
BIOS error second BM worker

Comment 2 Yuval Kashtan 2021-05-31 13:16:24 UTC
we're deploying on a mixed R640/R740
and issue shows up (everytime) but only on R740

Comment 3 Yuval Kashtan 2021-05-31 13:17:57 UTC
we're deploying on a mixed R640/R740
and issue shows up (everytime) but only on R740

Comment 5 Bob Fournier 2021-06-01 15:56:24 UTC
Wonder if we're hitting issue due to RHEL 8.4 grub change - https://bugzilla.redhat.com/show_bug.cgi?id=1961784#c2.  Looks like Steve has an IPA patch up https://review.opendev.org/c/openstack/ironic-python-agent/+/782885/

I'm not sure, but is the difference between 4.8.0-fc.3 (working) and v4.8.0-fc.5 (not working) is that the latest one is using rhel 8.4?

Comment 6 Steve Baker 2021-06-01 20:31:39 UTC
I don't think you're hitting the grub2-install issue specifically. Can you supply the deploy logs for the affected nodes? Also is it possible secure boot has been switched on for these nodes?

Also something to consider, the NVRAM may need cleaning up. Can you wipe it, delete old UEFI boot entries via the BIOS?

Comment 7 Bob Fournier 2021-06-01 21:10:23 UTC
Can we get a must-gather for these failures with the ironic logs?

Comment 8 Steve Baker 2021-06-01 22:35:38 UTC
Recent rhel-8.4 images have a change where package grub2-efi-x86, shim-x86 is now preinstalled, which populates /boot/efi.

diskimage-builder creates an empty /boot/efi partition then expects it to be populated by installing these packages, which won't happen if rpm believes they are already installed.

The fix for diskimage-builder is here[1]. If you're using another image building tool it may be hit by the same issue.

[1] https://review.opendev.org/c/openstack/diskimage-builder/+/786804

Comment 9 Steve Baker 2021-06-01 22:36:54 UTC
If you provide me a link to the image file you're deploying I can take a look at the /boot/efi layout

Comment 10 Nikita 2021-06-02 08:07:57 UTC
Created attachment 1788635 [details]
Installer

Comment 11 Nikita 2021-06-02 08:09:06 UTC
Hi Team,

I added openshift-installer logs to attachments

We use following images:
bootstrapOSImage: rhcos-48.84.202105190318-0-qemu.x86_64.qcow2.gz?sha256=84683a75c0e3d164c1d4a95448e142490a0bf91ff07076bff2b3bbc209c6c368#
clusterOSImage: rhcos-48.84.202105190318-0-openstack.x86_64.qcow2.gz?sha256=37a156f9f2b0efded45cb3cd5688aa2d42c26873a534951484e96f546a6b2c84#

Comment 13 Bob Fournier 2021-06-02 14:38:36 UTC
Created attachment 1788709 [details]
Ironic deploy ramdisk log from failed master deployment in Lubov's setup

Comment 14 Bob Fournier 2021-06-02 14:55:04 UTC
I've added the ramdisk deploy logs from the bootstrap in Lubov's setup.  All 3 baremetal master nodes are halted at the BIOS screen.

Comment 15 Derek Higgins 2021-06-04 16:15:48 UTC
Investigating this on a machine that had the problem, the host had a lot of bootmgr entries, most of them left behind from old RHCOS instalments,
removing them and rebooting each time, the machine sometimes booted fine and others didn't

There appears to be a few relevant points, 
whether the boot is successful or not appears to be dependent on what bootmanager entries was selected (of these 3)

Boot0010* Red Hat Enterprise Linux      HD(2,GPT,3b6e914a-f943-4f30-9f2d-4cb92848d7eb,0x1000,0x3f800)/File(\EFI\redhat\shimx64.efi)
Boot001A* ironic1       HD(2,GPT,3b6e914a-f943-4f30-9f2d-4cb92848d7eb,0x1000,0x3f800)/File(\EFI\BOOT\BOOTX64.EFI)
Boot0013* EFI RAID Disk PlaceHolder 2   PciRoot(0x1)/Pci(0x0,0x0)/Pci(0x0,0x0)/Ctrl(0x1)/SCSI(1,0)

Boot0013 gets added by the bios when "Hard-disk Drive Placeholder" is enabled


I've seen Boot001A both fail and succeed to boot when selected, when put at the start of the boot order I haven't seen Boot0010 fail


Boot001A is the ironic created entry, next step is to find out if this is somehow different to older version of RHCOS

Comment 16 Bob Fournier 2021-06-04 16:36:09 UTC
Updated the summary as this is only occurring on Dell R740s, the problem is not apparent on Dell R640s or other systems.  It can also occur on BM master nodes (not just workers).

The workaround to set enable "Hard Disk drive placeholder" in the iDrac UI appears to work in the majority of the cases to successfully boot the node and the cluster but we've found certain times when this workaround didn't work.  Note that the "Hard Disk drive placeholder" field is specific to R740, in the R640 it shows as permanently disabled and is not settable.  This "Hard Disk drive placeholder" is defined in [1] as:

"In certain instances, administrators may wish to reserve a boot entry for a fixed disk in the UEFI Boot Sequence before an OS is installed or before a physical or virtual drive has been formatted. When a HardDisk Drive Placeholder is set to Enabled, the BIOS will create a boot option for the PERC RAID (Integrated or in a PCIe slot) disk if a partition is found, even if there is no FAT filesystem present. When set to Disabled, BIOS will only add a boot option if a UEFI boot file is found. This allows the Integrated RAID controller to be moved in the UEFI Boot Sequence prior to the OS installation.

The Hard-disk Driver Placeholder default is disabled and can be changed with the following BIOS Attribute. It is available only in UEFI Boot Mode."

This problem has only been seen with 8.4 based rhcos starting with 4.8.0-fc.5.

[1]https://downloads.dell.com/manuals/all-products/esuprt_solutions_int/esuprt_solutions_int_solutions_resources/dell-management-solution-resources_white-papers12_en-us.pdf

Comment 17 Derek Higgins 2021-06-07 02:29:31 UTC
Investigating further, this appears like it could be down to the filename for the boatloader somehow
I created several bootmanager entries on a Dell R740
Boot000C* ironic1       HD(2,GPT,3b6e914a-f943-4f30-9f2d-4cb92848d7eb,0x1000,0x3f800)/File(\EFI\BOOT\BOOTX64.EFI) Fails
Boot000A* a test4       HD(2,GPT,3b6e914a-f943-4f30-9f2d-4cb92848d7eb,0x1000,0x3f800)/File(\EFI\BOOT\shimx64.efi) OK                                                                                                
Boot000D* name t1       HD(2,GPT,3b6e914a-f943-4f30-9f2d-4cb92848d7eb,0x1000,0x3f800)/File(\EFI\BOOT\t1.efi) OK
Boot000E* name t2       HD(2,GPT,3b6e914a-f943-4f30-9f2d-4cb92848d7eb,0x1000,0x3f800)/File(\EFI\BOOT\T2.efi) OK, 
Boot000F* name t3       HD(2,GPT,3b6e914a-f943-4f30-9f2d-4cb92848d7eb,0x1000,0x3f800)/File(\EFI\BOOT\T3.EFI) OK, 
Boot0011* name t4       HD(2,GPT,3b6e914a-f943-4f30-9f2d-4cb92848d7eb,0x1000,0x3f800)/File(\EFI\BOOT\BOOTX64.EFI) Fails

The only ones that failed to boot were when I referenced the file BOOTX64.EFI, 

all were using the exact same file
[root@openshift-master-0 BOOT]# md5sum BOOTX64.EFI shimx64.efi t1.efi T2.efi T3.EFI
e149994b7a32b4a0b3f92a61b88d5eb8  BOOTX64.EFI
e149994b7a32b4a0b3f92a61b88d5eb8  shimx64.efi
e149994b7a32b4a0b3f92a61b88d5eb8  t1.efi
e149994b7a32b4a0b3f92a61b88d5eb8  T2.efi
e149994b7a32b4a0b3f92a61b88d5eb8  T3.EFI


I also tried change in the timestamps on BOOTX64.EFI (From 1979 to now) but this didn't help.

Comment 18 Bob Fournier 2021-06-07 17:54:22 UTC
Providing a summary of the investigations we've done the last few days including duplicating the failure with a vanilla rhel 8.4 image:

- This problem was first observed in 4.8.0-fc5 (4.8.0-fc.3 and earlier did not have this problem). The Dell R740s are widely used in our lab setups and no problems like BIOS “red screen” have been seen before. We first investigated Ironic assuming something may have changed with the bootloader that required changes to ironic-python-agent, perhaps something similar to https://bugzilla.redhat.com/show_bug.cgi?id=1961784. However we were able to duplicate the same symptoms (passing in fc3 and failing in fc5) using the same version of Ironic and IPA. 1961784 also had different symptoms and was determined to not be related.

- We then focused on the Dell R740 configuration as this problem has not been seen on other systems including the Dell R640. The R740 has some different settings for UEFI, including “Hard-disk Drive Placeholder” which shows as a ReadOnly setting on the R640, permanently set to Disabled. We had some success setting this value to Enabled and were able to successfully boot dozens of times, while when it was Disabled, the failure with “red screen” would occur on the first few boots. However, we were finding that the R740 would inconsistently change the UEFI boot sequence (as seen in the iDRAC UI) and eventually the boot would again fail with the BIOS “red screen”.

- We tried to manipulate the order of the UEFI boot sequence both directly using efibootmgr on the host and also in the iDRAC UI to remove the entries in the UEFI boot sequence marked “Unavailable”. We again found some success with these configuration changes, but eventually it would fail with the “red screen” after a few reboots.

- We then tried a vanilla rhel image from http://download.eng.bos.redhat.com/released/rhel-8/RHEL-8/8.4.0/BaseOS/x86_64/images/rhel-guest-image-8.4-992.x86_64.qcow2 to rule out any rhcos issues.  We tried it on both worker-0 and worker-1.  After 4 successful deployments we got the “red screen” on the 5th try.  The BIOS configuration is all default settings, including Hard-disk Drive Placeholder set to Disabled. I’ve attached the screenshot of the console along with the BIOS settings when the failure occurred.

Comment 19 Bob Fournier 2021-06-07 17:55:28 UTC
Created attachment 1789272 [details]
BIOS halt console with RHEL 8.4

Comment 20 Bob Fournier 2021-06-07 17:56:31 UTC
Created attachment 1789273 [details]
R740 BIOS settings when failure with RHEL 8.4 occurred

Comment 21 Bob Fournier 2021-06-07 18:29:12 UTC
Moving this to the RHEL team to take a look, feel free to change the component if that's not correct. We can make an R740 available to test on.

Also note that secure boot is disabled.

Comment 25 Bob Fournier 2021-06-08 00:21:15 UTC
Created attachment 1789298 [details]
boot video with BIOS halt from idrac

Comment 27 Bob Fournier 2021-06-08 17:30:26 UTC
A couple updates:

- I tried using RHEL 8.3 - rhel-guest-image-8.3-401.x86_64.qcow2, on the same setup that was exhibiting the problem and I could not get the "red screen" problem to reproduce with multiple retries.
- Just yesterday (June 7) new versions of iDRAC (4.40.40.00) and BIOS (2.11.2) for the R740 were released. However, using RHEL 8.4, on the 2nd reboot the "red screen" occurred again. The error signature looks the same.

Comment 28 Michael Gourin 2021-06-09 14:11:02 UTC
Reproduced this issue on an HP ProLiant DL380 Gen10 baremetal worker (iLO - 10.19.28.23 - cnfdb4),
OCP version: 4.8.0-0.nightly-2021-06-09-023740.

Comment 29 Derek Higgins 2021-06-09 15:44:16 UTC
(In reply to Michael Gourin from comment #28)
> Reproduced this issue on an HP ProLiant DL380 Gen10 baremetal worker (iLO -
> 10.19.28.23 - cnfdb4),
> OCP version: 4.8.0-0.nightly-2021-06-09-023740.

We also see the same behaviour in this host with multiple bootmanager entries for the same partition

  "HD(2,GPT,3B6E914A-F943-4F30-9F2D-4CB92848D7EB,0x1000,0x3F800)/\\EFI\\redhat\\shimx64.efi",
  "HD(2,GPT,3B6E914A-F943-4F30-9F2D-4CB92848D7EB,0x1000,0x3F800)/\\EFI\\BOOT\\BOOTX64.EFI",

2 entries in the bootmanager for the same partition, one works and the other doesn't

The enties point to two different files
redhat\\shimx64.efi vs BOOT\\BOOTX64.EFI
Both files have the same contents (same md5sum), see below for more info on the EFI system partition


[root@cnfdb4 mp]# find EFI/ -type f -exec ls -l {} \;
-rwxr-xr-x. 1 root root 86920 Jan  1  1980 EFI/BOOT/fbx64.efi
-rwxr-xr-x. 1 root root 924888 Jan  1  1980 EFI/BOOT/BOOTX64.EFI
-rwxr-xr-x. 1 root root 924888 Jan  1  1980 EFI/redhat/shimx64.efi
-rwxr-xr-x. 1 root root 2285512 Jan  1  1980 EFI/redhat/grubx64.efi
-rwxr-xr-x. 1 root root 182 Jan  1  1980 EFI/redhat/BOOTX64.CSV
-rwxr-xr-x. 1 root root 846856 Jan  1  1980 EFI/redhat/mmx64.efi
-rwxr-xr-x. 1 root root 918944 Jan  1  1980 EFI/redhat/shimx64-redhat.efi
-rwxr-xr-x. 1 root root 328 May 19 03:23 EFI/redhat/grub.cfg


[root@cnfdb4 mp]# find EFI/ -type f -exec md5sum {} \;
dc49e7cc629cf1e6be224378dd6db0f2  EFI/BOOT/fbx64.efi
e149994b7a32b4a0b3f92a61b88d5eb8  EFI/BOOT/BOOTX64.EFI
e149994b7a32b4a0b3f92a61b88d5eb8  EFI/redhat/shimx64.efi
070eec299df211fe9854f90afed097d7  EFI/redhat/grubx64.efi
b90ffff182e4b99380e6e4d2a9e33753  EFI/redhat/BOOTX64.CSV
76ea3d87e2c6df533e0b754f4baec768  EFI/redhat/mmx64.efi
22aae41e76d5feb2140e0df8b4653839  EFI/redhat/shimx64-redhat.efi
8f2892e2f05287773ace1f47f024f15f  EFI/redhat/grub.cfg


[root@cnfdb4 mp]# stat EFI/BOOT/BOOTX64.EFI
  File: EFI/BOOT/BOOTX64.EFI
  Size: 924888
Device: 812h/2066d
Access: (0755/-rwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Context: system_u:object_r:dosfs_t:s0
Access: 2021-06-09 00:00:00.000000000 +0000
Modify: 1980-01-01 00:00:00.000000000 +0000
Change: 2021-05-19 03:23:20.700000000 +0000
 Birth: -

[root@cnfdb4 mp]# stat EFI/redhat/shimx64.efi
  File: EFI/redhat/shimx64.efi
  Size: 924888
Device: 812h/2066d
Access: (0755/-rwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Context: system_u:object_r:dosfs_t:s0
Access: 2021-06-09 00:00:00.000000000 +0000
Modify: 1980-01-01 00:00:00.000000000 +0000
Change: 2021-05-19 03:23:20.700000000 +0000
 Birth: -

[root@cnfdb4 mp]# efibootmgr -v
BootCurrent: 0024
Timeout: 0 seconds
BootOrder: 0000,0024,0026,001E,0022,0021,0020,000A,000B,000E,0016,0010,0018,000F,0019,0017,0011,0012,0013,001A,001C,0015,001B,0014,001D,000C,000D,0001,0002,0003,0004,0005,0006,0007,0008,0009
Boot0000* System Utilities	FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(1fd631e5-44e0-2f91-10ab-f88f3568ef30)
Boot0001  Embedded UEFI Shell	FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(c57ad6b7-0515-40a8-9d21-551652854e37)
Boot0002  Diagnose Error	FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(0849279d-40d5-53ea-e764-2496766f9844)
Boot0003  Intelligent Provisioning	FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(4a433501-ddaa-490b-96b2-04f42d8669b8)
Boot0004  Boot Menu	FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(d3fd6286-43c5-bb8d-0793-07b70aa9de36)
Boot0005  Network Boot	FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(0334f316-c276-49f6-9879-aaf1ecffa5d5)
Boot0006  View Integrated Management Log	FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(93c92423-d1c6-4286-be67-b76b6671047e)
Boot0007  HTTP Boot	FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(3f770860-3d63-4803-9ea3-df37144ab546)
Boot0008  PXE Boot	FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(ee8b26b0-37e9-11e1-b86c-0800200c9a66)
Boot0009  Embedded Diagnostics	FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(b57fe6f1-4f49-d46e-4bba-0a8add34d2f3)
Boot000A* Generic USB Boot	UsbClass(ffff,ffff,255,255)
Boot000B* Internal SD Card 1 : Generic USB3.0-CRW	PciRoot(0x0)/Pci(0x14,0x0)/USB(19,0)N.....YM....R,Y.
Boot000C* Embedded RAID 1 : HPE Smart Array P408i-a SR Gen10 - Size:3.4 TiB Port:2I Bay:7 Box:3	PciRoot(0x3)/Pci(0x0,0x0)/Pci(0x0,0x0)/SCSI(6,16384)N.....YM....R,Y.
Boot000D* Embedded RAID 1 : HPE Smart Array P408i-a SR Gen10 - Size:3.4 TiB Port:2I Bay:8 Box:3	PciRoot(0x3)/Pci(0x0,0x0)/Pci(0x0,0x0)/SCSI(7,16384)N.....YM....R,Y.
Boot000E* Embedded LOM 1 Port 1 : HPE Ethernet 1Gb 4-port 331i Adapter - NIC (HTTP(S) IPv4)	PciRoot(0x0)/Pci(0x1c,0x0)/Pci(0x0,0x0)/MAC(9440c9ed1f9a,1)/IPv4(0.0.0.00.0.0.0,0,0)/Uri()N.....YM....R,Y.
Boot000F* Embedded LOM 1 Port 1 : HPE Ethernet 1Gb 4-port 331i Adapter - NIC (HTTP(S) IPv6)	PciRoot(0x0)/Pci(0x1c,0x0)/Pci(0x0,0x0)/MAC(9440c9ed1f9a,1)/IPv6([::]:<->[::]:,0,0)/Uri()N.....YM....R,Y.
Boot0010* Embedded LOM 1 Port 1 : HPE Ethernet 1Gb 4-port 331i Adapter - NIC (PXE IPv4)	PciRoot(0x0)/Pci(0x1c,0x0)/Pci(0x0,0x0)/MAC(9440c9ed1f9a,1)/IPv4(0.0.0.00.0.0.0,0,0)N.....YM....R,Y.
Boot0011* Embedded LOM 1 Port 1 : HPE Ethernet 1Gb 4-port 331i Adapter - NIC (PXE IPv6)	PciRoot(0x0)/Pci(0x1c,0x0)/Pci(0x0,0x0)/MAC(9440c9ed1f9a,1)/IPv6([::]:<->[::]:,0,0)N.....YM....R,Y.
Boot0012* Slot 1 Port 1 : Intel(R) Ethernet Controller XXV710 for 25GbE SFP28 (HTTP(S) IPv4)	PciRoot(0x1)/Pci(0x0,0x0)/Pci(0x0,0x0)/MAC(48df37bcf320,1)/IPv4(0.0.0.00.0.0.0,0,0)/Uri()N.....YM....R,Y.
Boot0013* Slot 1 Port 1 : Intel(R) Ethernet Controller XXV710 for 25GbE SFP28 (HTTP(S) IPv6)	PciRoot(0x1)/Pci(0x0,0x0)/Pci(0x0,0x0)/MAC(48df37bcf320,1)/IPv6([::]:<->[::]:,0,0)/Uri()N.....YM....R,Y.
Boot0014* Slot 1 Port 1 : Intel(R) Ethernet Controller XXV710 for 25GbE SFP28 (PXE IPv4)	PciRoot(0x1)/Pci(0x0,0x0)/Pci(0x0,0x0)/MAC(48df37bcf320,1)/IPv4(0.0.0.00.0.0.0,0,0)N.....YM....R,Y.
Boot0015* Slot 1 Port 1 : Intel(R) Ethernet Controller XXV710 for 25GbE SFP28 (PXE IPv6)	PciRoot(0x1)/Pci(0x0,0x0)/Pci(0x0,0x0)/MAC(48df37bcf320,1)/IPv6([::]:<->[::]:,0,0)N.....YM....R,Y.
Boot0016* Embedded FlexibleLOM 1 Port 1 : HPE Ethernet 10Gb 2-port 562FLR-SFP+ Adapter - NIC (HTTP(S) IPv4)	PciRoot(0x3)/Pci(0x2,0x0)/Pci(0x0,0x0)/MAC(48df37c2eb50,1)/IPv4(0.0.0.00.0.0.0,0,0)/Uri()N.....YM....R,Y.
Boot0017* Embedded FlexibleLOM 1 Port 1 : HPE Ethernet 10Gb 2-port 562FLR-SFP+ Adapter - NIC (HTTP(S) IPv6)	PciRoot(0x3)/Pci(0x2,0x0)/Pci(0x0,0x0)/MAC(48df37c2eb50,1)/IPv6([::]:<->[::]:,0,0)/Uri()N.....YM....R,Y.
Boot0018* Embedded FlexibleLOM 1 Port 1 : HPE Ethernet 10Gb 2-port 562FLR-SFP+ Adapter - NIC (PXE IPv4)	PciRoot(0x3)/Pci(0x2,0x0)/Pci(0x0,0x0)/MAC(48df37c2eb50,1)/IPv4(0.0.0.00.0.0.0,0,0)N.....YM....R,Y.
Boot0019* Embedded FlexibleLOM 1 Port 1 : HPE Ethernet 10Gb 2-port 562FLR-SFP+ Adapter - NIC (PXE IPv6)	PciRoot(0x3)/Pci(0x2,0x0)/Pci(0x0,0x0)/MAC(48df37c2eb50,1)/IPv6([::]:<->[::]:,0,0)N.....YM....R,Y.
Boot001A* Slot 4 Port 1 : Intel(R) Ethernet Controller XXV710 for 25GbE SFP28 (HTTP(S) IPv4)	PciRoot(0x8)/Pci(0x0,0x0)/Pci(0x0,0x0)/MAC(48df37bced54,1)/IPv4(0.0.0.00.0.0.0,0,0)/Uri()N.....YM....R,Y.
Boot001B* Slot 4 Port 1 : Intel(R) Ethernet Controller XXV710 for 25GbE SFP28 (HTTP(S) IPv6)	PciRoot(0x8)/Pci(0x0,0x0)/Pci(0x0,0x0)/MAC(48df37bced54,1)/IPv6([::]:<->[::]:,0,0)/Uri()N.....YM....R,Y.
Boot001C* Slot 4 Port 1 : Intel(R) Ethernet Controller XXV710 for 25GbE SFP28 (PXE IPv4)	PciRoot(0x8)/Pci(0x0,0x0)/Pci(0x0,0x0)/MAC(48df37bced54,1)/IPv4(0.0.0.00.0.0.0,0,0)N.....YM....R,Y.
Boot001D* Slot 4 Port 1 : Intel(R) Ethernet Controller XXV710 for 25GbE SFP28 (PXE IPv6)	PciRoot(0x8)/Pci(0x0,0x0)/Pci(0x0,0x0)/MAC(48df37bced54,1)/IPv6([::]:<->[::]:,0,0)N.....YM....R,Y.
Boot001E* Red Hat Enterprise Linux	HD(2,GPT,07868584-8198-4878-a8e8-3dbe2ae2e582,0x1000,0x3f800)/File(\EFI\redhat\shimx64.efi)
Boot0020* Red Hat Enterprise Linux	HD(1,GPT,0a2b3da1-95e1-4d98-8d4d-fda3f514ace0,0x800,0xc0000)/File(\EFI\redhat\shimx64.efi)
Boot0021* Red Hat Enterprise Linux	HD(2,GPT,76dcc054-f4b7-4427-847b-b910f6d70444,0xc0800,0x3f800)/File(\EFI\redhat\shimx64.efi)
Boot0022* Red Hat Enterprise Linux	HD(2,GPT,79b8235d-8dbe-4ea5-ad3d-109ec8bab7eb,0x1000,0x3f800)/File(\EFI\redhat\shimx64.efi)
Boot0023  Temporary Legacy Boot Option	BBS(255,Temporary Legacy Boot Option,0x0)
Boot0024* Red Hat Enterprise Linux	HD(2,GPT,3b6e914a-f943-4f30-9f2d-4cb92848d7eb,0x1000,0x3f800)/File(\EFI\redhat\shimx64.efi)
Boot0026* ironic1	HD(2,GPT,3b6e914a-f943-4f30-9f2d-4cb92848d7eb,0x1000,0x3f800)/File(\EFI\BOOT\BOOTX64.EFI)

Comment 30 Derek Higgins 2021-06-09 15:46:36 UTC
(In reply to Derek Higgins from comment #29)
> [root@cnfdb4 mp]# efibootmgr -v
> BootCurrent: 0024
> Timeout: 0 seconds
> BootOrder:
> 0000,0024,0026,001E,0022,0021,0020,000A,000B,000E,0016,0010,0018,000F,0019,
> 0017,0011,0012,0013,001A,001C,0015,001B,0014,001D,000C,000D,0001,0002,0003,
> 0004,0005,0006,0007,0008,0009

Note: in order to boot and get this info I switch boot entry 0024 and 0026 around
i.e. 0026 is the one that doesn't work

Comment 31 Derek Higgins 2021-06-09 16:42:41 UTC
Again, similar to the Dell hosts, copying the file and creating a new entry for it does work fine

[root@cnfdb4 BOOT]# cp BOOTX64.EFI BOOTX64_NEW.EFI
[root@cnfdb4 BOOT]# efibootmgr -c -d /dev/sdc -p 2 -w -L ironic2 -l '\EFI\BOOT\BOOTX64_NEW.EFI'
[root@cnfdb4 core]# init 6
                                                                                                                                                                         
[root@cnfdb4 core]# efibootmgr -v
BootCurrent: 001F
<snip/>
Boot001F* ironic2       HD(2,GPT,3b6e914a-f943-4f30-9f2d-4cb92848d7eb,0x1000,0x3f800)/File(\EFI\BOOT\BOOTX64_NEW.EFI)

Comment 32 Steve Baker 2021-06-09 20:30:39 UTC
(In reply to Derek Higgins from comment #31)
> Again, similar to the Dell hosts, copying the file and creating a new entry
> for it does work fine
> 
> [root@cnfdb4 BOOT]# cp BOOTX64.EFI BOOTX64_NEW.EFI
> [root@cnfdb4 BOOT]# efibootmgr -c -d /dev/sdc -p 2 -w -L ironic2 -l
> '\EFI\BOOT\BOOTX64_NEW.EFI'
> [root@cnfdb4 core]# init 6

I have a couple of questions:
- what if you delete the ironic1 entry and create an ironic2 entry pointing at \EFI\BOOT\BOOTX64.EFI ?
- can you look through the deploy logs and find the 'efibootmgr -c' call for creating ironic1 and paste it here?

Comment 33 Steve Baker 2021-06-09 20:34:35 UTC
From the attached ironic-deploy-ramdisk-logs-cef9cd537130.log (a different machine):

efibootmgr -c -d /dev/sda -p 2 -w -L ironic1 -l \EFI\BOOT\BOOTX64.EFI

Comment 34 Steve Baker 2021-06-09 22:45:33 UTC
Another thought, how about modifying the image before deployment to delete /boot/efi/EFI/redhat/shimx64.efi, just to see what happens?

Comment 35 Derek Higgins 2021-06-10 13:12:35 UTC
(In reply to Steve Baker from comment #32)
> (In reply to Derek Higgins from comment #31)
> > Again, similar to the Dell hosts, copying the file and creating a new entry
> > for it does work fine
> > 
> > [root@cnfdb4 BOOT]# cp BOOTX64.EFI BOOTX64_NEW.EFI
> > [root@cnfdb4 BOOT]# efibootmgr -c -d /dev/sdc -p 2 -w -L ironic2 -l
> > '\EFI\BOOT\BOOTX64_NEW.EFI'
> > [root@cnfdb4 core]# init 6
> 
> I have a couple of questions:
> - what if you delete the ironic1 entry and create an ironic2 entry pointing
> at \EFI\BOOT\BOOTX64.EFI ?
> - can you look through the deploy logs and find the 'efibootmgr -c' call for
> creating ironic1 and paste it here?

ON a HPE DL380


I booted into IPA (where the ironic1 entry was originally created), deleted the ironic entry and recreated it
the command was
$ efibootmgr -c -d /dev/sdd -p 2 -w -L ironic1 -l '\EFI\BOOT\BOOTX64.EFI'
it booted fine from this entry

I have at this stage tested several different entries, see OK/FAIL after each one, I thought at one stage that it was only
entries that were created from IPA that reproduced the problem, until some didn't reproduce the problem

Both IPA the RHCOS are using the same version of efibootmgr and kernel
efibootmgr-16-1.el8.x86_64
Linux cnfdb4.clus2.t5g.lab.eng.bos.redhat.com 4.18.0-305.3.1.el8_4.x86_64 #1 SMP Mon May 17 10:08:25 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux

I've also seen entries fails and then successfully boot later without being changed (other entries had been added)

One thing to note is that the device path of the HD where RHCOS was written to sometimes changes between boots, I've seen it as
/dev/sdb,sdc and sdd between reboots
this isn't new and has always happened in RHCOS on baremetal, so may not be relevant but just mentioning
it incase the boot entries are no longer able to tolerate this changing

Comment 36 Derek Higgins 2021-06-10 13:26:33 UTC
A summary of the entries I now have(HPE DL380) and if they booted (OK) or not (FAIL), some I booted multiple times, some that failed then were OK when later tried

# This entry is created by IPA when it writes the image to disk, this is the entry that fails when we first noticed the problem
Boot0026* ironic1	HD(2,GPT,3b6e914a-f943-4f30-9f2d-4cb92848d7eb,0x1000,0x3f800)/File(\EFI\BOOT\BOOTX64.EFI) FAIL

# This entry is created when the OS first boots, I have never seen it fail (what creates it??)
Boot0024* Red Hat Enterprise Linux	HD(2,GPT,3b6e914a-f943-4f30-9f2d-4cb92848d7eb,0x1000,0x3f800)/File(\EFI\redhat\shimx64.efi) OK, OK

# These I created manually while running IPA, both failed
Boot001F* ironic_ipa_1	HD(2,GPT,3b6e914a-f943-4f30-9f2d-4cb92848d7eb,0x1000,0x3f800)/File(\EFI\BOOT\BOOTX64.EFI) FAIL
Boot0020* ironic_ipa_2	HD(2,GPT,3b6e914a-f943-4f30-9f2d-4cb92848d7eb,0x1000,0x3f800)/File(\EFI\BOOT\BOOTX64_NEW.EFI) FAIL

# These I created manually while running RHCOS, all were created when RHCOS was on /dev/sdc, at least one of these (Boot0025)
# booted when the device name was /dev/sdb which rules out the device name changing being a problem
Boot001E* rhcos_1	HD(2,GPT,3b6e914a-f943-4f30-9f2d-4cb92848d7eb,0x1000,0x3f800)/File(\EFI\BOOT\BOOTX64.EFI) FAIL, FAIL, OK
Boot0021* rhcos_2	HD(2,GPT,3b6e914a-f943-4f30-9f2d-4cb92848d7eb,0x1000,0x3f800)/File(\EFI\BOOT\BOOTX64_NEW.EFI) FAIL, OK OK OK
Boot0022* rhcos_3	HD(2,GPT,3b6e914a-f943-4f30-9f2d-4cb92848d7eb,0x1000,0x3f800)/File(\EFI\BOOT\BOOTX64_NEW.EFI) OK
Boot0025* rhcos_4	HD(2,GPT,3b6e914a-f943-4f30-9f2d-4cb92848d7eb,0x1000,0x3f800)/File(\EFI\BOOT\BOOTX64_NEW.EFI) OK (booted as /dev/sdb)
Boot0027* rhcos_5	HD(2,GPT,3b6e914a-f943-4f30-9f2d-4cb92848d7eb,0x1000,0x3f800)/File(\EFI\BOOT\BOOTX64_NEW.EFI) Ok
Boot0028* rhcos_6	HD(2,GPT,3b6e914a-f943-4f30-9f2d-4cb92848d7eb,0x1000,0x3f800)/File(\EFI\BOOT\BOOTX64_NEW.EFI) OK
Boot0029* rhcos_7	HD(2,GPT,3b6e914a-f943-4f30-9f2d-4cb92848d7eb,0x1000,0x3f800)/File(\EFI\BOOT\BOOTX64_NEW.EFI) OK
Boot002A* rhcos_8	HD(2,GPT,3b6e914a-f943-4f30-9f2d-4cb92848d7eb,0x1000,0x3f800)/File(\EFI\BOOT\BOOTX64_NEW.EFI) OK
Boot002B* rhcos_9	HD(2,GPT,3b6e914a-f943-4f30-9f2d-4cb92848d7eb,0x1000,0x3f800)/File(\EFI\BOOT\BOOTX64_NEW.EFI) OK

# I then went back to IPA, deleted the ironic1 entry, recreated it and create a entry pointing to shimx64.efi
# both booted ok
[root@cnfdb4 tmp]# efibootmgr -c -d /dev/sdd -p 2 -w -L ironic_ipa_3 -l '\EFI\redhat\shimx64.efi'
[root@cnfdb4 tmp]# efibootmgr -c -d /dev/sdd -p 2 -w -L ironic1 -l '\EFI\BOOT\BOOTX64.EFI'
Boot0026* ironic1	HD(2,GPT,3b6e914a-f943-4f30-9f2d-4cb92848d7eb,0x1000,0x3f800)/File(\EFI\BOOT\BOOTX64.EFI) OK
Boot002C* ironic_ipa_3	HD(2,GPT,3b6e914a-f943-4f30-9f2d-4cb92848d7eb,0x1000,0x3f800)/File(\EFI\redhat\shimx64.efi) OK

Comment 37 Lenny Szubowicz 2021-06-10 14:34:18 UTC
Just a point of information: The behavior of \EFI\BOOT\BOOTX64.EFI is different from \EFI\redhat\shimx64.efi

This is despite the fact that they are exactly the same image, as pointed out by comment 17 and others.

The RHEL shim boot loader looks at it's name and file path and will do extra work if its invoked as \EFI\BOOT\BOOTX64.EFI.[1]

The file path and file name \EFI\BOOT\BOOTX64.EFI is defined in the UEFI System Specification as the default bootloader file path that the firmware is to use if the UEFI boot variable omits them and the device is a block storage device.[2]

If this the RHEL shim boot loader is invoked as \EFI\BOOT\BOOTX64.EFI is assumes that the normal RHEL boot variable might have been damaged. Secondly, the boot loader normally expects to grubx64.efi in the same directory as itself, but that's not the case with \EFI\BOOT\. Therefore, BOOTX64.EFI looks for \EFI\REDHAT\BOOT.CSV and a UEFI boot variable that conforms to the contents of the BOOT.CSV file. If necessary, a new UEFI boot variable is created that uses the info in \EFI\REDHAT\BOOT.CSV and it's placed first in boot order.[3]

So, it looks like some bug, either in SHIMX64.EFI (aka BOOTX64.EFI) or in the UEFI firmware on ther Dell R740, is being exposed when some of this extra work is being done. 

                                                 -Lenny.



[1] I'm not 100% sure the different behavior requires a full match of \EFI\BOOT\BOOTX64.EFI or just \EFI\BOOT\

[2] X64 is the file name suffix for the x86_64 arch. Other strings are specified for different processor architectures.

[3] BOOTX64.EFI uses fbx64.efi to do some of this extra work. fbx64.efi is the "fall back" boot recovery image. I'm not sure exactly where the division of labor is. But that also means that BOOTX64.EFI has to look for and load fbx64.efi when it thinks it's necessary.

Comment 38 Lenny Szubowicz 2021-06-10 14:42:10 UTC
Regarding comment 37, I should also add that \EFI\BOOT\BOOTX64.EFI is used if you are booting the installation image from a device with an ISO-9660 file system. But I'm assuming that's not the case here. Is that correct?

                         -Lenny.

Comment 39 Yuval Kashtan 2021-06-10 14:51:57 UTC
(In reply to Lenny Szubowicz from comment #38)
> Regarding comment 37, I should also add that \EFI\BOOT\BOOTX64.EFI is used
> if you are booting the installation image from a device with an ISO-9660
> file system. But I'm assuming that's not the case here. Is that correct?
when using virtual-media, that's exactly the case

Comment 40 Javier Martinez Canillas 2021-06-10 14:53:37 UTC
Then booting \EFI\BOOT\BOOTX64.EFI, shim runs the fallback path so it seems the problem is there. This may be a duplicate of bug #1966973.

Comment 41 Javier Martinez Canillas 2021-06-10 15:04:07 UTC
(In reply to Javier Martinez Canillas from comment #40)
> Then booting \EFI\BOOT\BOOTX64.EFI, shim runs the fallback path so it seems
> the problem is there. This may be a duplicate of bug #1966973.

Scratch that. Peter mentioned that's unlikely this could be a duplicate of that bug.

Still, seems to be related to shim's fallback since it doesn't fail when using \EFI\redhat\shimx64.efi

Comment 42 Peter Jones 2021-06-10 15:04:49 UTC
Can someone tell me what version of the shim-x64 package is actually installed?

Comment 43 Peter Jones 2021-06-10 15:06:45 UTC
(In reply to Yuval Kashtan from comment #39)
> (In reply to Lenny Szubowicz from comment #38)
> > Regarding comment 37, I should also add that \EFI\BOOT\BOOTX64.EFI is used
> > if you are booting the installation image from a device with an ISO-9660
> > file system. But I'm assuming that's not the case here. Is that correct?
> when using virtual-media, that's exactly the case

So, given that, can you show me the full layout of the EFI System Partition?  i.e. the output of "find /boot/efi/ -type f" ?

Comment 44 Derek Higgins 2021-06-10 15:09:37 UTC
(In reply to Yuval Kashtan from comment #39)
> (In reply to Lenny Szubowicz from comment #38)
> > Regarding comment 37, I should also add that \EFI\BOOT\BOOTX64.EFI is used
> > if you are booting the installation image from a device with an ISO-9660
> > file system. But I'm assuming that's not the case here. Is that correct?
> when using virtual-media, that's exactly the case

Yes, IPA may have booted from virtmedia but iirc we've also seen this bug when PXE booting

and the entry were have difficulty with is creaded by IPA
its the following boot we have problems with i.e.

IPA runs (either by virtual media or PXE)
writes RHCOS to disk
runs "efibootmgr -c -d /dev/sdc -p 2 -w -L ironic1 -l '\EFI\BOOT\BOOTX64.EFI'"
reboot
ERROR

Comment 45 Derek Higgins 2021-06-10 15:14:26 UTC
(In reply to Peter Jones from comment #42)
> Can someone tell me what version of the shim-x64 package is actually
> installed?
shim-x64-15.4-2.el8_1.x86_64

(In reply to Peter Jones from comment #43)
> (In reply to Yuval Kashtan from comment #39)
> > (In reply to Lenny Szubowicz from comment #38)
> > > Regarding comment 37, I should also add that \EFI\BOOT\BOOTX64.EFI is used
> > > if you are booting the installation image from a device with an ISO-9660
> > > file system. But I'm assuming that's not the case here. Is that correct?
> > when using virtual-media, that's exactly the case
> 
> So, given that, can you show me the full layout of the EFI System Partition?
> i.e. the output of "find /boot/efi/ -type f" ?

[root@cnfdb4 tmp]# mount /dev/sdc2 /boot/efi 
[root@cnfdb4 tmp]# find /boot/efi -type f
/boot/efi/EFI/BOOT/fbx64.efi
/boot/efi/EFI/BOOT/BOOTX64.EFI
/boot/efi/EFI/redhat/shimx64.efi
/boot/efi/EFI/redhat/grubx64.efi
/boot/efi/EFI/redhat/BOOTX64.CSV
/boot/efi/EFI/redhat/mmx64.efi
/boot/efi/EFI/redhat/shimx64-redhat.efi
/boot/efi/EFI/redhat/grub.cfg

Comment 46 Derek Higgins 2021-06-10 15:26:42 UTC
(In reply to Derek Higgins from comment #45)

> [root@cnfdb4 tmp]# mount /dev/sdc2 /boot/efi 
> [root@cnfdb4 tmp]# find /boot/efi -type f
> /boot/efi/EFI/BOOT/fbx64.efi
> /boot/efi/EFI/BOOT/BOOTX64.EFI
> /boot/efi/EFI/redhat/shimx64.efi
> /boot/efi/EFI/redhat/grubx64.efi
> /boot/efi/EFI/redhat/BOOTX64.CSV
> /boot/efi/EFI/redhat/mmx64.efi
> /boot/efi/EFI/redhat/shimx64-redhat.efi
> /boot/efi/EFI/redhat/grub.cfg

sorry, This is the contents of /boot/efi on the RHCOS image, is it the IPA iso you were looking for?

Comment 47 Javier Martinez Canillas 2021-06-10 15:30:53 UTC
Could you please test what happens when...

(In reply to Derek Higgins from comment #46)
> (In reply to Derek Higgins from comment #45)
> 
> > [root@cnfdb4 tmp]# mount /dev/sdc2 /boot/efi 
> > [root@cnfdb4 tmp]# find /boot/efi -type f
> > /boot/efi/EFI/BOOT/fbx64.efi

... you remove this file

> > /boot/efi/EFI/BOOT/BOOTX64.EFI
> > /boot/efi/EFI/redhat/shimx64.efi
> > /boot/efi/EFI/redhat/grubx64.efi
> > /boot/efi/EFI/redhat/BOOTX64.CSV

and this one.

> > /boot/efi/EFI/redhat/mmx64.efi
> > /boot/efi/EFI/redhat/shimx64-redhat.efi
> > /boot/efi/EFI/redhat/grub.cfg
> 
> sorry, This is the contents of /boot/efi on the RHCOS image, is it the IPA
> iso you were looking for?

I believe he asked the files in the RHCOS image, so your answer was correct.

Comment 48 Derek Higgins 2021-06-10 16:13:28 UTC
(In reply to Javier Martinez Canillas from comment #47)
> Could you please test what happens when...


See screenshot(loaderror.png) attached it threw an error and continued to the next entry in the boot list

[root@cnfdb4 efi]# efibootmgr -v | grep -e 3b6e914a -e BootOrder -e BootCurr
BootCurrent: 001E
BootOrder: 0004,0006,0002,0009,0000,0026,000A,000B,000E,0016,0010,0018,000F,0019,0017,0011,0012,0013,001A,001C,0015,001B,0014,0003,0005,0007,0008,001D,000C,000D,0001
Boot0026* ironic1	HD(2,GPT,3b6e914a-f943-4f30-9f2d-4cb92848d7eb,0x1000,0x3f800)/File(\EFI\BOOT\BOOTX64.EFI)
[root@cnfdb4 efi]# find EFI/ -type f
EFI/BOOT/BOOTX64.EFI
EFI/redhat/shimx64.efi
EFI/redhat/grubx64.efi
EFI/redhat/mmx64.efi
EFI/redhat/shimx64-redhat.efi
EFI/redhat/grub.cfg
[root@cnfdb4 efi]# init 6


I removed the file above your commends, were these the correct files?

Comment 49 Derek Higgins 2021-06-10 16:14:14 UTC
Created attachment 1789954 [details]
loaderror.png , error after removign files from ESP

Comment 50 Dennis Gilmore 2021-06-10 16:32:21 UTC
Do the same issues occur when doing an install of 8.4 on the problematic hardware when doing an install using anaconda?  It looks like the issue is not related to OCP at all,  it could possibly be a bug in or triggered by RHCOS if it is setting something up differently to anaconda.

Comment 51 Dmitry Tantsur 2021-06-10 17:04:19 UTC
This is what we're doing to set up EFI variables for booting: https://opendev.org/openstack/ironic-python-agent/src/branch/master/ironic_python_agent/extensions/image.py#L272-L274. This has worked for a couple of years already (and is also included in OSP Director, so it may be affected).

Comment 52 Bob Fournier 2021-06-10 18:35:12 UTC
Proposed fix to ironic-python-agent - https://review.opendev.org/c/openstack/ironic-python-agent/+/795862

Comment 53 Lenny Szubowicz 2021-06-10 18:43:12 UTC
(In reply to Dennis Gilmore from comment #50)
> Do the same issues occur when doing an install of 8.4 on the problematic
> hardware when doing an install using anaconda?  It looks like the issue is
> not related to OCP at all,  it could possibly be a bug in or triggered by
> RHCOS if it is setting something up differently to anaconda.

A normal RHEL8 install does not use efibootmgr to set \EFI\BOOT\BOOTX64.EFI as the boot loader, which is the case for the type of OpenShift install that runs into the issue in this BZ.

A normal RHEL8 install uses efibootmgr to set \EFI\redhat\shimx64.efi as the boot loader.

I think both cases should work. But the former is unusual and I don't understand why OpenShift needs to use \EFI\BOOT\BOOTX64.EFI vs. \EFI\redhat\shimx64.efi

I'm guessing that the same problem would likely occur if one did a normal RHEL8.4 install on a R740 and then manually used efibootmgr to specify \EFI\BOOT\BOOTX64.EFI.
However, I can't try that right now since there are no free Dell R740 systems in beaker which are set up to boot via UEFI.

                            -Lenny.

Comment 54 Peter Jones 2021-06-10 18:49:27 UTC
(In reply to Dmitry Tantsur from comment #51)
> This is what we're doing to set up EFI variables for booting:
> https://opendev.org/openstack/ironic-python-agent/src/branch/master/
> ironic_python_agent/extensions/image.py#L272-L274. This has worked for a
> couple of years already (and is also included in OSP Director, so it may be
> affected).

Okay, so let's make sure I understand the intent here.  If I understand correctly:

- this is a static, pre-configured image being installed onto a machine
- this code is then used to install the bootloader
- but this code has to first determine bios vs efi and what kind of image was installed, hence winload.efi and the arch options in your list

If those are all correct, that filesystem layout is not a good choice, and we should probably build different packaging to help avoid doing that.  The tl;dr is fbx64.efi shouldn't be installed, \EFI\redhat\shimx64* shouldn't be there, and everything else in \EFI\redhat should be in \EFI\BOOT.

That said, that's not the bug here, it's just how you're getting to it.  Is there any chance you can verify whether or not you see this problem with shim-15-16.el8 ?

Comment 55 Julia Kreger 2021-06-10 19:56:18 UTC
Following up on comment #51, I'm considering this an issue for OSP Director as well, hence why I paired up with Bob on a patch this morning to at least try make IPA more aware of setting explicit filename/labeling matching the CSV file so we don't end up in a situation where shim updates the list of efi loaders again, as Derek was apparently able to just update the label in the CSV file and reboot and encountered it updating the entry for the next reboot in the EFI bootloader list.

With regards to comment #53, I too think both cases *should* work and I think it is generally fine that shim is updating the loader records, however that is apparently impacting another team at this time which is booting an RHCOS image separately from ironic or a formal deployment being involved as they their bootloader is reached via temporary virtual media usage. They attach via the BMC, they power on, OS discovers the virtual media. In the other team's case, they are apparently experiencing the EFI boot entry getting added when they boot from virtual media  (I personally suspect they are booting a virtual usb drive as opposed to a virtual iso (which is what ironic does for vmedia deploys)). If what Derek indicated to me when I was talking with him earlier to be correct, they then experience the entry orphaned in the EFI boot records table when they reboot and the machine attempts to boot to that. Derek also mentioned this was reproduced on an R640 now, and he was reproducing on an HPE DL380 Gen ?10? just by using efibootmgr to change the default record. Adding NeedsInfo for Derek to confirm.

Peter, There are a couple different cases it seems, so it boils down to:

- Machine is network booted ramdisk or virtual media device attachment.
- An image is laid down on disk with the stock assets.
- In ironic's case, we attempt to find a bootloader and lean towards what hopefully would be the OS defaults that should be present if the efi bootloader records table is corrupted or where the machine would otherwise try to auto-detect a bootloader. Ironic leans to being agnostic of what is there, hence winload being present. In other case, I can't speak for that team, but they may just be rebooting and expecting the base firmware to discover the OS on the disk and boot it.

Comment 56 Benjamin Gilbert 2021-06-10 20:08:33 UTC
(In reply to Lenny Szubowicz from comment #53)
> I think both cases should work. But the former is unusual and I don't
> understand why OpenShift needs to use \EFI\BOOT\BOOTX64.EFI vs.
> \EFI\redhat\shimx64.efi

The RHCOS installer doesn't configure UEFI boot variables at all; we're relying on the UEFI default boot behavior.

Comment 57 Colin Walters 2021-06-10 20:15:36 UTC
Right, more generally for [F,RH]CoreOS we ship a dual BIOS+UEFI disk image that is booted directly without any massaging by anything external before that (in contrast to Anaconda).  There's a bit more on this here https://docs.fedoraproject.org/en-US/fedora-coreos/storage/#_disk_layout
It's just "dd to disk".

Comment 58 Bob Fournier 2021-06-10 20:30:46 UTC
Thanks for all the help with triage. Moving this back to the Ironic team to follow up on the patch to handle this in a general purpose way.

Comment 59 Julia Kreger 2021-06-10 20:40:20 UTC
Bob, it seems like there is clearly an issue inside of shim, I think if anything, this bug should get duplicated or new bugs created for any workarounds while the issue with shim is explored, since as others noted they are relying upon the default and that seems like a case where this is going to happen regardless of a pre-existing efibootmgr entry being added.

(For those unaware who are watching, we explicitly invoke efibootmgr in ironic as operators tend to install against machines with a number of storage controllers and devices, so they need the proper disk to boot, as they may have left operating systems on other disks which could get discovered OR the hardware doesn't explicitly support searching other disks by default.)

Comment 60 Bob Fournier 2021-06-10 21:04:49 UTC
Good point Julia. I've cloned this assigned to the shim team to keep all the info, changed the summary, and lowered the priority - https://bugzilla.redhat.com/show_bug.cgi?id=1970632.

Comment 61 Derek Higgins 2021-06-10 23:33:58 UTC
(In reply to Bob Fournier from comment #52)
> Proposed fix to ironic-python-agent -
> https://review.opendev.org/c/openstack/ironic-python-agent/+/795862

I've patched this into an IPA image on the HPE systems that reproduced the problem and it looks to be working as expected

The entry was created using the info in the csv file
Boot001F* Red Hat Enterprise Linux	HD(2,GPT,3b6e914a-f943-4f30-9f2d-4cb92848d7eb,0x1000,0x3f800)/File(\EFI\redhat\shimx64.efi)

I'll try it some more tomorrow to ensure there are no corner cases

(In reply to Julia Kreger from comment #55)
> With regards to comment #53, I too think both cases *should* work and I
> think it is generally fine that shim is updating the loader records, however
> that is apparently impacting another team at this time which is booting an
> RHCOS image separately from ironic or a formal deployment being involved as
> they their bootloader is reached via temporary virtual media usage. They
> attach via the BMC, they power on, OS discovers the virtual media. In the
> other team's case, they are apparently experiencing the EFI boot entry
> getting added when they boot from virtual media  (I personally suspect they
> are booting a virtual usb drive as opposed to a virtual iso (which is what
> ironic does for vmedia deploys)). If what Derek indicated to me when I was
> talking with him earlier to be correct, they then experience the entry
> orphaned in the EFI boot records table when they reboot and the machine
> attempts to boot to that.

The problem is that the shim added boot entry in this case is referencing a cdrom 
  "PciRoot(0x0)/Pci(0x1C,0x4)/Pci(0x0,0x4)/USB(0x1,0x0)/CDROM(0x1)/\\EFI\\redhat\\shimx64.efi",

Later when a user wants to do a one time boot from CDROM (not a RHCOS image)
The problem that occurs is because this entry has higher priority then the generic CDROM entry
the system attempts to use it to boot, when it fails (as it didn't exist),
the system follows the regular BootOrder to boot and boots to the HD.

I believe this is whats happening here
https://storyboard.openstack.org/#!/story/2008763

The only way we can get IPA to boot using the CD one time boot option is to remove the 
entry that was created by shim

> Derek also mentioned this was reproduced on an
> R640 now, and he was reproducing on an HPE DL380 Gen ?10? just by using
> efibootmgr to change the default record. Adding NeedsInfo for Derek to
> confirm.

Yes, we've now seen this same bug on a Dell R640 and a HPE ProLiant DL380 Gen10

(In reply to Bob Fournier from comment #60)
> Good point Julia. I've cloned this assigned to the shim team to keep all the
> info, changed the summary, and lowered the priority -
> https://bugzilla.redhat.com/show_bug.cgi?id=1970632.

I created bz#1970514 for the IPA work around 
https://bugzilla.redhat.com/show_bug.cgi?id=1970514

I thought this bug would be remaining assigned to shim to fix the root of the problem
either way, I guess we can close one of them

Comment 62 Tomas Sedovic 2021-06-11 11:21:06 UTC
*** Bug 1970514 has been marked as a duplicate of this bug. ***

Comment 63 Tomas Sedovic 2021-06-11 11:27:29 UTC
I've closed https://bugzilla.redhat.com/show_bug.cgi?id=1970514 as a duplicate of this bug.

We will use it to track the Ironic Python Agent workaround (https://review.opendev.org/795862).

The https://bugzilla.redhat.com/show_bug.cgi?id=1970632 clone is assigned to RHEL/shim to track the root cause of the issue.

Comment 65 Bob Fournier 2021-06-15 12:35:01 UTC
Changing target to 4.9, we'll clone another one for 4.8.

Comment 66 Bob Fournier 2021-06-15 22:36:12 UTC
Fix has merged to 4.9, moving to ON_QA.

Comment 69 Flavio Percoco 2021-06-25 07:04:24 UTC
*** Bug 1976074 has been marked as a duplicate of this bug. ***

Comment 72 errata-xmlrpc 2021-10-18 17:32:17 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759


Note You need to log in before you can comment on or make changes to this bug.