Bug 1997805 - coreos install boot order
Summary: coreos install boot order
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: RHCOS
Version: 4.8
Hardware: x86_64
OS: Linux
low
low
Target Milestone: ---
: 4.9.0
Assignee: Benjamin Gilbert
QA Contact: Michael Nguyen
URL:
Whiteboard:
: 1977983 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-08-25 20:09 UTC by Vincent Rouleau
Modified: 2021-09-08 14:11 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-08-31 17:19:11 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
BIOS boot order screentshot (56.31 KB, image/jpeg)
2021-08-25 20:09 UTC, Vincent Rouleau
no flags Details
journal -b of installation (396.73 KB, text/plain)
2021-08-25 20:10 UTC, Vincent Rouleau
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github coreos fedora-coreos-tracker issues 946 0 None None None 2021-08-31 19:53:29 UTC

Description Vincent Rouleau 2021-08-25 20:09:22 UTC
Created attachment 1817570 [details]
BIOS boot order screentshot

Created attachment 1817570 [details]
BIOS boot order screentshot

Thanks for reporting your issue!


---


OCP Version at Install Time: 4.6
RHCOS Version at Install Time: 4.6
OCP Version after Upgrade (if applicable): 4.8.2
RHCOS Version after Upgrade (if applicable): 4.8.2
Platform:  bare metal
Architecture: x86_64


What are you trying to do? What is your use case?
Re-install a host with a newer version of RHCOS

What happened? What went wrong or what did you expect?


When re-installing RHCOS via a PXE/ignition install on Dell PowerEdge R620 servers, we have a problem with the boot order. So after installing the OS via PXE, when the host reboots, instead of booting of the HD partition, it re-executes the PXE installation. Before the installation, the BIOS has the HDD as first boot order, then the NIC follow. This is not an upgrade, this is when we re-install, but only when we change RHCOS version. The problem is, in this case only, the installer puts the new installation boot option last in the boot order and the previous installation is left there, marked as 'Unavailable'. The only solution is to manually go in the BIOS and change the boot order. 

This does not occur if we reinstall the same RHCOS over and over, only when re-install a different RHCOS version , a.k.a. 4.4 -> 4.6 ,4.6->4.8.  This affects us as we have automated re-install scripts at customer sites that fails when we need to uplift RHCOS/OCP. This sequence of events occur running our own installation scripts to somewhat minic the official manual installation procedure.


What are the steps to reproduce your issue? Please try to reduce these steps to something that can be reproduced with a single RHCOS node.
We PXE boot our nodes using the as our grub.cfg in our tftp folder:

set default=0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          set timeout=5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          menuentry  'Install RHCOS 4.8.2' --class fedora --class gnu-linux --class gnu --class os {
   linuxefi rhcos-4.8.2-x86_64-live-kernel-x86_64 ip=[2602:807:900e:105::170]::[2602:807:900e:105::2]:64:node-1.ts17.lab.kaloom.io:eno1:none nameserver=2602:807:900e:105::2 rd.neednet=1 vga=791 console=ttyS0 console=tty0 biosdevname=1 coreos.inst=yes coreos.inst.install_dev=/dev/sda coreos.live.rootfs_url=http://[2602:807:900e:105::2]/rhcos-4.8.2-x86_64-live-rootfs.x86_64.img coreos.inst.ignition_url=http://[2602:807:900e:105::2]/fabric.ign                                                                                              
   initrdefi rhcos-4.8.2-x86_64-live-initramfs.x86_64.img                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             }

Comment 1 Vincent Rouleau 2021-08-25 20:10:13 UTC
Created attachment 1817572 [details]
journal -b of installation

Comment 2 Timothée Ravier 2021-08-26 11:17:54 UTC
Please fill out the entire form from the new bug template.

coreos-installer is not used for RHCOS update thus I'm not sure I understand your use case.

What are those reinstalled scripts for?

Comment 3 Vincent Rouleau 2021-08-26 12:05:38 UTC
updated description

Comment 4 Timothée Ravier 2021-08-26 16:25:57 UTC
What's the problem you're trying to solve when re-installing RHCOS for updates?
RHCOS is able to update itself and you do not need to reinstall using a newer version during OCP updates.
It is recommended to only use the version used for the initial installation for node installation as the nodes will auto-update on first boot to the latest version.

Comment 5 Vincent Rouleau 2021-08-26 16:54:30 UTC
Mind you. this is for our internal regression testing suite, where we re-install over and over via Jenkins suite. So on a daily basis, we have about 6 cluster where we run installation regression. This was tipically managable, but in this case we have 2 branches with 2 different RHCOS baseline (4.6/4.8). So we would like to be able to continue issuing our jenkins installation build that can be 'randomly' 4.6 or 4.8 installation

Comment 6 Micah Abbott 2021-08-27 19:12:50 UTC
Setting to low Priority; this will have to be evaluated/addressed for a future release, but is unlikely to make 4.9.  (There is no future release past 4.9 that can be targeted at this time)

Comment 7 Timothée Ravier 2021-08-30 14:32:16 UTC
I'm afraid that this could be a EFI firmware bug where the firmware would remember an older boot entry.
RHCOS does not setup EFI boot entries and relies on the default paths for booting up.

Comment 8 Benjamin Gilbert 2021-08-31 05:38:20 UTC
RHCOS does create an EFI boot entry, but not in coreos-installer.  On the first boot, shim is invoked as the fallback bootloader and it creates the entry: https://github.com/rhboot/shim/blob/main/README.fallback

On first boot, we currently randomize the disk's GUID and the filesystem UUIDs of / and /boot, but we do not randomize partition GUIDs.  The image build process does, however, so each distinct OS release will have a different partition GUID for the ESP, and that GUID is included in the EFI boot variable.  That seems like an awkward middle ground.  We can't randomize the partition GUID on first boot (because it'll break the boot entry shim just created) and the current behavior (one GUID per OS revision) doesn't really make sense.  So one possible fix for this BZ is to hardcode a particular partition GUID for the ESP in all FCOS/RHCOS OS images, ensuring that reinstalls don't spawn additional boot entries.  That will undoubtedly cause trouble if two disks have RHCOS installed, but there are already other reasons the OS won't boot successfully in that case.

Vincent, can you confirm that the firmware is not somehow configured to prioritize network boot over local boot?  Also, could you post the output of `efibootmgr -v`?

Comment 9 Vincent Rouleau 2021-08-31 17:19:11 UTC
This is indeed a UEFI issue. We tried to automate as much as possible, but at some point, there are cases were things should be done manually, or on a per vendor case. We have worked around this by using the Dell utility racadm to reset the boot order in those case. Also, this is a lab issue on our side, no a product issue in the field. I will close this case. Thanks

Comment 10 Benjamin Gilbert 2021-08-31 19:53:29 UTC
It's currently not an issue in the field because clusters always provision nodes using their original boot images.  However, there are plans to change that, at which point reprovisioning of existing nodes will start spawning duplicate boot entries.  I'll leave this issue closed, but I've opened https://github.com/coreos/fedora-coreos-tracker/issues/946 to track the problem upstream.  Thanks for reporting.

Comment 11 Prashanth Sundararaman 2021-09-08 14:11:18 UTC
*** Bug 1977983 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.