Created attachment 1817570 [details] BIOS boot order screentshot Created attachment 1817570 [details] BIOS boot order screentshot Thanks for reporting your issue! --- OCP Version at Install Time: 4.6 RHCOS Version at Install Time: 4.6 OCP Version after Upgrade (if applicable): 4.8.2 RHCOS Version after Upgrade (if applicable): 4.8.2 Platform: bare metal Architecture: x86_64 What are you trying to do? What is your use case? Re-install a host with a newer version of RHCOS What happened? What went wrong or what did you expect? When re-installing RHCOS via a PXE/ignition install on Dell PowerEdge R620 servers, we have a problem with the boot order. So after installing the OS via PXE, when the host reboots, instead of booting of the HD partition, it re-executes the PXE installation. Before the installation, the BIOS has the HDD as first boot order, then the NIC follow. This is not an upgrade, this is when we re-install, but only when we change RHCOS version. The problem is, in this case only, the installer puts the new installation boot option last in the boot order and the previous installation is left there, marked as 'Unavailable'. The only solution is to manually go in the BIOS and change the boot order. This does not occur if we reinstall the same RHCOS over and over, only when re-install a different RHCOS version , a.k.a. 4.4 -> 4.6 ,4.6->4.8. This affects us as we have automated re-install scripts at customer sites that fails when we need to uplift RHCOS/OCP. This sequence of events occur running our own installation scripts to somewhat minic the official manual installation procedure. What are the steps to reproduce your issue? Please try to reduce these steps to something that can be reproduced with a single RHCOS node. We PXE boot our nodes using the as our grub.cfg in our tftp folder: set default=0 set timeout=5 menuentry 'Install RHCOS 4.8.2' --class fedora --class gnu-linux --class gnu --class os { linuxefi rhcos-4.8.2-x86_64-live-kernel-x86_64 ip=[2602:807:900e:105::170]::[2602:807:900e:105::2]:64:node-1.ts17.lab.kaloom.io:eno1:none nameserver=2602:807:900e:105::2 rd.neednet=1 vga=791 console=ttyS0 console=tty0 biosdevname=1 coreos.inst=yes coreos.inst.install_dev=/dev/sda coreos.live.rootfs_url=http://[2602:807:900e:105::2]/rhcos-4.8.2-x86_64-live-rootfs.x86_64.img coreos.inst.ignition_url=http://[2602:807:900e:105::2]/fabric.ign initrdefi rhcos-4.8.2-x86_64-live-initramfs.x86_64.img }
Created attachment 1817572 [details] journal -b of installation
Please fill out the entire form from the new bug template. coreos-installer is not used for RHCOS update thus I'm not sure I understand your use case. What are those reinstalled scripts for?
updated description
What's the problem you're trying to solve when re-installing RHCOS for updates? RHCOS is able to update itself and you do not need to reinstall using a newer version during OCP updates. It is recommended to only use the version used for the initial installation for node installation as the nodes will auto-update on first boot to the latest version.
Mind you. this is for our internal regression testing suite, where we re-install over and over via Jenkins suite. So on a daily basis, we have about 6 cluster where we run installation regression. This was tipically managable, but in this case we have 2 branches with 2 different RHCOS baseline (4.6/4.8). So we would like to be able to continue issuing our jenkins installation build that can be 'randomly' 4.6 or 4.8 installation
Setting to low Priority; this will have to be evaluated/addressed for a future release, but is unlikely to make 4.9. (There is no future release past 4.9 that can be targeted at this time)
I'm afraid that this could be a EFI firmware bug where the firmware would remember an older boot entry. RHCOS does not setup EFI boot entries and relies on the default paths for booting up.
RHCOS does create an EFI boot entry, but not in coreos-installer. On the first boot, shim is invoked as the fallback bootloader and it creates the entry: https://github.com/rhboot/shim/blob/main/README.fallback On first boot, we currently randomize the disk's GUID and the filesystem UUIDs of / and /boot, but we do not randomize partition GUIDs. The image build process does, however, so each distinct OS release will have a different partition GUID for the ESP, and that GUID is included in the EFI boot variable. That seems like an awkward middle ground. We can't randomize the partition GUID on first boot (because it'll break the boot entry shim just created) and the current behavior (one GUID per OS revision) doesn't really make sense. So one possible fix for this BZ is to hardcode a particular partition GUID for the ESP in all FCOS/RHCOS OS images, ensuring that reinstalls don't spawn additional boot entries. That will undoubtedly cause trouble if two disks have RHCOS installed, but there are already other reasons the OS won't boot successfully in that case. Vincent, can you confirm that the firmware is not somehow configured to prioritize network boot over local boot? Also, could you post the output of `efibootmgr -v`?
This is indeed a UEFI issue. We tried to automate as much as possible, but at some point, there are cases were things should be done manually, or on a per vendor case. We have worked around this by using the Dell utility racadm to reset the boot order in those case. Also, this is a lab issue on our side, no a product issue in the field. I will close this case. Thanks
It's currently not an issue in the field because clusters always provision nodes using their original boot images. However, there are plans to change that, at which point reprovisioning of existing nodes will start spawning duplicate boot entries. I'll leave this issue closed, but I've opened https://github.com/coreos/fedora-coreos-tracker/issues/946 to track the problem upstream. Thanks for reporting.
*** Bug 1977983 has been marked as a duplicate of this bug. ***