Description of problem: coreos-install with the fc.2 build for 4.7 fails silently. The installation completes, but on the next boot no bootable images are found in the petitboot menu. Version-Release number of selected component (if applicable): 4.7 [ 2984.289336] coreos-installer-service[2762]: Installing Red Hat Enterprise Linux CoreOS 47.83.202101091312-0 (Ootpa) ppc64le (4096-byte sectors) How reproducible: Consistently Steps to Reproduce: 1. pxe boot a bare-metal system from a server with the 4.7 fc.2 rhcos images 2. wait for the coreos install to complete 3. next boot will not find a boot image on disk. Actual results: No boot images found on disk. Expected results: CoreOS image expected on disk. Additional info:
@rravanel Would you be able to test PXE booting RHCOS 4.7 with a ppc64le system? @Manoj could you provide the journal from the system during the install? And from when the system attempts to boot?
There were some partition layout changes that might have caused this; it's also worth checking the bootloader configs. But we don't have local expertise on ppc64le booting, so it'd be helpful if a ppc64le SME could dig into this. It'd also be useful to know in which build this stopped working.
Note that according to @Manoj the 47.83.202012070110-0 build worked without issues. there are lot of package changes after this build and a kernel change which makes me think this could also be RHEL related. It would be good to also try to boot with the latest rhel 8.3.
Successfully installed /redhat/beta_cds/RHEL-8.4.0-Alpha-1.0/BaseOS/ppc64le/os/ [root@bootstrap-0 ~]# cat /etc/redhat-release Red Hat Enterprise Linux release 8.4 Beta (Ootpa) [root@bootstrap-0 ~]# uname -a Linux bootstrap-0.ocp-ppc64le-test-080078.aus.stglabs.ibm.com 4.18.0-259.el8.ppc64le #1 SMP Sat Dec 5 03:07:47 EST 2020 ppc64le ppc64le ppc64le GNU/Linux
Manoj, can you provide the system model and firmware version of this system?
Here is some debug info using rhcos-47.83.202101121312-0: The server is using IBM-mihawk-OP9_v2.4-4.37 which provides Petitboot 1.11, this firmware version has the fix for the preview issue with grub using BLS. In the PXE install case, it is able to write ostree in the disk in: /var/petitboot/mnt/dev/nvme0n1p4/ostree/ and /var/petitboot/mnt/dev/nvme0n1p3/ostree/ Nonetheless, there isn't any entry in petitboot for it. I was able to successfully boot the system forcing the boot via kexec using the ostree path. I also tried to install it only via kexec. cat /proc/cmdline powersave=off nomodeset rd.neednet=1 coreos.inst=yes coreos.live.rootfs_url=http://192.168.79.1:8080/assets/rhcos-47.83.202101121312-0-live-rootfs.ppc64le.img coreos.inst.install_dev=nvme0n1 ignition.firstboot=1 ignition.platform.id=metal coreos.inst.image_url=http://192.168.79.1:8080/assets/rhcos-47.83.202101121312-0-metal.ppc64le.raw.gz coreos.inst.ignition_url=http://http://192.168.79.1:8080/assets/config.ign ifname=net0:08:94:ef:80:c5:35 ip=192.168.79.253::192.168.79.1:255.255.255.0:renata_test: net0:none nameserver=192.168.79.1 ostree=/ostree/boot.1/rhcos/3334f496365ccf3421e6eac4f813c99c5408756987d6d838188bd87b61b7a449/0 I'm probably missing something here .... In this case I can't see ostree in / :/# ls / bin proc shutdown dev rhcos-47.83.202101121312-0-metal.ppc64le.raw.osmet sys dracut-state.sh rhcos-47.83.202101121312-0-metal4k.ppc64le.raw.osmet sysroot etc root tmp init root.squashfs usr lib run var lib64 sbin systemctl start ostree-prepare-root.service [ 221.254864] ostree-prepare-root[2803]: ostree-prepare-root: No OSTree target; expected ostree=/ostree/boot.N/.. lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT loop0 7:0 0 255.6G 0 loop /run/ephemeral loop1 7:1 0 769.5M 0 loop /sysroot nvme0n1 259:0 0 1.5T 0 disk [ 221.533368] ostree-prepare-root[2803]: ostree-prepare-root: No OSTree target; expected ostree=/ostree/boot.N/... [ 221.533543] systemd[1]: ostree-prepare-root.service: Main process exited, code=exited, status=1/FAILURE [ 221.533821] systemd[1]: ostree-prepare-root.service: Failed with result 'exit-code'. [ 221.534464] systemd[1]: Failed to start OSTree Prepare OS/. [ 221.534527] systemd[1]: ostree-prepare-root.service: Triggering OnFailure= dependencies. [ 221.536438] systemd[1]: Starting Setup Virtual Console... [ 221.782488] systemd-vconsole-setup[2805]: KD_FONT_OP_GET failed while trying to get the font metadata: Function not implemented
So I debugged the problem a bit and it's actually caused by this change: https://github.com/coreos/coreos-assembler/commit/858360aa839a794c822440ac6cdc4d83eb1638e0#diff-0a3babc96e82f4f8af34b10551183337867caf95840fc23a878290a45ffc55d5 On petitboot, we do not support "test -e $file": https://github.com/open-power/petitboot/blob/967cfa7e5c1bfb4d2cf78bb3de3dc6d36b78c440/discover%2Fgrub2%2Fbuiltins.c#L327-L328 and although should be an easy addition, I don't think we support zero-length filenames as well (looks like in this case we are really testing for the presence of device (md/md-boot) alone). So supporting this semantics require some upstream work and later on backporting etc.. We can work around that by replacing the "if [ -e (md/md-boot) ];" for "if [ -f (md/md-boot) ];" but note that this test will always fail (thus reverting to old-style search for the (first?) device labeled as boot and assigning it as the boot device) So a decision must be made on where the "fix" should go.. and please mirror this bug to IBM, family ppc64 development, product OPAL, component Petitboot. Feel free to assign it to myself.
Awesome, thanks for tracking that down! "test -d (md/md-boot)/grub2" should be equivalent for our purposes. We control the filesystem layout, and the exists branch already assumes /grub2, so this should be safe.
(In reply to Benjamin Gilbert from comment #8) > Awesome, thanks for tracking that down! "test -d (md/md-boot)/grub2" should > be equivalent for our purposes. We control the filesystem layout, and the > exists branch already assumes /grub2, so this should be safe. I was able to test the fix suggested by Benjamin, it worked adding the entry in the Menu. The PR already got merged. Klaus was also able to test a patch for Petitboot that worked.
Closing as verified based on https://bugzilla.redhat.com/show_bug.cgi?id=1915540#c10
Successfully deployed 4.7.0-fc.4 on the previously failing baremetal cluster.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633