Bug 1915540 - Silent 4.7 RHCOS install failure on ppc64le
Summary: Silent 4.7 RHCOS install failure on ppc64le
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: RHCOS
Version: 4.7
Hardware: ppc64le
OS: Linux
high
urgent
Target Milestone: ---
: 4.7.0
Assignee: Benjamin Gilbert
QA Contact: Michael Nguyen
URL:
Whiteboard:
Depends On:
Blocks: 1915617
TreeView+ depends on / blocked
 
Reported: 2021-01-12 20:55 UTC by Manoj Kumar
Modified: 2022-09-30 21:00 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-02-24 15:52:28 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github coreos coreos-assembler pull 2005 0 None closed grub: make boot RAID check Petitboot-compatible 2021-02-19 13:56:27 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:52:45 UTC

Description Manoj Kumar 2021-01-12 20:55:25 UTC
Description of problem:

coreos-install with the fc.2 build for 4.7 fails silently. The installation completes, but on the next boot no bootable images are found in the petitboot menu.


Version-Release number of selected component (if applicable):
4.7 
[ 2984.289336] coreos-installer-service[2762]: Installing Red Hat Enterprise Linux CoreOS 47.83.202101091312-0 (Ootpa) ppc64le (4096-byte sectors)


How reproducible:
Consistently

Steps to Reproduce:
1. pxe boot a bare-metal system from a server with the 4.7 fc.2 rhcos images
2. wait for the coreos install to complete
3. next boot will not find a boot image on disk.

Actual results:
No boot images found on disk.

Expected results:
CoreOS image expected on disk.

Additional info:

Comment 1 Micah Abbott 2021-01-12 22:21:12 UTC
@rravanel Would you be able to test PXE booting RHCOS 4.7 with a ppc64le system?

@Manoj could you provide the journal from the system during the install?  And from when the system attempts to boot?

Comment 2 Benjamin Gilbert 2021-01-12 22:31:19 UTC
There were some partition layout changes that might have caused this; it's also worth checking the bootloader configs.  But we don't have local expertise on ppc64le booting, so it'd be helpful if a ppc64le SME could dig into this.

It'd also be useful to know in which build this stopped working.

Comment 3 Prashanth Sundararaman 2021-01-12 23:02:35 UTC
Note that according to @Manoj the 47.83.202012070110-0 build worked without issues. there are lot of package changes after this build and a kernel change which makes me think this could also be RHEL related.

It would be good to also try to boot with the latest rhel 8.3.

Comment 4 Mark Hamzy 2021-01-13 03:31:15 UTC
Successfully installed

/redhat/beta_cds/RHEL-8.4.0-Alpha-1.0/BaseOS/ppc64le/os/

[root@bootstrap-0 ~]# cat /etc/redhat-release
Red Hat Enterprise Linux release 8.4 Beta (Ootpa)
[root@bootstrap-0 ~]# uname -a
Linux bootstrap-0.ocp-ppc64le-test-080078.aus.stglabs.ibm.com 4.18.0-259.el8.ppc64le #1 SMP Sat Dec 5 03:07:47 EST 2020 ppc64le ppc64le ppc64le GNU/Linux

Comment 5 Renata Ravanelli 2021-01-13 14:02:55 UTC
Manoj, can you provide the system model and firmware version of this system?

Comment 6 Renata Ravanelli 2021-01-13 23:47:12 UTC
Here is some debug info using rhcos-47.83.202101121312-0: 


The server is using IBM-mihawk-OP9_v2.4-4.37 which provides Petitboot 1.11, this firmware version has the fix for the preview issue with grub using BLS.

In the PXE install case, it is able to write ostree in the disk in:

/var/petitboot/mnt/dev/nvme0n1p4/ostree/ and /var/petitboot/mnt/dev/nvme0n1p3/ostree/
Nonetheless, there isn't any entry in petitboot for it.

I was able to successfully boot the system forcing the boot via kexec using the ostree path.


I also tried to install it only via kexec.

cat /proc/cmdline
powersave=off
nomodeset rd.neednet=1 coreos.inst=yes
coreos.live.rootfs_url=http://192.168.79.1:8080/assets/rhcos-47.83.202101121312-0-live-rootfs.ppc64le.img
coreos.inst.install_dev=nvme0n1 ignition.firstboot=1 ignition.platform.id=metal
coreos.inst.image_url=http://192.168.79.1:8080/assets/rhcos-47.83.202101121312-0-metal.ppc64le.raw.gz
coreos.inst.ignition_url=http://http://192.168.79.1:8080/assets/config.ign
ifname=net0:08:94:ef:80:c5:35 ip=192.168.79.253::192.168.79.1:255.255.255.0:renata_test:
net0:none nameserver=192.168.79.1 ostree=/ostree/boot.1/rhcos/3334f496365ccf3421e6eac4f813c99c5408756987d6d838188bd87b61b7a449/0


I'm probably missing something here ....

In this case I can't see ostree in /


:/# ls /
bin              proc                                                  shutdown
dev              rhcos-47.83.202101121312-0-metal.ppc64le.raw.osmet    sys
dracut-state.sh  rhcos-47.83.202101121312-0-metal4k.ppc64le.raw.osmet  sysroot
etc              root                                                  tmp
init             root.squashfs                                         usr
lib              run                                                   var
lib64            sbin




systemctl start ostree-prepare-root.service
[  221.254864] ostree-prepare-root[2803]: ostree-prepare-root: No OSTree target; expected ostree=/ostree/boot.N/..

lsblk
NAME    MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
loop0     7:0    0 255.6G  0 loop /run/ephemeral
loop1     7:1    0 769.5M  0 loop /sysroot
nvme0n1 259:0    0   1.5T  0 disk


[  221.533368] ostree-prepare-root[2803]: ostree-prepare-root: No OSTree target; expected ostree=/ostree/boot.N/...
[  221.533543] systemd[1]: ostree-prepare-root.service: Main process exited, code=exited, status=1/FAILURE
[  221.533821] systemd[1]: ostree-prepare-root.service: Failed with result 'exit-code'.
[  221.534464] systemd[1]: Failed to start OSTree Prepare OS/.
[  221.534527] systemd[1]: ostree-prepare-root.service: Triggering OnFailure= dependencies.
[  221.536438] systemd[1]: Starting Setup Virtual Console...
[  221.782488] systemd-vconsole-setup[2805]: KD_FONT_OP_GET failed while trying to get the font metadata: Function not implemented

Comment 7 Klaus Kiwi (Old account no longer used) 2021-01-14 18:09:39 UTC
So I debugged the problem a bit and it's actually caused by this change: https://github.com/coreos/coreos-assembler/commit/858360aa839a794c822440ac6cdc4d83eb1638e0#diff-0a3babc96e82f4f8af34b10551183337867caf95840fc23a878290a45ffc55d5

On petitboot, we do not support "test -e $file": https://github.com/open-power/petitboot/blob/967cfa7e5c1bfb4d2cf78bb3de3dc6d36b78c440/discover%2Fgrub2%2Fbuiltins.c#L327-L328 

and although should be an easy addition, I don't think we support zero-length filenames as well (looks like in this case we are really testing for the presence of device (md/md-boot) alone).

So supporting this semantics require some upstream work and later on backporting etc.. We can work around that by replacing the "if [ -e (md/md-boot) ];" for "if [ -f (md/md-boot) ];" but note that this test will always fail (thus reverting to old-style search for the (first?) device labeled as boot and assigning it as the boot device)


So a decision must be made on where the "fix" should go.. and please mirror this bug to IBM, family ppc64 development, product OPAL, component Petitboot. Feel free to assign it to myself.

Comment 8 Benjamin Gilbert 2021-01-14 20:40:48 UTC
Awesome, thanks for tracking that down!  "test -d (md/md-boot)/grub2" should be equivalent for our purposes.  We control the filesystem layout, and the exists branch already assumes /grub2, so this should be safe.

Comment 10 Renata Ravanelli 2021-01-18 17:23:06 UTC
(In reply to Benjamin Gilbert from comment #8)
> Awesome, thanks for tracking that down!  "test -d (md/md-boot)/grub2" should
> be equivalent for our purposes.  We control the filesystem layout, and the
> exists branch already assumes /grub2, so this should be safe.

I was able to test the fix suggested by Benjamin, it worked adding the entry in the Menu. The PR already got merged.


Klaus was also able to test a patch for Petitboot that worked.

Comment 12 Michael Nguyen 2021-01-25 19:04:12 UTC
Closing as verified based on https://bugzilla.redhat.com/show_bug.cgi?id=1915540#c10

Comment 13 Mark Hamzy 2021-01-26 23:41:37 UTC
Successfully deployed 4.7.0-fc.4 on the previously failing baremetal cluster.

Comment 16 errata-xmlrpc 2021-02-24 15:52:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.