Bug 2107674 - Unable to install RHCOS 4.11.0-rc2-ppc64le on Bare-metal Power system
Summary: Unable to install RHCOS 4.11.0-rc2-ppc64le on Bare-metal Power system
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: RHCOS
Version: 4.11
Hardware: ppc64le
OS: Linux
unspecified
high
Target Milestone: ---
: 4.12.0
Assignee: Benjamin Gilbert
QA Contact: Michael Nguyen
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-07-15 15:48 UTC by Thomas L Falcon
Modified: 2022-08-25 05:09 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-08-25 05:09:11 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
console-and-ign-files.tar.gz (137.36 KB, application/gzip)
2022-07-15 18:34 UTC, Thomas L Falcon
no flags Details
pxe-boot-console-logs (149.20 KB, text/plain)
2022-07-18 17:01 UTC, Thomas L Falcon
no flags Details
bootstrap-autologin.ign (272.21 KB, text/plain)
2022-07-18 22:12 UTC, Thomas L Falcon
no flags Details
pb-discover.log (16.81 KB, text/plain)
2022-08-17 18:52 UTC, Thomas L Falcon
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github coreos/coreos-assembler/commit/45fc1e7518 0 None None None 2022-08-25 05:09:11 UTC
IBM Linux Technology Center 199033 0 None None None 2022-07-19 23:30:26 UTC
Red Hat Issue Tracker OCPBUGS-565 0 None None None 2022-08-25 05:09:11 UTC

Description Thomas L Falcon 2022-07-15 15:48:31 UTC
OCP Version at Install Time: 4.11.0-rc.2-ppc64le
RHCOS Version at Install Time: 4.11.0-rc.2-ppc64le
OCP Version after Upgrade (if applicable):
RHCOS Version after Upgrade (if applicable):
Platform: bare-metal
Architecture: ppc64le


What are you trying to do? What is your use case?

Install Openshift and RHCOS 4.11.0-rc.2

What happened? What went wrong or what did you expect?

Installation did not complete. I was able to install with an 4.11.0-fc.2 image.

Will work on getting console logs, debug output etc. and provide in follow-up comments

Comment 1 Micah Abbott 2022-07-15 16:09:23 UTC
In the future, please collect all required information as noted in our template before opening a BZ.  The promise of more information doesn't allow us to act on this BZ in the current state and increases the chance that it could be overlooked in the future.

Comment 2 Thomas L Falcon 2022-07-15 18:34:38 UTC
Created attachment 1897494 [details]
console-and-ign-files.tar.gz

Sorry for the trouble. I am attaching console output from one of the nodes and ignition files. I can not gather journalctl output. After I pxeboot the nodes to kick off the installation, petitboot no longer detects any disks. I cannot progress further.

Comment 3 Benjamin Gilbert 2022-07-15 18:44:08 UTC
I don't know petitboot very well; are you sure it's completely failing to boot?  The log suggests that it has started the kernel, but since the console is not directed to ttyS1, you aren't seeing any output.  Is it possible to intercept the boot and manually add "console=ttyS1" to the kernel arguments?

Comment 4 Thomas L Falcon 2022-07-15 19:57:44 UTC
I don't about intercepting the boot, but the "console=ttyS1" parameter was already present in the nodes' configuration file.

LABEL pxeboot
    KERNEL http://192.168.79.1:8080/assets/rhcos-live-kernel-ppc64le
    APPEND initrd=http://192.168.79.1:8080/assets/rhcos-live-initramfs.ppc64le.img console=tty0 console=ttyS1 <------ Here
ip=dhcp rd.neednet=1 coreos.inst.install_dev=/dev/nvme0n1 coreos.live.rootfs_url=http://192.168.79.1:8080/assets/rhcos-live-rootfs.ppc64le.img  coreos.inst.ignition_url=http://192.168.79.1:8080/ignition?mac=08:94:ef:80:c5:51

I don't know how to grab the output from it though.

Comment 5 Benjamin Gilbert 2022-07-15 20:10:08 UTC
That only affects the PXE boot, not the installed system.

If you can't intercept the boot, the next easiest approach is to remove "coreos.inst.install_dev=/dev/nvme0n1" from the APPEND line, boot from PXE, connect to the console, and manually run coreos-installer at the prompt:

    sudo coreos-installer install /dev/nvme0n1 --ignition-url http://192.168.79.1:8080/ignition?mac=08:94:ef:80:c5:51 --insecure-ignition --append-karg console=ttyS1

Then run "sudo reboot" and see what output you get when booting from the installed system.

Comment 6 Thomas L Falcon 2022-07-15 21:23:25 UTC
Thanks, I was able to boot that way but I don't how to proceed. We're using ssh keys, so I don't know the password. However, ssh also is not working.

Red Hat Enterprise Linux CoreOS 411.86.202207090936-0 (Ootpa) 4.11
SSH host key: SHA256:3RTAHY3cLSrqVdiR5q/79etki64sFwVO41+eB6pebpw (ECDSA)
SSH host key: SHA256:CrnZwblQXJR7uv7UtE9qP0ESX8s9A/ebvyc8yp2dE3g (ED25519)
SSH host key: SHA256:qZZb5gZb8nxp/eE4NWc/f2dVE3Geq1ElCRjcOhEQz/A (RSA)
enP49p3s0f1: 192.168.79.25 fe80::a94:efff:fe80:c551
master-1 login:
Red Hat Enterprise Linux CoreOS 411.86.202207090936-0 (Ootpa) 4.11
SSH host key: SHA256:3RTAHY3cLSrqVdiR5q/79etki64sFwVO41+eB6pebpw (ECDSA)
SSH host key: SHA256:CrnZwblQXJR7uv7UtE9qP0ESX8s9A/ebvyc8yp2dE3g (ED25519)
SSH host key: SHA256:qZZb5gZb8nxp/eE4NWc/f2dVE3Geq1ElCRjcOhEQz/A (RSA)
enP49p3s0f1: 192.168.79.25 fe80::a94:efff:fe80:c551
master-1 login:

Comment 7 Benjamin Gilbert 2022-07-15 21:35:32 UTC
Please post the console log from the successful boot.

Comment 8 Manoj Kumar 2022-07-17 22:42:03 UTC
I tried this on another system (to ensure that nvme devices and 4k block sizes were not causing the installation).

The coreos-install seems to complete with some warnings.

Installing Red Hat Enterprise Linux CoreOS 411.86.202207090936-0 (Ootpa) ppc64le (512-byte sectors)
[ 4610.624181] GPT:Primary header thinks Alt. header is not at the end of the disk.
[ 4610.624181] GPT:Primary header thinks Alt. header is not at the end of the disk.
[ 4610.674346] GPT:8454143 != 1953525167
[ 4610.674346] GPT:8454143 != 1953525167
[ 4610.714445] GPT:Alternate GPT header not at the end of the disk.
[ 4610.714445] GPT:Alternate GPT header not at the end of the disk.
[ 4610.764567] GPT:8454143 != 1953525167
[ 4610.764567] GPT:8454143 != 1953525167
[ 4610.804659] GPT: Use GNU Parted to correct GPT errors.
[ 4610.804659] GPT: Use GNU Parted to correct GPT errors.
[ 4610.844768]  sda: sda1 sda2 sda3 sda4
[ 4610.844768]  sda: sda1 sda2 sda3 sda4
> Read disk 4.0 GiB/4.0 GiB (100%)
[ 4690.651453] GPT:Primary header thinks Alt. header is not at the end of the disk.
[ 4690.651453] GPT:Primary header thinks Alt. header is not at the end of the disk.
[ 4690.701629] GPT:8454143 != 1953525167
[ 4690.701629] GPT:8454143 != 1953525167
[ 4690.741712] GPT:Alternate GPT header not at the end of the disk.
[ 4690.741712] GPT:Alternate GPT header not at the end of the disk.
[ 4690.801836] GPT:8454143 != 1953525167
[ 4690.801836] GPT:8454143 != 1953525167
[ 4690.841922] GPT: Use GNU Parted to correct GPT errors.
[ 4690.841922] GPT: Use GNU Parted to correct GPT errors.
[ 4690.892035]  sda: sda1 sda2 sda3 sda4
[ 4690.892035]  sda: sda1 sda2 sda3 sda4
[ 4691.271404] EXT4-fs (sda3): mounted filesystem with ordered data mode. Opts: (null)
[ 4691.271404] EXT4-fs (sda3): mounted filesystem with ordered data mode. Opts: (null)
Writing Ignition config
Modifying kernel arguments
[ 4691.502798] GPT:Primary header thinks Alt. header is not at the end of the disk.
[ 4691.502798] GPT:Primary header thinks Alt. header is not at the end of the disk.
[ 4691.552950] GPT:8454143 != 1953525167
[ 4691.552950] GPT:8454143 != 1953525167
[ 4691.583048] GPT:Alternate GPT header not at the end of the disk.
[ 4691.583048] GPT:Alternate GPT header not at the end of the disk.
[ 4691.633162] GPT:8454143 != 1953525167
[ 4691.633162] GPT:8454143 != 1953525167
[ 4691.663248] GPT: Use GNU Parted to correct GPT errors.
[ 4691.663248] GPT: Use GNU Parted to correct GPT errors.
[ 4691.713364]  sda: sda1 sda2 sda3 sda4
[ 4691.713364]  sda: sda1 sda2 sda3 sda4
Install complete.
[ 4692.130617] GPT:Primary header thinks Alt. header is not at the end of the disk.
[ 4692.130617] GPT:Primary header thinks Alt. header is not at the end of the disk.
[ 4692.181393] GPT:8454143 != 1953525167
[ 4692.181393] GPT:8454143 != 1953525167
[ 4692.222040] GPT:Alternate GPT header not at the end of the disk.
[ 4692.222040] GPT:Alternate GPT header not at the end of the disk.
[ 4692.272734] GPT:8454143 != 1953525167
[ 4692.272734] GPT:8454143 != 1953525167
[ 4692.313398] GPT: Use GNU Parted to correct GPT errors.
[ 4692.313398] GPT: Use GNU Parted to correct GPT errors.
[ 4692.354094]  sda: sda1 sda2 sda3 sda4
[ 4692.354094]  sda: sda1 sda2 sda3 sda4
bash-4.4# [ 4692.411483] GPT:Primary header thinks Alt. header is not at the end of the disk.
[ 4692.411483] GPT:Primary header thinks Alt. header is not at the end of the disk.
[ 4692.462238] GPT:8454143 != 1953525167
[ 4692.462238] GPT:8454143 != 1953525167
[ 4692.502961] GPT:Alternate GPT header not at the end of the disk.
[ 4692.502961] GPT:Alternate GPT header not at the end of the disk.
[ 4692.553717] GPT:8454143 != 1953525167
[ 4692.553717] GPT:8454143 != 1953525167
[ 4692.594448] GPT: Use GNU Parted to correct GPT errors.
[ 4692.594448] GPT: Use GNU Parted to correct GPT errors.
[ 4692.635195]  sda: sda1 sda2 sda3 sda4
[ 4692.635195]  sda: sda1 sda2 sda3 sda4

Comment 9 Manoj Kumar 2022-07-17 23:00:28 UTC
bash-4.4# lsblk
NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
loop0    7:0    0 127.7G  0 loop /run/ephemeral
loop1    7:1    0 937.8M  0 loop /sysroot
sda      8:0    1 931.5G  0 disk 
|-sda1   8:1    1     4M  0 part 
|-sda2   8:2    1     1M  0 part 
|-sda3   8:3    1   384M  0 part 
`-sda4   8:4    1   3.7G  0 part 
sdb      8:16   1 931.5G  0 disk 
bash-4.4# fdisk /dev/sda

Welcome to fdisk (util-linux 2.32.1).
Changes will remain in memory only, until you decide to write them.
Be careful before using the write command.

GPT PMBR size mismatch (8454143 != 1953525167) will be corrected by write.
The backup GPT table is not on the end of the device. This problem will be corrected by write.
[  980.509508] GPT:Primary header thinks Alt. header is not at the end of the disk.
[  980.509508] GPT:Primary header thinks Alt. header is not at the end of the disk.
[  980.570297] GPT:8454143 != 1953525167
[  980.570297] GPT:8454143 != 1953525167
[  980.611009] GPT:Alternate GPT header not at the end of the disk.
[  980.611009] GPT:Alternate GPT header not at the end of the disk.
[  980.661746] GPT:8454143 != 1953525167
[  980.661746] GPT:8454143 != 1953525167
[  980.692452] GPT: Use GNU Parted to correct GPT errors.
[  980.692452] GPT: Use GNU Parted to correct GPT errors.
[  980.743197]  sda: sda1 sda2 sda3 sda4
[  980.743197]  sda: sda1 sda2 sda3 sda4

Comment 10 Manoj Kumar 2022-07-18 15:45:18 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=2107674#c8 seems to indicate that the installation succeeded.  Even though the coreos-install seems to complete, I could not boot on to the installed image.

Comment 11 Benjamin Gilbert 2022-07-18 15:53:23 UTC
Those GPT warnings are normal, and are corrected automatically on first boot.

I agree that the installation appears to have succeeded.  At this point we'd need to see logs from the failed boot.

Comment 13 Thomas L Falcon 2022-07-18 17:01:54 UTC
Created attachment 1897962 [details]
pxe-boot-console-logs

Logs after removing "coreos.inst.install_dev=/dev/nvme0n1" from kernel arguments.

Comment 14 Benjamin Gilbert 2022-07-18 17:09:43 UTC
@tfalcon That's the log from the PXE boot.  We need the log from the boot of the installed system.  Did you follow the full instructions in comment 5?

Comment 15 Thomas L Falcon 2022-07-18 17:36:49 UTC
I'm still trying to figure out how to connect to the console and login to run the command.

Red Hat Enterprise Linux CoreOS 411.86.202207090936-0 (Ootpa) 4.11
SSH host key: SHA256:fc3NlioetXjw0i3C2B1U+drAlVpBtQVH2H+JK7dMbZs (ECDSA)
SSH host key: SHA256:z2kQhaMpcxjgYInABwQbEaEZqpC1jN3ahDgJuDDC0Mc (ED25519)
SSH host key: SHA256:feM2Q+9mxFS3IwQJMK/6OL8G7UQBR72Ue8VS7AtwjC8 (RSA)
enP49p3s0f1: 192.168.79.25 fe80::a94:efff:fe80:c551
master-1 login: 
Red Hat Enterprise Linux CoreOS 411.86.202207090936-0 (Ootpa) 4.11
SSH host key: SHA256:fc3NlioetXjw0i3C2B1U+drAlVpBtQVH2H+JK7dMbZs (ECDSA)
SSH host key: SHA256:z2kQhaMpcxjgYInABwQbEaEZqpC1jN3ahDgJuDDC0Mc (ED25519)
SSH host key: SHA256:feM2Q+9mxFS3IwQJMK/6OL8G7UQBR72Ue8VS7AtwjC8 (RSA)
enP49p3s0f1: 192.168.79.25 fe80::a94:efff:fe80:c551
master-1 login:

Comment 16 Thomas L Falcon 2022-07-18 22:12:58 UTC
Created attachment 1898018 [details]
bootstrap-autologin.ign

I tried to add an autologin unit to the bootstrap ignition file, but something isn't working. I'm still blocked on accessing the console after booting a node in PXE.

Comment 17 Benjamin Gilbert 2022-07-19 07:08:40 UTC
Whoops, yeah, sorry about that.  In ISO boots autologin happens automatically if an Ignition config is omitted, but that doesn't happen in PXE boots.

This config, based on the one in https://docs.fedoraproject.org/en-US/fedora-coreos/tutorial-autologin/#_first_ignition_config_via_butane, should enable autologin on ttyS1:

{
  "ignition": {
    "version": "3.3.0"
  },
  "systemd": {
    "units": [
      {
        "dropins": [
          {
            "contents": "[Service]\nExecStart=\nExecStart=-/usr/sbin/agetty --autologin core --noclear %I $TERM\n",
            "name": "autologin-core.conf"
          }
        ],
        "name": "serial-getty"
      }
    ]
  }
}

You can use it by hosting it on a web server, then passing "ignition.platform.id=metal ignition.firstboot ignition.config.url=https://example.com/ignition/config" in the kernel arguments.

Comment 18 Thomas L Falcon 2022-07-19 17:32:47 UTC
Thanks! I booted a node with that file, but I am still being prompted to login. I am trying to confirm that I am connected to ttyS1.

Comment 19 Mark Hamzy 2022-07-19 19:37:09 UTC
Manoj said he did the installation on infnod-1, and this is what we noticed there.  Petitboot did mount sda.  And we could see the following:

# ls -l /var/petitboot/mnt/dev/sda3
total 23
lrwxrwxrwx    1 root     root             1 Jul  9 10:28 boot -> .
drwxr-xr-x    5 root     root          1024 Jul  9 10:28 grub2
drwx------    2 root     root          1024 Jul 17 22:56 ignition
-rw-r--r--    1 root     root             0 Jul  9 10:28 ignition.firstboot
lrwxrwxrwx    1 root     root             8 Jul  9 10:28 loader -> loader.1
drwxr-xr-x    3 root     root          1024 Jul  9 10:28 loader.1
drwx------    2 root     root         12288 Jul  9 10:27 lost+found
drwxr-xr-x    3 root     root          1024 Jul  9 10:28 ostree
# ls -l /var/petitboot/mnt/dev/sda3/ostree/
total 2
drwxr-xr-x    2 root     root          1024 Jul  9 10:28 rhcos-797b8f8e1f9f0b6a5118d56f126a1fd518cfbadfaf7cf28deaa744924e38c87e
# ls -l /var/petitboot/mnt/dev/sda3/ostree/rhcos-797b8f8e1f9f0b6a5118d56f126a1fd518cfbadfaf7cf28deaa744924e38c87e/
total 122614
-rw-r--r--    1 root     root      89918534 Jul  9 10:28 initramfs-4.18.0-372.13.1.el8_6.ppc64le.img
-rwxr-xr-x    1 root     root      35634157 Jul  9 10:28 vmlinuz-4.18.0-372.13.1.el8_6.ppc64le

# ls -l /var/petitboot/mnt/dev/sda4
total 0
drwxr-xr-x    2 root     root             6 Jul  9 10:27 boot
drwxr-xr-x    5 root     root            62 Jul  9 10:28 ostree

# cat /var/petitboot/mnt/dev/sda3/loader/entries/ostree-1-rhcos.conf
title Red Hat Enterprise Linux CoreOS 411.86.202207090936-0 (Ootpa) (ostree:0)
version 1
options random.trust_cpu=on console=tty0 console=hvc0,115200n8 ignition.platform.id=metal $ignition_firstboot ostree=/ostree/boot.1/rhcos/797b8f8e1f9f0b6a5118d56f126a1fd518cfbadfaf7cf28deaa744924e38c87e/0 console=ttyS1
linux /ostree/rhcos-797b8f8e1f9f0b6a5118d56f126a1fd518cfbadfaf7cf28deaa744924e38c87e/vmlinuz-4.18.0-372.13.1.el8_6.ppc64le
initrd /ostree/rhcos-797b8f8e1f9f0b6a5118d56f126a1fd518cfbadfaf7cf28deaa744924e38c87e/initramfs-4.18.0-372.13.1.el8_6.ppc64le.img

We tried to kexec manually into the image, but encountered an error:

# cd /var/petitboot/mnt/dev/sda3/ostree/rhcos-797b8f8e1f9f0b6a5118d56f126a1fd518cfbadfaf7cf28deaa744924e38c87e/

# kexec -l vmlinuz-4.18.0-372.13.1.el8_6.ppc64le -i initramfs-4.18.0-372.13.1.el8_6.ppc64le.img -c "random.trust_cpu=on console=tty0 console=hvc0,115200n8 ignition.platform.id=metal ostree=/ostree/boot.1/rhcos/797b8f8e1f9f0b6a5118d56f126a1fd518cfbadfaf7cf28deaa744924e38c87e/0 console=ttyS1"

kexec syscall failed: Operation not permitted

Unfortunately, petitboot does not have the "file" command, so we could not verify if the kernel was valid.

Comment 20 Thomas L Falcon 2022-07-20 01:01:42 UTC
We've been able to kexec manually into the image from the petitboot menu. 

# cd /var/petitboot/mnt/dev/*1p3/ostree/rhcos-*/
# kexec -l vmlinuz-*.ppc64le -i initramfs-*.img -c "$(grep options /var/petitboot/mnt/dev/*1p3/loader/entries/ostree-1-rhcos.conf | sed 's,^options ,,')"
# kexec -e

Comment 21 IBM Bug Proxy 2022-07-22 21:50:33 UTC
------- Comment From chavez.com 2022-07-22 17:45 EDT-------
Hi Tom and Mark,

Trying to figure out if you need some help from IBM side for petiboot or if y'all are still doing some more investigation on your own.

Comment 22 Timothée Ravier 2022-07-27 11:24:41 UTC
Can we get a summary of the investigation and the issues that you are facing?

Comment 23 Thomas L Falcon 2022-07-27 21:24:25 UTC
Luciano, sorry for the late response. We were hoping to get some help with figuring out why the petitboot entries are not being populated. Unfortunately the cluster is being used for a different issue at the moment.

Timothee, We were able to workaround the petitboot issue by running the following commands in the shell

# cd /var/petitboot/mnt/dev/nvme0n1p3/ostree/rhcos-*/
# kexec -l vmlinuz-*.ppc64le -i initramfs-*.img -c "ignition.firstboot rd.neednet=1 ip=dhcp $(grep options /var/petitboot/mnt/dev/nvme0n1p3/loader/entries/ostree-1-rhcos.conf | sed 's,^options ,,')" && kexec -e

However, for some reason the bootkube service did not run on the bootstrap node and needed to be run manually. I wasn't able to get the installation to work before we had to repurpose the cluster for another issue. I don't know if the bootstrap issue is related to this one.

Comment 24 Timothée Ravier 2022-08-11 10:30:09 UTC
Is this still an issue that you are facing or have you found a solution?
If this is still an issue then we need a clearer summary as we do not understand how we could help here.

Comment 25 Thomas L Falcon 2022-08-11 15:04:32 UTC
Hi, we were able to workaround by installing 4.10 and upgrading to 4.11 using the openshift client. The system is being used by other teams for their work so I have not had the opportunity to attempt any installations with newer versions.

The issue seems to be that petitboot is unable to parse the grub configuration file when rhcos 4.11.0-rc.2-ppc64le live images are used for installation on our bare-metal cluster. As a result there are no entries populated in the boot menu. We were forced to boot the live image manually with kexec from the petitboot shell.

Comment 26 IBM Bug Proxy 2022-08-11 19:40:36 UTC
------- Comment From kumarmn.com 2022-08-11 15:37 EDT-------
chavez.com:  Any update on this?

Comment 27 IBM Bug Proxy 2022-08-12 17:00:58 UTC
------- Comment From arbab.com 2022-08-12 12:55 EDT-------
(In reply to comment #3)
> # cat /var/petitboot/mnt/dev/sda3/loader/entries/ostree-1-rhcos.conf
> title Red Hat Enterprise Linux CoreOS 411.86.202207090936-0 (Ootpa) (ostree:0)
> version 1
> options random.trust_cpu=on console=tty0 console=hvc0,115200n8 ignition.platform.id=metal $ignition_firstboot ostree=/ostree/boot.1/rhcos/797b8f8e1f9f0b6a5118d56f126a1fd518cfbadfaf7cf28deaa744924e38c87e/0 console=ttyS1
> linux /ostree/rhcos-797b8f8e1f9f0b6a5118d56f126a1fd518cfbadfaf7cf28deaa744924e38c87e/vmlinuz-4.18.0-372.13.1.el8_6.ppc64le
> initrd /ostree/rhcos-797b8f8e1f9f0b6a5118d56f126a1fd518cfbadfaf7cf28deaa744924e38c87e/initramfs-4.18.0-372.13.1.el8_6.ppc64le.img

Could someone please grab /var/log/petitboot/pb-discover.log when this happens? It may tell us why the above is not appearing in the menu.

Comment 28 Thomas L Falcon 2022-08-17 18:52:35 UTC
Created attachment 1906052 [details]
pb-discover.log

Hi, I pulled this from one of the nodes. I see this output a few times in the logs.

[18:38:51] Registering new progress struct
[18:38:51] boot option enP49p3s0f1@0x118d2128 is resolved, sending to clients
[18:38:51] process_read_stdout_once: read failed: Bad file descriptor

Comment 29 Thomas L Falcon 2022-08-23 15:21:30 UTC
Have you been able to review the attachment? I'll provide the logs here in case there is an issue reading it.

# cat /var/petitboot/mnt/dev/nvme0n1p3/loader/entries/ostree-1-rhcos.conf
title Red Hat Enterprise Linux CoreOS 411.86.202208022159-0 (Ootpa) (ostree:0)
version 1
options random.trust_cpu=on console=tty0 console=hvc0,115200n8 ignition.platform.id=metal $ignition_firstboot ostree=/ostree/boot.1/rhcos/69ef35526678094d9ab331a87f97458d144ef220af08e8eaf683bf78d8cfcb32/0
linux /ostree/rhcos-69ef35526678094d9ab331a87f97458d144ef220af08e8eaf683bf78d8cfcb32/vmlinuz-4.18.0-372.19.1.el8_6.ppc64le
initrd /ostree/rhcos-69ef35526678094d9ab331a87f97458d144ef220af08e8eaf683bf78d8cfcb32/initramfs-4.18.0-372.19.1.el8_6.ppc64le.img
# cat /var/log/petitboot/pb-discover.log
[18:28:49] --- pb-discover ---
[18:28:49] config_set_defaults: lang: en_US.utf8
[18:28:49] Detected platform type: powerpc
[18:28:49] Running command:
 exe:  nvram
 argv: 'nvram' '--print-config' '--partition' 'common'
[18:28:49] platform: non-zero completion code 128 from IPMI req
[18:28:49] platform: non-zero completion code 255 from IPMI network req
[18:28:49] configuration:
[18:28:49]  autoboot: disabled
[18:28:49]   boot device 0: network
[18:28:49]   boot device 1: any
[18:28:49]   IPMI boot device 0x00
[18:28:49]   Modifications allowed to disks: yes
[18:28:49]   Default UI to boot on: /dev/hvc0 [IPMI / Serial]
[18:28:49]  language:
[18:28:49] Interface enP49p3s0f0 ready
[18:28:49] Running command:
 exe:  /sbin/ip
 argv: '/sbin/ip' 'link' 'set' 'enP49p3s0f0' 'up'
[18:28:50] network: bringing up interface enP49p3s0f0
[18:28:50] Interface enP49p3s0f1 ready
[18:28:50] Running command:
 exe:  /sbin/ip
 argv: '/sbin/ip' 'link' 'set' 'enP49p3s0f1' 'up'
[18:28:50] network: bringing up interface enP49p3s0f1
[18:28:50] SKIP: loop0: ignored (path=/devices/virtual/block/loop0)
[18:28:50] SKIP: loop1: ignored (path=/devices/virtual/block/loop1)
[18:28:50] SKIP: loop2: ignored (path=/devices/virtual/block/loop2)
[18:28:50] SKIP: loop3: ignored (path=/devices/virtual/block/loop3)
[18:28:50] SKIP: loop4: ignored (path=/devices/virtual/block/loop4)
[18:28:50] SKIP: loop5: ignored (path=/devices/virtual/block/loop5)
[18:28:50] SKIP: loop6: ignored (path=/devices/virtual/block/loop6)
[18:28:50] SKIP: loop7: ignored (path=/devices/virtual/block/loop7)
[18:28:50] Running command:
 exe:  /sbin/ip
 argv: '/sbin/ip' 'link' 'set' 'enP49p3s0f0' 'up'
[18:28:50] network: bringing up interface enP49p3s0f0
[18:28:50] Running command:
 exe:  /sbin/ip
 argv: '/sbin/ip' 'link' 'set' 'lo' 'up'
[18:28:50] sit0 not marked ready yet
[18:28:50] Running command:
 exe:  /sbin/ip
 argv: '/sbin/ip' 'link' 'set' 'enP49p3s0f0' 'up'
[18:28:50] network: bringing up interface enP49p3s0f0
[18:28:50] Running command:
 exe:  /sbin/ip
 argv: '/sbin/ip' 'link' 'set' 'enP49p3s0f1' 'up'
[18:28:50] network: bringing up interface enP49p3s0f1
[18:28:50] network: configuring interface enP49p3s0f0
[18:28:50] Running DHCPv4 client
[18:28:50] Running command:
 exe:  /sbin/udhcpc
 argv: '/sbin/udhcpc' '-R' '-f' '-O' 'pxeconffile' '-O' 'pxepathprefix' '-O' 'reboottime' '-p' '/var/petitboot//udhcpc-enP49p3s0f0.pid' '-i' 'enP49p3s0f0' '-x' '0x5d:000e'
[18:28:50] Running DHCPv6 client
[18:28:50] Running command:
 exe:  /usr/bin/udhcpc6
 argv: '/usr/bin/udhcpc6' '-R' '-f' '-O' 'bootfile_url' '-O' 'bootfile_param' '-O' 'pxeconffile' '-O' 'pxepathprefix' '-p' '/var/petitboot//udhcpc-enP49p3s0f0.pid' '-i' 'enP49p3s0f0' '-x' '0x3d:000e'
[18:28:50] network: configuring interface enP49p3s0f1
[18:28:50] Running DHCPv4 client
[18:28:50] Running command:
 exe:  /sbin/udhcpc
 argv: '/sbin/udhcpc' '-R' '-f' '-O' 'pxeconffile' '-O' 'pxepathprefix' '-O' 'reboottime' '-p' '/var/petitboot//udhcpc-enP49p3s0f1.pid' '-i' 'enP49p3s0f1' '-x' '0x5d:000e'
[18:28:50] Running DHCPv6 client
[18:28:50] Running command:
 exe:  /usr/bin/udhcpc6
 argv: '/usr/bin/udhcpc6' '-R' '-f' '-O' 'bootfile_url' '-O' 'bootfile_param' '-O' 'pxeconffile' '-O' 'pxepathprefix' '-p' '/var/petitboot//udhcpc-enP49p3s0f1.pid' '-i' 'enP49p3s0f1' '-x' '0x3d:000e'
udhcpc: started, v1.33.0
udhcpc: started, v1.33.0
udhcpc6: can't get link-local IPv6 address
udhcpc6: started, v1.33.0
udhcpc: sending discover
udhcpc: sending discover
udhcpc6: sending discover
udhcpc: sending select for 192.168.79.26
udhcpc: lease of 192.168.79.26 obtained, lease time 900
deleting routers
adding dns 192.168.79.1
[18:28:50] trying parsers for enP49p3s0f1
[18:28:50] Running command:
 exe:  /usr/bin/tftp
 argv: '/usr/bin/tftp' '-g' '-l' '/tmp/pb-SpYSa8' '-r' '/pxelinux/pxelinux.cfg/01-08-94-ef-80-c2-e8.cfg' '192.168.79.1' '69'
[18:28:50] Registering new progress struct
[18:28:50] boot option enP49p3s0f1@0x118cfbd8 is resolved, sending to clients
[18:28:50] process_read_stdout_once: read failed: Bad file descriptor
[18:28:51] eth0 not marked ready yet
[18:28:52] SKIP: nvme0n1: no ID_FS_TYPE property
[18:28:52] Snapshot successfully created for nvme0n1p4
[18:28:52] mounting device /dev/nvme0n1p4 read-only
[18:28:52] trying parsers for nvme0n1p4
[18:28:52] Running command:
 exe:  /usr/sbin/pb-plugin
 argv: '/usr/sbin/pb-plugin' 'scan' '/var/petitboot/mnt/dev/nvme0n1p4'
Scanning device /var/petitboot/mnt/dev/nvme0n1p4
No plugins found
[18:28:52] SKIP: nvme0n1p2: no ID_FS_TYPE property
[18:28:52] SKIP: nvme0n1p1: no ID_FS_TYPE property
[18:28:52] SKIP: dm-0: no ID_FS_TYPE property
[18:28:52] SKIP: dm-1: no ID_FS_TYPE property
[18:28:52] Snapshot successfully created for nvme0n1p3
[18:28:52] mounting device /dev/nvme0n1p3 read-only
[18:28:52] trying parsers for nvme0n1p3
[18:28:52] grub2: undefined function 'serial'
[18:28:52] grub2: undefined function 'terminal_input'
[18:28:52] grub2: undefined function 'terminal_output'
[18:28:52] grub2: undefined function 'blscf'
[18:28:52] Running command:
 exe:  /usr/sbin/pb-plugin
 argv: '/usr/sbin/pb-plugin' 'scan' '/var/petitboot/mnt/dev/nvme0n1p3'
Scanning device /var/petitboot/mnt/dev/nvme0n1p3
No plugins found
[18:28:52] eth1 not marked ready yet
[18:28:52] SKIP: dm-2: no ID_FS_TYPE property
[18:28:52] SKIP: dm-3: no ID_FS_TYPE property
[18:28:52] SKIP: dm-4: no ID_FS_TYPE property
[18:28:52] SKIP: dm-5: no ID_FS_TYPE property
[18:28:52] enp1s0f1 not marked ready yet
[18:28:52] Interface enp1s0f1 ready
[18:28:52] Running command:
 exe:  /sbin/ip
 argv: '/sbin/ip' 'link' 'set' 'enp1s0f1' 'up'
udhcpc: sending discover
[18:28:54] network: bringing up interface enp1s0f1
[18:28:54] enp1s0f0 not marked ready yet
[18:28:54] Interface enp1s0f0 ready
[18:28:54] Running command:
 exe:  /sbin/ip
 argv: '/sbin/ip' 'link' 'set' 'enp1s0f0' 'up'
udhcpc6: sending discover
[18:28:56] network: bringing up interface enp1s0f0
[18:28:56] network: configuring interface enp1s0f0
[18:28:56] Running DHCPv4 client
[18:28:56] Running command:
 exe:  /sbin/udhcpc
 argv: '/sbin/udhcpc' '-R' '-f' '-O' 'pxeconffile' '-O' 'pxepathprefix' '-O' 'reboottime' '-p' '/var/petitboot//udhcpc-enp1s0f0.pid' '-i' 'enp1s0f0' '-x' '0x5d:000e'
[18:28:56] Running DHCPv6 client
[18:28:56] Running command:
 exe:  /usr/bin/udhcpc6
 argv: '/usr/bin/udhcpc6' '-R' '-f' '-O' 'bootfile_url' '-O' 'bootfile_param' '-O' 'pxeconffile' '-O' 'pxepathprefix' '-p' '/var/petitboot//udhcpc-enp1s0f0.pid' '-i' 'enp1s0f0' '-x' '0x3d:000e'
udhcpc: started, v1.33.0
udhcpc6: started, v1.33.0
udhcpc: sending discover
udhcpc: sending discover
udhcpc6: sending discover
udhcpc6: sending discover
...
[18:33:50] Running DHCPv4 client
[18:33:50] Running command:
 exe:  /sbin/udhcpc
 argv: '/sbin/udhcpc' '-R' '-f' '-O' 'pxeconffile' '-O' 'pxepathprefix' '-O' 'reboottime' '-p' '/var/petitboot//udhcpc-enP49p3s0f1.pid' '-i' 'enP49p3s0f1' '-x' '0x5d:000e'
udhcpc: received SIGTERM
udhcpc: unicasting a release of 192.168.79.26 to 192.168.79.1
udhcpc: sending release
udhcpc: entering released state
[18:33:50] Running DHCPv6 client
udhcpc: started, v1.33.0
[18:33:50] Running command:
 exe:  /usr/bin/udhcpc6
 argv: '/usr/bin/udhcpc6' '-R' '-f' '-O' 'bootfile_url' '-O' 'bootfile_param' '-O' 'pxeconffile' '-O' 'pxepathprefix' '-p' '/var/petitboot//udhcpc-enP49p3s0f1.pid' '-i' 'enP49p3s0f1' '-x' '0x3d:000e'
udhcpc6: started, v1.33.0
udhcpc: sending discover
udhcpc6: sending discover
udhcpc: sending select for 192.168.79.26
udhcpc: lease of 192.168.79.26 obtained, lease time 900
deleting routers
adding dns 192.168.79.1
[18:33:50] Couldn't find interface matching 08:94:ef:80:c2:e8
[18:33:50] trying parsers for enP49p3s0f1
[18:33:50] Running command:
 exe:  /usr/bin/tftp
 argv: '/usr/bin/tftp' '-g' '-l' '/tmp/pb-IyS8G7' '-r' '/pxelinux/pxelinux.cfg/01-08-94-ef-80-c2-e8.cfg' '192.168.79.1' '69'
[18:33:50] Registering new progress struct
[18:33:50] boot option enP49p3s0f1@0x118e72f8 is resolved, sending to clients
[18:33:50] process_read_stdout_once: read failed: Bad file descriptor
udhcpc: sending discover
udhcpc6: sending discover
...
[18:38:50] Running DHCPv4 client
udhcpc: received SIGTERM
udhcpc: unicasting a release of 192.168.79.26 to 192.168.79.1
udhcpc: sending release
[18:38:50] Running command:
 exe:  /sbin/udhcpc
 argv: '/sbin/udhcpc' '-R' '-f' '-O' 'pxeconffile' '-O' 'pxepathprefix' '-O' 'reboottime' '-p' '/var/petitboot//udhcpc-enP49p3s0f1.pid' '-i' 'enP49p3s0f1' '-x' '0x5d:000e'
udhcpc: entering released state
udhcpc6: received SIGTERM
udhcpc6: entering released state
[18:38:50] Running DHCPv6 client
[18:38:50] Running command:
 exe:  /usr/bin/udhcpc6
 argv: '/usr/bin/udhcpc6' '-R' '-f' '-O' 'bootfile_url' '-O' 'bootfile_param' '-O' 'pxeconffile' '-O' 'pxepathprefix' '-p' '/var/petitboot//udhcpc-enP49p3s0f1.pid' '-i' 'enP49p3s0f1' '-x' '0x3d:000e'
udhcpc: started, v1.33.0
udhcpc6: started, v1.33.0
udhcpc: sending discover
udhcpc6: sending discover
udhcpc: sending select for 192.168.79.26
udhcpc: lease of 192.168.79.26 obtained, lease time 900
deleting routers
adding dns 192.168.79.1
[18:38:51] Couldn't find interface matching 08:94:ef:80:c2:e8
[18:38:51] trying parsers for enP49p3s0f1
[18:38:51] Running command:
 exe:  /usr/bin/tftp
 argv: '/usr/bin/tftp' '-g' '-l' '/tmp/pb-mWf4c6' '-r' '/pxelinux/pxelinux.cfg/01-08-94-ef-80-c2-e8.cfg' '192.168.79.1' '69'
[18:38:51] Registering new progress struct
[18:38:51] boot option enP49p3s0f1@0x118d2128 is resolved, sending to clients
[18:38:51] process_read_stdout_once: read failed: Bad file descriptor

Comment 30 IBM Bug Proxy 2022-08-23 18:20:55 UTC
------- Comment From arbab.com 2022-08-23 14:16 EDT-------
(In reply to comment #17)
> Have you been able to review the attachment? I'll provide the logs here in
> case there is an issue reading it.

Here's the relevant part of pb-discover.log:

I think the problem is this:

[18:28:52] grub2: undefined function 'blscf'

In the grub.cfg file, the command should be "blscfg" (with a 'g' at the end), not "blscf". I mocked up a recreation environment, and sure enough, without "blscfg" petitboot does not parse the files under /loader/entries, which is why the menu entry is not showing up.

Could you please verify what's in your grub.cfg?

Comment 31 Thomas L Falcon 2022-08-23 21:09:31 UTC
Hey, good catch! I looked at grub.cfg and blscfg is there, but there is some kind of corruption? in the file. Viewing the file with cat produces strange results. 

# cat /var/petitboot/mnt/dev/nvme0n1p3/grub2/grub.cfg
set pager=1
# petitboot doesn't support -e and doesn't support an empty path part
if [ -d (md/md-boot)/grub2 ]; then
  # fcct currently creates /boot RAID with superblock 1.0, which allows
  # component partitions to be read directly as filesystems.  This is
  # necessary because transposefs doesn't yet rerun grub2-install on BIOS,
  # so GRUB still expects /boot to be a partition on the first disk.
  #
  # There are two consequences:
  # 1. On BIOS and UEFI, the search command might pick an individual RAID
  #    component, but we want it to use the full RAID in case there are bad
  #    sectors etc.  The undocumented --hint option is supposed to support
  #    this sort of override, but it doesn't seem to work, so we set $boot
  #    directly.
  # 2. On BIOS, the "normal" module has already been loaded from an
  #    individual RAID component, and $prefix still points there.  We want
  #    future module loads to come from the RAID, so we reset $prefix.
  #    (On UEFI, the stub grub.cfg has already set $prefix properly.)
  set boot=md/md-boot
  set prefix=($boot)/grub2
else
  if [ -f ${config_directory}/bootuuid.cfg ]; then
    source ${config_directory}/bootuuid.cfg
  fi
  if [ -n "${BOOT_UUID}" ]; then
    search --fs-uuid "${BOOT_UUID}" --set boot --no-floppy
  else
    search --label boot --set boot --no-floppy
  fi
fi
set root=$boot

if [ -f ${config_directory}/grubenv ]; then
  load_env -f ${config_directory}/grubenv
elif [ -s $prefix/grubenv ]; then
  load_env
fi

if [ x"${feature_menuentry_id}" = xy ]; then
  menuentry_id_option="--id"
else
  menuentry_id_option=""
fi

function load_video {
  if [ x$feature_all_video_module = xy ]; then
    insmod all_video
  else
    insmod efi_gop
    insmod efi_uga
    insmod ieee1275_fb
    insmod vbe
    insmod vga
    insmod video_bochs
    insmod video_cirrus
  fi
}

serial --speed=115200
terminal_input serial console
terminal_output serial console
if [ x$feature_timeout_style = xy ] ; then
  set timeout_style=menu
  set timeout=1
# Fallback normal timeout code in case the timeout_style feature is
# unavailable.
else
  set timeout=1
fi

# Determine if this is a first boot and set the ${ignition_firstboot} variable
# which is used in the kernel command line.
set ignition_firstboot=""
if [ -f "/ignition.firstboot" ]; then
    # Default networking parameters to be used with ignition.
    set ignition_network_kcmdline=''

    # Source in the `ignition.firstboot` file which could override the
    # above $ignition_network_kcmdline with static networking config.
    # This override feature is also by coreos-installer to persist static
    # networking config provided during install to the first boot of the machine.
    source "/ignition.firstboot"

    set ignition_firstboot="ignition.firstboot ${ignition_network_kcmdline}"
fi

# Import user defined configuration
# tracker: https://github.com/coreos/fedora-coreos-tracker/issues/805
if [ -f $prefix/user.cfg ]; then
  source $prefix/user.cfg
fi

blscfg#

For comparison:

# cat /var/petitboot/mnt/dev/nvme0n1p3/loader/entries/ostree-1-rhcos.conf
title Red Hat Enterprise Linux CoreOS 411.86.202208022159-0 (Ootpa) (ostree:0)
version 1
options random.trust_cpu=on console=tty0 console=hvc0,115200n8 ignition.platform.id=metal $ignition_firstboot ostree=/ostree/boot.1/rhcos/69ef35526678094d9ab331a87f97458d144ef220af08e8eaf683bf78d8cfcb32/0
linux /ostree/rhcos-69ef35526678094d9ab331a87f97458d144ef220af08e8eaf683bf78d8cfcb32/vmlinuz-4.18.0-372.19.1.el8_6.ppc64le
initrd /ostree/rhcos-69ef35526678094d9ab331a87f97458d144ef220af08e8eaf683bf78d8cfcb32/initramfs-4.18.0-372.19.1.el8_6.ppc64le.img
#

Comment 32 IBM Bug Proxy 2022-08-23 21:40:29 UTC
------- Comment From arbab.com 2022-08-23 17:37 EDT-------
(In reply to comment #19)
> Hey, good catch! I looked at grub.cfg and blscfg is there, but there is some
> kind of corruption? in the file. Viewing the file with cat produces strange
> results.
>
> # cat /var/petitboot/mnt/dev/nvme0n1p3/grub2/grub.cfg
[snip]
> blscfg#

The file may be truncated, but more likely it's just missing a new-line character at the end of that last line.

Red Hat, can you see why this might be?

Comment 33 Benjamin Gilbert 2022-08-24 05:55:26 UTC
Nice find!  The missing trailing newline was introduced by https://github.com/coreos/coreos-assembler/commit/c9036faecb, which is in 4.11, and fixed by https://github.com/coreos/coreos-assembler/commit/45fc1e7518, which isn't.

Could you try adding the missing newline (something like "echo >> /var/petitboot/mnt/dev/nvme0n1p3/grub2/grub.cfg") and confirm that the system then boots correctly?

Comment 34 Thomas L Falcon 2022-08-24 14:52:52 UTC
Sorry, I am unable to edit the file, at least not through the petitboot shell.

# echo >> /var/petitboot/mnt/dev/nvme0n1p3/grub2/grub.cfg
-sh: can't create /var/petitboot/mnt/dev/nvme0n1p3/grub2/grub.cfg: Read-only file system

Comment 35 Benjamin Gilbert 2022-08-24 16:40:37 UTC
Okay, try "mount -o remount,rw /var/petitboot/mnt/dev/nvme0n1p3/grub2" beforehand and ...remount,ro... afterward?

Comment 36 Thomas L Falcon 2022-08-24 17:29:36 UTC
Alright, thanks, yes, the image is visible now with that change.

 Petitboot (v1.11)                                             9183-22X 13001DA
 ──────────────────────────────────────────────────────────────────────────────
  [Disk: nvme0n1p3 / 96d15588-3596-4b3c-adca-a2ff7279ea63]
    Red Hat Enterprise Linux CoreOS 411.86.202208022159-0 (Ootpa) (ostree:0)
  [Network: enP49p3s0f1 / 08:94:ef:80:c2:e8]
    pxeboot

  System information
  System configuration
  System status log
  Language
  Rescan devices
  Retrieve config from URL
  Plugins (0)
 *Exit to shell

Comment 37 Benjamin Gilbert 2022-08-24 17:42:35 UTC
Great, thanks for checking.

Comment 38 Benjamin Gilbert 2022-08-25 05:09:11 UTC
The code fix has landed in coreos-assembler for RHCOS 4.11.  However, the paperwork on this one is a bit weird, because a) the problem was already fixed in the 4.12 branch but needs a backport, b) the fix won't be visible to users until we bump the RHCOS bootimage, and c) the Bugzilla OCP product is closed to new bugs.  Net result, I have to close this bug CURRENTRELEASE and file a Jira bug for 4.11.  You can track the progress of the 4.11 fix in https://issues.redhat.com/browse/OCPBUGS-565.

Thanks for reporting this.


Note You need to log in before you can comment on or make changes to this bug.