Created attachment 1871930 [details] boot error OCP Version at Install Time: 4.11.0-0.nightly-2022-04-08-205307 Platform: bare metal (Dell machines) Architecture: x86_64 What are you trying to do? What is your use case? install a worker node (with IPI) What happened? What went wrong or what did you expect? coreos-installer boot fails with "Error: Expected one vendor dir" https://github.com/coreos/fedora-coreos-tracker/issues/1116 which I believe was fixed here https://github.com/coreos/coreos-installer/pull/802 (but not yet included in RHCOS release) screenshot attached
Since this is fixed upstream, we will need to make a new release of `coreos-installer` with the fix as part of 4.11. Assigning this to Jonathan, who did the fix upstream.
This bug has been reported fixed in a new RHCOS build and is ready for QE verification. To mark the bug verified, set the Verified field to Tested. This bug will automatically move to MODIFIED once the fix has landed in a new bootimage.
Hi, By when can we expect a build? or is there one already?
The bug listed as "Depends On" on this bug links to the installer PR which includes the fixed RHCOS build: https://github.com/openshift/installer/pull/5887
The fix for this bug has landed in a bootimage bump, as tracked in bug 2065893 (now in status MODIFIED). Moving this bug to MODIFIED.
note, this is still not fixed in OCP. machine-os-images still include an image of rhcos from March: ``` tar tvf 451362729673f5483e7d55cbeab6c5e44212a30a598cfd553bb01f7daa22845a.tar drwxr-xr-x 0/0 0 2022-05-11 07:05 coreos/ -rwxr-xr-x 0/0 0 1970-01-01 02:00 coreos/.wh..wh..opq -rw-r--r-- 0/0 34791 2022-05-11 07:05 coreos/coreos-stream.json -rw-r--r-- 0/0 1098907648 2022-03-18 20:57 coreos/coreos-x86_64.iso ``` which still uses the old coreos-installer thus we still cant deploy on Dell machines..
(In reply to Yuval Kashtan from comment #10) > note, this is still not fixed in OCP. > machine-os-images still include an image of rhcos from March: > ``` > tar tvf 451362729673f5483e7d55cbeab6c5e44212a30a598cfd553bb01f7daa22845a.tar > drwxr-xr-x 0/0 0 2022-05-11 07:05 coreos/ > -rwxr-xr-x 0/0 0 1970-01-01 02:00 coreos/.wh..wh..opq > -rw-r--r-- 0/0 34791 2022-05-11 07:05 coreos/coreos-stream.json > -rw-r--r-- 0/0 1098907648 2022-03-18 20:57 coreos/coreos-x86_64.iso > ``` > which still uses the old coreos-installer > thus we still cant deploy on Dell machines.. Maybe confusingly, the CoreOS team is not responsible for the machine-os-images container. I think that is the Bare Metal IPI team? https://github.com/openshift/machine-os-images Additionally, it looks like there hasn't been a successful 4.11 nightly payload for 6 days, which probably is a factor, too. https://amd64.ocp.releases.ci.openshift.org/#4.11.0-0.nightly However, there are some accepted 4.11 CI payloads within the last 24 hours https://amd64.ocp.releases.ci.openshift.org/#4.11.0-0.ci
which is why I've created https://bugzilla.redhat.com/show_bug.cgi?id=2087127
I can verify that latest 4.11 nightly works on Dell :yay:
Thanks Yuval for the update, change status to VERIFIED based on Comment #13
My 2 cents: Tested also on UPI BM dual-stack (Packet) with disk and etcd encryption using latest RHCOS image (411.85.202205101201-0) along with latest 4.11-nightly (4.11.0-0.nightly-2022-05-18-171831), machines are all PowerEdge R6515 with different BIOS versions and the coreos-installer UEFI boot error is no longer present, so far so good. NOTE: FWIW, there was an unexpected "failed to fetch config" ignition message in all master/worker consoles that wasn't present before in other RHCOS 4.11 versions, despite of this message, installation was successful, example: ~~~ Red Hat Enterprise Linux CoreOS 411.85.202205170333-0 (Ootpa) 4.11 Ignition: ran on 2022/05/19 15:30:40 UTC (at least 1 boot ago) Ignition: user-provided config was applied Ignition: failed to fetch config: resource requires networking <----------------------- SSH host key: SHA256:QaVBFC5dzcW31IeYlSty1NLztaJo+2B6ZU6diYk/Uw4 (ED25519) SSH host key: SHA256:4zAJNk67hSFr7CwGCTOk/ELdZ35ar8lWewvEW1TQPcQ (ECDSA) SSH host key: SHA256:wF/md0RzcgdBQYKqG3ySwIS4zWlLpMCODEIGO2MZe2U (RSA) bond0: ens3f0: ens3f1: master-02 login: ~~~ Best Regards.
The "failed to fetch config" message is harmless and will be fixed in 4.11: https://github.com/coreos/fedora-coreos-tracker/issues/1159
ACK Benjamin, thanks for the confirmation and the tracker link.
*** Bug 2093486 has been marked as a duplicate of this bug. ***
I'm facing a similar issue when trying to install a 4.11 spoke cluster with 4.11.0-0.nightly-2022-07-01-065600, where two agents are stuck at Rebooting and the cluster is in "installing-pending-user-action" state. We're using VMs to simulate a BM cluster, and rebooting manually didn't help. Can it be related to this bug?
@epassaro not likely, but try to catch on what rebooting is stuck. For example, virt-manager can be used if those are KVM machines If you're seeing "Error: Expected one vendor dir on /dev/sda2, got 2" then it's probably a very similar case
Thanks Osher. Actually when rebooting the VM, it reboots without errors. I checked both the libvirt logs for the VM and rebooting from the graphical interface. It's the agents and installation that are still stuck, so it's likely due to another issue.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069