Bug 2074483 - coreos-installer doesnt work on Dell machines
Summary: coreos-installer doesnt work on Dell machines
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: RHCOS
Version: 4.11
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.11.0
Assignee: Jonathan Lebon
QA Contact: Michael Nguyen
URL:
Whiteboard:
: 2093486 (view as bug list)
Depends On: 2065893
Blocks: 2052124 2057502 2078998
TreeView+ depends on / blocked
 
Reported: 2022-04-12 10:28 UTC by Yuval Kashtan
Modified: 2022-08-10 11:06 UTC (History)
17 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2078998 (view as bug list)
Environment:
Last Closed: 2022-08-10 11:06:30 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
boot error (30.07 KB, image/png)
2022-04-12 10:28 UTC, Yuval Kashtan
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github coreos coreos-installer pull 802 0 None Merged blockdev: rework EFI vendor dir checking 2022-04-12 15:15:16 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 11:06:50 UTC

Description Yuval Kashtan 2022-04-12 10:28:16 UTC
Created attachment 1871930 [details]
boot error

OCP Version at Install Time: 4.11.0-0.nightly-2022-04-08-205307
Platform: bare metal (Dell machines)
Architecture: x86_64


What are you trying to do? What is your use case?
install a worker node (with IPI)


What happened? What went wrong or what did you expect?
coreos-installer boot fails with "Error: Expected one vendor dir"
https://github.com/coreos/fedora-coreos-tracker/issues/1116
which I believe was fixed here https://github.com/coreos/coreos-installer/pull/802 (but not yet included in RHCOS release)

screenshot attached

Comment 1 Micah Abbott 2022-04-12 12:49:47 UTC
Since this is fixed upstream, we will need to make a new release of `coreos-installer` with the fix as part of 4.11.

Assigning this to Jonathan, who did the fix upstream.

Comment 2 RHCOS Bug Bot 2022-05-03 21:00:37 UTC
This bug has been reported fixed in a new RHCOS build and is ready for QE verification.  To mark the bug verified, set the Verified field to Tested.  This bug will automatically move to MODIFIED once the fix has landed in a new bootimage.

Comment 6 Gowrishankar Rajaiyan 2022-05-12 11:41:15 UTC
Hi,

By when can we expect a build? or is there one already?

Comment 7 Jonathan Lebon 2022-05-12 16:48:31 UTC
The bug listed as "Depends On" on this bug links to the installer PR which includes the fixed RHCOS build: https://github.com/openshift/installer/pull/5887

Comment 8 RHCOS Bug Bot 2022-05-14 05:26:58 UTC
The fix for this bug has landed in a bootimage bump, as tracked in bug 2065893 (now in status MODIFIED).  Moving this bug to MODIFIED.

Comment 10 Yuval Kashtan 2022-05-17 12:05:40 UTC
note, this is still not fixed in OCP.
machine-os-images still include an image of rhcos from March:
```
tar tvf 451362729673f5483e7d55cbeab6c5e44212a30a598cfd553bb01f7daa22845a.tar 
drwxr-xr-x 0/0               0 2022-05-11 07:05 coreos/
-rwxr-xr-x 0/0               0 1970-01-01 02:00 coreos/.wh..wh..opq
-rw-r--r-- 0/0           34791 2022-05-11 07:05 coreos/coreos-stream.json
-rw-r--r-- 0/0      1098907648 2022-03-18 20:57 coreos/coreos-x86_64.iso
```
which still uses the old coreos-installer
thus we still cant deploy on Dell machines..

Comment 11 Micah Abbott 2022-05-17 13:56:47 UTC
(In reply to Yuval Kashtan from comment #10)
> note, this is still not fixed in OCP.
> machine-os-images still include an image of rhcos from March:
> ```
> tar tvf 451362729673f5483e7d55cbeab6c5e44212a30a598cfd553bb01f7daa22845a.tar 
> drwxr-xr-x 0/0               0 2022-05-11 07:05 coreos/
> -rwxr-xr-x 0/0               0 1970-01-01 02:00 coreos/.wh..wh..opq
> -rw-r--r-- 0/0           34791 2022-05-11 07:05 coreos/coreos-stream.json
> -rw-r--r-- 0/0      1098907648 2022-03-18 20:57 coreos/coreos-x86_64.iso
> ```
> which still uses the old coreos-installer
> thus we still cant deploy on Dell machines..

Maybe confusingly, the CoreOS team is not responsible for the machine-os-images container.  I think that is the Bare Metal IPI team?

https://github.com/openshift/machine-os-images

Additionally, it looks like there hasn't been a successful 4.11 nightly payload for 6 days, which probably is a factor, too.

https://amd64.ocp.releases.ci.openshift.org/#4.11.0-0.nightly

However, there are some accepted 4.11 CI payloads within the last 24 hours

https://amd64.ocp.releases.ci.openshift.org/#4.11.0-0.ci

Comment 12 Yuval Kashtan 2022-05-17 18:31:42 UTC
which is why I've created https://bugzilla.redhat.com/show_bug.cgi?id=2087127

Comment 13 Yuval Kashtan 2022-05-19 12:55:22 UTC
I can verify that latest 4.11 nightly works on Dell
:yay:

Comment 14 HuijingHei 2022-05-19 13:38:19 UTC
Thanks Yuval for the update, change status to VERIFIED based on Comment #13

Comment 15 Pedro Amoedo 2022-05-19 17:26:30 UTC
My 2 cents:

Tested also on UPI BM dual-stack (Packet) with disk and etcd encryption using latest RHCOS image (411.85.202205101201-0) along with latest 4.11-nightly (4.11.0-0.nightly-2022-05-18-171831), machines are all PowerEdge R6515 with different BIOS versions and the coreos-installer UEFI boot error is no longer present, so far so good.

NOTE: FWIW, there was an unexpected "failed to fetch config" ignition message in all master/worker consoles that wasn't present before in other RHCOS 4.11 versions, despite of this message, installation was successful, example:

~~~
Red Hat Enterprise Linux CoreOS 411.85.202205170333-0 (Ootpa) 4.11
Ignition: ran on 2022/05/19 15:30:40 UTC (at least 1 boot ago)
Ignition: user-provided config was applied
Ignition: failed to fetch config: resource requires networking     <-----------------------
SSH host key: SHA256:QaVBFC5dzcW31IeYlSty1NLztaJo+2B6ZU6diYk/Uw4 (ED25519)
SSH host key: SHA256:4zAJNk67hSFr7CwGCTOk/ELdZ35ar8lWewvEW1TQPcQ (ECDSA)
SSH host key: SHA256:wF/md0RzcgdBQYKqG3ySwIS4zWlLpMCODEIGO2MZe2U (RSA)
bond0:  
ens3f0:  
ens3f1:  
master-02 login:
~~~

Best Regards.

Comment 16 Benjamin Gilbert 2022-05-19 18:36:08 UTC
The "failed to fetch config" message is harmless and will be fixed in 4.11: https://github.com/coreos/fedora-coreos-tracker/issues/1159

Comment 17 Pedro Amoedo 2022-05-23 08:29:04 UTC
ACK Benjamin, thanks for the confirmation and the tracker link.

Comment 18 Jonathan Lebon 2022-06-15 13:17:08 UTC
*** Bug 2093486 has been marked as a duplicate of this bug. ***

Comment 20 Jonathan Lebon 2022-06-16 15:43:31 UTC
*** Bug 2093486 has been marked as a duplicate of this bug. ***

Comment 21 epassaro 2022-07-04 15:54:38 UTC
I'm facing a similar issue when trying to install a 4.11 spoke cluster with 4.11.0-0.nightly-2022-07-01-065600, where two agents are stuck at Rebooting and the cluster is in "installing-pending-user-action" state.
We're using VMs to simulate a BM cluster, and rebooting manually didn't help. 
Can it be related to this bug?

Comment 22 Osher De Paz 2022-07-04 16:02:07 UTC
@epassaro not likely, but try to catch on what rebooting is stuck. For example, virt-manager can be used if those are KVM machines
If you're seeing "Error: Expected one vendor dir on /dev/sda2, got 2" then it's probably a very similar case

Comment 23 epassaro 2022-07-04 16:21:08 UTC
Thanks Osher.

Actually when rebooting the VM, it reboots without errors. I checked both the libvirt logs for the VM and rebooting from the graphical interface. It's the agents and installation that are still stuck, so it's likely due to another issue.

Comment 24 errata-xmlrpc 2022-08-10 11:06:30 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069


Note You need to log in before you can comment on or make changes to this bug.