Description of problem: When installing a new baremetal, ipi, cluster, the worker nodes are not joining the cluster. Version-Release number of selected component (if applicable): 4.6.0-0.nightly-2020-09-12-230035 How reproducible: It constantly happens with this version on different environments. Steps to Reproduce: 1. run the installer 2. wait for worker nodes to join the cluster Actual results: [root@cnfdb4-installer ~]# oc get node NAME STATUS ROLES AGE VERSION dhcp19-17-23.clus2.t5g.lab.eng.bos.redhat.com Ready master,virtual 18h v1.19.0+4336ff4 dhcp19-17-24.clus2.t5g.lab.eng.bos.redhat.com Ready master,virtual 18h v1.19.0+4336ff4 dhcp19-17-7.clus2.t5g.lab.eng.bos.redhat.com Ready master,virtual 18h v1.19.0+4336ff4 [root@cnfdb4-installer ~]# oc get bmh -A NAMESPACE NAME STATUS PROVISIONING STATUS CONSUMER BMC HARDWARE PROFILE ONLINE ERROR openshift-machine-api cnfdb4-master-0 OK externally provisioned cnfdb4-z5nw2-master-0 ipmi://10.19.17.62:6230 true openshift-machine-api cnfdb4-master-1 OK externally provisioned cnfdb4-z5nw2-master-1 ipmi://10.19.17.62:6231 true openshift-machine-api cnfdb4-master-2 OK externally provisioned cnfdb4-z5nw2-master-2 ipmi://10.19.17.62:6232 true openshift-machine-api cnfdb4-worker-0 OK provisioned cnfdb4-z5nw2-worker-0-mx2g8 ipmi://10.19.17.62:6240 unknown true openshift-machine-api cnfdb4-worker-1 OK provisioned cnfdb4-z5nw2-worker-0-xwfxt ipmi://10.19.28.23 unknown true [root@cnfdb4-installer ~]# oc get machine -A NAMESPACE NAME PHASE TYPE REGION ZONE AGE openshift-machine-api cnfdb4-z5nw2-master-0 Running 18h openshift-machine-api cnfdb4-z5nw2-master-1 Running 18h openshift-machine-api cnfdb4-z5nw2-master-2 Running 18h openshift-machine-api cnfdb4-z5nw2-worker-0-mx2g8 Provisioned 18h openshift-machine-api cnfdb4-z5nw2-worker-0-xwfxt Provisioned 18h Expected results: All the workers should join the cluster Additional info: Logs from one of the worker nodes: [core@localhost ~]$ cat /etc/resolv.conf # Generated by NetworkManager search local clus2.t5g.lab.eng.bos.redhat.com [core@localhost ~]$ host quay.io ;; connection timed out; no servers could be reached journalctl logs from the worker node: Sep 15 08:26:11 localhost sh[3725]: Error: error pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:02af38047ceaf2ca00f0870ae60b750d5d8c4ee6fd5ac2af9aa9ba763012e266": unable to pull quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:02af38047ceaf2ca00f0870ae60b750d5d8c4ee6fd5ac2af9aa9ba763012e266: unable to pull image: Error initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:02af38047ceaf2ca00f0870ae60b750d5d8c4ee6fd5ac2af9aa9ba763012e266: (Mirrors also failed: [dhcp19-17-62.clus2.t5g.lab.eng.bos.redhat.com:5000/ocp4/openshift4@sha256:02af38047ceaf2ca00f0870ae60b750d5d8c4ee6fd5ac2af9aa9ba763012e266: error pinging docker registry dhcp19-17-62.clus2.t5g.lab.eng.bos.redhat.com:5000: Get "https://dhcp19-17-62.clus2.t5g.lab.eng.bos.redhat.com:5000/v2/": dial tcp: lookup dhcp19-17-62.clus2.t5g.lab.eng.bos.redhat.com on [::1]:53: read udp [::1]:43164->[::1]:53: read: connection refused]): quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:02af38047ceaf2ca00f0870ae60b750d5d8c4ee6fd5ac2af9aa9ba763012e266: error pinging docker registry quay.io: Get "https://quay.io/v2/": dial tcp: lookup quay.io on [::1]:53: read udp [::1]:43398->[::1]:53: read: connection refused Sep 15 10:45:00 cnfdb4.clus2.t5g.lab.eng.bos.redhat.com NetworkManager[1582]: <info> [1600166700.8954] dhcp4 (ens1f0): option domain_name_servers => '10.19.42.41 10.11.5.19 10.5.30.160' Sep 15 10:45:51 cnfdb4.clus2.t5g.lab.eng.bos.redhat.com ignition[1872]: INFO : files: createFilesystemsFiles: createFiles: op(19): [started] writing file "/sysroot/etc/NetworkManager/dispatcher.d/30-resolv-prepender" Sep 15 10:45:51 cnfdb4.clus2.t5g.lab.eng.bos.redhat.com ignition[1872]: INFO : files: createFilesystemsFiles: createFiles: op(19): [finished] writing file "/sysroot/etc/NetworkManager/dispatcher.d/30-resolv-prepender" Sep 15 10:46:09 cnfdb4.clus2.t5g.lab.eng.bos.redhat.com nm-dispatcher[3116]: NM resolv-prepender triggered by enp1s0f4u4 up. Sep 15 10:46:10 cnfdb4.clus2.t5g.lab.eng.bos.redhat.com nm-dispatcher[3116]: req:4 'up' [enp1s0f4u4], "/etc/NetworkManager/dispatcher.d/30-resolv-prepender": complete: failed with Script '/etc/NetworkManager/dispatcher.d/30-resolv-prepender' exited with error status 125. Sep 15 10:46:10 cnfdb4.clus2.t5g.lab.eng.bos.redhat.com NetworkManager[3102]: <warn> [1600166770.5469] dispatcher: (4) /etc/NetworkManager/dispatcher.d/30-resolv-prepender failed (failed): Script '/etc/NetworkManager/dispatcher.d/30-resolv-prepender' exited with error status 125. Sep 15 10:46:12 cnfdb4.clus2.t5g.lab.eng.bos.redhat.com NetworkManager[3102]: <info> [1600166772.7926] dhcp4 (ens1f0): option domain_name_servers => '10.19.42.41 10.11.5.19 10.5.30.160' Sep 15 10:46:12 cnfdb4.clus2.t5g.lab.eng.bos.redhat.com nm-dispatcher[3116]: NM resolv-prepender triggered by ens1f0 up. Sep 15 10:46:12 cnfdb4.clus2.t5g.lab.eng.bos.redhat.com nm-dispatcher[3116]: req:7 'up' [ens1f0], "/etc/NetworkManager/dispatcher.d/30-resolv-prepender": complete: failed with Script '/etc/NetworkManager/dispatcher.d/30-resolv-prepender' exited with error status 125. Sep 15 10:46:12 cnfdb4.clus2.t5g.lab.eng.bos.redhat.com NetworkManager[3102]: <warn> [1600166772.9297] dispatcher: (7) /etc/NetworkManager/dispatcher.d/30-resolv-prepender failed (failed): Script '/etc/NetworkManager/dispatcher.d/30-resolv-prepender' exited with error status 125. Sep 15 10:46:13 cnfdb4.clus2.t5g.lab.eng.bos.redhat.com nm-dispatcher[3116]: NM resolv-prepender triggered by eno1 up. Sep 15 10:46:13 cnfdb4.clus2.t5g.lab.eng.bos.redhat.com nm-dispatcher[3116]: req:10 'up' [eno1], "/etc/NetworkManager/dispatcher.d/30-resolv-prepender": complete: failed with Script '/etc/NetworkManager/dispatcher.d/30-resolv-prepender' exited with error status 125. Sep 15 10:46:13 cnfdb4.clus2.t5g.lab.eng.bos.redhat.com NetworkManager[3102]: <warn> [1600166773.5118] dispatcher: (10) /etc/NetworkManager/dispatcher.d/30-resolv-prepender failed (failed): Script '/etc/NetworkManager/dispatcher.d/30-resolv-prepender' exited with error status 125. Sep 15 10:46:10 cnfdb4.clus2.t5g.lab.eng.bos.redhat.com root[3237]: NM mdns-hostname triggered by up. Sep 15 10:46:10 cnfdb4.clus2.t5g.lab.eng.bos.redhat.com nm-dispatcher[3116]: <13>Sep 15 10:46:10 root: NM mdns-hostname triggered by up. Sep 15 10:46:10 cnfdb4.clus2.t5g.lab.eng.bos.redhat.com root[3240]: Hostname changed: cnfdb4.clus2.t5g.lab.eng.bos.redhat.com Sep 15 10:46:10 cnfdb4.clus2.t5g.lab.eng.bos.redhat.com nm-dispatcher[3116]: <13>Sep 15 10:46:10 root: Hostname changed: cnfdb4.clus2.t5g.lab.eng.bos.redhat.com Sep 15 10:46:10 cnfdb4.clus2.t5g.lab.eng.bos.redhat.com network-manager/90-long-hostname[3244]: hostname is already set
I think this is a regression caused by https://github.com/openshift/machine-config-operator/pull/2062 - we copy a resolv.conf early which doesn't yet contain any actual DNS entries, so the podman run then fails - it seems to be timing dependent so doesn't reproduce in all environments, but we probably need a better test than file-exists to decide if the existing file should be overwritten. I'm also not completely clear on the motivation for that change from an OpenStack perspective, if it's not actually needed for baremetal we could perhaps consider a partial revert to restore the previous behavior for baremetal.
*** Bug 1879499 has been marked as a duplicate of this bug. ***
@Jainlin would your team be able to help verify this BZ?
Recently we also hit some similar issue on vsphere platform - https://bugzilla.redhat.com/show_bug.cgi?id=1879322. But our team did not test ipi on baremetal, I will ask some help from edge QE team for this verification.
*** Bug 1879322 has been marked as a duplicate of this bug. ***
Verified on 4.6.0-0.nightly-2020-09-21-030155 @Johnny Liu. It works on IPI BM If it doesn't work on vsphere, please, open separate BZ
*** Bug 1874869 has been marked as a duplicate of this bug. ***
Happened again in 4.6.0-0.nightly-2020-09-22-011738 [root@cnfdb4-installer ~]# oc get node NAME STATUS ROLES AGE VERSION dhcp19-17-23.clus2.t5g.lab.eng.bos.redhat.com Ready master,virtual 3h4m v1.19.0+7e8389f dhcp19-17-24.clus2.t5g.lab.eng.bos.redhat.com Ready master,virtual 3h16m v1.19.0+7e8389f dhcp19-17-7.clus2.t5g.lab.eng.bos.redhat.com Ready master,virtual 3h16m v1.19.0+7e8389f [root@cnfdb4-installer ~]# oc get bmh -A NAMESPACE NAME STATUS PROVISIONING STATUS CONSUMER BMC HARDWARE PROFILE ONLINE ERROR openshift-machine-api cnfdb4-master-0 OK externally provisioned cnfdb4-sx7tj-master-0 ipmi://10.19.17.137:6230 true openshift-machine-api cnfdb4-master-1 OK externally provisioned cnfdb4-sx7tj-master-1 ipmi://10.19.17.137:6231 true openshift-machine-api cnfdb4-master-2 OK externally provisioned cnfdb4-sx7tj-master-2 ipmi://10.19.17.137:6232 true openshift-machine-api cnfdb4-worker-0 OK provisioned cnfdb4-sx7tj-worker-0-cdscz ipmi://10.19.17.137:6240 unknown true openshift-machine-api cnfdb4-worker-1 OK provisioned cnfdb4-sx7tj-worker-0-brhw9 ipmi://10.19.28.23 unknown true [root@cnfdb4-installer ~]# oc get machine -A NAMESPACE NAME PHASE TYPE REGION ZONE AGE openshift-machine-api cnfdb4-sx7tj-master-0 Running 3h27m openshift-machine-api cnfdb4-sx7tj-master-1 Running 3h27m openshift-machine-api cnfdb4-sx7tj-master-2 Running 3h27m openshift-machine-api cnfdb4-sx7tj-worker-0-brhw9 Provisioned 174m openshift-machine-api cnfdb4-sx7tj-worker-0-cdscz Provisioned 174m [root@cnfdb4-installer ~]# oc version Client Version: 4.6.0-0.nightly-2020-09-22-011738 Server Version: 4.6.0-0.nightly-2020-09-22-011738 Kubernetes Version: v1.19.0+f5121a6
Is it IPI BM environment? We verified for redfish
[kni@ocp-edge06 ~]$ oc version Client Version: 4.6.0-0.nightly-2020-09-22-130743 Server Version: 4.6.0-0.nightly-2020-09-22-130743 Kubernetes Version: v1.19.0+f5121a6 [kni@ocp-edge06 ~]$ oc get nodes NAME STATUS ROLES AGE VERSION openshift-master-0 Ready master 11h v1.19.0+7e8389f openshift-master-1 Ready master 11h v1.19.0+7e8389f openshift-master-2 Ready master 11h v1.19.0+7e8389f openshift-worker-0 Ready worker 11h v1.19.0+7e8389f openshift-worker-1 Ready worker 11h v1.19.0+7e8389f [kni@ocp-edge06 ~]$ oc get bmh -A NAMESPACE NAME STATUS PROVISIONING STATUS CONSUMER BMC HARDWARE PROFILE ONLINE ERROR openshift-machine-api openshift-master-0 OK externally provisioned ocp-edge-k26sz-master-0 redfish://10.46.2.220/redfish/v1/Systems/1 true openshift-machine-api openshift-master-1 OK externally provisioned ocp-edge-k26sz-master-1 redfish://10.46.2.221/redfish/v1/Systems/1 true openshift-machine-api openshift-master-2 OK externally provisioned ocp-edge-k26sz-master-2 redfish://10.46.2.222/redfish/v1/Systems/1 true openshift-machine-api openshift-worker-0 OK provisioned ocp-edge-k26sz-worker-0-d4dqz redfish://10.46.2.223/redfish/v1/Systems/1 unknown true openshift-machine-api openshift-worker-1 OK provisioned ocp-edge-k26sz-worker-0-hf6cf redfish://10.46.2.224/redfish/v1/Systems/1 unknown true
Yes, it is IPI BM environment. The problem is inconsistent, today i was able to deploy the same environment with no issue with 4.6.0-0.nightly-2020-09-23-022756.
(In reply to Sabina Aledort from comment #13) > Yes, it is IPI BM environment. > The problem is inconsistent, today i was able to deploy the same environment > with no issue with 4.6.0-0.nightly-2020-09-23-022756. In that case I moved it back to verified We don't see this problem Feel free to re-open it if it happens again
(In reply to Lubov from comment #14) > (In reply to Sabina Aledort from comment #13) > > Yes, it is IPI BM environment. > > The problem is inconsistent, today i was able to deploy the same environment > > with no issue with 4.6.0-0.nightly-2020-09-23-022756. > > In that case I moved it back to verified > We don't see this problem > > Feel free to re-open it if it happens again As far a i understand the fix is already included since 4.6.0-0.nightly-2020-09-21-081745 (https://github.com/openshift/machine-config-operator/pull/2094), but this issue happened again in a later version, 4.6.0-0.nightly-2020-09-22-130743.
Hi, This happened again today in two different clusters with 4.6.0-0.nightly-2020-09-29-170625 and 4.6.0-0.nightly-2020-09-30-052433, redeploying with 4.6.0-0.nightly-2020-09-28-212756 worked. [root@cnfdb3-installer ~]# oc get node NAME STATUS ROLES AGE VERSION dhcp19-17-12.clus2.t5g.lab.eng.bos.redhat.com Ready master,virtual 3h13m v1.19.0+bafba66 dhcp19-17-13.clus2.t5g.lab.eng.bos.redhat.com Ready master,virtual 3h37m v1.19.0+bafba66 dhcp19-17-14.clus2.t5g.lab.eng.bos.redhat.com Ready master,virtual 3h37m v1.19.0+bafba66 [root@cnfdb3-installer ~]# oc get bmh -A NAMESPACE NAME STATUS PROVISIONING STATUS CONSUMER BMC HARDWARE PROFILE ONLINE ERROR openshift-machine-api cnfdb3-master-0 OK externally provisioned cnfdb3-bdnbp-master-0 ipmi://10.19.17.183:6230 true openshift-machine-api cnfdb3-master-1 OK externally provisioned cnfdb3-bdnbp-master-1 ipmi://10.19.17.183:6231 true openshift-machine-api cnfdb3-master-2 OK externally provisioned cnfdb3-bdnbp-master-2 ipmi://10.19.17.183:6232 true openshift-machine-api cnfdb3-worker-0 OK provisioned cnfdb3-bdnbp-worker-0-g4gn4 ipmi://10.19.28.20 unknown true openshift-machine-api cnfdb3-worker-1 OK provisioned cnfdb3-bdnbp-worker-0-h8fdh ipmi://10.19.28.21 unknown true openshift-machine-api cnfdb3-worker-2 OK provisioned cnfdb3-bdnbp-worker-0-5t78f ipmi://10.19.28.22 unknown true [root@cnfdb3-installer ~]# oc get machine -A NAMESPACE NAME PHASE TYPE REGION ZONE AGE openshift-machine-api cnfdb3-bdnbp-master-0 Running 3h48m openshift-machine-api cnfdb3-bdnbp-master-1 Running 3h48m openshift-machine-api cnfdb3-bdnbp-master-2 Running 3h48m openshift-machine-api cnfdb3-bdnbp-worker-0-5t78f Provisioned 3h5m openshift-machine-api cnfdb3-bdnbp-worker-0-g4gn4 Provisioned 3h5m openshift-machine-api cnfdb3-bdnbp-worker-0-h8fdh Provisioned 3h5m [root@cnfdb3-installer ~]# oc version Client Version: 4.6.0-0.nightly-2020-09-29-170625 Server Version: 4.6.0-0.nightly-2020-09-29-170625 Kubernetes Version: v1.19.0+6ef2098 [root@cnfdb5-installer ~]# oc get node NAME STATUS ROLES AGE VERSION dhcp19-17-0.clus2.t5g.lab.eng.bos.redhat.com Ready master,virtual 128m v1.19.0+beb741b dhcp19-17-1.clus2.t5g.lab.eng.bos.redhat.com Ready master,virtual 3h23m v1.19.0+beb741b dhcp19-17-193.clus2.t5g.lab.eng.bos.redhat.com Ready worker 99m v1.19.0+beb741b dhcp19-17-2.clus2.t5g.lab.eng.bos.redhat.com Ready master,virtual 128m v1.19.0+beb741b [root@cnfdb5-installer ~]# oc get bmh -A NAMESPACE NAME STATUS PROVISIONING STATUS CONSUMER BMC HARDWARE PROFILE ONLINE ERROR openshift-machine-api cnfdb5-master-0 OK externally provisioned cnfdb5-dqphp-master-0 ipmi://10.19.17.188:6230 true openshift-machine-api cnfdb5-master-1 OK externally provisioned cnfdb5-dqphp-master-1 ipmi://10.19.17.188:6231 true openshift-machine-api cnfdb5-master-2 OK externally provisioned cnfdb5-dqphp-master-2 ipmi://10.19.17.188:6232 true openshift-machine-api cnfdb5-worker-0 OK provisioned cnfdb5-dqphp-worker-0-2cvrr ipmi://10.19.17.188:6240 unknown true openshift-machine-api cnfdb5-worker-1 OK provisioned cnfdb5-dqphp-worker-0-hzqzt ipmi://10.19.28.24 unknown true [root@cnfdb5-installer ~]# oc get machine -A NAMESPACE NAME PHASE TYPE REGION ZONE AGE openshift-machine-api cnfdb5-dqphp-master-0 Running 3h33m openshift-machine-api cnfdb5-dqphp-master-1 Running 3h33m openshift-machine-api cnfdb5-dqphp-master-2 Running 3h33m openshift-machine-api cnfdb5-dqphp-worker-0-2cvrr Running 119m openshift-machine-api cnfdb5-dqphp-worker-0-hzqzt Provisioned 119m [root@cnfdb5-installer ~]# oc version Client Version: 4.6.0-0.nightly-2020-09-30-052433 Server Version: 4.6.0-0.nightly-2020-09-30-052433 Kubernetes Version: v1.19.0+beb741b
Please, provide must-gather when it happens again. With no logs it is impossible to investigate the issue
The issue doesn't happen in our environment. We use redfish, Sabina uses ipmi I'll re-open bug and re-assign to original QA contact
Hi, I will provide must-gather if it happens again as we already recreated those clusters with 4.6.0-0.nightly-2020-09-28-212756.
Moving this to 4.7 unless a consistent reproducer is found. (And must-gather is provided)
Hi, It happened again with 4.6.0-0.nightly-2020-10-03-051134. I got the must-gather. [root@cnfdb3-installer ~]# oc get node NAME STATUS ROLES AGE VERSION dhcp19-17-12.clus2.t5g.lab.eng.bos.redhat.com Ready master,virtual 8h v1.19.0+db1fc96 dhcp19-17-13.clus2.t5g.lab.eng.bos.redhat.com Ready master,virtual 8h v1.19.0+db1fc96 dhcp19-17-14.clus2.t5g.lab.eng.bos.redhat.com Ready master,virtual 8h v1.19.0+db1fc96 [root@cnfdb3-installer ~]# oc get bmh -A NAMESPACE NAME STATUS PROVISIONING STATUS CONSUMER BMC HARDWARE PROFILE ONLINE ERROR openshift-machine-api cnfdb3-master-0 OK externally provisioned cnfdb3-pxs2p-master-0 ipmi://10.19.17.221:6230 true openshift-machine-api cnfdb3-master-1 OK externally provisioned cnfdb3-pxs2p-master-1 ipmi://10.19.17.221:6231 true openshift-machine-api cnfdb3-master-2 OK externally provisioned cnfdb3-pxs2p-master-2 ipmi://10.19.17.221:6232 true openshift-machine-api cnfdb3-worker-0 OK provisioned cnfdb3-pxs2p-worker-0-h7dl2 ipmi://10.19.28.20 unknown true openshift-machine-api cnfdb3-worker-1 OK provisioned cnfdb3-pxs2p-worker-0-zp9wt ipmi://10.19.28.21 unknown true openshift-machine-api cnfdb3-worker-2 OK provisioned cnfdb3-pxs2p-worker-0-jtqsf ipmi://10.19.28.22 unknown true [root@cnfdb3-installer ~]# oc get machine -A NAMESPACE NAME PHASE TYPE REGION ZONE AGE openshift-machine-api cnfdb3-pxs2p-master-0 Running 8h openshift-machine-api cnfdb3-pxs2p-master-1 Running 8h openshift-machine-api cnfdb3-pxs2p-master-2 Running 8h openshift-machine-api cnfdb3-pxs2p-worker-0-h7dl2 Provisioned 8h openshift-machine-api cnfdb3-pxs2p-worker-0-jtqsf Provisioned 8h openshift-machine-api cnfdb3-pxs2p-worker-0-zp9wt Provisioned 8h
must-gather can be downloaded from: https://drive.google.com/file/d/1bNOWwQojhLO_jgM64FV-vK6BNFj9aJva/view?usp=sharing
Hi, This is now a blocking issue for us as it keeps happening with our CI environment in the last days, with different ocp versions. must-gather can be downloaded from: https://drive.google.com/file/d/1sEIUOFJS9tdCMsf9NeS-7t8Bgxowrsr1/view?usp=sharing [root@cnfdb3-installer ~]# oc get node NAME STATUS ROLES AGE VERSION dhcp19-17-12.clus2.t5g.lab.eng.bos.redhat.com Ready master 87m v1.19.0+db1fc96 dhcp19-17-13.clus2.t5g.lab.eng.bos.redhat.com Ready master 87m v1.19.0+db1fc96 dhcp19-17-14.clus2.t5g.lab.eng.bos.redhat.com Ready master 87m v1.19.0+db1fc96 [root@cnfdb3-installer ~]# oc get bmh -A NAMESPACE NAME STATUS PROVISIONING STATUS CONSUMER BMC HARDWARE PROFILE ONLINE ERROR openshift-machine-api cnfdb3-master-0 OK externally provisioned cnfdb3-jb47d-master-0 ipmi://10.19.17.10:6230 true openshift-machine-api cnfdb3-master-1 OK externally provisioned cnfdb3-jb47d-master-1 ipmi://10.19.17.10:6231 true openshift-machine-api cnfdb3-master-2 OK externally provisioned cnfdb3-jb47d-master-2 ipmi://10.19.17.10:6232 true openshift-machine-api cnfdb3-worker-0 OK provisioned cnfdb3-jb47d-worker-0-c6tc9 ipmi://10.19.28.20 unknown true openshift-machine-api cnfdb3-worker-1 OK provisioned cnfdb3-jb47d-worker-0-xt6sg ipmi://10.19.28.21 unknown true openshift-machine-api cnfdb3-worker-2 OK provisioned cnfdb3-jb47d-worker-0-cg82x ipmi://10.19.28.22 unknown true [root@cnfdb3-installer ~]# oc get machine -A NAMESPACE NAME PHASE TYPE REGION ZONE AGE openshift-machine-api cnfdb3-jb47d-master-0 Running 99m openshift-machine-api cnfdb3-jb47d-master-1 Running 99m openshift-machine-api cnfdb3-jb47d-master-2 Running 99m openshift-machine-api cnfdb3-jb47d-worker-0-c6tc9 Provisioned 80m openshift-machine-api cnfdb3-jb47d-worker-0-cg82x Provisioned 80m openshift-machine-api cnfdb3-jb47d-worker-0-xt6sg Provisioned 80m [root@cnfdb3-installer ~]# oc version Client Version: 4.6.0-0.nightly-2020-10-05-234751 Server Version: 4.6.0-0.nightly-2020-10-05-234751 Kubernetes Version: v1.19.0+db1fc96
After setting root device hint to sdb for the workers in the install config, nodes were able to properly join. for the record, here's a sample snippet - name: cnfdb3-worker-0 role: worker bmc: address: ipmi://10.0.0.1 username: eddie password: vanhalen bootMACAddress: 95:40:c7:26:44:62 hardwareProfile: unknown rootDeviceHints: deviceName: "/dev/sdb" rhcos45 detects the following as sda, causing the issue, which seems to be a different behaviour than with rhcos45 14.561921] scsi 0:0:0:0: Direct-Access Generic- SD/MMC CRW 1.00 PQ: 0 ANSI: 6 [ 14.602220] scsi 0:0:0:0: Attached scsi generic sg0 type 0 [ 14.629936] sd 0:0:0:0: [sda] Attached SCSI removable disk
This looked to be an environmental difference between 4.5 and 4.6. Workaround has been identified. Closing.