Bug 1879156
Summary: | [baremetal] Worker nodes are not joining the cluster | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Sabina Aledort <saledort> |
Component: | Machine Config Operator | Assignee: | Brad P. Crochet <brad> |
Status: | CLOSED NOTABUG | QA Contact: | Johnny Liu <jialiu> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 4.6 | CC: | amurdaca, brad, dblack, jialiu, jima, kboumedh, kgarriso, lshilin, mcornea, miabbott, mkrejci, omichael, shardy, yjoseph, ykashtan |
Target Milestone: | --- | Keywords: | Regression, Triaged |
Target Release: | 4.7.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-10-26 15:50:56 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Sabina Aledort
2020-09-15 14:40:48 UTC
I think this is a regression caused by https://github.com/openshift/machine-config-operator/pull/2062 - we copy a resolv.conf early which doesn't yet contain any actual DNS entries, so the podman run then fails - it seems to be timing dependent so doesn't reproduce in all environments, but we probably need a better test than file-exists to decide if the existing file should be overwritten. I'm also not completely clear on the motivation for that change from an OpenStack perspective, if it's not actually needed for baremetal we could perhaps consider a partial revert to restore the previous behavior for baremetal. *** Bug 1879499 has been marked as a duplicate of this bug. *** @Jainlin would your team be able to help verify this BZ? Recently we also hit some similar issue on vsphere platform - https://bugzilla.redhat.com/show_bug.cgi?id=1879322. But our team did not test ipi on baremetal, I will ask some help from edge QE team for this verification. *** Bug 1879322 has been marked as a duplicate of this bug. *** Verified on 4.6.0-0.nightly-2020-09-21-030155 @Johnny Liu. It works on IPI BM If it doesn't work on vsphere, please, open separate BZ *** Bug 1874869 has been marked as a duplicate of this bug. *** Happened again in 4.6.0-0.nightly-2020-09-22-011738 [root@cnfdb4-installer ~]# oc get node NAME STATUS ROLES AGE VERSION dhcp19-17-23.clus2.t5g.lab.eng.bos.redhat.com Ready master,virtual 3h4m v1.19.0+7e8389f dhcp19-17-24.clus2.t5g.lab.eng.bos.redhat.com Ready master,virtual 3h16m v1.19.0+7e8389f dhcp19-17-7.clus2.t5g.lab.eng.bos.redhat.com Ready master,virtual 3h16m v1.19.0+7e8389f [root@cnfdb4-installer ~]# oc get bmh -A NAMESPACE NAME STATUS PROVISIONING STATUS CONSUMER BMC HARDWARE PROFILE ONLINE ERROR openshift-machine-api cnfdb4-master-0 OK externally provisioned cnfdb4-sx7tj-master-0 ipmi://10.19.17.137:6230 true openshift-machine-api cnfdb4-master-1 OK externally provisioned cnfdb4-sx7tj-master-1 ipmi://10.19.17.137:6231 true openshift-machine-api cnfdb4-master-2 OK externally provisioned cnfdb4-sx7tj-master-2 ipmi://10.19.17.137:6232 true openshift-machine-api cnfdb4-worker-0 OK provisioned cnfdb4-sx7tj-worker-0-cdscz ipmi://10.19.17.137:6240 unknown true openshift-machine-api cnfdb4-worker-1 OK provisioned cnfdb4-sx7tj-worker-0-brhw9 ipmi://10.19.28.23 unknown true [root@cnfdb4-installer ~]# oc get machine -A NAMESPACE NAME PHASE TYPE REGION ZONE AGE openshift-machine-api cnfdb4-sx7tj-master-0 Running 3h27m openshift-machine-api cnfdb4-sx7tj-master-1 Running 3h27m openshift-machine-api cnfdb4-sx7tj-master-2 Running 3h27m openshift-machine-api cnfdb4-sx7tj-worker-0-brhw9 Provisioned 174m openshift-machine-api cnfdb4-sx7tj-worker-0-cdscz Provisioned 174m [root@cnfdb4-installer ~]# oc version Client Version: 4.6.0-0.nightly-2020-09-22-011738 Server Version: 4.6.0-0.nightly-2020-09-22-011738 Kubernetes Version: v1.19.0+f5121a6 Is it IPI BM environment? We verified for redfish [kni@ocp-edge06 ~]$ oc version Client Version: 4.6.0-0.nightly-2020-09-22-130743 Server Version: 4.6.0-0.nightly-2020-09-22-130743 Kubernetes Version: v1.19.0+f5121a6 [kni@ocp-edge06 ~]$ oc get nodes NAME STATUS ROLES AGE VERSION openshift-master-0 Ready master 11h v1.19.0+7e8389f openshift-master-1 Ready master 11h v1.19.0+7e8389f openshift-master-2 Ready master 11h v1.19.0+7e8389f openshift-worker-0 Ready worker 11h v1.19.0+7e8389f openshift-worker-1 Ready worker 11h v1.19.0+7e8389f [kni@ocp-edge06 ~]$ oc get bmh -A NAMESPACE NAME STATUS PROVISIONING STATUS CONSUMER BMC HARDWARE PROFILE ONLINE ERROR openshift-machine-api openshift-master-0 OK externally provisioned ocp-edge-k26sz-master-0 redfish://10.46.2.220/redfish/v1/Systems/1 true openshift-machine-api openshift-master-1 OK externally provisioned ocp-edge-k26sz-master-1 redfish://10.46.2.221/redfish/v1/Systems/1 true openshift-machine-api openshift-master-2 OK externally provisioned ocp-edge-k26sz-master-2 redfish://10.46.2.222/redfish/v1/Systems/1 true openshift-machine-api openshift-worker-0 OK provisioned ocp-edge-k26sz-worker-0-d4dqz redfish://10.46.2.223/redfish/v1/Systems/1 unknown true openshift-machine-api openshift-worker-1 OK provisioned ocp-edge-k26sz-worker-0-hf6cf redfish://10.46.2.224/redfish/v1/Systems/1 unknown true Yes, it is IPI BM environment. The problem is inconsistent, today i was able to deploy the same environment with no issue with 4.6.0-0.nightly-2020-09-23-022756. (In reply to Sabina Aledort from comment #13) > Yes, it is IPI BM environment. > The problem is inconsistent, today i was able to deploy the same environment > with no issue with 4.6.0-0.nightly-2020-09-23-022756. In that case I moved it back to verified We don't see this problem Feel free to re-open it if it happens again (In reply to Lubov from comment #14) > (In reply to Sabina Aledort from comment #13) > > Yes, it is IPI BM environment. > > The problem is inconsistent, today i was able to deploy the same environment > > with no issue with 4.6.0-0.nightly-2020-09-23-022756. > > In that case I moved it back to verified > We don't see this problem > > Feel free to re-open it if it happens again As far a i understand the fix is already included since 4.6.0-0.nightly-2020-09-21-081745 (https://github.com/openshift/machine-config-operator/pull/2094), but this issue happened again in a later version, 4.6.0-0.nightly-2020-09-22-130743. Hi, This happened again today in two different clusters with 4.6.0-0.nightly-2020-09-29-170625 and 4.6.0-0.nightly-2020-09-30-052433, redeploying with 4.6.0-0.nightly-2020-09-28-212756 worked. [root@cnfdb3-installer ~]# oc get node NAME STATUS ROLES AGE VERSION dhcp19-17-12.clus2.t5g.lab.eng.bos.redhat.com Ready master,virtual 3h13m v1.19.0+bafba66 dhcp19-17-13.clus2.t5g.lab.eng.bos.redhat.com Ready master,virtual 3h37m v1.19.0+bafba66 dhcp19-17-14.clus2.t5g.lab.eng.bos.redhat.com Ready master,virtual 3h37m v1.19.0+bafba66 [root@cnfdb3-installer ~]# oc get bmh -A NAMESPACE NAME STATUS PROVISIONING STATUS CONSUMER BMC HARDWARE PROFILE ONLINE ERROR openshift-machine-api cnfdb3-master-0 OK externally provisioned cnfdb3-bdnbp-master-0 ipmi://10.19.17.183:6230 true openshift-machine-api cnfdb3-master-1 OK externally provisioned cnfdb3-bdnbp-master-1 ipmi://10.19.17.183:6231 true openshift-machine-api cnfdb3-master-2 OK externally provisioned cnfdb3-bdnbp-master-2 ipmi://10.19.17.183:6232 true openshift-machine-api cnfdb3-worker-0 OK provisioned cnfdb3-bdnbp-worker-0-g4gn4 ipmi://10.19.28.20 unknown true openshift-machine-api cnfdb3-worker-1 OK provisioned cnfdb3-bdnbp-worker-0-h8fdh ipmi://10.19.28.21 unknown true openshift-machine-api cnfdb3-worker-2 OK provisioned cnfdb3-bdnbp-worker-0-5t78f ipmi://10.19.28.22 unknown true [root@cnfdb3-installer ~]# oc get machine -A NAMESPACE NAME PHASE TYPE REGION ZONE AGE openshift-machine-api cnfdb3-bdnbp-master-0 Running 3h48m openshift-machine-api cnfdb3-bdnbp-master-1 Running 3h48m openshift-machine-api cnfdb3-bdnbp-master-2 Running 3h48m openshift-machine-api cnfdb3-bdnbp-worker-0-5t78f Provisioned 3h5m openshift-machine-api cnfdb3-bdnbp-worker-0-g4gn4 Provisioned 3h5m openshift-machine-api cnfdb3-bdnbp-worker-0-h8fdh Provisioned 3h5m [root@cnfdb3-installer ~]# oc version Client Version: 4.6.0-0.nightly-2020-09-29-170625 Server Version: 4.6.0-0.nightly-2020-09-29-170625 Kubernetes Version: v1.19.0+6ef2098 [root@cnfdb5-installer ~]# oc get node NAME STATUS ROLES AGE VERSION dhcp19-17-0.clus2.t5g.lab.eng.bos.redhat.com Ready master,virtual 128m v1.19.0+beb741b dhcp19-17-1.clus2.t5g.lab.eng.bos.redhat.com Ready master,virtual 3h23m v1.19.0+beb741b dhcp19-17-193.clus2.t5g.lab.eng.bos.redhat.com Ready worker 99m v1.19.0+beb741b dhcp19-17-2.clus2.t5g.lab.eng.bos.redhat.com Ready master,virtual 128m v1.19.0+beb741b [root@cnfdb5-installer ~]# oc get bmh -A NAMESPACE NAME STATUS PROVISIONING STATUS CONSUMER BMC HARDWARE PROFILE ONLINE ERROR openshift-machine-api cnfdb5-master-0 OK externally provisioned cnfdb5-dqphp-master-0 ipmi://10.19.17.188:6230 true openshift-machine-api cnfdb5-master-1 OK externally provisioned cnfdb5-dqphp-master-1 ipmi://10.19.17.188:6231 true openshift-machine-api cnfdb5-master-2 OK externally provisioned cnfdb5-dqphp-master-2 ipmi://10.19.17.188:6232 true openshift-machine-api cnfdb5-worker-0 OK provisioned cnfdb5-dqphp-worker-0-2cvrr ipmi://10.19.17.188:6240 unknown true openshift-machine-api cnfdb5-worker-1 OK provisioned cnfdb5-dqphp-worker-0-hzqzt ipmi://10.19.28.24 unknown true [root@cnfdb5-installer ~]# oc get machine -A NAMESPACE NAME PHASE TYPE REGION ZONE AGE openshift-machine-api cnfdb5-dqphp-master-0 Running 3h33m openshift-machine-api cnfdb5-dqphp-master-1 Running 3h33m openshift-machine-api cnfdb5-dqphp-master-2 Running 3h33m openshift-machine-api cnfdb5-dqphp-worker-0-2cvrr Running 119m openshift-machine-api cnfdb5-dqphp-worker-0-hzqzt Provisioned 119m [root@cnfdb5-installer ~]# oc version Client Version: 4.6.0-0.nightly-2020-09-30-052433 Server Version: 4.6.0-0.nightly-2020-09-30-052433 Kubernetes Version: v1.19.0+beb741b Please, provide must-gather when it happens again. With no logs it is impossible to investigate the issue The issue doesn't happen in our environment. We use redfish, Sabina uses ipmi I'll re-open bug and re-assign to original QA contact Hi, I will provide must-gather if it happens again as we already recreated those clusters with 4.6.0-0.nightly-2020-09-28-212756. Moving this to 4.7 unless a consistent reproducer is found. (And must-gather is provided) Hi, It happened again with 4.6.0-0.nightly-2020-10-03-051134. I got the must-gather. [root@cnfdb3-installer ~]# oc get node NAME STATUS ROLES AGE VERSION dhcp19-17-12.clus2.t5g.lab.eng.bos.redhat.com Ready master,virtual 8h v1.19.0+db1fc96 dhcp19-17-13.clus2.t5g.lab.eng.bos.redhat.com Ready master,virtual 8h v1.19.0+db1fc96 dhcp19-17-14.clus2.t5g.lab.eng.bos.redhat.com Ready master,virtual 8h v1.19.0+db1fc96 [root@cnfdb3-installer ~]# oc get bmh -A NAMESPACE NAME STATUS PROVISIONING STATUS CONSUMER BMC HARDWARE PROFILE ONLINE ERROR openshift-machine-api cnfdb3-master-0 OK externally provisioned cnfdb3-pxs2p-master-0 ipmi://10.19.17.221:6230 true openshift-machine-api cnfdb3-master-1 OK externally provisioned cnfdb3-pxs2p-master-1 ipmi://10.19.17.221:6231 true openshift-machine-api cnfdb3-master-2 OK externally provisioned cnfdb3-pxs2p-master-2 ipmi://10.19.17.221:6232 true openshift-machine-api cnfdb3-worker-0 OK provisioned cnfdb3-pxs2p-worker-0-h7dl2 ipmi://10.19.28.20 unknown true openshift-machine-api cnfdb3-worker-1 OK provisioned cnfdb3-pxs2p-worker-0-zp9wt ipmi://10.19.28.21 unknown true openshift-machine-api cnfdb3-worker-2 OK provisioned cnfdb3-pxs2p-worker-0-jtqsf ipmi://10.19.28.22 unknown true [root@cnfdb3-installer ~]# oc get machine -A NAMESPACE NAME PHASE TYPE REGION ZONE AGE openshift-machine-api cnfdb3-pxs2p-master-0 Running 8h openshift-machine-api cnfdb3-pxs2p-master-1 Running 8h openshift-machine-api cnfdb3-pxs2p-master-2 Running 8h openshift-machine-api cnfdb3-pxs2p-worker-0-h7dl2 Provisioned 8h openshift-machine-api cnfdb3-pxs2p-worker-0-jtqsf Provisioned 8h openshift-machine-api cnfdb3-pxs2p-worker-0-zp9wt Provisioned 8h must-gather can be downloaded from: https://drive.google.com/file/d/1bNOWwQojhLO_jgM64FV-vK6BNFj9aJva/view?usp=sharing Hi, This is now a blocking issue for us as it keeps happening with our CI environment in the last days, with different ocp versions. must-gather can be downloaded from: https://drive.google.com/file/d/1sEIUOFJS9tdCMsf9NeS-7t8Bgxowrsr1/view?usp=sharing [root@cnfdb3-installer ~]# oc get node NAME STATUS ROLES AGE VERSION dhcp19-17-12.clus2.t5g.lab.eng.bos.redhat.com Ready master 87m v1.19.0+db1fc96 dhcp19-17-13.clus2.t5g.lab.eng.bos.redhat.com Ready master 87m v1.19.0+db1fc96 dhcp19-17-14.clus2.t5g.lab.eng.bos.redhat.com Ready master 87m v1.19.0+db1fc96 [root@cnfdb3-installer ~]# oc get bmh -A NAMESPACE NAME STATUS PROVISIONING STATUS CONSUMER BMC HARDWARE PROFILE ONLINE ERROR openshift-machine-api cnfdb3-master-0 OK externally provisioned cnfdb3-jb47d-master-0 ipmi://10.19.17.10:6230 true openshift-machine-api cnfdb3-master-1 OK externally provisioned cnfdb3-jb47d-master-1 ipmi://10.19.17.10:6231 true openshift-machine-api cnfdb3-master-2 OK externally provisioned cnfdb3-jb47d-master-2 ipmi://10.19.17.10:6232 true openshift-machine-api cnfdb3-worker-0 OK provisioned cnfdb3-jb47d-worker-0-c6tc9 ipmi://10.19.28.20 unknown true openshift-machine-api cnfdb3-worker-1 OK provisioned cnfdb3-jb47d-worker-0-xt6sg ipmi://10.19.28.21 unknown true openshift-machine-api cnfdb3-worker-2 OK provisioned cnfdb3-jb47d-worker-0-cg82x ipmi://10.19.28.22 unknown true [root@cnfdb3-installer ~]# oc get machine -A NAMESPACE NAME PHASE TYPE REGION ZONE AGE openshift-machine-api cnfdb3-jb47d-master-0 Running 99m openshift-machine-api cnfdb3-jb47d-master-1 Running 99m openshift-machine-api cnfdb3-jb47d-master-2 Running 99m openshift-machine-api cnfdb3-jb47d-worker-0-c6tc9 Provisioned 80m openshift-machine-api cnfdb3-jb47d-worker-0-cg82x Provisioned 80m openshift-machine-api cnfdb3-jb47d-worker-0-xt6sg Provisioned 80m [root@cnfdb3-installer ~]# oc version Client Version: 4.6.0-0.nightly-2020-10-05-234751 Server Version: 4.6.0-0.nightly-2020-10-05-234751 Kubernetes Version: v1.19.0+db1fc96 After setting root device hint to sdb for the workers in the install config, nodes were able to properly join. for the record, here's a sample snippet - name: cnfdb3-worker-0 role: worker bmc: address: ipmi://10.0.0.1 username: eddie password: vanhalen bootMACAddress: 95:40:c7:26:44:62 hardwareProfile: unknown rootDeviceHints: deviceName: "/dev/sdb" rhcos45 detects the following as sda, causing the issue, which seems to be a different behaviour than with rhcos45 14.561921] scsi 0:0:0:0: Direct-Access Generic- SD/MMC CRW 1.00 PQ: 0 ANSI: 6 [ 14.602220] scsi 0:0:0:0: Attached scsi generic sg0 type 0 [ 14.629936] sd 0:0:0:0: [sda] Attached SCSI removable disk This looked to be an environmental difference between 4.5 and 4.6. Workaround has been identified. Closing. |