Version: ./openshift-baremetal-install 4.8.0-0.nightly-2021-06-19-005119 built from commit a5ddd2dd6c72d8a5ea0a5f17acd8b964b6a3d1be release image registry.ci.openshift.org/ocp/release@sha256:e61b7db574443b47d33f9ca429c7aea726958eebe0de8a2e876004948e0e4d93 Platform: IPI reproduced the same problem on both real BM and virtual simulation What happened? Running procedure https://docs.openshift.com/container-platform/4.7/backup_and_restore/replacing-unhealthy-etcd-member.html#restore-replace-stopped-etcd-member_replacing-unhealthy-etcd-member on the master that holds metal3 and cluster-baremetal-operator pods 1. Deletion of machine object stuck: $ oc get machine -A; oc get nodes NAMESPACE NAME PHASE TYPE REGION ZONE AGE openshift-machine-api ocp-edge3-qjkpv-master-0 Running 134m openshift-machine-api ocp-edge3-qjkpv-master-1 Running 134m openshift-machine-api ocp-edge3-qjkpv-master-2 Deleting 134m openshift-machine-api ocp-edge3-qjkpv-worker-0-wk5m2 Running 107m openshift-machine-api ocp-edge3-qjkpv-worker-0-x2hnx Running 107m NAME STATUS ROLES AGE VERSION openshift-master-0 Ready master 111m v1.21.0-rc.0+120883f openshift-master-1 Ready master 111m v1.21.0-rc.0+120883f openshift-master-2 NotReady,SchedulingDisabled master 111m v1.21.0-rc.0+120883f openshift-worker-0 Ready worker 75m v1.21.0-rc.0+120883f openshift-worker-1 Ready worker 73m v1.21.0-rc.0+120883f 2. Error in bmh object is reported Normal ProvisionedRegistrationError 21m metal3-baremetal-controller Host adoption failed: Error while attempting to adopt node 44880afd-fcec-454c-a13a-e5778233f0ca: Cannot validate image information for node 44880afd-fcec-454c-a13a-e5778233f0ca because one or more parameters are missing from its instance_info and insufficent information is present to boot from a remote volume. Missing are: ['image_source', 'kernel', 'ramdisk']. 3. cbo and metal3 pods on master to delete are stuck in Terminating state also new pods are created NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES cluster-autoscaler-operator-7589bfb57-p4tll 2/2 Running 0 130m 10.128.0.6 openshift-master-1 <none> <none> cluster-baremetal-operator-58976889b8-97p2c 2/2 Terminating 2 130m 10.130.0.6 openshift-master-2 <none> <none> cluster-baremetal-operator-58976889b8-hwcjp 2/2 Running 0 21m 10.128.0.60 openshift-master-1 <none> <none> machine-api-controllers-d7b79b59-8m4ll 7/7 Running 0 104m 10.128.0.20 openshift-master-1 <none> <none> machine-api-operator-55c89fbbfb-n6j4k 2/2 Running 0 130m 10.129.0.6 openshift-master-0 <none> <none> metal3-7c98d5b44c-dhgrb 10/10 Running 0 21m 10.46.59.21 openshift-master-1 <none> <none> metal3-7c98d5b44c-k6nbx 10/10 Terminating 0 100m 10.46.59.22 openshift-master-2 <none> <none> metal3-image-cache-clpbv 1/1 Running 0 100m 10.46.59.20 openshift-master-0 <none> <none> metal3-image-cache-cqwj8 1/1 Running 0 100m 10.46.59.21 openshift-master-1 <none> <none> metal3-image-cache-hq9rv 1/1 Running 0 100m 10.46.59.22 openshift-master-2 <none> <none> What did you expect to happen? he node is drained successfully and machine is deleted How to reproduce it (as minimally and precisely as possible)? 1. Check what master runs cbo and metal3 2. Set online status to false on bmh dedicated to the master to simulate the situation when master becomes notReady 3. Run all steps from the procedure Anything else we need to know? adding must-gather
Created attachment 1792512 [details] must-gather errors
This looks like bug 1972572. We should retest with that fix once it's backported.
Actually, I missed that it has been backported already as bug 1973018, and the 4.8 fix has now been verified, so this could be tested again now.
Verified on 4.8.0-0.nightly-2021-06-25-182927 - passed
*** This bug has been marked as a duplicate of bug 1973018 ***