Description of problem: remediation on worker: machine not returning "Provisioned as node" ,rather "Provisioned": remediation.metal3.io/powered-off-for-remediation not being removed: machine not associated with Node: Remaining Unhealthy Version-Release number of selected component (if applicable): 4.6.0-0.ci-2020-07-21-114552 How reproducible: 100% Steps to Reproduce: 1. Install OCP 4.6 2. Create a machine Health Check (like below) apiVersion: machine.openshift.io/v1beta1 kind: MachineHealthCheck metadata: name: workers namespace: openshift-machine-api annotations: machine.openshift.io/remediation-strategy: external-baremetal spec: maxUnhealthy: 1 selector: matchLabels: machine.openshift.io/cluster-api-machine-role: worker unhealthyConditions: - type: Ready status: Unknown timeout: 60s 3. Create unknown/Ready condition by suspending virtual node virsh suspend worker-0-0 Actual results: [root@sealusa6 ~]# oc get mhc -n openshift-machine-api -w NAME MAXUNHEALTHY EXPECTEDMACHINES CURRENTHEALTHY workers 1 4 3 oc describe machine ocp-edge-cluster-rdu1-0-worker-0-tm62r -n openshift-machine-api |less Name: ocp-edge-cluster-rdu1-0-worker-0-tm62r Namespace: openshift-machine-api Labels: machine.openshift.io/cluster-api-cluster=ocp-edge-cluster-rdu1-0 machine.openshift.io/cluster-api-machine-role=worker machine.openshift.io/cluster-api-machine-type=worker machine.openshift.io/cluster-api-machineset=ocp-edge-cluster-rdu1-0-worker-0 Annotations: host.metal3.io/external-remediation: metal3.io/BareMetalHost: openshift-machine-api/openshift-worker-0-0 remediation.metal3.io/powered-off-for-remediation: API Version: machine.openshift.io/v1beta1 [root@sealusa6 ~]# oc get machines -n openshift-machine-api NAME PHASE TYPE REGION ZONE AGE ocp-edge-cluster-rdu1-0-master-0 Running 20h ocp-edge-cluster-rdu1-0-master-1 Running 20h ocp-edge-cluster-rdu1-0-master-2 Running 20h ocp-edge-cluster-rdu1-0-worker-0-44ldb Running 20h ocp-edge-cluster-rdu1-0-worker-0-qfc2t Running 20h ocp-edge-cluster-rdu1-0-worker-0-tm62r Provisioned 20h ocp-edge-cluster-rdu1-0-worker-0-z4ptg Running 20h Expected results: Full remediation and healthy node restored: Power off the host Add poweredOffForRemediation annotation to the unhealthy Machine Delete the node Power on the host Wait for the node the come up (by waiting for the node to be registered in the cluster) Remove poweredOffForRemediation annotation and the MAO's machine unhealthy annotation LAST SEEN TYPE REASON OBJECT MESSAGE 28m Normal DetectedUnhealthy machine/ocp-edge-cluster-rdu1-0-worker-0-tm62r Machine openshift-machine-api/workers/ocp-edge-cluster-rdu1-0-worker-0-tm62r/worker-0-0 has unhealthy node worker-0-0 27m Normal ExternalAnnotationAdded machine/ocp-edge-cluster-rdu1-0-worker-0-tm62r Requesting external remediation of node associated with machine openshift-machine-api/workers/ocp-edge-cluster-rdu1-0-worker-0-tm62r/worker-0-0 3m2s Normal DetectedUnhealthy machine/ocp-edge-cluster-rdu1-0-worker-0-tm62r Machine openshift-machine-api/workers/ocp-edge-cluster-rdu1-0-worker-0-tm62r/ has unhealthy node I0730 16:10:25.824985 1 machinehealthcheck_controller.go:153] Reconciling openshift-machine-api/workers I0730 16:10:25.825167 1 machinehealthcheck_controller.go:166] Reconciling openshift-machine-api/workers: finding targets I0730 16:10:25.825453 1 machinehealthcheck_controller.go:278] Reconciling openshift-machine-api/workers/ocp-edge-cluster-rdu1-0-worker-0-tm62r/: health checking I0730 16:10:25.825559 1 machinehealthcheck_controller.go:292] Reconciling openshift-machine-api/workers/ocp-edge-cluster-rdu1-0-worker-0-tm62r/: is likely to go unhealthy in 10m0.174451529s I0730 16:10:25.825656 1 machinehealthcheck_controller.go:278] Reconciling openshift-machine-api/workers/ocp-edge-cluster-rdu1-0-worker-0-44ldb/worker-0-2: health checking I0730 16:10:25.825741 1 machinehealthcheck_controller.go:278] Reconciling openshift-machine-api/workers/ocp-edge-cluster-rdu1-0-worker-0-z4ptg/worker-0-3: health checking I0730 16:10:25.825800 1 machinehealthcheck_controller.go:278] Reconciling openshift-machine-api/workers/ocp-edge-cluster-rdu1-0-worker-0-qfc2t/worker-0-1: health checking I0730 16:10:25.833595 1 machinehealthcheck_controller.go:205] Reconciling openshift-machine-api/workers: monitoring MHC: total targets: 4, maxUnhealthy: 1, unhealthy: 1. Remediations are allowed I0730 16:10:25.833752 1 machinehealthcheck_controller.go:229] Reconciling openshift-machine-api/workers: some targets might go unhealthy. Ensuring a requeue happens in 10m0.174451529s Manually removed poweroff remediation annotation from machine but did not resolve the problem I was unable to pull must-gather while filing bug but it is reproducible 100% [root@sealusa6 ~]# oc adm must-gather [must-gather ] OUT Using must-gather plugin-in image: registry.svc.ci.openshift.org/ocp/4.6-2020-07-21-114552@sha256:ae1ab459d16df6d104b05b5bc9d751d7dbf92a99d6c88238083fe3f6a2ee7c19 [must-gather ] OUT namespace/openshift-must-gather-6qpf4 created [must-gather ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-xnppw created [must-gather ] OUT pod for plug-in image registry.svc.ci.openshift.org/ocp/4.6-2020-07-21-114552@sha256:ae1ab459d16df6d104b05b5bc9d751d7dbf92a99d6c88238083fe3f6a2ee7c19 created [must-gather-pqtg5] OUT gather did not start: unable to pull image: ErrImagePull: rpc error: code = Unknown desc = Error reading manifest sha256:ae1ab459d16df6d104b05b5bc9d751d7dbf92a99d6c88238083fe3f6a2ee7c19 in registry.svc.ci.openshift.org/ocp/4.6-2020-07-21-114552: manifest unknown: manifest unknown [must-gather ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-xnppw deleted [must-gather ] OUT namespace/openshift-must-gather-6qpf4 deleted error: gather did not start for pod must-gather-pqtg5: unable to pull image: ErrImagePull: rpc error: code = Unknown desc = Error reading manifest sha256:ae1ab459d16df6d104b05b5bc9d751d7dbf92a99d6c88238083fe3f6a2ee7c19 in registry.svc.ci.openshift.org/ocp/4.6-2020-07-21-114552: manifest unknown: manifest unknown
oc get machine -n openshift-machine-api NAME PHASE TYPE REGION ZONE AGE ocp-edge-cluster-rdu1-0-master-0 Running 21h ocp-edge-cluster-rdu1-0-master-1 Running 21h ocp-edge-cluster-rdu1-0-master-2 Running 21h ocp-edge-cluster-rdu1-0-worker-0-44ldb Running 21h ocp-edge-cluster-rdu1-0-worker-0-qfc2t Running 21h ocp-edge-cluster-rdu1-0-worker-0-tm62r Provisioned 21h ocp-edge-cluster-rdu1-0-worker-0-z4ptg Running 21h oc get no -owide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME master-0-0 Ready master 21h v1.18.3+a34fde4 192.168.123.129 <none> Red Hat Enterprise Linux CoreOS 46.82.202007171740-0 (Ootpa) 4.18.0-211.el8.x86_64 cri-o://1.19.0-33.rhaos4.6.git8adc682.el8-dev master-0-1 Ready master 21h v1.18.3+a34fde4 192.168.123.142 <none> Red Hat Enterprise Linux CoreOS 46.82.202007171740-0 (Ootpa) 4.18.0-211.el8.x86_64 cri-o://1.19.0-33.rhaos4.6.git8adc682.el8-dev master-0-2 Ready master 21h v1.18.3+a34fde4 192.168.123.127 <none> Red Hat Enterprise Linux CoreOS 46.82.202007171740-0 (Ootpa) 4.18.0-211.el8.x86_64 cri-o://1.19.0-33.rhaos4.6.git8adc682.el8-dev worker-0-0 Ready worker 21h v1.18.3+a34fde4 192.168.123.114 <none> Red Hat Enterprise Linux CoreOS 46.82.202007171740-0 (Ootpa) 4.18.0-211.el8.x86_64 cri-o://1.19.0-33.rhaos4.6.git8adc682.el8-dev worker-0-1 Ready worker 21h v1.18.3+a34fde4 192.168.123.137 <none> Red Hat Enterprise Linux CoreOS 46.82.202007171740-0 (Ootpa) 4.18.0-211.el8.x86_64 cri-o://1.19.0-33.rhaos4.6.git8adc682.el8-dev worker-0-2 Ready worker 21h v1.18.3+a34fde4 192.168.123.138 <none> Red Hat Enterprise Linux CoreOS 46.82.202007171740-0 (Ootpa) 4.18.0-211.el8.x86_64 cri-o://1.19.0-33.rhaos4.6.git8adc682.el8-dev worker-0-3 Ready worker 21h v1.18.3+a34fde4 192.168.123.135 <none> Red Hat Enterprise Linux CoreOS 46.82.202007171740-0 (Ootpa) 4.18.0-211.el8.x86_64 cri-o://1.19.0-33.rhaos4.6.git8adc682.el8-dev oc get bmh -n openshift-machine-api NAME STATUS PROVISIONING STATUS CONSUMER BMC HARDWARE PROFILE ONLINE ERROR openshift-master-0-0 OK externally provisioned ocp-edge-cluster-rdu1-0-master-0 redfish://192.168.123.1:8000/redfish/v1/Systems/2c84feb5-8bc8-4fac-81cc-c835def78f31 true openshift-master-0-1 OK externally provisioned ocp-edge-cluster-rdu1-0-master-1 redfish://192.168.123.1:8000/redfish/v1/Systems/dc7385b7-ba20-4528-803b-3ea4fbc97412 true openshift-master-0-2 OK externally provisioned ocp-edge-cluster-rdu1-0-master-2 redfish://192.168.123.1:8000/redfish/v1/Systems/dca26a1c-8405-4970-bf53-85eb7707e341 true openshift-worker-0-0 OK provisioned ocp-edge-cluster-rdu1-0-worker-0-tm62r redfish://192.168.123.1:8000/redfish/v1/Systems/0697c5e3-6bf0-4c2f-8cc3-4b5ea9abccba unknown true openshift-worker-0-1 OK provisioned ocp-edge-cluster-rdu1-0-worker-0-qfc2t redfish://192.168.123.1:8000/redfish/v1/Systems/4029924f-2aad-4630-bf68-2046e84c4879 unknown true openshift-worker-0-2 OK provisioned ocp-edge-cluster-rdu1-0-worker-0-44ldb redfish://192.168.123.1:8000/redfish/v1/Systems/276cb58c-649c-4552-8f0e-c6183f98ce26 unknown true openshift-worker-0-3 OK provisioned ocp-edge-cluster-rdu1-0-worker-0-z4ptg redfish://192.168.123.1:8000/redfish/v1/Systems/ca12cfa2-0b5f-42c2-9ba6-1555cf58eab3 unknown true oc get po -n openshift-machine-api NAME READY STATUS RESTARTS AGE cluster-autoscaler-operator-8464b98f6b-n6p2k 2/2 Running 0 21h machine-api-controllers-644db4b6c5-8nxng 7/7 Running 0 21h machine-api-operator-67488f57b-q5hlq 2/2 Running 0 21h metal3-565d56c698-wwjhz 8/8 Running 0 21h
@mlammon - can you try to reproduce again with an up-to-date build? I believe this was resolved by https://bugzilla.redhat.com/show_bug.cgi?id=1866719
No longer seeing the issue... Assume it was fixed with other issue https://bugzilla.redhat.com/show_bug.cgi?id=1866719