Description of problem: The remediation health check strategy 'external-baremetal' is remediation logic not working - version 4.6.0-0.nightly-2020-07-07-083718 How reproducible: 100% Steps to Reproduce: 1. Deploy OCP 4.6 (3 master, 2 worker) 2. Create machine health check -> mHc_Ready.yaml for worker nodes cat > mHc_Ready.yaml << EOF apiVersion: machine.openshift.io/v1beta1 kind: MachineHealthCheck metadata: name: workers namespace: openshift-machine-api annotations: machine.openshift.io/remediation-strategy: external-baremetal spec: selector: matchLabels: machine.openshift.io/cluster-api-machine-role: worker unhealthyConditions: - type: Ready status: Unknown timeout: 60s EOF 3. Create failure using virsh suspend (VIRTUAL ENVIRONMENT) virsh suspend worker-0-0 Actual results: We see the node switch to Not Ready and Expected results: We expect remediation flow as seen in doc. https://github.com/openshift/cluster-api-provider-baremetal/blob/master/docs/remediation.md Additional info: [root@sealusa9 ~]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.0-0.nightly-2020-07-07-083718 [root@sealusa9 ~]# oc get nodes NAME STATUS ROLES AGE VERSION master-0-0 Ready master 22h v1.18.3+1a1d81c master-0-1 Ready master 22h v1.18.3+1a1d81c master-0-2 Ready master 22h v1.18.3+1a1d81c worker-0-0 NotReady worker 22h v1.18.3+1a1d81c <------ worker-0-1 Ready worker 22h v1.18.3+1a1d81c [root@sealusa9 ~]# oc get mhc -n openshift-machine-api -w NAME MAXUNHEALTHY EXPECTEDMACHINES CURRENTHEALTHY workers 2 1 oc -n openshift-machine-api logs $(oc -n openshift-machine-api get pods | awk '/controllers/{ print$1 }') -c machine-healthcheck-controller I0708 18:36:54.487304 1 machinehealthcheck_controller.go:166] Reconciling openshift-machine-api/workers: finding targets I0708 18:36:54.487476 1 machinehealthcheck_controller.go:278] Reconciling openshift-machine-api/workers/ocp-edge-cluster-rdu2-0-worker-0-5s7qr/worker-0-1: health checking I0708 18:36:54.487495 1 machinehealthcheck_controller.go:278] Reconciling openshift-machine-api/workers/ocp-edge-cluster-rdu2-0-worker-0-p722p/worker-0-0: health checking I0708 18:36:54.487507 1 machinehealthcheck_controller.go:601] openshift-machine-api/workers/ocp-edge-cluster-rdu2-0-worker-0-p722p/worker-0-0: unhealthy: condition Ready in state Unknown longer than 60s I0708 18:36:54.513166 1 machinehealthcheck_controller.go:205] Reconciling openshift-machine-api/workers: monitoring MHC: total targets: 2, maxUnhealthy: <nil>, unhealthy: 1. Remediations are allowed I0708 18:36:54.513228 1 machinehealthcheck_controller.go:214] Reconciling openshift-machine-api/workers/ocp-edge-cluster-rdu2-0-worker-0-p722p/worker-0-0: meet unhealthy criteria, triggers remediation I0708 18:36:54.513236 1 machinehealthcheck_controller.go:438] openshift-machine-api/workers/ocp-edge-cluster-rdu2-0-worker-0-p722p/worker-0-0: start remediation logic I0708 18:36:54.513245 1 machinehealthcheck_controller.go:233] Reconciling openshift-machine-api/workers: no more targets meet unhealthy criteria I0708 18:31:54.127115 1 machinehealthcheck_controller.go:205] Reconciling openshift-machine-api/workers: monitoring MHC: total targets: 2, maxUnhealthy: <nil>, unhealthy: 1. Remediations are allowed [root@sealusa9 ~]# oc describe node worker-0-0 | grep -A1 Taints Taints: node.kubernetes.io/unreachable:NoExecute node.kubernetes.io/unreachable:NoSchedule
Fixed in https://github.com/openshift/cluster-api-provider-baremetal/pull/84
This has been re-tested using 4.6.0-0.nightly-2020-07-14-195500 and we can move it to Verified when it changes to ON_QA
*** This bug has been marked as a duplicate of bug 1857224 ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196