Bug 1855049

Summary: [4.6] The remediation health check strategy 'external-baremetal' remediation logic not working
Product: OpenShift Container Platform Reporter: mlammon
Component: Cloud ComputeAssignee: Nir <nyehia>
Cloud Compute sub component: MachineHealthCheck QA Contact: mlammon
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: abeekhof, aos-bugs, gharden, yprokule
Version: 4.6Keywords: Reopened, TestBlocker, Triaged
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-10-27 16:13:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description mlammon 2020-07-08 18:40:49 UTC
Description of problem:
The remediation health check strategy 'external-baremetal' is remediation logic not working

- version   4.6.0-0.nightly-2020-07-07-083718

How reproducible:

Steps to Reproduce:
1. Deploy OCP 4.6 (3 master, 2 worker)
2. Create machine health check -> mHc_Ready.yaml for worker nodes

cat > mHc_Ready.yaml << EOF
apiVersion: machine.openshift.io/v1beta1
kind: MachineHealthCheck
 name: workers
 namespace: openshift-machine-api
   machine.openshift.io/remediation-strategy: external-baremetal
     machine.openshift.io/cluster-api-machine-role: worker
 - type: Ready
   status: Unknown
   timeout: 60s

3. Create failure using virsh suspend (VIRTUAL ENVIRONMENT)
virsh suspend worker-0-0

Actual results:
We see the node switch to Not Ready and 

Expected results:
We expect remediation flow as seen in doc.

Additional info:

[root@sealusa9 ~]# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-07-07-083718

[root@sealusa9 ~]# oc get nodes
master-0-0   Ready      master   22h   v1.18.3+1a1d81c
master-0-1   Ready      master   22h   v1.18.3+1a1d81c
master-0-2   Ready      master   22h   v1.18.3+1a1d81c
worker-0-0   NotReady   worker   22h   v1.18.3+1a1d81c  <------
worker-0-1   Ready      worker   22h   v1.18.3+1a1d81c

[root@sealusa9 ~]# oc get mhc -n openshift-machine-api  -w
workers                  2                  1

oc -n openshift-machine-api logs $(oc -n openshift-machine-api get pods | awk '/controllers/{ print$1 }') -c machine-healthcheck-controller  

I0708 18:36:54.487304       1 machinehealthcheck_controller.go:166] Reconciling openshift-machine-api/workers: finding targets
I0708 18:36:54.487476       1 machinehealthcheck_controller.go:278] Reconciling openshift-machine-api/workers/ocp-edge-cluster-rdu2-0-worker-0-5s7qr/worker-0-1: health checking
I0708 18:36:54.487495       1 machinehealthcheck_controller.go:278] Reconciling openshift-machine-api/workers/ocp-edge-cluster-rdu2-0-worker-0-p722p/worker-0-0: health checking
I0708 18:36:54.487507       1 machinehealthcheck_controller.go:601] openshift-machine-api/workers/ocp-edge-cluster-rdu2-0-worker-0-p722p/worker-0-0: unhealthy: condition Ready in state Unknown longer than 60s
I0708 18:36:54.513166       1 machinehealthcheck_controller.go:205] Reconciling openshift-machine-api/workers: monitoring MHC: total targets: 2,  maxUnhealthy: <nil>, unhealthy: 1. Remediations are allowed
I0708 18:36:54.513228       1 machinehealthcheck_controller.go:214] Reconciling openshift-machine-api/workers/ocp-edge-cluster-rdu2-0-worker-0-p722p/worker-0-0: meet unhealthy criteria, triggers remediation
I0708 18:36:54.513236       1 machinehealthcheck_controller.go:438]  openshift-machine-api/workers/ocp-edge-cluster-rdu2-0-worker-0-p722p/worker-0-0: start remediation logic
I0708 18:36:54.513245       1 machinehealthcheck_controller.go:233] Reconciling openshift-machine-api/workers: no more targets meet unhealthy criteria

I0708 18:31:54.127115       1 machinehealthcheck_controller.go:205] Reconciling openshift-machine-api/workers: monitoring MHC: total targets: 2,  maxUnhealthy: <nil>, unhealthy: 1. Remediations are allowed

[root@sealusa9 ~]# oc describe node worker-0-0 | grep -A1 Taints
Taints:             node.kubernetes.io/unreachable:NoExecute

Comment 3 mlammon 2020-07-15 12:32:53 UTC
This has been re-tested using 4.6.0-0.nightly-2020-07-14-195500 and we can move it to Verified when it changes to ON_QA

Comment 4 Nir 2020-07-15 13:45:57 UTC

*** This bug has been marked as a duplicate of bug 1857224 ***

Comment 7 errata-xmlrpc 2020-10-27 16:13:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.