Bug 1855049

Summary:	[4.6] The remediation health check strategy 'external-baremetal' remediation logic not working
Product:	OpenShift Container Platform	Reporter:	mlammon
Component:	Cloud Compute	Assignee:	Nir <nyehia>
Cloud Compute sub component:	MachineHealthCheck	QA Contact:	mlammon
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	high	CC:	abeekhof, aos-bugs, gharden, yprokule
Version:	4.6	Keywords:	Reopened, TestBlocker, Triaged
Target Milestone:	---
Target Release:	4.6.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-10-27 16:13:10 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description mlammon 2020-07-08 18:40:49 UTC

Description of problem:
The remediation health check strategy 'external-baremetal' is remediation logic not working

- version   4.6.0-0.nightly-2020-07-07-083718


How reproducible:
100% 

Steps to Reproduce:
1. Deploy OCP 4.6 (3 master, 2 worker)
2. Create machine health check -> mHc_Ready.yaml for worker nodes

cat > mHc_Ready.yaml << EOF
apiVersion: machine.openshift.io/v1beta1
kind: MachineHealthCheck
metadata:
 name: workers
 namespace: openshift-machine-api
 annotations:
   machine.openshift.io/remediation-strategy: external-baremetal
spec:
 selector:
   matchLabels:
     machine.openshift.io/cluster-api-machine-role: worker
 unhealthyConditions:
 - type: Ready
   status: Unknown
   timeout: 60s
EOF

3. Create failure using virsh suspend (VIRTUAL ENVIRONMENT)
virsh suspend worker-0-0

Actual results:
We see the node switch to Not Ready and 

Expected results:
We expect remediation flow as seen in doc.
https://github.com/openshift/cluster-api-provider-baremetal/blob/master/docs/remediation.md

Additional info:


[root@sealusa9 ~]# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-07-07-083718

[root@sealusa9 ~]# oc get nodes
NAME         STATUS     ROLES    AGE   VERSION
master-0-0   Ready      master   22h   v1.18.3+1a1d81c
master-0-1   Ready      master   22h   v1.18.3+1a1d81c
master-0-2   Ready      master   22h   v1.18.3+1a1d81c
worker-0-0   NotReady   worker   22h   v1.18.3+1a1d81c  <------
worker-0-1   Ready      worker   22h   v1.18.3+1a1d81c

[root@sealusa9 ~]# oc get mhc -n openshift-machine-api  -w
NAME      MAXUNHEALTHY   EXPECTEDMACHINES   CURRENTHEALTHY
workers                  2                  1


oc -n openshift-machine-api logs $(oc -n openshift-machine-api get pods | awk '/controllers/{ print$1 }') -c machine-healthcheck-controller  

I0708 18:36:54.487304       1 machinehealthcheck_controller.go:166] Reconciling openshift-machine-api/workers: finding targets
I0708 18:36:54.487476       1 machinehealthcheck_controller.go:278] Reconciling openshift-machine-api/workers/ocp-edge-cluster-rdu2-0-worker-0-5s7qr/worker-0-1: health checking
I0708 18:36:54.487495       1 machinehealthcheck_controller.go:278] Reconciling openshift-machine-api/workers/ocp-edge-cluster-rdu2-0-worker-0-p722p/worker-0-0: health checking
I0708 18:36:54.487507       1 machinehealthcheck_controller.go:601] openshift-machine-api/workers/ocp-edge-cluster-rdu2-0-worker-0-p722p/worker-0-0: unhealthy: condition Ready in state Unknown longer than 60s
I0708 18:36:54.513166       1 machinehealthcheck_controller.go:205] Reconciling openshift-machine-api/workers: monitoring MHC: total targets: 2,  maxUnhealthy: <nil>, unhealthy: 1. Remediations are allowed
I0708 18:36:54.513228       1 machinehealthcheck_controller.go:214] Reconciling openshift-machine-api/workers/ocp-edge-cluster-rdu2-0-worker-0-p722p/worker-0-0: meet unhealthy criteria, triggers remediation
I0708 18:36:54.513236       1 machinehealthcheck_controller.go:438]  openshift-machine-api/workers/ocp-edge-cluster-rdu2-0-worker-0-p722p/worker-0-0: start remediation logic
I0708 18:36:54.513245       1 machinehealthcheck_controller.go:233] Reconciling openshift-machine-api/workers: no more targets meet unhealthy criteria



I0708 18:31:54.127115       1 machinehealthcheck_controller.go:205] Reconciling openshift-machine-api/workers: monitoring MHC: total targets: 2,  maxUnhealthy: <nil>, unhealthy: 1. Remediations are allowed

[root@sealusa9 ~]# oc describe node worker-0-0 | grep -A1 Taints
Taints:             node.kubernetes.io/unreachable:NoExecute
                    node.kubernetes.io/unreachable:NoSchedule

Comment 2 Nir 2020-07-12 06:43:01 UTC

Fixed in https://github.com/openshift/cluster-api-provider-baremetal/pull/84

Comment 3 mlammon 2020-07-15 12:32:53 UTC

This has been re-tested using 4.6.0-0.nightly-2020-07-14-195500 and we can move it to Verified when it changes to ON_QA

Comment 4 Nir 2020-07-15 13:45:57 UTC


*** This bug has been marked as a duplicate of bug 1857224 ***

Comment 7 errata-xmlrpc 2020-10-27 16:13:10 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196