1855049 – [4.6] The remediation health check strategy 'external-baremetal' remediation logic not working

Bug 1855049 - [4.6] The remediation health check strategy 'external-baremetal' remediation logic not working

Summary: [4.6] The remediation health check strategy 'external-baremetal' remediation ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Nir
QA Contact:	mlammon
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-07-08 18:40 UTC by mlammon
Modified:	2021-02-06 07:10 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-10-27 16:13:10 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-api-provider-baremetal pull 84	0	None	closed	Change node finalizer name to match api requirements	2021-02-06 07:04:56 UTC
Red Hat Product Errata	RHBA-2020:4196	0	None	None	None	2020-10-27 16:13:30 UTC

Description mlammon 2020-07-08 18:40:49 UTC

Description of problem:
The remediation health check strategy 'external-baremetal' is remediation logic not working

- version   4.6.0-0.nightly-2020-07-07-083718


How reproducible:
100% 

Steps to Reproduce:
1. Deploy OCP 4.6 (3 master, 2 worker)
2. Create machine health check -> mHc_Ready.yaml for worker nodes

cat > mHc_Ready.yaml << EOF
apiVersion: machine.openshift.io/v1beta1
kind: MachineHealthCheck
metadata:
 name: workers
 namespace: openshift-machine-api
 annotations:
   machine.openshift.io/remediation-strategy: external-baremetal
spec:
 selector:
   matchLabels:
     machine.openshift.io/cluster-api-machine-role: worker
 unhealthyConditions:
 - type: Ready
   status: Unknown
   timeout: 60s
EOF

3. Create failure using virsh suspend (VIRTUAL ENVIRONMENT)
virsh suspend worker-0-0

Actual results:
We see the node switch to Not Ready and 

Expected results:
We expect remediation flow as seen in doc.
https://github.com/openshift/cluster-api-provider-baremetal/blob/master/docs/remediation.md

Additional info:


[root@sealusa9 ~]# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-07-07-083718

[root@sealusa9 ~]# oc get nodes
NAME         STATUS     ROLES    AGE   VERSION
master-0-0   Ready      master   22h   v1.18.3+1a1d81c
master-0-1   Ready      master   22h   v1.18.3+1a1d81c
master-0-2   Ready      master   22h   v1.18.3+1a1d81c
worker-0-0   NotReady   worker   22h   v1.18.3+1a1d81c  <------
worker-0-1   Ready      worker   22h   v1.18.3+1a1d81c

[root@sealusa9 ~]# oc get mhc -n openshift-machine-api  -w
NAME      MAXUNHEALTHY   EXPECTEDMACHINES   CURRENTHEALTHY
workers                  2                  1


oc -n openshift-machine-api logs $(oc -n openshift-machine-api get pods | awk '/controllers/{ print$1 }') -c machine-healthcheck-controller  

I0708 18:36:54.487304       1 machinehealthcheck_controller.go:166] Reconciling openshift-machine-api/workers: finding targets
I0708 18:36:54.487476       1 machinehealthcheck_controller.go:278] Reconciling openshift-machine-api/workers/ocp-edge-cluster-rdu2-0-worker-0-5s7qr/worker-0-1: health checking
I0708 18:36:54.487495       1 machinehealthcheck_controller.go:278] Reconciling openshift-machine-api/workers/ocp-edge-cluster-rdu2-0-worker-0-p722p/worker-0-0: health checking
I0708 18:36:54.487507       1 machinehealthcheck_controller.go:601] openshift-machine-api/workers/ocp-edge-cluster-rdu2-0-worker-0-p722p/worker-0-0: unhealthy: condition Ready in state Unknown longer than 60s
I0708 18:36:54.513166       1 machinehealthcheck_controller.go:205] Reconciling openshift-machine-api/workers: monitoring MHC: total targets: 2,  maxUnhealthy: <nil>, unhealthy: 1. Remediations are allowed
I0708 18:36:54.513228       1 machinehealthcheck_controller.go:214] Reconciling openshift-machine-api/workers/ocp-edge-cluster-rdu2-0-worker-0-p722p/worker-0-0: meet unhealthy criteria, triggers remediation
I0708 18:36:54.513236       1 machinehealthcheck_controller.go:438]  openshift-machine-api/workers/ocp-edge-cluster-rdu2-0-worker-0-p722p/worker-0-0: start remediation logic
I0708 18:36:54.513245       1 machinehealthcheck_controller.go:233] Reconciling openshift-machine-api/workers: no more targets meet unhealthy criteria



I0708 18:31:54.127115       1 machinehealthcheck_controller.go:205] Reconciling openshift-machine-api/workers: monitoring MHC: total targets: 2,  maxUnhealthy: <nil>, unhealthy: 1. Remediations are allowed

[root@sealusa9 ~]# oc describe node worker-0-0 | grep -A1 Taints
Taints:             node.kubernetes.io/unreachable:NoExecute
                    node.kubernetes.io/unreachable:NoSchedule

Comment 2 Nir 2020-07-12 06:43:01 UTC

Fixed in https://github.com/openshift/cluster-api-provider-baremetal/pull/84

Comment 3 mlammon 2020-07-15 12:32:53 UTC

This has been re-tested using 4.6.0-0.nightly-2020-07-14-195500 and we can move it to Verified when it changes to ON_QA

Comment 4 Nir 2020-07-15 13:45:57 UTC


*** This bug has been marked as a duplicate of bug 1857224 ***

Comment 7 errata-xmlrpc 2020-10-27 16:13:10 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.