Bug 1855049 - [4.6] The remediation health check strategy 'external-baremetal' remediation logic not working
Summary: [4.6] The remediation health check strategy 'external-baremetal' remediation ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.6
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.6.0
Assignee: Nir
QA Contact: mlammon
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-07-08 18:40 UTC by mlammon
Modified: 2021-02-06 07:10 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-27 16:13:10 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-api-provider-baremetal pull 84 0 None closed Change node finalizer name to match api requirements 2021-02-06 07:04:56 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:13:30 UTC

Description mlammon 2020-07-08 18:40:49 UTC
Description of problem:
The remediation health check strategy 'external-baremetal' is remediation logic not working

- version   4.6.0-0.nightly-2020-07-07-083718


How reproducible:
100% 

Steps to Reproduce:
1. Deploy OCP 4.6 (3 master, 2 worker)
2. Create machine health check -> mHc_Ready.yaml for worker nodes

cat > mHc_Ready.yaml << EOF
apiVersion: machine.openshift.io/v1beta1
kind: MachineHealthCheck
metadata:
 name: workers
 namespace: openshift-machine-api
 annotations:
   machine.openshift.io/remediation-strategy: external-baremetal
spec:
 selector:
   matchLabels:
     machine.openshift.io/cluster-api-machine-role: worker
 unhealthyConditions:
 - type: Ready
   status: Unknown
   timeout: 60s
EOF

3. Create failure using virsh suspend (VIRTUAL ENVIRONMENT)
virsh suspend worker-0-0

Actual results:
We see the node switch to Not Ready and 

Expected results:
We expect remediation flow as seen in doc.
https://github.com/openshift/cluster-api-provider-baremetal/blob/master/docs/remediation.md

Additional info:


[root@sealusa9 ~]# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-07-07-083718

[root@sealusa9 ~]# oc get nodes
NAME         STATUS     ROLES    AGE   VERSION
master-0-0   Ready      master   22h   v1.18.3+1a1d81c
master-0-1   Ready      master   22h   v1.18.3+1a1d81c
master-0-2   Ready      master   22h   v1.18.3+1a1d81c
worker-0-0   NotReady   worker   22h   v1.18.3+1a1d81c  <------
worker-0-1   Ready      worker   22h   v1.18.3+1a1d81c

[root@sealusa9 ~]# oc get mhc -n openshift-machine-api  -w
NAME      MAXUNHEALTHY   EXPECTEDMACHINES   CURRENTHEALTHY
workers                  2                  1


oc -n openshift-machine-api logs $(oc -n openshift-machine-api get pods | awk '/controllers/{ print$1 }') -c machine-healthcheck-controller  

I0708 18:36:54.487304       1 machinehealthcheck_controller.go:166] Reconciling openshift-machine-api/workers: finding targets
I0708 18:36:54.487476       1 machinehealthcheck_controller.go:278] Reconciling openshift-machine-api/workers/ocp-edge-cluster-rdu2-0-worker-0-5s7qr/worker-0-1: health checking
I0708 18:36:54.487495       1 machinehealthcheck_controller.go:278] Reconciling openshift-machine-api/workers/ocp-edge-cluster-rdu2-0-worker-0-p722p/worker-0-0: health checking
I0708 18:36:54.487507       1 machinehealthcheck_controller.go:601] openshift-machine-api/workers/ocp-edge-cluster-rdu2-0-worker-0-p722p/worker-0-0: unhealthy: condition Ready in state Unknown longer than 60s
I0708 18:36:54.513166       1 machinehealthcheck_controller.go:205] Reconciling openshift-machine-api/workers: monitoring MHC: total targets: 2,  maxUnhealthy: <nil>, unhealthy: 1. Remediations are allowed
I0708 18:36:54.513228       1 machinehealthcheck_controller.go:214] Reconciling openshift-machine-api/workers/ocp-edge-cluster-rdu2-0-worker-0-p722p/worker-0-0: meet unhealthy criteria, triggers remediation
I0708 18:36:54.513236       1 machinehealthcheck_controller.go:438]  openshift-machine-api/workers/ocp-edge-cluster-rdu2-0-worker-0-p722p/worker-0-0: start remediation logic
I0708 18:36:54.513245       1 machinehealthcheck_controller.go:233] Reconciling openshift-machine-api/workers: no more targets meet unhealthy criteria



I0708 18:31:54.127115       1 machinehealthcheck_controller.go:205] Reconciling openshift-machine-api/workers: monitoring MHC: total targets: 2,  maxUnhealthy: <nil>, unhealthy: 1. Remediations are allowed

[root@sealusa9 ~]# oc describe node worker-0-0 | grep -A1 Taints
Taints:             node.kubernetes.io/unreachable:NoExecute
                    node.kubernetes.io/unreachable:NoSchedule

Comment 3 mlammon 2020-07-15 12:32:53 UTC
This has been re-tested using 4.6.0-0.nightly-2020-07-14-195500 and we can move it to Verified when it changes to ON_QA

Comment 4 Nir 2020-07-15 13:45:57 UTC

*** This bug has been marked as a duplicate of bug 1857224 ***

Comment 7 errata-xmlrpc 2020-10-27 16:13:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.