Bug 1809049

Summary: Delete is triggered when MHC has "healthchecking.openshift.io/strategy: reboot" annotation [Azure Set up]
Product: OpenShift Container Platform Reporter: Milind Yadav <miyadav>
Component: Cloud ComputeAssignee: Alberto <agarcial>
Cloud Compute sub component: Other Providers QA Contact: sunzhaohua <zhsun>
Status: CLOSED ERRATA Docs Contact:
Severity: unspecified    
Priority: unspecified    
Version: 4.4   
Target Milestone: ---   
Target Release: 4.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-08-04 18:03:04 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Milind Yadav 2020-03-02 10:23:59 UTC
Description of problem:Delete is triggered when MHC has "healthchecking.openshift.io/strategy: reboot" annotation [Azure Set up]

Version-Release number of selected component (if applicable):
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.4.0-0.nightly-2020-03-01-215047   True        False         8h      Cluster version is 4.4.0-0.nightly-2020-03-01-215047

Steps to Reproduce :
1.Create a mhc
--- 
apiVersion: machine.openshift.io/v1beta1
kind: MachineHealthCheck
metadata: 
  creationTimestamp: "2020-02-14T09:47:08Z"
  generation: 1
  name: "<User defined Name>"
  namespace: openshift-machine-api
  resourceVersion: "71059"
  selfLink: /apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machinehealthchecks/mhc-miyadav-1402-drlvf-worker-us-east-2c
  uid: ef74b735-e58e-4c24-aa69-015d90998b77
spec: 
  maxUnhealthy: 3
  selector: 
    matchLabels: 
      machine.openshift.io/cluster-api-cluster: "<Your Cluster Name>"
      machine.openshift.io/cluster-api-machine-role: worker
      machine.openshift.io/cluster-api-machine-type: worker
      machine.openshift.io/cluster-api-machineset: "<Your Machine Set>"
  unhealthyConditions: 
    - 
      status: "False"
      timeout: 300s
      type: Ready
    - 
      status: Unknown
      timeout: 300s
      type: Ready
Result:MHC created successfully

2.Annotate 'reboot' remediation strategy to the mhc
oc annotate mhc NAME healthchecking.openshift.io/strategy=reboot
Result : annotation done successfully

3.Go to cloud provider console, stop the instance of the node  
Result : instance stopped successfully 

4.oc get machine <machine-name>  -o=jsonpath="{.metadata.annotations}"

Actual results:Getting map[machine.openshift.io/instance-state:Updating]


Expected results:Should be reboot instead of Updating
and machine should not get deleted



Additional info:
oc describe mhc mhc1
Name:         mhc1
Namespace:    openshift-machine-api
Labels:       <none>
Annotations:  healthchecking.openshift.io/strategy: reboot
API Version:  machine.openshift.io/v1beta1
Kind:         MachineHealthCheck
Metadata:
  Creation Timestamp:  2020-03-02T09:43:22Z
  Generation:          1
  Resource Version:    159714
  Self Link:           /apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machinehealthchecks/mhc1
  UID:                 29ed6db5-bf78-4349-a8f1-536029b9a394
Spec:
  Max Unhealthy:  1
  Selector:
    Match Labels:
      machine.openshift.io/cluster-api-cluster:       zhsun-b6sbk
      machine.openshift.io/cluster-api-machine-role:  worker
      machine.openshift.io/cluster-api-machine-type:  worker
      machine.openshift.io/cluster-api-machineset:    zhsun-b6sbk-worker-centralus2
  Unhealthy Conditions:
    Status:   False
    Timeout:  300s
    Type:     Ready
    Status:   Unknown
    Timeout:  300s
    Type:     Ready
Status:
  Current Healthy:    1
  Expected Machines:  1
Events:
  Type     Reason                 Age                  From                           Message
  ----     ------                 ----                 ----                           -------
  Warning  RemediationRestricted  9m4s (x26 over 15m)  machinehealthcheck-controller  Remediation restricted due to exceeded number of unhealthy machines (total: 2, unhealthy: 2, maxUnhealthy: 1)



Logs : 
I0302 09:57:23.000322       1 machinehealthcheck_controller.go:268] Reconciling openshift-machine-api/mhc1/zhsun-b6sbk-worker-centralus2-6mm5n/zhsun-b6sbk-worker-centralus2-6mm5n: health checking
I0302 09:57:24.657182       1 machinehealthcheck_controller.go:268] Reconciling openshift-machine-api/mhc1/zhsun-b6sbk-worker-centralus2-6mm5n/zhsun-b6sbk-worker-centralus2-6mm5n: health checking
I0302 09:57:24.664357       1 machinehealthcheck_controller.go:268] Reconciling openshift-machine-api/mhc1/zhsun-b6sbk-worker-centralus2-6mm5n/zhsun-b6sbk-worker-centralus2-6mm5n: health checking
I0302 09:57:34.675665       1 machinehealthcheck_controller.go:268] Reconciling openshift-machine-api/mhc1/zhsun-b6sbk-worker-centralus2-6mm5n/zhsun-b6sbk-worker-centralus2-6mm5n: health checking
I0302 09:57:34.699337       1 machinehealthcheck_controller.go:204] Reconciling openshift-machine-api/mhc1/zhsun-b6sbk-worker-centralus2-6mm5n/zhsun-b6sbk-worker-centralus2-6mm5n: meet unhealthy criteria, triggers remediation
I0302 09:57:34.699372       1 machinehealthcheck_controller.go:428]  openshift-machine-api/mhc1/zhsun-b6sbk-worker-centralus2-6mm5n/zhsun-b6sbk-worker-centralus2-6mm5n: start remediation logic
I0302 09:57:34.699383       1 machinehealthcheck_controller.go:452] openshift-machine-api/mhc1/zhsun-b6sbk-worker-centralus2-6mm5n/zhsun-b6sbk-worker-centralus2-6mm5n: deleting
I0302 09:57:34.716910       1 machinehealthcheck_controller.go:268] Reconciling openshift-machine-api/mhc1/zhsun-b6sbk-worker-centralus2-6mm5n/zhsun-b6sbk-worker-centralus2-6mm5n: health checking
I0302 09:57:34.736539       1 machinehealthcheck_controller.go:204] Reconciling openshift-machine-api/mhc1/zhsun-b6sbk-worker-centralus2-6mm5n/zhsun-b6sbk-worker-centralus2-6mm5n: meet unhealthy criteria, triggers remediation
I0302 09:57:34.736545       1 machinehealthcheck_controller.go:428]  openshift-machine-api/mhc1/zhsun-b6sbk-worker-centralus2-6mm5n/zhsun-b6sbk-worker-centralus2-6mm5n: start remediation logic
I0302 09:57:34.736551       1 machinehealthcheck_controller.go:452] openshift-machine-api/mhc1/zhsun-b6sbk-worker-centralus2-6mm5n/zhsun-b6sbk-worker-centralus2-6mm5n: deleting
I0302 09:57:38.963250       1 machinehealthcheck_controller.go:268] Reconciling openshift-machine-api/mhc1/zhsun-b6sbk-worker-centralus2-6mm5n/zhsun-b6sbk-worker-centralus2-6mm5n: health checking
I0302 09:57:38.974295       1 machinehealthcheck_controller.go:204] Reconciling openshift-machine-api/mhc1/zhsun-b6sbk-worker-centralus2-6mm5n/zhsun-b6sbk-worker-centralus2-6mm5n: meet unhealthy criteria, triggers remediation
I0302 09:57:38.974301       1 machinehealthcheck_controller.go:428]  openshift-machine-api/mhc1/zhsun-b6sbk-worker-centralus2-6mm5n/zhsun-b6sbk-worker-centralus2-6mm5n: start remediation logic
I0302 09:57:38.974309       1 machinehealthcheck_controller.go:452] openshift-machine-api/mhc1/zhsun-b6sbk-worker-centralus2-6mm5n/zhsun-b6sbk-worker-centralus2-6mm5n: deleting

Comment 1 Alberto 2020-03-02 10:49:13 UTC
Annotation was renamed to `host.metal3.io/external-remediation` https://github.com/openshift/machine-api-operator/pull/476/files#diff-614d58186947ca2e4e215d42c496d72eR31

Comment 4 Milind Yadav 2020-03-12 10:29:11 UTC
Description of problem:Delete is triggered when MHC has "healthchecking.openshift.io/strategy: reboot" annotation [Azure Set up]

Version-Release number of selected component (if applicable):
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.0-0.nightly-2020-03-12-041748   True        False         153m    Cluster version is 4.5.0-0.nightly-2020-03-12-041748


Steps to Reproduce :
1.Create a mhc
--- 
apiVersion: machine.openshift.io/v1beta1
kind: MachineHealthCheck
metadata: 
  creationTimestamp: "2020-02-14T09:47:08Z"
  generation: 1
  name: "<User defined Name>"
  namespace: openshift-machine-api
  resourceVersion: "71059"
  selfLink: /apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machinehealthchecks/mhc-miyadav-1402-drlvf-worker-us-east-2c
  uid: ef74b735-e58e-4c24-aa69-015d90998b77
spec: 
  maxUnhealthy: 3
  selector: 
    matchLabels: 
      machine.openshift.io/cluster-api-cluster: "<Your Cluster Name>"
      machine.openshift.io/cluster-api-machine-role: worker
      machine.openshift.io/cluster-api-machine-type: worker
      machine.openshift.io/cluster-api-machineset: "<Your Machine Set>"
  unhealthyConditions: 
    - 
      status: "False"
      timeout: 300s
      type: Ready
    - 
      status: Unknown
      timeout: 300s
      type: Ready
Result:MHC created successfully

2.Annotate 'reboot' remediation strategy to the mhc
oc annotate mhc NAME machine.openshift.io/remediation-strategy=external-baremetal
Result : annotation done successfully

3.Go to cloud provider console, stop the instance of the node  
Result : instance stopped successfully 

4.oc get machine <machine-name>  -o jsonpath="{.metadata.annotations}"

Actual results:map[host.metal3.io/external-remediation: machine.openshift.io/instance-state:Running


Expected results:Remediation should trigger, but should not delete the machine.
Instead, should add an annotation "host.metal3.io/external-remediation" to the machine

Comment 5 errata-xmlrpc 2020-08-04 18:03:04 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.5 image release advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409