1809049 – Delete is triggered when MHC has "healthchecking.openshift.io/strategy: reboot" annotation [Azure Set up]

Bug 1809049 - Delete is triggered when MHC has "healthchecking.openshift.io/strategy: reboot" annotation [Azure Set up]

Summary: Delete is triggered when MHC has "healthchecking.openshift.io/strategy: reboo...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Alberto
QA Contact:	sunzhaohua
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-03-02 10:23 UTC by Milind Yadav
Modified:	2020-08-04 18:03 UTC (History)
CC List:	0 users
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-08-04 18:03:04 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2020:2409	0	None	None	None	2020-08-04 18:03:07 UTC

Description Milind Yadav 2020-03-02 10:23:59 UTC

Description of problem:Delete is triggered when MHC has "healthchecking.openshift.io/strategy: reboot" annotation [Azure Set up]

Version-Release number of selected component (if applicable):
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.4.0-0.nightly-2020-03-01-215047   True        False         8h      Cluster version is 4.4.0-0.nightly-2020-03-01-215047

Steps to Reproduce :
1.Create a mhc
--- 
apiVersion: machine.openshift.io/v1beta1
kind: MachineHealthCheck
metadata: 
  creationTimestamp: "2020-02-14T09:47:08Z"
  generation: 1
  name: "<User defined Name>"
  namespace: openshift-machine-api
  resourceVersion: "71059"
  selfLink: /apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machinehealthchecks/mhc-miyadav-1402-drlvf-worker-us-east-2c
  uid: ef74b735-e58e-4c24-aa69-015d90998b77
spec: 
  maxUnhealthy: 3
  selector: 
    matchLabels: 
      machine.openshift.io/cluster-api-cluster: "<Your Cluster Name>"
      machine.openshift.io/cluster-api-machine-role: worker
      machine.openshift.io/cluster-api-machine-type: worker
      machine.openshift.io/cluster-api-machineset: "<Your Machine Set>"
  unhealthyConditions: 
    - 
      status: "False"
      timeout: 300s
      type: Ready
    - 
      status: Unknown
      timeout: 300s
      type: Ready
Result:MHC created successfully

2.Annotate 'reboot' remediation strategy to the mhc
oc annotate mhc NAME healthchecking.openshift.io/strategy=reboot
Result : annotation done successfully

3.Go to cloud provider console, stop the instance of the node  
Result : instance stopped successfully 

4.oc get machine <machine-name>  -o=jsonpath="{.metadata.annotations}"

Actual results:Getting map[machine.openshift.io/instance-state:Updating]


Expected results:Should be reboot instead of Updating
and machine should not get deleted



Additional info:
oc describe mhc mhc1
Name:         mhc1
Namespace:    openshift-machine-api
Labels:       <none>
Annotations:  healthchecking.openshift.io/strategy: reboot
API Version:  machine.openshift.io/v1beta1
Kind:         MachineHealthCheck
Metadata:
  Creation Timestamp:  2020-03-02T09:43:22Z
  Generation:          1
  Resource Version:    159714
  Self Link:           /apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machinehealthchecks/mhc1
  UID:                 29ed6db5-bf78-4349-a8f1-536029b9a394
Spec:
  Max Unhealthy:  1
  Selector:
    Match Labels:
      machine.openshift.io/cluster-api-cluster:       zhsun-b6sbk
      machine.openshift.io/cluster-api-machine-role:  worker
      machine.openshift.io/cluster-api-machine-type:  worker
      machine.openshift.io/cluster-api-machineset:    zhsun-b6sbk-worker-centralus2
  Unhealthy Conditions:
    Status:   False
    Timeout:  300s
    Type:     Ready
    Status:   Unknown
    Timeout:  300s
    Type:     Ready
Status:
  Current Healthy:    1
  Expected Machines:  1
Events:
  Type     Reason                 Age                  From                           Message
  ----     ------                 ----                 ----                           -------
  Warning  RemediationRestricted  9m4s (x26 over 15m)  machinehealthcheck-controller  Remediation restricted due to exceeded number of unhealthy machines (total: 2, unhealthy: 2, maxUnhealthy: 1)



Logs : 
I0302 09:57:23.000322       1 machinehealthcheck_controller.go:268] Reconciling openshift-machine-api/mhc1/zhsun-b6sbk-worker-centralus2-6mm5n/zhsun-b6sbk-worker-centralus2-6mm5n: health checking
I0302 09:57:24.657182       1 machinehealthcheck_controller.go:268] Reconciling openshift-machine-api/mhc1/zhsun-b6sbk-worker-centralus2-6mm5n/zhsun-b6sbk-worker-centralus2-6mm5n: health checking
I0302 09:57:24.664357       1 machinehealthcheck_controller.go:268] Reconciling openshift-machine-api/mhc1/zhsun-b6sbk-worker-centralus2-6mm5n/zhsun-b6sbk-worker-centralus2-6mm5n: health checking
I0302 09:57:34.675665       1 machinehealthcheck_controller.go:268] Reconciling openshift-machine-api/mhc1/zhsun-b6sbk-worker-centralus2-6mm5n/zhsun-b6sbk-worker-centralus2-6mm5n: health checking
I0302 09:57:34.699337       1 machinehealthcheck_controller.go:204] Reconciling openshift-machine-api/mhc1/zhsun-b6sbk-worker-centralus2-6mm5n/zhsun-b6sbk-worker-centralus2-6mm5n: meet unhealthy criteria, triggers remediation
I0302 09:57:34.699372       1 machinehealthcheck_controller.go:428]  openshift-machine-api/mhc1/zhsun-b6sbk-worker-centralus2-6mm5n/zhsun-b6sbk-worker-centralus2-6mm5n: start remediation logic
I0302 09:57:34.699383       1 machinehealthcheck_controller.go:452] openshift-machine-api/mhc1/zhsun-b6sbk-worker-centralus2-6mm5n/zhsun-b6sbk-worker-centralus2-6mm5n: deleting
I0302 09:57:34.716910       1 machinehealthcheck_controller.go:268] Reconciling openshift-machine-api/mhc1/zhsun-b6sbk-worker-centralus2-6mm5n/zhsun-b6sbk-worker-centralus2-6mm5n: health checking
I0302 09:57:34.736539       1 machinehealthcheck_controller.go:204] Reconciling openshift-machine-api/mhc1/zhsun-b6sbk-worker-centralus2-6mm5n/zhsun-b6sbk-worker-centralus2-6mm5n: meet unhealthy criteria, triggers remediation
I0302 09:57:34.736545       1 machinehealthcheck_controller.go:428]  openshift-machine-api/mhc1/zhsun-b6sbk-worker-centralus2-6mm5n/zhsun-b6sbk-worker-centralus2-6mm5n: start remediation logic
I0302 09:57:34.736551       1 machinehealthcheck_controller.go:452] openshift-machine-api/mhc1/zhsun-b6sbk-worker-centralus2-6mm5n/zhsun-b6sbk-worker-centralus2-6mm5n: deleting
I0302 09:57:38.963250       1 machinehealthcheck_controller.go:268] Reconciling openshift-machine-api/mhc1/zhsun-b6sbk-worker-centralus2-6mm5n/zhsun-b6sbk-worker-centralus2-6mm5n: health checking
I0302 09:57:38.974295       1 machinehealthcheck_controller.go:204] Reconciling openshift-machine-api/mhc1/zhsun-b6sbk-worker-centralus2-6mm5n/zhsun-b6sbk-worker-centralus2-6mm5n: meet unhealthy criteria, triggers remediation
I0302 09:57:38.974301       1 machinehealthcheck_controller.go:428]  openshift-machine-api/mhc1/zhsun-b6sbk-worker-centralus2-6mm5n/zhsun-b6sbk-worker-centralus2-6mm5n: start remediation logic
I0302 09:57:38.974309       1 machinehealthcheck_controller.go:452] openshift-machine-api/mhc1/zhsun-b6sbk-worker-centralus2-6mm5n/zhsun-b6sbk-worker-centralus2-6mm5n: deleting

Comment 1 Alberto 2020-03-02 10:49:13 UTC

Annotation was renamed to `host.metal3.io/external-remediation` https://github.com/openshift/machine-api-operator/pull/476/files#diff-614d58186947ca2e4e215d42c496d72eR31

Comment 4 Milind Yadav 2020-03-12 10:29:11 UTC

Description of problem:Delete is triggered when MHC has "healthchecking.openshift.io/strategy: reboot" annotation [Azure Set up]

Version-Release number of selected component (if applicable):
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.0-0.nightly-2020-03-12-041748   True        False         153m    Cluster version is 4.5.0-0.nightly-2020-03-12-041748


Steps to Reproduce :
1.Create a mhc
--- 
apiVersion: machine.openshift.io/v1beta1
kind: MachineHealthCheck
metadata: 
  creationTimestamp: "2020-02-14T09:47:08Z"
  generation: 1
  name: "<User defined Name>"
  namespace: openshift-machine-api
  resourceVersion: "71059"
  selfLink: /apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machinehealthchecks/mhc-miyadav-1402-drlvf-worker-us-east-2c
  uid: ef74b735-e58e-4c24-aa69-015d90998b77
spec: 
  maxUnhealthy: 3
  selector: 
    matchLabels: 
      machine.openshift.io/cluster-api-cluster: "<Your Cluster Name>"
      machine.openshift.io/cluster-api-machine-role: worker
      machine.openshift.io/cluster-api-machine-type: worker
      machine.openshift.io/cluster-api-machineset: "<Your Machine Set>"
  unhealthyConditions: 
    - 
      status: "False"
      timeout: 300s
      type: Ready
    - 
      status: Unknown
      timeout: 300s
      type: Ready
Result:MHC created successfully

2.Annotate 'reboot' remediation strategy to the mhc
oc annotate mhc NAME machine.openshift.io/remediation-strategy=external-baremetal
Result : annotation done successfully

3.Go to cloud provider console, stop the instance of the node  
Result : instance stopped successfully 

4.oc get machine <machine-name>  -o jsonpath="{.metadata.annotations}"

Actual results:map[host.metal3.io/external-remediation: machine.openshift.io/instance-state:Running


Expected results:Remediation should trigger, but should not delete the machine.
Instead, should add an annotation "host.metal3.io/external-remediation" to the machine

Comment 5 errata-xmlrpc 2020-08-04 18:03:04 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.5 image release advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Note You need to log in before you can comment on or make changes to this bug.