1800425 – Choose more appropriate annotation for external remediation

Bug 1800425 - Choose more appropriate annotation for external remediation

Summary: Choose more appropriate annotation for external remediation

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	4.4.0
Assignee:	Steven Hardy
QA Contact:	Amit Ugol
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-02-07 03:01 UTC by Andrew Beekhof
Modified:	2020-05-17 16:56 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-05-15 16:04:29 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-api-operator pull 476	0	None	closed	Bug 1800425: More appropriate annotation name	2020-05-15 14:35:30 UTC

Description Andrew Beekhof 2020-02-07 03:01:27 UTC

Description of problem:

Since MHC has no control over how the machine is remediated, it would be
better not to imply that it will (only) be via reboot.

Update the annotation, variables, functions, and logging as appropriate


Version-Release number of selected component (if applicable): 4.3

Comment 3 Milind Yadav 2020-02-21 03:46:50 UTC

For Testing the steps are : 

create a mhc -> annotate stratergy -> stop instance from Provider console -> Monitor mhc 

Expected : if annotated with machine.openshift.io/remediation-strategy=external-baremetal it will not be deleted and remediated by the healthcheck controller.

So needed more info on , if the above steps suffice ?

Comment 4 Milind Yadav 2020-03-09 07:25:37 UTC

-- Expecting the below steps to cover the testing for the change --

version :
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.4.0-0.nightly-2020-03-08-213224   True        False         6h45m   Cluster version is 4.4.0-0.nightly-2020-03-08-213224


Steps :

1.Create mhc use below yaml :
---
apiVersion: machine.openshift.io/v1beta1
kind: MachineHealthCheck
metadata:
  name: mh1
  namespace: openshift-machine-api
spec:
  maxUnhealthy: 3
  selector:
    matchLabels:
      machine.openshift.io/cluster-api-cluster: <Your cluster>
      machine.openshift.io/cluster-api-machine-role: worker
      machine.openshift.io/cluster-api-machine-type: worker
      machine.openshift.io/cluster-api-machineset: <Your machineset>
  unhealthyConditions:
    - 
      status: "False"
      timeout: 300s
      type: Ready
    - 
      status: Unknown
      timeout: 300s
      type: Ready

2.Annotate mhc : 
 oc annotate mhc <mhc name> healthchecking.openshift.io/strategy=machine.openshift.io/remediation-strategy=external-baremetal

3.Terminate the machine of the machineset being monitored by mhc using the IAAS console (AWS in this)

Actual : Machine remediation did not happen and it stays in Failed state
Expected : No remediation should take place

Comment 5 Andrew Beekhof 2020-03-09 23:54:39 UTC

(In reply to Milind Yadav from comment #4)
> -- Expecting the below steps to cover the testing for the change --
> 
> version :
> NAME      VERSION                             AVAILABLE   PROGRESSING  
> SINCE   STATUS
> version   4.4.0-0.nightly-2020-03-08-213224   True        False        
> 6h45m   Cluster version is 4.4.0-0.nightly-2020-03-08-213224
> 
> 
> Steps :
> 
> 1.Create mhc use below yaml :
> ---
> apiVersion: machine.openshift.io/v1beta1
> kind: MachineHealthCheck
> metadata:
>   name: mh1
>   namespace: openshift-machine-api
> spec:
>   maxUnhealthy: 3
>   selector:
>     matchLabels:
>       machine.openshift.io/cluster-api-cluster: <Your cluster>
>       machine.openshift.io/cluster-api-machine-role: worker
>       machine.openshift.io/cluster-api-machine-type: worker
>       machine.openshift.io/cluster-api-machineset: <Your machineset>
>   unhealthyConditions:
>     - 
>       status: "False"
>       timeout: 300s
>       type: Ready
>     - 
>       status: Unknown
>       timeout: 300s
>       type: Ready
> 
> 2.Annotate mhc : 
>  oc annotate mhc <mhc name>
> healthchecking.openshift.io/strategy=machine.openshift.io/remediation-
> strategy=external-baremetal
> 

This looks wrong.

I think you want: oc annotate mhc <mhc name> machine.openshift.io/remediation-strategy=external-baremetal

> 3.Terminate the machine of the machineset being monitored by mhc using the
> IAAS console (AWS in this)
> 
> Actual : Machine remediation did not happen and it stays in Failed state
> Expected : No remediation should take place

Was the 'host.metal3.io/external-remediation' annotation added to the machine associated with the failed node?

Comment 6 Milind Yadav 2020-03-11 08:01:21 UTC

I cannot  check annotation at the node as , node died after the Instance that was containing it got terminated .

Do you mean the annotation 'host.meta3.io/external-remediation' was added or not on the machine that is showing failed status ? 

Then , no , it wasnt , the annotation was 

  annotations:
    machine.openshift.io/instance-state: running

Comment 7 Andrew Beekhof 2020-03-11 11:47:28 UTC

(In reply to Milind Yadav from comment #6)
> I cannot  check annotation at the node as , node died after the Instance
> that was containing it got terminated .

It should be on the Machine, not the node.
If the Node got deleted, then you've tested the default remediation strategy (deletion) not the baremetal one. 

> 
> Do you mean the annotation 'host.meta3.io/external-remediation' was added or
> not on the machine that is showing failed status ? 
> 
> Then , no , it wasnt , the annotation was 
> 
>   annotations:
>     machine.openshift.io/instance-state: running

I would recommend retesting with 'oc annotate mhc <mhc name> machine.openshift.io/remediation-strategy=external-baremetal'

Comment 8 Milind Yadav 2020-03-12 03:59:43 UTC

@Andrew , I think this is what you expected and is correct , I will update the annotation value as you suggested , Thanks , the case still is VERIFIED 

In the validation steps updated : 

2.Annotate mhc : 
>  oc annotate mhc <mhc name>
> healthchecking.openshift.io/strategy=machine.openshift.io/remediation-
> strategy=external-baremetal

to 

'oc annotate mhc <mhc name> machine.openshift.io/remediation-strategy=external-baremetal'

[miyadav@miyadav bug1800425]$ oc describe machine aiyengar-1103-6nfzf-worker-us-east-2c-q8p6j 
Name:         aiyengar-1103-6nfzf-worker-us-east-2c-q8p6j
Namespace:    openshift-machine-api
Labels:       machine.openshift.io/cluster-api-cluster=aiyengar-1103-6nfzf
              machine.openshift.io/cluster-api-machine-role=worker
              machine.openshift.io/cluster-api-machine-type=worker
              machine.openshift.io/cluster-api-machineset=aiyengar-1103-6nfzf-worker-us-east-2c
              machine.openshift.io/instance-type=m4.large
              machine.openshift.io/region=us-east-2
              machine.openshift.io/zone=us-east-2c
Annotations:  host.metal3.io/external-remediation: 
              machine.openshift.io/instance-state: running

Note You need to log in before you can comment on or make changes to this bug.