Bug 1800425

Summary: Choose more appropriate annotation for external remediation
Product: OpenShift Container Platform Reporter: Andrew Beekhof <abeekhof>
Component: Cloud ComputeAssignee: Steven Hardy <shardy>
Cloud Compute sub component: BareMetal Provider QA Contact: Amit Ugol <augol>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: unspecified CC: stbenjam, vlaad
Version: 4.3.0   
Target Milestone: ---   
Target Release: 4.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-05-15 16:04:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Andrew Beekhof 2020-02-07 03:01:27 UTC
Description of problem:

Since MHC has no control over how the machine is remediated, it would be
better not to imply that it will (only) be via reboot.

Update the annotation, variables, functions, and logging as appropriate


Version-Release number of selected component (if applicable): 4.3

Comment 3 Milind Yadav 2020-02-21 03:46:50 UTC
For Testing the steps are : 

create a mhc -> annotate stratergy -> stop instance from Provider console -> Monitor mhc 

Expected : if annotated with machine.openshift.io/remediation-strategy=external-baremetal it will not be deleted and remediated by the healthcheck controller.

So needed more info on , if the above steps suffice ?

Comment 4 Milind Yadav 2020-03-09 07:25:37 UTC
-- Expecting the below steps to cover the testing for the change --

version :
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.4.0-0.nightly-2020-03-08-213224   True        False         6h45m   Cluster version is 4.4.0-0.nightly-2020-03-08-213224


Steps :

1.Create mhc use below yaml :
---
apiVersion: machine.openshift.io/v1beta1
kind: MachineHealthCheck
metadata:
  name: mh1
  namespace: openshift-machine-api
spec:
  maxUnhealthy: 3
  selector:
    matchLabels:
      machine.openshift.io/cluster-api-cluster: <Your cluster>
      machine.openshift.io/cluster-api-machine-role: worker
      machine.openshift.io/cluster-api-machine-type: worker
      machine.openshift.io/cluster-api-machineset: <Your machineset>
  unhealthyConditions:
    - 
      status: "False"
      timeout: 300s
      type: Ready
    - 
      status: Unknown
      timeout: 300s
      type: Ready

2.Annotate mhc : 
 oc annotate mhc <mhc name> healthchecking.openshift.io/strategy=machine.openshift.io/remediation-strategy=external-baremetal

3.Terminate the machine of the machineset being monitored by mhc using the IAAS console (AWS in this)

Actual : Machine remediation did not happen and it stays in Failed state
Expected : No remediation should take place

Comment 5 Andrew Beekhof 2020-03-09 23:54:39 UTC
(In reply to Milind Yadav from comment #4)
> -- Expecting the below steps to cover the testing for the change --
> 
> version :
> NAME      VERSION                             AVAILABLE   PROGRESSING  
> SINCE   STATUS
> version   4.4.0-0.nightly-2020-03-08-213224   True        False        
> 6h45m   Cluster version is 4.4.0-0.nightly-2020-03-08-213224
> 
> 
> Steps :
> 
> 1.Create mhc use below yaml :
> ---
> apiVersion: machine.openshift.io/v1beta1
> kind: MachineHealthCheck
> metadata:
>   name: mh1
>   namespace: openshift-machine-api
> spec:
>   maxUnhealthy: 3
>   selector:
>     matchLabels:
>       machine.openshift.io/cluster-api-cluster: <Your cluster>
>       machine.openshift.io/cluster-api-machine-role: worker
>       machine.openshift.io/cluster-api-machine-type: worker
>       machine.openshift.io/cluster-api-machineset: <Your machineset>
>   unhealthyConditions:
>     - 
>       status: "False"
>       timeout: 300s
>       type: Ready
>     - 
>       status: Unknown
>       timeout: 300s
>       type: Ready
> 
> 2.Annotate mhc : 
>  oc annotate mhc <mhc name>
> healthchecking.openshift.io/strategy=machine.openshift.io/remediation-
> strategy=external-baremetal
> 

This looks wrong.

I think you want: oc annotate mhc <mhc name> machine.openshift.io/remediation-strategy=external-baremetal

> 3.Terminate the machine of the machineset being monitored by mhc using the
> IAAS console (AWS in this)
> 
> Actual : Machine remediation did not happen and it stays in Failed state
> Expected : No remediation should take place

Was the 'host.metal3.io/external-remediation' annotation added to the machine associated with the failed node?

Comment 6 Milind Yadav 2020-03-11 08:01:21 UTC
I cannot  check annotation at the node as , node died after the Instance that was containing it got terminated .

Do you mean the annotation 'host.meta3.io/external-remediation' was added or not on the machine that is showing failed status ? 

Then , no , it wasnt , the annotation was 

  annotations:
    machine.openshift.io/instance-state: running

Comment 7 Andrew Beekhof 2020-03-11 11:47:28 UTC
(In reply to Milind Yadav from comment #6)
> I cannot  check annotation at the node as , node died after the Instance
> that was containing it got terminated .

It should be on the Machine, not the node.
If the Node got deleted, then you've tested the default remediation strategy (deletion) not the baremetal one. 

> 
> Do you mean the annotation 'host.meta3.io/external-remediation' was added or
> not on the machine that is showing failed status ? 
> 
> Then , no , it wasnt , the annotation was 
> 
>   annotations:
>     machine.openshift.io/instance-state: running

I would recommend retesting with 'oc annotate mhc <mhc name> machine.openshift.io/remediation-strategy=external-baremetal'

Comment 8 Milind Yadav 2020-03-12 03:59:43 UTC
@Andrew , I think this is what you expected and is correct , I will update the annotation value as you suggested , Thanks , the case still is VERIFIED 

In the validation steps updated : 

2.Annotate mhc : 
>  oc annotate mhc <mhc name>
> healthchecking.openshift.io/strategy=machine.openshift.io/remediation-
> strategy=external-baremetal

to 

'oc annotate mhc <mhc name> machine.openshift.io/remediation-strategy=external-baremetal'

[miyadav@miyadav bug1800425]$ oc describe machine aiyengar-1103-6nfzf-worker-us-east-2c-q8p6j 
Name:         aiyengar-1103-6nfzf-worker-us-east-2c-q8p6j
Namespace:    openshift-machine-api
Labels:       machine.openshift.io/cluster-api-cluster=aiyengar-1103-6nfzf
              machine.openshift.io/cluster-api-machine-role=worker
              machine.openshift.io/cluster-api-machine-type=worker
              machine.openshift.io/cluster-api-machineset=aiyengar-1103-6nfzf-worker-us-east-2c
              machine.openshift.io/instance-type=m4.large
              machine.openshift.io/region=us-east-2
              machine.openshift.io/zone=us-east-2c
Annotations:  host.metal3.io/external-remediation: 
              machine.openshift.io/instance-state: running