Bug 1914837

Summary: Machine API Termination Handlers should be tested
Product: OpenShift Container Platform Reporter: Joel Speed <jspeed>
Component: Cloud ComputeAssignee: Joel Speed <jspeed>
Cloud Compute sub component: Other Providers QA Contact: sunzhaohua <zhsun>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: high CC: zhsun
Version: 4.7   
Target Milestone: ---   
Target Release: 4.9.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: 1907286 Environment:
Last Closed: 2021-10-18 17:29:03 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1907286    
Bug Blocks:    

Description Joel Speed 2021-01-11 10:15:50 UTC
+++ This bug was initially created as a clone of Bug #1907286 +++

Description of problem:
The default mhc machine-api-termination-handler couldn't watch spot instance, if we create some spot instances, the mhc total torgets is 0.

Version-Release number of selected component (if applicable):
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2020-12-09-112139   True        False         3d1h    Cluster version is 4.7.0-0.nightly-2020-12-09-112139

How reproducible:
always

Steps to Reproduce:
1. Create a spot instance with "preemptible: true"
2. Check mhc machine-api-termination-handler
3.

Actual results:
The default mhc machine-api-termination-handler total targets is 0.

$ oc get ds
NAME                              DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                  AGE
machine-api-termination-handler   1         1         1       1            1           machine.openshift.io/interruptible-instance=   3d3h

$ oc get machine
NAME                              PHASE     TYPE            REGION        ZONE            AGE
zhsungcp11-bjhl5-master-0         Running   n1-standard-4   us-central1   us-central1-a   3d3h
zhsungcp11-bjhl5-master-1         Running   n1-standard-4   us-central1   us-central1-b   3d3h
zhsungcp11-bjhl5-master-2         Running   n1-standard-4   us-central1   us-central1-c   3d3h
zhsungcp11-bjhl5-worker-a-5k5cd   Running   n1-standard-4   us-central1   us-central1-a   3d3h
zhsungcp11-bjhl5-worker-b-vwv2r   Running   n1-standard-4   us-central1   us-central1-b   3d3h
zhsungcp11-bjhl5-worker-c-54m7p   Running   n1-standard-4   us-central1   us-central1-c   163m


$ oc get mhc
NAME                              MAXUNHEALTHY   EXPECTEDMACHINES   CURRENTHEALTHY
machine-api-termination-handler   100%           0                  0

$ oc logs -f machine-api-controllers-6dddcb4fff-fjsxv -c machine-healthcheck-controller
I1211 14:11:05.851755       1 machinehealthcheck_controller.go:153] Reconciling openshift-machine-api/machine-api-termination-handler
I1211 14:11:05.851803       1 machinehealthcheck_controller.go:171] Reconciling openshift-machine-api/machine-api-termination-handler: finding targets
I1211 14:11:05.851880       1 machinehealthcheck_controller.go:228] Remediations are allowed for openshift-machine-api/machine-api-termination-handler: total targets: 0,  max unhealthy: 100%, unhealthy targets: 0
I1211 14:11:05.859989       1 machinehealthcheck_controller.go:263] Reconciling openshift-machine-api/machine-api-termination-handler: no more targets meet unhealthy criteria

$ oc edit machine zhsungcp11-bjhl5-worker-c-54m7p
apiVersion: machine.openshift.io/v1beta1
kind: Machine
metadata:
  annotations:
    machine.openshift.io/instance-state: RUNNING
  creationTimestamp: "2020-12-14T02:56:43Z"
  finalizers:
  - machine.machine.openshift.io
  generateName: zhsungcp11-bjhl5-worker-c-
  generation: 2
  labels:
    machine.openshift.io/cluster-api-cluster: zhsungcp11-bjhl5
    machine.openshift.io/cluster-api-machine-role: worker
    machine.openshift.io/cluster-api-machine-type: worker
    machine.openshift.io/cluster-api-machineset: zhsungcp11-bjhl5-worker-c
    machine.openshift.io/instance-type: n1-standard-4
    machine.openshift.io/region: us-central1
    machine.openshift.io/zone: us-central1-c
  name: zhsungcp11-bjhl5-worker-c-54m7p
...
spec:
  metadata:
    labels:
      machine.openshift.io/interruptible-instance: ""
  providerID: gce://openshift-qe/us-central1-c/zhsungcp11-bjhl5-worker-c-54m7p
  
    
Expected results:
The total target number is equal to the spot instance number.

Additional info:

Comment 1 Joel Speed 2021-02-01 11:58:13 UTC
We are still having issues with the GCP version of the termination handler during testing. Needs more investigation.

Comment 2 Joel Speed 2021-02-25 14:05:35 UTC
Still not sure how to resolve this, our testing breaks the GCP handler because of how DNS works on GCP, we can't re-enable the tests until we work out a way around this

Comment 3 Joel Speed 2021-03-19 14:29:48 UTC
Still not had time to get into this, hopefully will have time next sprint

Comment 4 Joel Speed 2021-04-19 16:14:56 UTC
Still no time to investigate how to test GCP effectively. Our current approach relies on overriding localhost binding of the metadata API but all DNS traffic also takes this route.
An alternative approach may be to configure a proxy that intercepts the traffic and configure the termination handlers to observe that proxy, not sure if that will work either though.

Comment 8 sunzhaohua 2021-07-15 09:36:59 UTC
move to verified as it is e2e test, not affect function.

Comment 11 errata-xmlrpc 2021-10-18 17:29:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759

Comment 12 milesjr 2022-10-16 22:08:40 UTC Comment hidden (spam)