+++ This bug was initially created as a clone of Bug #1907286 +++ Description of problem: The default mhc machine-api-termination-handler couldn't watch spot instance, if we create some spot instances, the mhc total torgets is 0. Version-Release number of selected component (if applicable): $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.0-0.nightly-2020-12-09-112139 True False 3d1h Cluster version is 4.7.0-0.nightly-2020-12-09-112139 How reproducible: always Steps to Reproduce: 1. Create a spot instance with "preemptible: true" 2. Check mhc machine-api-termination-handler 3. Actual results: The default mhc machine-api-termination-handler total targets is 0. $ oc get ds NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE machine-api-termination-handler 1 1 1 1 1 machine.openshift.io/interruptible-instance= 3d3h $ oc get machine NAME PHASE TYPE REGION ZONE AGE zhsungcp11-bjhl5-master-0 Running n1-standard-4 us-central1 us-central1-a 3d3h zhsungcp11-bjhl5-master-1 Running n1-standard-4 us-central1 us-central1-b 3d3h zhsungcp11-bjhl5-master-2 Running n1-standard-4 us-central1 us-central1-c 3d3h zhsungcp11-bjhl5-worker-a-5k5cd Running n1-standard-4 us-central1 us-central1-a 3d3h zhsungcp11-bjhl5-worker-b-vwv2r Running n1-standard-4 us-central1 us-central1-b 3d3h zhsungcp11-bjhl5-worker-c-54m7p Running n1-standard-4 us-central1 us-central1-c 163m $ oc get mhc NAME MAXUNHEALTHY EXPECTEDMACHINES CURRENTHEALTHY machine-api-termination-handler 100% 0 0 $ oc logs -f machine-api-controllers-6dddcb4fff-fjsxv -c machine-healthcheck-controller I1211 14:11:05.851755 1 machinehealthcheck_controller.go:153] Reconciling openshift-machine-api/machine-api-termination-handler I1211 14:11:05.851803 1 machinehealthcheck_controller.go:171] Reconciling openshift-machine-api/machine-api-termination-handler: finding targets I1211 14:11:05.851880 1 machinehealthcheck_controller.go:228] Remediations are allowed for openshift-machine-api/machine-api-termination-handler: total targets: 0, max unhealthy: 100%, unhealthy targets: 0 I1211 14:11:05.859989 1 machinehealthcheck_controller.go:263] Reconciling openshift-machine-api/machine-api-termination-handler: no more targets meet unhealthy criteria $ oc edit machine zhsungcp11-bjhl5-worker-c-54m7p apiVersion: machine.openshift.io/v1beta1 kind: Machine metadata: annotations: machine.openshift.io/instance-state: RUNNING creationTimestamp: "2020-12-14T02:56:43Z" finalizers: - machine.machine.openshift.io generateName: zhsungcp11-bjhl5-worker-c- generation: 2 labels: machine.openshift.io/cluster-api-cluster: zhsungcp11-bjhl5 machine.openshift.io/cluster-api-machine-role: worker machine.openshift.io/cluster-api-machine-type: worker machine.openshift.io/cluster-api-machineset: zhsungcp11-bjhl5-worker-c machine.openshift.io/instance-type: n1-standard-4 machine.openshift.io/region: us-central1 machine.openshift.io/zone: us-central1-c name: zhsungcp11-bjhl5-worker-c-54m7p ... spec: metadata: labels: machine.openshift.io/interruptible-instance: "" providerID: gce://openshift-qe/us-central1-c/zhsungcp11-bjhl5-worker-c-54m7p Expected results: The total target number is equal to the spot instance number. Additional info:
We are still having issues with the GCP version of the termination handler during testing. Needs more investigation.
Still not sure how to resolve this, our testing breaks the GCP handler because of how DNS works on GCP, we can't re-enable the tests until we work out a way around this
Still not had time to get into this, hopefully will have time next sprint
Still no time to investigate how to test GCP effectively. Our current approach relies on overriding localhost binding of the metadata API but all DNS traffic also takes this route. An alternative approach may be to configure a proxy that intercepts the traffic and configure the termination handlers to observe that proxy, not sure if that will work either though.
move to verified as it is e2e test, not affect function.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759
This comment was flagged a spam, view the edit history to see the original text if required.