Bug 1907286 - The default mhc machine-api-termination-handler couldn't watch spot instance
Summary: The default mhc machine-api-termination-handler couldn't watch spot instance
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.7
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.7.0
Assignee: Joel Speed
QA Contact: sunzhaohua
URL:
Whiteboard:
Depends On:
Blocks: 1914837
TreeView+ depends on / blocked
 
Reported: 2020-12-14 06:16 UTC by sunzhaohua
Modified: 2021-02-24 15:43 UTC (History)
0 users

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1908350 1914837 (view as bug list)
Environment:
Last Closed: 2021-02-24 15:43:03 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-api-provider-aws pull 379 0 None closed Bug 1907286: Ensure Machine is marked interruptible as well as Node 2021-01-11 10:13:49 UTC
Github openshift cluster-api-provider-azure pull 188 0 None closed Bug 1907286: Ensure Machine is marked interruptible as well as Node 2021-01-13 01:35:29 UTC
Github openshift cluster-api-provider-gcp pull 140 0 None closed Bug 1907286: Ensure Machine is marked interruptible as well as Node 2021-01-12 07:14:09 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:43:24 UTC

Description sunzhaohua 2020-12-14 06:16:25 UTC
Description of problem:
The default mhc machine-api-termination-handler couldn't watch spot instance, if we create some spot instances, the mhc total torgets is 0.

Version-Release number of selected component (if applicable):
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2020-12-09-112139   True        False         3d1h    Cluster version is 4.7.0-0.nightly-2020-12-09-112139

How reproducible:
always

Steps to Reproduce:
1. Create a spot instance with "preemptible: true"
2. Check mhc machine-api-termination-handler
3.

Actual results:
The default mhc machine-api-termination-handler total targets is 0.

$ oc get ds
NAME                              DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                  AGE
machine-api-termination-handler   1         1         1       1            1           machine.openshift.io/interruptible-instance=   3d3h

$ oc get machine
NAME                              PHASE     TYPE            REGION        ZONE            AGE
zhsungcp11-bjhl5-master-0         Running   n1-standard-4   us-central1   us-central1-a   3d3h
zhsungcp11-bjhl5-master-1         Running   n1-standard-4   us-central1   us-central1-b   3d3h
zhsungcp11-bjhl5-master-2         Running   n1-standard-4   us-central1   us-central1-c   3d3h
zhsungcp11-bjhl5-worker-a-5k5cd   Running   n1-standard-4   us-central1   us-central1-a   3d3h
zhsungcp11-bjhl5-worker-b-vwv2r   Running   n1-standard-4   us-central1   us-central1-b   3d3h
zhsungcp11-bjhl5-worker-c-54m7p   Running   n1-standard-4   us-central1   us-central1-c   163m


$ oc get mhc
NAME                              MAXUNHEALTHY   EXPECTEDMACHINES   CURRENTHEALTHY
machine-api-termination-handler   100%           0                  0

$ oc logs -f machine-api-controllers-6dddcb4fff-fjsxv -c machine-healthcheck-controller
I1211 14:11:05.851755       1 machinehealthcheck_controller.go:153] Reconciling openshift-machine-api/machine-api-termination-handler
I1211 14:11:05.851803       1 machinehealthcheck_controller.go:171] Reconciling openshift-machine-api/machine-api-termination-handler: finding targets
I1211 14:11:05.851880       1 machinehealthcheck_controller.go:228] Remediations are allowed for openshift-machine-api/machine-api-termination-handler: total targets: 0,  max unhealthy: 100%, unhealthy targets: 0
I1211 14:11:05.859989       1 machinehealthcheck_controller.go:263] Reconciling openshift-machine-api/machine-api-termination-handler: no more targets meet unhealthy criteria

$ oc edit machine zhsungcp11-bjhl5-worker-c-54m7p
apiVersion: machine.openshift.io/v1beta1
kind: Machine
metadata:
  annotations:
    machine.openshift.io/instance-state: RUNNING
  creationTimestamp: "2020-12-14T02:56:43Z"
  finalizers:
  - machine.machine.openshift.io
  generateName: zhsungcp11-bjhl5-worker-c-
  generation: 2
  labels:
    machine.openshift.io/cluster-api-cluster: zhsungcp11-bjhl5
    machine.openshift.io/cluster-api-machine-role: worker
    machine.openshift.io/cluster-api-machine-type: worker
    machine.openshift.io/cluster-api-machineset: zhsungcp11-bjhl5-worker-c
    machine.openshift.io/instance-type: n1-standard-4
    machine.openshift.io/region: us-central1
    machine.openshift.io/zone: us-central1-c
  name: zhsungcp11-bjhl5-worker-c-54m7p
...
spec:
  metadata:
    labels:
      machine.openshift.io/interruptible-instance: ""
  providerID: gce://openshift-qe/us-central1-c/zhsungcp11-bjhl5-worker-c-54m7p
  
    
Expected results:
The total target number is equal to the spot instance number.

Additional info:

Comment 2 sunzhaohua 2021-01-13 03:14:56 UTC
Verified on aws, gcp and azure
clusterversion: 4.7.0-0.nightly-2021-01-12-203716
$ oc get machine zhsunaws113-9ksps-worker-us-east-2c-2gv7h --show-labels
NAME                                        PHASE     TYPE       REGION      ZONE         AGE   LABELS
zhsunaws113-9ksps-worker-us-east-2c-2gv7h   Running   m5.large   us-east-2   us-east-2c   22m   machine.openshift.io/cluster-api-cluster=zhsunaws113-9ksps,machine.openshift.io/cluster-api-machine-role=worker,machine.openshift.io/cluster-api-machine-type=worker,machine.openshift.io/cluster-api-machineset=zhsunaws113-9ksps-worker-us-east-2c,machine.openshift.io/instance-type=m5.large,machine.openshift.io/interruptible-instance=,machine.openshift.io/region=us-east-2,machine.openshift.io/zone=us-east-2c
[szh@bogon aws]$ oc get mhc
NAME                              MAXUNHEALTHY   EXPECTEDMACHINES   CURRENTHEALTHY
machine-api-termination-handler   100%           1                  1

$ oc get machine zhsungcp113-zgxg5-worker-c-928pf --show-labels
NAME                               PHASE     TYPE            REGION        ZONE            AGE   LABELS
zhsungcp113-zgxg5-worker-c-928pf   Running   n1-standard-4   us-central1   us-central1-c   22m   machine.openshift.io/cluster-api-cluster=zhsungcp113-zgxg5,machine.openshift.io/cluster-api-machine-role=worker,machine.openshift.io/cluster-api-machine-type=worker,machine.openshift.io/cluster-api-machineset=zhsungcp113-zgxg5-worker-c,machine.openshift.io/instance-type=n1-standard-4,machine.openshift.io/interruptible-instance=,machine.openshift.io/region=us-central1,machine.openshift.io/zone=us-central1-c
[szh@bogon gcp]$ oc get mhc
NAME                              MAXUNHEALTHY   EXPECTEDMACHINES   CURRENTHEALTHY
machine-api-termination-handler   100%           1                  1

$ oc get machine zhsunazure-tpj2g-worker-northcentralus1-2q8kv --show-labels
NAME                                            PHASE     TYPE              REGION           ZONE   AGE     LABELS
zhsunazure-tpj2g-worker-northcentralus1-2q8kv   Running   Standard_D2s_v3   northcentralus          9m35s   machine.openshift.io/cluster-api-cluster=zhsunazure-tpj2g,machine.openshift.io/cluster-api-machine-role=worker,machine.openshift.io/cluster-api-machine-type=worker,machine.openshift.io/cluster-api-machineset=zhsunazure-tpj2g-worker-northcentralus1,machine.openshift.io/instance-type=Standard_D2s_v3,machine.openshift.io/interruptible-instance=,machine.openshift.io/region=northcentralus
[szh@bogon azure]$ oc get mhc
NAME                              MAXUNHEALTHY   EXPECTEDMACHINES   CURRENTHEALTHY
machine-api-termination-handler   100%           1                  1

Comment 6 errata-xmlrpc 2021-02-24 15:43:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.