Description of problem: Sometimes a machine is the only machine in a machineset, but autoscaler thinks this machine should scale down and annotates it with "machine.openshift.io/cluster-api-delete-machine", but autoscaler couldn't scale down to 0, and this machine status is still Running. Version-Release number of selected component (if applicable): 4.4.0-0.nightly-2020-04-01-213929 How reproducible: Sometimes Steps to Reproduce: 1. Enable autoscaler, machineautoscaler with min=1 2. Create workload to scale up the cluster 3. Delete workload to scale down the cluster 4. Check machine annotations Actual results: Machine was annotated with "machine.openshift.io/cluster-api-delete-machine", but is still Running. $ oc get machine machineset-clone-20108-2-xnlwn -o yaml apiVersion: machine.openshift.io/v1beta1 kind: Machine metadata: annotations: instance-status: | {"kind":"Machine","apiVersion":"machine.openshift.io/v1beta1","metadata":{"name":"machineset-clone-20108-2-xnlwn","generateName":"machineset-clone-20108-2-","namespace":"openshift-machine-api","selfLink":"/apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machines/machineset-clone-20108-2-xnlwn","uid":"ecfff356-1812-4682-8efa-149eef01286f","resourceVersion":"139371","generation":1,"creationTimestamp":"2020-03-31T14:20:53Z","labels":{"machine.openshift.io/cluster-api-cluster":"zhsunosp3312-rgcqr","machine.openshift.io/cluster-api-machine-role":"worker","machine.openshift.io/cluster-api-machine-type":"worker","machine.openshift.io/cluster-api-machineset":"machineset-clone-20108-2","machine.openshift.io/instance-type":"m1.xlarge","machine.openshift.io/region":"regionOne","machine.openshift.io/zone":"nova"},"annotations":{"instance-status":"","machine.openshift.io/instance-state":"ACTIVE","openstack-ip-address":"192.168.0.13","openstack-resourceId":"97d43089-c4a4-4500-bdf2-4e56acfcb5dc"},"ownerReferences":[{"apiVersion":"machine.openshift.io/v1beta1","kind":"MachineSet","name":"machineset-clone-20108-2","uid":"79390991-fd7a-4cba-aa6c-f601cedfda9e","controller":true,"blockOwnerDeletion":true}],"finalizers":["machine.machine.openshift.io"]},"spec":{"metadata":{"creationTimestamp":null},"providerSpec":{"value":{"apiVersion":"openstackproviderconfig.openshift.io/v1alpha1","cloudName":"openstack","cloudsSecret":{"name":"openstack-cloud-credentials","namespace":"openshift-machine-api"},"flavor":"m1.xlarge","image":"zhsunosp3312-rgcqr-rhcos","kind":"OpenstackProviderSpec","metadata":{"creationTimestamp":null},"networks":[{"filter":{},"subnets":[{"filter":{"name":"zhsunosp3312-rgcqr-nodes","tags":"openshiftClusterID=zhsunosp3312-rgcqr"}}]}],"securityGroups":[{"filter":{},"name":"zhsunosp3312-rgcqr-worker"}],"serverMetadata":{"Name":"zhsunosp3312-rgcqr-worker","openshiftClusterID":"zhsunosp3312-rgcqr"},"tags":["openshiftClusterID=zhsunosp3312-rgcqr"],"trun k":true,"userDataSecret":{"name":"worker-user-data"}}}},"status":{"lastUpdated":"2020-03-31T14:23:17Z","phase":"Provisioning"}} machine.openshift.io/cluster-api-delete-machine: 2020-03-31 14:33:16.763166269 +0000 UTC m=+770.866538766 machine.openshift.io/instance-state: ACTIVE ... status: addresses: - address: 192.168.0.13 type: InternalIP - address: machineset-clone-20108-2-xnlwn type: Hostname - address: machineset-clone-20108-2-xnlwn type: InternalDNS lastUpdated: "2020-03-31T14:27:48Z" nodeRef: kind: Node name: machineset-clone-20108-2-xnlwn uid: fb09e8ea-ba23-40eb-b97a-9596a9a2ef4e phase: Running Expected results: If this machine is the last machine in the machineset, we shouldn't add "machine.openshift.io/cluster-api-delete-machin" annotation. Additional info:
hi sunzhaohua, i am trying to reproduce this condition but i am not having luck. would you mind sharing the ClusterAutoscaler and MachineAutoscaler manifests, as well as any information about the workload you are running? thank you!
Test on aws, clusterversion: 4.5.0-0.nightly-2020-04-06-212952 1. create 2 new machinesets $ oc get machineset NAME DESIRED CURRENT READY AVAILABLE AGE zhsunaws047-8vmvr-worker-us-east-2a 1 1 1 1 71m zhsunaws047-8vmvr-worker-us-east-2a-a 1 1 1 1 8m23s zhsunaws047-8vmvr-worker-us-east-2a-b 1 1 1 1 7m12s zhsunaws047-8vmvr-worker-us-east-2b 1 1 1 1 71m zhsunaws047-8vmvr-worker-us-east-2c 1 1 1 1 71m 2. create clusterautoscaler --- apiVersion: "autoscaling.openshift.io/v1" kind: "ClusterAutoscaler" metadata: name: "default" spec: balanceSimilarNodeGroups: true scaleDown: enabled: true delayAfterAdd: 10s delayAfterDelete: 10s delayAfterFailure: 10s unneededTime: 10s 3.create 2 machineautoscalers --- apiVersion: "autoscaling.openshift.io/v1beta1" kind: "MachineAutoscaler" metadata: name: "worker-a" namespace: "openshift-machine-api" spec: minReplicas: 1 maxReplicas: 3 scaleTargetRef: apiVersion: machine.openshift.io/v1beta1 kind: MachineSet name: zhsunaws047-8vmvr-worker-us-east-2a-a --- apiVersion: "autoscaling.openshift.io/v1beta1" kind: "MachineAutoscaler" metadata: name: "worker-b" namespace: "openshift-machine-api" spec: minReplicas: 1 maxReplicas: 3 scaleTargetRef: apiVersion: machine.openshift.io/v1beta1 kind: MachineSet name: zhsunaws047-8vmvr-worker-us-east-2a-b $ oc get machineautoscaler NAME REF KIND REF NAME MIN MAX AGE worker-a MachineSet zhsunaws047-8vmvr-worker-us-east-2a-a 1 3 17m worker-b MachineSet zhsunaws047-8vmvr-worker-us-east-2a-b 1 3 17m 4. create workload apiVersion: apps/v1 kind: Deployment metadata: name: scale-up labels: app: scale-up spec: replicas: 15 selector: matchLabels: app: scale-up template: metadata: labels: app: scale-up spec: containers: - name: busybox image: docker.io/library/busybox resources: requests: memory: 2Gi command: - /bin/sh - "-c" - "echo 'this should be in the logs' && sleep 86400" terminationGracePeriodSeconds: 0 5. wait for the cluster to scale up untile all nodes become ready, delete workload $ oc delete deploy scale-up deployment.apps "scale-up" deleted 6. check machine zhsunaws047-8vmvr-worker-us-east-2a-a-rqdnk and zhsunaws047-8vmvr-worker-us-east-2a-b-nfrdg $ oc get machine NAME PHASE TYPE REGION ZONE AGE zhsunaws047-8vmvr-master-0 Running m4.xlarge us-east-2 us-east-2a 93m zhsunaws047-8vmvr-master-1 Running m4.xlarge us-east-2 us-east-2b 93m zhsunaws047-8vmvr-master-2 Running m4.xlarge us-east-2 us-east-2c 93m zhsunaws047-8vmvr-worker-us-east-2a-a-rqdnk Running m4.large us-east-2 us-east-2a 31m zhsunaws047-8vmvr-worker-us-east-2a-b-nfrdg Running m4.large us-east-2 us-east-2a 17m zhsunaws047-8vmvr-worker-us-east-2a-fmmq5 Running m4.large us-east-2 us-east-2a 81m zhsunaws047-8vmvr-worker-us-east-2b-qchpf Running m4.large us-east-2 us-east-2b 81m zhsunaws047-8vmvr-worker-us-east-2c-b6glk Running m4.large us-east-2 us-east-2c 81m $ oc get machine zhsunaws047-8vmvr-worker-us-east-2a-a-rqdnk -o yaml apiVersion: machine.openshift.io/v1beta1 kind: Machine metadata: annotations: machine.openshift.io/cluster-api-delete-machine: 2020-04-07 02:15:20.314564427 +0000 UTC m=+702.255500233 machine.openshift.io/instance-state: running $ oc get machine zhsunaws047-8vmvr-worker-us-east-2a-b-nfrdg -o yaml apiVersion: machine.openshift.io/v1beta1 kind: Machine metadata: annotations: machine.openshift.io/cluster-api-delete-machine: 2020-04-07 02:15:20.357407058 +0000 UTC m=+702.298343095 machine.openshift.io/instance-state: running
that's perfect, thank you again =)
@sunzhaohua have you experienced this behavior with the default machinesets? i am curious if this problem is limited to new machinesets that we add, i am testing both ways, just interested to know if you've seen any patterns.
one more thing, for completeness sake, would you mind sharing the machinesets you added as well? i want to make sure i am copying your setup as much as possible.
just wanted to confirm that i have now been able to reproduce this bug. i have an idea about how to mitigate it, but i am working through that process now.
i have added a pull request to the autoscaler repo that seems to alleviate the issue for the tests i ran. https://github.com/openshift/kubernetes-autoscaler/pull/141
@Michael McCune I have tested this behavior with the default machinesets, also met this issue. $ oc get machine zhsunaws047-8vmvr-worker-us-east-2b-gl86s -o yaml apiVersion: machine.openshift.io/v1beta1 kind: Machine metadata: annotations: machine.openshift.io/cluster-api-delete-machine: 2020-04-08 07:58:14.943919184 +0000 UTC m=+107676.884855186 machine.openshift.io/instance-state: running
@sunzhaohua thanks for the update. i think the patch i am proposing will fix the issue, i was not able to reproduce the error after introducing the patched autoscaler.
Verified Tested with below steps, i was not able to reproduce the error Clusterversion: 4.5.0-0.nightly-2020-05-05-205255 1. Create 2 new machinesets 2. Enable autoscaler, machineautoscaler with min=1 3. Create workload to scale up the cluster 4. Delete workload to scale down the cluster 5. Check machine annotations
thanks for reporting back sunzhaohua! i'm glad to hear it is working =) i have also proposed an automated test for this condition, https://github.com/openshift/cluster-api-actuator-pkg/pull/150
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409