Bug 1820410
| Summary: | [autoscaler]Sometimes the only machine of a machineset gets a delete annotation by autoscaler | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | sunzhaohua <zhsun> | |
| Component: | Cloud Compute | Assignee: | Michael McCune <mimccune> | |
| Cloud Compute sub component: | Other Providers | QA Contact: | sunzhaohua <zhsun> | |
| Status: | CLOSED ERRATA | Docs Contact: | ||
| Severity: | medium | |||
| Priority: | medium | CC: | agarcial | |
| Version: | 4.4 | |||
| Target Milestone: | --- | |||
| Target Release: | 4.5.0 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | Bug Fix | ||
| Doc Text: |
Cause: Occasionally during scale down operations, a condition presents itself whereby the last Machine in a MachineSet will contain the annotations for deletion.
Consequence: The last Machine in a MachineSet will continue to have the deletion annotation after the scale down. The autoscaler will not remove this node though as the minimum size has been reached.
Fix: The fix changes the way Machine annotations are unmarked after a scale down in which the minimum size has been reached.
Result: The annotations no longer persist on the final Machine of the MachineSet.
|
Story Points: | --- | |
| Clone Of: | ||||
| : | 1835873 (view as bug list) | Environment: | ||
| Last Closed: | 2020-07-13 17:25:20 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1835873 | |||
hi sunzhaohua, i am trying to reproduce this condition but i am not having luck. would you mind sharing the ClusterAutoscaler and MachineAutoscaler manifests, as well as any information about the workload you are running? thank you! Test on aws, clusterversion: 4.5.0-0.nightly-2020-04-06-212952
1. create 2 new machinesets
$ oc get machineset
NAME DESIRED CURRENT READY AVAILABLE AGE
zhsunaws047-8vmvr-worker-us-east-2a 1 1 1 1 71m
zhsunaws047-8vmvr-worker-us-east-2a-a 1 1 1 1 8m23s
zhsunaws047-8vmvr-worker-us-east-2a-b 1 1 1 1 7m12s
zhsunaws047-8vmvr-worker-us-east-2b 1 1 1 1 71m
zhsunaws047-8vmvr-worker-us-east-2c 1 1 1 1 71m
2. create clusterautoscaler
---
apiVersion: "autoscaling.openshift.io/v1"
kind: "ClusterAutoscaler"
metadata:
name: "default"
spec:
balanceSimilarNodeGroups: true
scaleDown:
enabled: true
delayAfterAdd: 10s
delayAfterDelete: 10s
delayAfterFailure: 10s
unneededTime: 10s
3.create 2 machineautoscalers
---
apiVersion: "autoscaling.openshift.io/v1beta1"
kind: "MachineAutoscaler"
metadata:
name: "worker-a"
namespace: "openshift-machine-api"
spec:
minReplicas: 1
maxReplicas: 3
scaleTargetRef:
apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
name: zhsunaws047-8vmvr-worker-us-east-2a-a
---
apiVersion: "autoscaling.openshift.io/v1beta1"
kind: "MachineAutoscaler"
metadata:
name: "worker-b"
namespace: "openshift-machine-api"
spec:
minReplicas: 1
maxReplicas: 3
scaleTargetRef:
apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
name: zhsunaws047-8vmvr-worker-us-east-2a-b
$ oc get machineautoscaler
NAME REF KIND REF NAME MIN MAX AGE
worker-a MachineSet zhsunaws047-8vmvr-worker-us-east-2a-a 1 3 17m
worker-b MachineSet zhsunaws047-8vmvr-worker-us-east-2a-b 1 3 17m
4. create workload
apiVersion: apps/v1
kind: Deployment
metadata:
name: scale-up
labels:
app: scale-up
spec:
replicas: 15
selector:
matchLabels:
app: scale-up
template:
metadata:
labels:
app: scale-up
spec:
containers:
- name: busybox
image: docker.io/library/busybox
resources:
requests:
memory: 2Gi
command:
- /bin/sh
- "-c"
- "echo 'this should be in the logs' && sleep 86400"
terminationGracePeriodSeconds: 0
5. wait for the cluster to scale up untile all nodes become ready, delete workload
$ oc delete deploy scale-up
deployment.apps "scale-up" deleted
6. check machine zhsunaws047-8vmvr-worker-us-east-2a-a-rqdnk and zhsunaws047-8vmvr-worker-us-east-2a-b-nfrdg
$ oc get machine
NAME PHASE TYPE REGION ZONE AGE
zhsunaws047-8vmvr-master-0 Running m4.xlarge us-east-2 us-east-2a 93m
zhsunaws047-8vmvr-master-1 Running m4.xlarge us-east-2 us-east-2b 93m
zhsunaws047-8vmvr-master-2 Running m4.xlarge us-east-2 us-east-2c 93m
zhsunaws047-8vmvr-worker-us-east-2a-a-rqdnk Running m4.large us-east-2 us-east-2a 31m
zhsunaws047-8vmvr-worker-us-east-2a-b-nfrdg Running m4.large us-east-2 us-east-2a 17m
zhsunaws047-8vmvr-worker-us-east-2a-fmmq5 Running m4.large us-east-2 us-east-2a 81m
zhsunaws047-8vmvr-worker-us-east-2b-qchpf Running m4.large us-east-2 us-east-2b 81m
zhsunaws047-8vmvr-worker-us-east-2c-b6glk Running m4.large us-east-2 us-east-2c 81m
$ oc get machine zhsunaws047-8vmvr-worker-us-east-2a-a-rqdnk -o yaml
apiVersion: machine.openshift.io/v1beta1
kind: Machine
metadata:
annotations:
machine.openshift.io/cluster-api-delete-machine: 2020-04-07 02:15:20.314564427
+0000 UTC m=+702.255500233
machine.openshift.io/instance-state: running
$ oc get machine zhsunaws047-8vmvr-worker-us-east-2a-b-nfrdg -o yaml
apiVersion: machine.openshift.io/v1beta1
kind: Machine
metadata:
annotations:
machine.openshift.io/cluster-api-delete-machine: 2020-04-07 02:15:20.357407058
+0000 UTC m=+702.298343095
machine.openshift.io/instance-state: running
that's perfect, thank you again =) @sunzhaohua have you experienced this behavior with the default machinesets? i am curious if this problem is limited to new machinesets that we add, i am testing both ways, just interested to know if you've seen any patterns. one more thing, for completeness sake, would you mind sharing the machinesets you added as well? i want to make sure i am copying your setup as much as possible. just wanted to confirm that i have now been able to reproduce this bug. i have an idea about how to mitigate it, but i am working through that process now. i have added a pull request to the autoscaler repo that seems to alleviate the issue for the tests i ran. https://github.com/openshift/kubernetes-autoscaler/pull/141 @Michael McCune
I have tested this behavior with the default machinesets, also met this issue.
$ oc get machine zhsunaws047-8vmvr-worker-us-east-2b-gl86s -o yaml
apiVersion: machine.openshift.io/v1beta1
kind: Machine
metadata:
annotations:
machine.openshift.io/cluster-api-delete-machine: 2020-04-08 07:58:14.943919184
+0000 UTC m=+107676.884855186
machine.openshift.io/instance-state: running
@sunzhaohua thanks for the update. i think the patch i am proposing will fix the issue, i was not able to reproduce the error after introducing the patched autoscaler. Verified Tested with below steps, i was not able to reproduce the error Clusterversion: 4.5.0-0.nightly-2020-05-05-205255 1. Create 2 new machinesets 2. Enable autoscaler, machineautoscaler with min=1 3. Create workload to scale up the cluster 4. Delete workload to scale down the cluster 5. Check machine annotations thanks for reporting back sunzhaohua! i'm glad to hear it is working =) i have also proposed an automated test for this condition, https://github.com/openshift/cluster-api-actuator-pkg/pull/150 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409 |
Description of problem: Sometimes a machine is the only machine in a machineset, but autoscaler thinks this machine should scale down and annotates it with "machine.openshift.io/cluster-api-delete-machine", but autoscaler couldn't scale down to 0, and this machine status is still Running. Version-Release number of selected component (if applicable): 4.4.0-0.nightly-2020-04-01-213929 How reproducible: Sometimes Steps to Reproduce: 1. Enable autoscaler, machineautoscaler with min=1 2. Create workload to scale up the cluster 3. Delete workload to scale down the cluster 4. Check machine annotations Actual results: Machine was annotated with "machine.openshift.io/cluster-api-delete-machine", but is still Running. $ oc get machine machineset-clone-20108-2-xnlwn -o yaml apiVersion: machine.openshift.io/v1beta1 kind: Machine metadata: annotations: instance-status: | {"kind":"Machine","apiVersion":"machine.openshift.io/v1beta1","metadata":{"name":"machineset-clone-20108-2-xnlwn","generateName":"machineset-clone-20108-2-","namespace":"openshift-machine-api","selfLink":"/apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machines/machineset-clone-20108-2-xnlwn","uid":"ecfff356-1812-4682-8efa-149eef01286f","resourceVersion":"139371","generation":1,"creationTimestamp":"2020-03-31T14:20:53Z","labels":{"machine.openshift.io/cluster-api-cluster":"zhsunosp3312-rgcqr","machine.openshift.io/cluster-api-machine-role":"worker","machine.openshift.io/cluster-api-machine-type":"worker","machine.openshift.io/cluster-api-machineset":"machineset-clone-20108-2","machine.openshift.io/instance-type":"m1.xlarge","machine.openshift.io/region":"regionOne","machine.openshift.io/zone":"nova"},"annotations":{"instance-status":"","machine.openshift.io/instance-state":"ACTIVE","openstack-ip-address":"192.168.0.13","openstack-resourceId":"97d43089-c4a4-4500-bdf2-4e56acfcb5dc"},"ownerReferences":[{"apiVersion":"machine.openshift.io/v1beta1","kind":"MachineSet","name":"machineset-clone-20108-2","uid":"79390991-fd7a-4cba-aa6c-f601cedfda9e","controller":true,"blockOwnerDeletion":true}],"finalizers":["machine.machine.openshift.io"]},"spec":{"metadata":{"creationTimestamp":null},"providerSpec":{"value":{"apiVersion":"openstackproviderconfig.openshift.io/v1alpha1","cloudName":"openstack","cloudsSecret":{"name":"openstack-cloud-credentials","namespace":"openshift-machine-api"},"flavor":"m1.xlarge","image":"zhsunosp3312-rgcqr-rhcos","kind":"OpenstackProviderSpec","metadata":{"creationTimestamp":null},"networks":[{"filter":{},"subnets":[{"filter":{"name":"zhsunosp3312-rgcqr-nodes","tags":"openshiftClusterID=zhsunosp3312-rgcqr"}}]}],"securityGroups":[{"filter":{},"name":"zhsunosp3312-rgcqr-worker"}],"serverMetadata":{"Name":"zhsunosp3312-rgcqr-worker","openshiftClusterID":"zhsunosp3312-rgcqr"},"tags":["openshiftClusterID=zhsunosp3312-rgcqr"],"trun k":true,"userDataSecret":{"name":"worker-user-data"}}}},"status":{"lastUpdated":"2020-03-31T14:23:17Z","phase":"Provisioning"}} machine.openshift.io/cluster-api-delete-machine: 2020-03-31 14:33:16.763166269 +0000 UTC m=+770.866538766 machine.openshift.io/instance-state: ACTIVE ... status: addresses: - address: 192.168.0.13 type: InternalIP - address: machineset-clone-20108-2-xnlwn type: Hostname - address: machineset-clone-20108-2-xnlwn type: InternalDNS lastUpdated: "2020-03-31T14:27:48Z" nodeRef: kind: Node name: machineset-clone-20108-2-xnlwn uid: fb09e8ea-ba23-40eb-b97a-9596a9a2ef4e phase: Running Expected results: If this machine is the last machine in the machineset, we shouldn't add "machine.openshift.io/cluster-api-delete-machin" annotation. Additional info: