Bug 1820410
Summary: | [autoscaler]Sometimes the only machine of a machineset gets a delete annotation by autoscaler | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | sunzhaohua <zhsun> | |
Component: | Cloud Compute | Assignee: | Michael McCune <mimccune> | |
Cloud Compute sub component: | Other Providers | QA Contact: | sunzhaohua <zhsun> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | medium | |||
Priority: | medium | CC: | agarcial | |
Version: | 4.4 | |||
Target Milestone: | --- | |||
Target Release: | 4.5.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: |
Cause: Occasionally during scale down operations, a condition presents itself whereby the last Machine in a MachineSet will contain the annotations for deletion.
Consequence: The last Machine in a MachineSet will continue to have the deletion annotation after the scale down. The autoscaler will not remove this node though as the minimum size has been reached.
Fix: The fix changes the way Machine annotations are unmarked after a scale down in which the minimum size has been reached.
Result: The annotations no longer persist on the final Machine of the MachineSet.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1835873 (view as bug list) | Environment: | ||
Last Closed: | 2020-07-13 17:25:20 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1835873 |
Description
sunzhaohua
2020-04-03 02:08:35 UTC
hi sunzhaohua, i am trying to reproduce this condition but i am not having luck. would you mind sharing the ClusterAutoscaler and MachineAutoscaler manifests, as well as any information about the workload you are running? thank you! Test on aws, clusterversion: 4.5.0-0.nightly-2020-04-06-212952 1. create 2 new machinesets $ oc get machineset NAME DESIRED CURRENT READY AVAILABLE AGE zhsunaws047-8vmvr-worker-us-east-2a 1 1 1 1 71m zhsunaws047-8vmvr-worker-us-east-2a-a 1 1 1 1 8m23s zhsunaws047-8vmvr-worker-us-east-2a-b 1 1 1 1 7m12s zhsunaws047-8vmvr-worker-us-east-2b 1 1 1 1 71m zhsunaws047-8vmvr-worker-us-east-2c 1 1 1 1 71m 2. create clusterautoscaler --- apiVersion: "autoscaling.openshift.io/v1" kind: "ClusterAutoscaler" metadata: name: "default" spec: balanceSimilarNodeGroups: true scaleDown: enabled: true delayAfterAdd: 10s delayAfterDelete: 10s delayAfterFailure: 10s unneededTime: 10s 3.create 2 machineautoscalers --- apiVersion: "autoscaling.openshift.io/v1beta1" kind: "MachineAutoscaler" metadata: name: "worker-a" namespace: "openshift-machine-api" spec: minReplicas: 1 maxReplicas: 3 scaleTargetRef: apiVersion: machine.openshift.io/v1beta1 kind: MachineSet name: zhsunaws047-8vmvr-worker-us-east-2a-a --- apiVersion: "autoscaling.openshift.io/v1beta1" kind: "MachineAutoscaler" metadata: name: "worker-b" namespace: "openshift-machine-api" spec: minReplicas: 1 maxReplicas: 3 scaleTargetRef: apiVersion: machine.openshift.io/v1beta1 kind: MachineSet name: zhsunaws047-8vmvr-worker-us-east-2a-b $ oc get machineautoscaler NAME REF KIND REF NAME MIN MAX AGE worker-a MachineSet zhsunaws047-8vmvr-worker-us-east-2a-a 1 3 17m worker-b MachineSet zhsunaws047-8vmvr-worker-us-east-2a-b 1 3 17m 4. create workload apiVersion: apps/v1 kind: Deployment metadata: name: scale-up labels: app: scale-up spec: replicas: 15 selector: matchLabels: app: scale-up template: metadata: labels: app: scale-up spec: containers: - name: busybox image: docker.io/library/busybox resources: requests: memory: 2Gi command: - /bin/sh - "-c" - "echo 'this should be in the logs' && sleep 86400" terminationGracePeriodSeconds: 0 5. wait for the cluster to scale up untile all nodes become ready, delete workload $ oc delete deploy scale-up deployment.apps "scale-up" deleted 6. check machine zhsunaws047-8vmvr-worker-us-east-2a-a-rqdnk and zhsunaws047-8vmvr-worker-us-east-2a-b-nfrdg $ oc get machine NAME PHASE TYPE REGION ZONE AGE zhsunaws047-8vmvr-master-0 Running m4.xlarge us-east-2 us-east-2a 93m zhsunaws047-8vmvr-master-1 Running m4.xlarge us-east-2 us-east-2b 93m zhsunaws047-8vmvr-master-2 Running m4.xlarge us-east-2 us-east-2c 93m zhsunaws047-8vmvr-worker-us-east-2a-a-rqdnk Running m4.large us-east-2 us-east-2a 31m zhsunaws047-8vmvr-worker-us-east-2a-b-nfrdg Running m4.large us-east-2 us-east-2a 17m zhsunaws047-8vmvr-worker-us-east-2a-fmmq5 Running m4.large us-east-2 us-east-2a 81m zhsunaws047-8vmvr-worker-us-east-2b-qchpf Running m4.large us-east-2 us-east-2b 81m zhsunaws047-8vmvr-worker-us-east-2c-b6glk Running m4.large us-east-2 us-east-2c 81m $ oc get machine zhsunaws047-8vmvr-worker-us-east-2a-a-rqdnk -o yaml apiVersion: machine.openshift.io/v1beta1 kind: Machine metadata: annotations: machine.openshift.io/cluster-api-delete-machine: 2020-04-07 02:15:20.314564427 +0000 UTC m=+702.255500233 machine.openshift.io/instance-state: running $ oc get machine zhsunaws047-8vmvr-worker-us-east-2a-b-nfrdg -o yaml apiVersion: machine.openshift.io/v1beta1 kind: Machine metadata: annotations: machine.openshift.io/cluster-api-delete-machine: 2020-04-07 02:15:20.357407058 +0000 UTC m=+702.298343095 machine.openshift.io/instance-state: running that's perfect, thank you again =) @sunzhaohua have you experienced this behavior with the default machinesets? i am curious if this problem is limited to new machinesets that we add, i am testing both ways, just interested to know if you've seen any patterns. one more thing, for completeness sake, would you mind sharing the machinesets you added as well? i want to make sure i am copying your setup as much as possible. just wanted to confirm that i have now been able to reproduce this bug. i have an idea about how to mitigate it, but i am working through that process now. i have added a pull request to the autoscaler repo that seems to alleviate the issue for the tests i ran. https://github.com/openshift/kubernetes-autoscaler/pull/141 @Michael McCune I have tested this behavior with the default machinesets, also met this issue. $ oc get machine zhsunaws047-8vmvr-worker-us-east-2b-gl86s -o yaml apiVersion: machine.openshift.io/v1beta1 kind: Machine metadata: annotations: machine.openshift.io/cluster-api-delete-machine: 2020-04-08 07:58:14.943919184 +0000 UTC m=+107676.884855186 machine.openshift.io/instance-state: running @sunzhaohua thanks for the update. i think the patch i am proposing will fix the issue, i was not able to reproduce the error after introducing the patched autoscaler. Verified Tested with below steps, i was not able to reproduce the error Clusterversion: 4.5.0-0.nightly-2020-05-05-205255 1. Create 2 new machinesets 2. Enable autoscaler, machineautoscaler with min=1 3. Create workload to scale up the cluster 4. Delete workload to scale down the cluster 5. Check machine annotations thanks for reporting back sunzhaohua! i'm glad to hear it is working =) i have also proposed an automated test for this condition, https://github.com/openshift/cluster-api-actuator-pkg/pull/150 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409 |