Bug 1656270 - [cloud-CA] ClusterAutoscaler maxNodesTotal does not work
Summary: [cloud-CA] ClusterAutoscaler maxNodesTotal does not work
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 4.1.0
Assignee: Andrew McDermott
QA Contact: sunzhaohua
Depends On:
TreeView+ depends on / blocked
Reported: 2018-12-05 06:30 UTC by sunzhaohua
Modified: 2019-06-04 10:41 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Last Closed: 2019-06-04 10:41:04 UTC
Target Upstream Version:

Attachments (Terms of Use)

System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:0758 0 None None None 2019-06-04 10:41:10 UTC

Description sunzhaohua 2018-12-05 06:30:26 UTC
Description of problem:
Cluster autoscaler can scale up nodes to a number greater than the intended value.

Version-Release number of selected component (if applicable):
$ bin/openshift-install version
bin/openshift-install v0.5.0-master-2-g78e2c8b144352b1bef854501d3760a9daaaa2eb0
Terraform v0.11.8

How reproducible:

Steps to Reproduce:
1. Create clusterautoscaler resource, set maxNodesTotal=10
2. Create pod to scale up the cluster
3. Check node number

Actual results:
Node number greater than the set value

$ oc edit clusterautoscaler default
apiVersion: autoscaling.openshift.io/v1alpha1
kind: ClusterAutoscaler
  creationTimestamp: 2018-12-04T04:47:54Z
  generation: 1
  name: default
  resourceVersion: "85156"
  selfLink: /apis/autoscaling.openshift.io/v1alpha1/clusterautoscalers/default
  uid: c3263c80-f77f-11e8-ba7f-0644519597a8
    maxNodesTotal: 10
    delayAfterAdd: 10s
    delayAfterDelete: 10s
    delayAfterFailure: 10s
    enabled: true

$ oc logs -f cluster-autoscaler-default-77f666c784-t5svt
I1204 06:57:18.921091       1 leaderelection.go:187] attempting to acquire leader lease  openshift-cluster-api/cluster-autoscaler...
I1204 06:57:35.138441       1 leaderelection.go:196] successfully acquired lease openshift-cluster-api/cluster-autoscaler
I1204 06:57:45.479845       1 scale_up.go:584] Scale-up: setting group qe-zhsun-worker-us-east-2a size to 3
I1204 06:57:56.302130       1 scale_up.go:584] Scale-up: setting group qe-zhsun-worker-us-east-2b size to 3
I1204 06:58:06.400558       1 scale_up.go:584] Scale-up: setting group qe-zhsun-worker-us-east-2c size to 3

$ oc get node
NAME                                         STATUS    ROLES     AGE       VERSION
ip-10-0-129-142.us-east-2.compute.internal   Ready     worker    6m        v1.11.0+b74cbdf
ip-10-0-135-191.us-east-2.compute.internal   Ready     worker    2h        v1.11.0+b74cbdf
ip-10-0-139-100.us-east-2.compute.internal   Ready     worker    6m        v1.11.0+b74cbdf
ip-10-0-146-243.us-east-2.compute.internal   Ready     worker    27m       v1.11.0+b74cbdf
ip-10-0-148-83.us-east-2.compute.internal    Ready     worker    6m        v1.11.0+b74cbdf
ip-10-0-15-241.us-east-2.compute.internal    Ready     master    2h        v1.11.0+b74cbdf
ip-10-0-150-26.us-east-2.compute.internal    Ready     worker    6m        v1.11.0+b74cbdf
ip-10-0-160-98.us-east-2.compute.internal    Ready     worker    27m       v1.11.0+b74cbdf
ip-10-0-161-210.us-east-2.compute.internal   Ready     worker    6m        v1.11.0+b74cbdf
ip-10-0-166-156.us-east-2.compute.internal   Ready     worker    6m        v1.11.0+b74cbdf
ip-10-0-21-79.us-east-2.compute.internal     Ready     master    2h        v1.11.0+b74cbdf
ip-10-0-40-58.us-east-2.compute.internal     Ready     master    2h        v1.11.0+b74cbdf

$ oc get machine
NAME                               AGE
qe-zhsun-master-0                  2h
qe-zhsun-master-1                  2h
qe-zhsun-master-2                  2h
qe-zhsun-worker-us-east-2a-mv5t9   9m
qe-zhsun-worker-us-east-2a-rcqf6   9m
qe-zhsun-worker-us-east-2a-xc49n   2h
qe-zhsun-worker-us-east-2b-9l2bs   9m
qe-zhsun-worker-us-east-2b-m5fc9   9m
qe-zhsun-worker-us-east-2b-n4jzn   30m
qe-zhsun-worker-us-east-2c-2xp5f   9m
qe-zhsun-worker-us-east-2c-b5vvw   9m
qe-zhsun-worker-us-east-2c-tx4jp   30m

Expected results:
Node number less than the set value

Additional info:

Comment 1 Andrew McDermott 2018-12-18 16:25:23 UTC
PR - https://github.com/openshift/kubernetes-autoscaler/pull/16

Comment 2 sunzhaohua 2018-12-19 03:06:48 UTC

$ bin/openshift-install version
bin/openshift-install v0.7.0-master-35-gead9f4b779a20dc32d51c3b2429d8d71d48ea043

$ oc version
oc v4.0.0-alpha.0+a2218fc-788
kubernetes v1.11.0+a2218fc
features: Basic-Auth GSSAPI Kerberos SPNEGO

1. Create clusterautoscaler
apiVersion: "autoscaling.openshift.io/v1alpha1"
kind: "ClusterAutoscaler"
  name: "default"
    maxNodesTotal: 7
    enabled: true
    delayAfterAdd: 10s
    delayAfterDelete: 10s
    delayAfterFailure: 10s

2. Create machineautoscaler
apiVersion: autoscaling.openshift.io/v1alpha1
kind: MachineAutoscaler
  - machinetarget.autoscaling.openshift.io
  name: autoscale-us-east-2a
  namespace: openshift-cluster-api
  maxReplicas: 10
  minReplicas: 1
    apiVersion: cluster.k8s.io/v1alpha1
    kind: MachineSet
    name: qe-zhsun-1-worker-us-east-2a
status: {}

3. Create pod to scaleup

4. Check logs and node 
$ oc logs -f cluster-autoscaler-default-5777b87c56-kg6sh
I1219 02:27:27.268286       1 scale_up.go:584] Scale-up: setting group openshift-cluster-api/qe-zhsun-1-worker-us-east-2a size to 2
E1219 02:27:37.416081       1 static_autoscaler.go:275] Failed to scale up: max node total count already reached
E1219 02:27:47.504674       1 static_autoscaler.go:275] Failed to scale up: max node total count already reached
E1219 02:27:57.574213       1 static_autoscaler.go:275] Failed to scale up: max node total count already reached
E1219 02:28:07.642231       1 static_autoscaler.go:275] Failed to scale up: max node total count already reached

$ oc get node
NAME                                         STATUS     ROLES     AGE       VERSION
ip-10-0-1-191.us-east-2.compute.internal     Ready      master    23m       v1.11.0+a2218fc
ip-10-0-141-194.us-east-2.compute.internal   Ready      worker    19m       v1.11.0+a2218fc
ip-10-0-148-215.us-east-2.compute.internal   Ready      worker    10m       v1.11.0+a2218fc
ip-10-0-150-181.us-east-2.compute.internal   Ready      worker    19m       v1.11.0+a2218fc
ip-10-0-164-244.us-east-2.compute.internal   Ready      worker    19m       v1.11.0+a2218fc
ip-10-0-27-8.us-east-2.compute.internal      Ready      master    23m       v1.11.0+a2218fc
ip-10-0-32-233.us-east-2.compute.internal    Ready      master    23m       v1.11.0+a2218fc

Comment 5 errata-xmlrpc 2019-06-04 10:41:04 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.