Bug 1656270

Summary:	[cloud-CA] ClusterAutoscaler maxNodesTotal does not work
Product:	OpenShift Container Platform	Reporter:	sunzhaohua <zhsun>
Component:	Cloud Compute	Assignee:	Andrew McDermott <amcdermo>
Status:	CLOSED ERRATA	QA Contact:	sunzhaohua <zhsun>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4.1.0	CC:	amcdermo, jhou
Target Milestone:	---
Target Release:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-06-04 10:41:04 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description sunzhaohua 2018-12-05 06:30:26 UTC

Description of problem:
Cluster autoscaler can scale up nodes to a number greater than the intended value.

Version-Release number of selected component (if applicable):
$ bin/openshift-install version
bin/openshift-install v0.5.0-master-2-g78e2c8b144352b1bef854501d3760a9daaaa2eb0
Terraform v0.11.8

How reproducible:
Always

Steps to Reproduce:
1. Create clusterautoscaler resource, set maxNodesTotal=10
2. Create pod to scale up the cluster
3. Check node number

Actual results:
Node number greater than the set value

$ oc edit clusterautoscaler default
apiVersion: autoscaling.openshift.io/v1alpha1
kind: ClusterAutoscaler
metadata:
  creationTimestamp: 2018-12-04T04:47:54Z
  generation: 1
  name: default
  resourceVersion: "85156"
  selfLink: /apis/autoscaling.openshift.io/v1alpha1/clusterautoscalers/default
  uid: c3263c80-f77f-11e8-ba7f-0644519597a8
spec:
  resourceLimits:
    maxNodesTotal: 10
  scaleDown:
    delayAfterAdd: 10s
    delayAfterDelete: 10s
    delayAfterFailure: 10s
    enabled: true

$ oc logs -f cluster-autoscaler-default-77f666c784-t5svt
I1204 06:57:18.921091       1 leaderelection.go:187] attempting to acquire leader lease  openshift-cluster-api/cluster-autoscaler...
I1204 06:57:35.138441       1 leaderelection.go:196] successfully acquired lease openshift-cluster-api/cluster-autoscaler
I1204 06:57:45.479845       1 scale_up.go:584] Scale-up: setting group qe-zhsun-worker-us-east-2a size to 3
I1204 06:57:56.302130       1 scale_up.go:584] Scale-up: setting group qe-zhsun-worker-us-east-2b size to 3
I1204 06:58:06.400558       1 scale_up.go:584] Scale-up: setting group qe-zhsun-worker-us-east-2c size to 3

$ oc get node
NAME                                         STATUS    ROLES     AGE       VERSION
ip-10-0-129-142.us-east-2.compute.internal   Ready     worker    6m        v1.11.0+b74cbdf
ip-10-0-135-191.us-east-2.compute.internal   Ready     worker    2h        v1.11.0+b74cbdf
ip-10-0-139-100.us-east-2.compute.internal   Ready     worker    6m        v1.11.0+b74cbdf
ip-10-0-146-243.us-east-2.compute.internal   Ready     worker    27m       v1.11.0+b74cbdf
ip-10-0-148-83.us-east-2.compute.internal    Ready     worker    6m        v1.11.0+b74cbdf
ip-10-0-15-241.us-east-2.compute.internal    Ready     master    2h        v1.11.0+b74cbdf
ip-10-0-150-26.us-east-2.compute.internal    Ready     worker    6m        v1.11.0+b74cbdf
ip-10-0-160-98.us-east-2.compute.internal    Ready     worker    27m       v1.11.0+b74cbdf
ip-10-0-161-210.us-east-2.compute.internal   Ready     worker    6m        v1.11.0+b74cbdf
ip-10-0-166-156.us-east-2.compute.internal   Ready     worker    6m        v1.11.0+b74cbdf
ip-10-0-21-79.us-east-2.compute.internal     Ready     master    2h        v1.11.0+b74cbdf
ip-10-0-40-58.us-east-2.compute.internal     Ready     master    2h        v1.11.0+b74cbdf

$ oc get machine
NAME                               AGE
qe-zhsun-master-0                  2h
qe-zhsun-master-1                  2h
qe-zhsun-master-2                  2h
qe-zhsun-worker-us-east-2a-mv5t9   9m
qe-zhsun-worker-us-east-2a-rcqf6   9m
qe-zhsun-worker-us-east-2a-xc49n   2h
qe-zhsun-worker-us-east-2b-9l2bs   9m
qe-zhsun-worker-us-east-2b-m5fc9   9m
qe-zhsun-worker-us-east-2b-n4jzn   30m
qe-zhsun-worker-us-east-2c-2xp5f   9m
qe-zhsun-worker-us-east-2c-b5vvw   9m
qe-zhsun-worker-us-east-2c-tx4jp   30m

Expected results:
Node number less than the set value

Additional info:

Comment 1 Andrew McDermott 2018-12-18 16:25:23 UTC

PR - https://github.com/openshift/kubernetes-autoscaler/pull/16

Comment 2 sunzhaohua 2018-12-19 03:06:48 UTC

Verified

$ bin/openshift-install version
bin/openshift-install v0.7.0-master-35-gead9f4b779a20dc32d51c3b2429d8d71d48ea043

$ oc version
oc v4.0.0-alpha.0+a2218fc-788
kubernetes v1.11.0+a2218fc
features: Basic-Auth GSSAPI Kerberos SPNEGO

1. Create clusterautoscaler
apiVersion: "autoscaling.openshift.io/v1alpha1"
kind: "ClusterAutoscaler"
metadata:
  name: "default"
spec:
  resourceLimits:
    maxNodesTotal: 7
  scaleDown:
    enabled: true
    delayAfterAdd: 10s
    delayAfterDelete: 10s
    delayAfterFailure: 10s

2. Create machineautoscaler
apiVersion: autoscaling.openshift.io/v1alpha1
kind: MachineAutoscaler
metadata:
  finalizers:
  - machinetarget.autoscaling.openshift.io
  name: autoscale-us-east-2a
  namespace: openshift-cluster-api
spec:
  maxReplicas: 10
  minReplicas: 1
  scaleTargetRef:
    apiVersion: cluster.k8s.io/v1alpha1
    kind: MachineSet
    name: qe-zhsun-1-worker-us-east-2a
status: {}

3. Create pod to scaleup

4. Check logs and node 
$ oc logs -f cluster-autoscaler-default-5777b87c56-kg6sh
I1219 02:27:27.268286       1 scale_up.go:584] Scale-up: setting group openshift-cluster-api/qe-zhsun-1-worker-us-east-2a size to 2
E1219 02:27:37.416081       1 static_autoscaler.go:275] Failed to scale up: max node total count already reached
E1219 02:27:47.504674       1 static_autoscaler.go:275] Failed to scale up: max node total count already reached
E1219 02:27:57.574213       1 static_autoscaler.go:275] Failed to scale up: max node total count already reached
E1219 02:28:07.642231       1 static_autoscaler.go:275] Failed to scale up: max node total count already reached

$ oc get node
NAME                                         STATUS     ROLES     AGE       VERSION
ip-10-0-1-191.us-east-2.compute.internal     Ready      master    23m       v1.11.0+a2218fc
ip-10-0-141-194.us-east-2.compute.internal   Ready      worker    19m       v1.11.0+a2218fc
ip-10-0-148-215.us-east-2.compute.internal   Ready      worker    10m       v1.11.0+a2218fc
ip-10-0-150-181.us-east-2.compute.internal   Ready      worker    19m       v1.11.0+a2218fc
ip-10-0-164-244.us-east-2.compute.internal   Ready      worker    19m       v1.11.0+a2218fc
ip-10-0-27-8.us-east-2.compute.internal      Ready      master    23m       v1.11.0+a2218fc
ip-10-0-32-233.us-east-2.compute.internal    Ready      master    23m       v1.11.0+a2218fc

Comment 5 errata-xmlrpc 2019-06-04 10:41:04 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758