1670695 – [cloud-CA] After updating clusterAutoscaler maxNodesTotal value, this flag does not work

Bug 1670695 - [cloud-CA] After updating clusterAutoscaler maxNodesTotal value, this flag does not work

Summary: [cloud-CA] After updating clusterAutoscaler maxNodesTotal value, this flag do...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.1.0
Assignee:	Andrew McDermott
QA Contact:	sunzhaohua
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-01-30 08:02 UTC by sunzhaohua
Modified:	2019-06-04 10:42 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-06-04 10:42:28 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
maxNodesTotal=7 (107.00 KB, text/plain) 2019-02-28 10:41 UTC, sunzhaohua	no flags	Details
maxNodesTotal=9 (97.05 KB, text/plain) 2019-02-28 10:41 UTC, sunzhaohua	no flags	Details
maxNodesTotal=11 (79.83 KB, text/plain) 2019-02-28 10:42 UTC, sunzhaohua	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:0758	0	None	None	None	2019-06-04 10:42:36 UTC

Description sunzhaohua 2019-01-30 08:02:59 UTC

Description of problem:
After updating clusterAutoscaler maxNodesTotal value, autoscaler can scale up nodes to a number greater than the intended value.

Version-Release number of selected component (if applicable):
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE     STATUS
version   4.0.0-0.nightly-2019-01-29-025207   True        False         2h        Cluster version is 4.0.0-0.nightly-2019-01-29-025207

How reproducible:
Always

Steps to Reproduce:
1. Create clusterautoscaler resource, set maxNodesTotal=7
2. Create pod to scale up the cluster,check logs and node number
3. Edit clusterautoscaler resource, set maxNodesTotal=9
$ oc get clusterautoscaler default -o yaml
apiVersion: autoscaling.openshift.io/v1alpha1
kind: ClusterAutoscaler
metadata:
  generation: 1
  name: default
spec:
  resourceLimits:
    maxNodesTotal: 9
  scaleDown:
    delayAfterAdd: 10s
    delayAfterDelete: 10s
    delayAfterFailure: 10s
    enabled: true
    
4. Check logs and node number

Actual results:
After updating clusterAutoscaler maxNodesTotal values, Node number is greater than the expected value.


Before updating clusterAutoscaler maxNodesTotal values, autoscaler logs:
$ oc logs -f cluster-autoscaler-default-686c6d5459-h8dt7
I0130 07:22:05.935702       1 leaderelection.go:187] attempting to acquire leader lease  openshift-cluster-api/cluster-autoscaler...
I0130 07:22:21.752204       1 leaderelection.go:196] successfully acquired lease openshift-cluster-api/cluster-autoscaler
I0130 07:23:52.088862       1 scale_up.go:584] Scale-up: setting group openshift-cluster-api/zhsun-worker-us-east-2c size to 2
E0130 07:24:02.171334       1 static_autoscaler.go:275] Failed to scale up: max node total count already reached
E0130 07:24:12.232822       1 static_autoscaler.go:275] Failed to scale up: max node total count already reached

After updating clusterAutoscaler maxNodesTotal values, autoscaler logs:
$ oc logs -f cluster-autoscaler-default-6765bb8dc7-dvgj7
I0130 07:26:43.101080       1 leaderelection.go:187] attempting to acquire leader lease  openshift-cluster-api/cluster-autoscaler...
I0130 07:27:04.966954       1 leaderelection.go:196] successfully acquired lease openshift-cluster-api/cluster-autoscaler
I0130 07:27:15.204164       1 scale_up.go:584] Scale-up: setting group openshift-cluster-api/zhsun-worker-us-east-2c size to 3
I0130 07:27:26.011054       1 scale_up.go:584] Scale-up: setting group openshift-cluster-api/zhsun-worker-us-east-2b size to 3

$ $ oc get node
NAME                                         STATUS    ROLES     AGE       VERSION
ip-10-0-11-8.us-east-2.compute.internal      Ready     master    3h45m     v1.11.0+dde478551e
ip-10-0-134-88.us-east-2.compute.internal    Ready     worker    3h37m     v1.11.0+dde478551e
ip-10-0-151-165.us-east-2.compute.internal   Ready     worker    2m2s      v1.11.0+dde478551e
ip-10-0-154-152.us-east-2.compute.internal   Ready     worker    3h37m     v1.11.0+dde478551e
ip-10-0-157-185.us-east-2.compute.internal   Ready     worker    2m2s      v1.11.0+dde478551e
ip-10-0-165-148.us-east-2.compute.internal   Ready     worker    5m17s     v1.11.0+dde478551e
ip-10-0-166-152.us-east-2.compute.internal   Ready     worker    2m2s      v1.11.0+dde478551e
ip-10-0-166-24.us-east-2.compute.internal    Ready     worker    3h37m     v1.11.0+dde478551e
ip-10-0-26-144.us-east-2.compute.internal    Ready     master    3h45m     v1.11.0+dde478551e
ip-10-0-46-25.us-east-2.compute.internal     Ready     master    3h45m     v1.11.0+dde478551e

$ oc get machine
NAME                            INSTANCE              STATE     TYPE        REGION      ZONE         AGE
zhsun-master-0                  i-03bdc74b8dd712763   running   m4.xlarge   us-east-2   us-east-2a   3h
zhsun-master-1                  i-040e67812dda04da4   running   m4.xlarge   us-east-2   us-east-2b   3h
zhsun-master-2                  i-08932384bb572f448   running   m4.xlarge   us-east-2   us-east-2c   3h
zhsun-worker-us-east-2a-rzlfv   i-01ab8e6d6007624fd   running   m4.large    us-east-2   us-east-2a   3h
zhsun-worker-us-east-2b-7lqg7   i-0531bd2808d1ddbe2   running   m4.large    us-east-2   us-east-2b   3h
zhsun-worker-us-east-2b-rlng5   i-0608c403d0a98493f   running   m4.large    us-east-2   us-east-2b   7m
zhsun-worker-us-east-2b-xr4l7   i-0eecf58bed7a134e7   running   m4.large    us-east-2   us-east-2b   7m
zhsun-worker-us-east-2c-7b4lh   i-0690a898daee5f27f   running   m4.large    us-east-2   us-east-2c   7m
zhsun-worker-us-east-2c-ct67b   i-04bf06175e22da477   running   m4.large    us-east-2   us-east-2c   3h
zhsun-worker-us-east-2c-r7tfd   i-0fb7c85636caf6b6b   running   m4.large    us-east-2   us-east-2c   11m

Expected results:
After updating clusterAutoscaler maxNodesTotal value, Node number is still less than the expected value

Additional info:

Comment 1 Brad Ison 2019-01-30 13:06:06 UTC

Trying to figure out what exactly is going on here. So, it looks like the cluster was scaling up, and while that was happening the maxNodesTotal value was increased, but eventually the cluster exceeded the max size. Is that right?

I think this may only happen if the autoscaler restarts before new nodes are ready. Can you confirm if, at the time the autoscaler restarted to pick up the new maxNodesTotal value, there were any nodes that were not yet in a "Ready" state?

Comment 2 sunzhaohua 2019-01-31 03:30:46 UTC

Brad,

Sometimes this will also happen after new nodes are ready.

This is the steps I tested:
1. Set maxNodesTotal=7, autoscaler works as expected. 
2. After new nodes are ready, update maxNodesTotal=9, autoscaler works as expected.
3. After new nodes are ready, update maxNodesTotal=11, eventually the cluster exceeded the max size.

step 2:
$ oc logs -f cluster-autoscaler-default-6789dcfb79-wg42n
I0131 02:58:35.357410       1 leaderelection.go:187] attempting to acquire leader lease  openshift-cluster-api/cluster-autoscaler...
I0131 02:58:52.795632       1 leaderelection.go:196] successfully acquired lease openshift-cluster-api/cluster-autoscaler
I0131 02:59:03.026393       1 scale_up.go:584] Scale-up: setting group openshift-cluster-api/zhsun-worker-us-east-2b size to 4
E0131 02:59:13.794678       1 static_autoscaler.go:275] Failed to scale up: max node total count already reached
E0131 02:59:23.877088       1 static_autoscaler.go:275] Failed to scale up: max node total count already reached
E0131 02:59:33.950753       1 static_autoscaler.go:275] Failed to scale up: max node total count already reached
E0131 02:59:44.022705       1 static_autoscaler.go:275] Failed to scale up: max node total count already reached
E0131 02:59:54.107962       1 static_autoscaler.go:275] Failed to scale up: max node total count already reached
E0131 03:00:04.198743       1 static_autoscaler.go:275] Failed to scale up: max node total count already reached
E0131 03:00:14.279314       1 static_autoscaler.go:275] Failed to scale up: max node total count already reached
E0131 03:00:24.358894       1 static_autoscaler.go:275] Failed to scale up: max node total count already reached
E0131 03:00:34.434187       1 static_autoscaler.go:275] Failed to scale up: max node total count already reached
E0131 03:00:44.513815       1 static_autoscaler.go:275] Failed to scale up: max node total count already reached
rpc error: code = Unknown desc = container with ID starting with 96b95ae1e743401de54f9cf990482c9146b392a7272b2e71997b7ef6b2137ed0 not found: ID does not exist

$ oc get node
NAME                                         STATUS    ROLES     AGE       VERSION
ip-10-0-129-37.us-east-2.compute.internal    Ready     worker    40m       v1.11.0+dde478551e
ip-10-0-148-154.us-east-2.compute.internal   Ready     worker    4m17s     v1.11.0+dde478551e
ip-10-0-149-135.us-east-2.compute.internal   Ready     worker    40m       v1.11.0+dde478551e
ip-10-0-153-105.us-east-2.compute.internal   Ready     worker    9m44s     v1.11.0+dde478551e
ip-10-0-157-86.us-east-2.compute.internal    Ready     worker    4m17s     v1.11.0+dde478551e
ip-10-0-165-193.us-east-2.compute.internal   Ready     worker    40m       v1.11.0+dde478551e
ip-10-0-26-123.us-east-2.compute.internal    Ready     master    49m       v1.11.0+dde478551e
ip-10-0-4-37.us-east-2.compute.internal      Ready     master    49m       v1.11.0+dde478551e
ip-10-0-45-63.us-east-2.compute.internal     Ready     master    49m       v1.11.0+dde478551e

step 3:
$ oc edit clusterautoscaler default
clusterautoscaler.autoscaling.openshift.io/default edited

apiVersion: autoscaling.openshift.io/v1alpha1
kind: ClusterAutoscaler
metadata:
  creationTimestamp: 2019-01-31T02:50:34Z
  generation: 1
  name: default
  resourceVersion: "36731"
  selfLink: /apis/autoscaling.openshift.io/v1alpha1/clusterautoscalers/default
  uid: fb192df3-2502-11e9-86b8-024f56e29114
spec:
  resourceLimits:
    maxNodesTotal: 11
  scaleDown:
    delayAfterAdd: 10s
    delayAfterDelete: 10s
    delayAfterFailure: 10s
    enabled: true

$ oc logs -f cluster-autoscaler-default-8745d955d-lj6jb
I0131 03:05:48.073916       1 leaderelection.go:187] attempting to acquire leader lease  openshift-cluster-api/cluster-autoscaler...
I0131 03:06:03.234709       1 leaderelection.go:196] successfully acquired lease openshift-cluster-api/cluster-autoscaler
I0131 03:06:13.833075       1 scale_up.go:584] Scale-up: setting group openshift-cluster-api/zhsun-worker-us-east-2b size to 5
I0131 03:06:24.797856       1 scale_up.go:584] Scale-up: setting group openshift-cluster-api/zhsun-worker-us-east-2a size to 2
E0131 03:06:34.898959       1 static_autoscaler.go:275] Failed to scale up: max node total count already reached
E0131 03:06:44.971573       1 static_autoscaler.go:275] Failed to scale up: max node total count already reached
E0131 03:06:55.047775       1 static_autoscaler.go:275] Failed to scale up: max node total count already reached
E0131 03:07:05.120855       1 static_autoscaler.go:275] Failed to scale up: max node total count already reached
E0131 03:07:15.207749       1 static_autoscaler.go:275] Failed to scale up: max node total count already reached
E0131 03:07:25.278675       1 static_autoscaler.go:275] Failed to scale up: max node total count already reached
E0131 03:07:35.360769       1 static_autoscaler.go:275] Failed to scale up: max node total count already reached
I0131 03:07:45.430281       1 scale_up.go:584] Scale-up: setting group openshift-cluster-api/zhsun-worker-us-east-2a size to 4
I0131 03:08:45.993273       1 scale_up.go:584] Scale-up: setting group openshift-cluster-api/zhsun-worker-us-east-2c size to 2

$ oc get node
NAME                                         STATUS    ROLES     AGE       VERSION
ip-10-0-129-37.us-east-2.compute.internal    Ready     worker    48m       v1.11.0+dde478551e
ip-10-0-131-31.us-east-2.compute.internal    Ready     worker    4m29s     v1.11.0+dde478551e
ip-10-0-132-184.us-east-2.compute.internal   Ready     worker    5m51s     v1.11.0+dde478551e
ip-10-0-134-247.us-east-2.compute.internal   Ready     worker    4m28s     v1.11.0+dde478551e
ip-10-0-148-154.us-east-2.compute.internal   Ready     worker    13m       v1.11.0+dde478551e
ip-10-0-149-135.us-east-2.compute.internal   Ready     worker    48m       v1.11.0+dde478551e
ip-10-0-153-105.us-east-2.compute.internal   Ready     worker    18m       v1.11.0+dde478551e
ip-10-0-155-143.us-east-2.compute.internal   Ready     worker    5m39s     v1.11.0+dde478551e
ip-10-0-157-86.us-east-2.compute.internal    Ready     worker    13m       v1.11.0+dde478551e
ip-10-0-165-193.us-east-2.compute.internal   Ready     worker    49m       v1.11.0+dde478551e
ip-10-0-173-245.us-east-2.compute.internal   Ready     worker    3m34s     v1.11.0+dde478551e
ip-10-0-26-123.us-east-2.compute.internal    Ready     master    58m       v1.11.0+dde478551e
ip-10-0-4-37.us-east-2.compute.internal      Ready     master    58m       v1.11.0+dde478551e
ip-10-0-45-63.us-east-2.compute.internal     Ready     master    58m       v1.11.0+dde478551e


$ oc get deploy scale-up -o yaml
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "1"
  creationTimestamp: 2019-01-31T02:53:13Z
  generation: 1
  labels:
    app: scale-up
  name: scale-up
  namespace: openshift-cluster-api
  resourceVersion: "41805"
  selfLink: /apis/extensions/v1beta1/namespaces/openshift-cluster-api/deployments/scale-up
  uid: 5a148ff9-2503-11e9-99ef-0aa1875edafe
spec:
  progressDeadlineSeconds: 2147483647
  replicas: 35
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: scale-up
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 1
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: scale-up
    spec:
      containers:
      - command:
        - /bin/sh
        - -c
        - echo 'this should be in the logs' && sleep 86400
        image: docker.io/library/busybox
        imagePullPolicy: Always
        name: busybox
        resources:
          requests:
            memory: 2Gi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 0

Comment 3 Andrew McDermott 2019-02-28 09:40:53 UTC

I spent quite some significant time trying to reproduce this both on AWS and also using the kubemark [actuator]. I was not able to reproduce this. If you run the steps again with latest installer version does this still repeat for you?

Comment 4 sunzhaohua 2019-02-28 10:41:11 UTC

It will be reproduced.

Reproduce steps:

1. create clusterautoscaler, maxNodesTotal set 7.
apiVersion: autoscaling.openshift.io/v1alpha1
kind: ClusterAutoscaler
metadata:
  creationTimestamp: 2019-02-28T10:02:10Z
  generation: 4
  name: default
  resourceVersion: "33039"
  selfLink: /apis/autoscaling.openshift.io/v1alpha1/clusterautoscalers/default
  uid: e9dbc0ba-3b3f-11e9-8a15-0add85e6ca2e
spec:
  resourceLimits:
    maxNodesTotal: 11
  scaleDown:
    delayAfterAdd: 10s
    delayAfterDelete: 10s
    delayAfterFailure: 10s
    enabled: true
    unneededTime: 10s

2. create machineautoscaler
$ oc get machineautoscaler
NAME                   REF KIND     REF NAME                         MIN       MAX       AGE
autoscale-us-east-2a   MachineSet   zhsun5-pmx48-worker-us-east-2a   1         5         26m
autoscale-us-east-2b   MachineSet   zhsun5-pmx48-worker-us-east-2b   1         5         25m
autoscale-us-east-2c   MachineSet   zhsun5-pmx48-worker-us-east-2c   1         5         25m

3. create pod to scale up the cluster 
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: scale-up
  labels:
    app: scale-up
spec:
  replicas: 35
  selector:
    matchLabels:
      app: scale-up
  template:
    metadata:
      labels:
        app: scale-up
    spec:
      containers:
      - name: busybox
        image: docker.io/library/busybox
        resources:
          requests:
            memory: 2Gi
        command:
        - /bin/sh
        - "-c"
        - "echo 'this should be in the logs' && sleep 86400"
      terminationGracePeriodSeconds: 0


4. After new nodes are ready, update maxNodesTotal=9, autoscaler works as expected.
5. After new nodes are ready, update maxNodesTotal=11, eventually the cluster exceeded the max size.

$ oc get node
NAME                                         STATUS    ROLES     AGE       VERSION
ip-10-0-132-2.us-east-2.compute.internal     Ready     worker    34m       v1.12.4+4dd65df23d
ip-10-0-137-81.us-east-2.compute.internal    Ready     master    51m       v1.12.4+4dd65df23d
ip-10-0-141-44.us-east-2.compute.internal    Ready     worker    102s      v1.12.4+4dd65df23d
ip-10-0-151-141.us-east-2.compute.internal   Ready     worker    6m49s     v1.12.4+4dd65df23d
ip-10-0-153-33.us-east-2.compute.internal    Ready     master    51m       v1.12.4+4dd65df23d
ip-10-0-153-48.us-east-2.compute.internal    Ready     worker    3m15s     v1.12.4+4dd65df23d
ip-10-0-154-206.us-east-2.compute.internal   Ready     worker    34m       v1.12.4+4dd65df23d
ip-10-0-156-140.us-east-2.compute.internal   Ready     worker    6m49s     v1.12.4+4dd65df23d
ip-10-0-157-6.us-east-2.compute.internal     Ready     worker    3m15s     v1.12.4+4dd65df23d
ip-10-0-160-240.us-east-2.compute.internal   Ready     worker    22m       v1.12.4+4dd65df23d
ip-10-0-167-174.us-east-2.compute.internal   Ready     worker    34m       v1.12.4+4dd65df23d
ip-10-0-168-151.us-east-2.compute.internal   Ready     worker    14m       v1.12.4+4dd65df23d
ip-10-0-172-109.us-east-2.compute.internal   Ready     master    51m       v1.12.4+4dd65df23d
ip-10-0-174-103.us-east-2.compute.internal   Ready     worker    14m       v1.12.4+4dd65df23d

Comment 5 sunzhaohua 2019-02-28 10:41:34 UTC

Created attachment 1539422 [details]
maxNodesTotal=7

Comment 6 sunzhaohua 2019-02-28 10:41:55 UTC

Created attachment 1539423 [details]
maxNodesTotal=9

Comment 7 sunzhaohua 2019-02-28 10:42:14 UTC

Created attachment 1539424 [details]
maxNodesTotal=11

Comment 8 Andrew McDermott 2019-02-28 10:43:32 UTC

(In reply to sunzhaohua from comment #4)
> It will be reproduced.
> 
> Reproduce steps:
> 
> 1. create clusterautoscaler, maxNodesTotal set 7.
> apiVersion: autoscaling.openshift.io/v1alpha1
> kind: ClusterAutoscaler
> metadata:
>   creationTimestamp: 2019-02-28T10:02:10Z
>   generation: 4
>   name: default
>   resourceVersion: "33039"
>   selfLink:
> /apis/autoscaling.openshift.io/v1alpha1/clusterautoscalers/default
>   uid: e9dbc0ba-3b3f-11e9-8a15-0add85e6ca2e
> spec:
>   resourceLimits:
>     maxNodesTotal: 11
>   scaleDown:
>     delayAfterAdd: 10s
>     delayAfterDelete: 10s
>     delayAfterFailure: 10s
>     enabled: true
>     unneededTime: 10s
> 
> 2. create machineautoscaler
> $ oc get machineautoscaler
> NAME                   REF KIND     REF NAME                         MIN    
> MAX       AGE
> autoscale-us-east-2a   MachineSet   zhsun5-pmx48-worker-us-east-2a   1      
> 5         26m
> autoscale-us-east-2b   MachineSet   zhsun5-pmx48-worker-us-east-2b   1      
> 5         25m
> autoscale-us-east-2c   MachineSet   zhsun5-pmx48-worker-us-east-2c   1      
> 5         25m
> 
> 3. create pod to scale up the cluster 
> apiVersion: extensions/v1beta1
> kind: Deployment
> metadata:
>   name: scale-up
>   labels:
>     app: scale-up
> spec:
>   replicas: 35
>   selector:
>     matchLabels:
>       app: scale-up
>   template:
>     metadata:
>       labels:
>         app: scale-up
>     spec:
>       containers:
>       - name: busybox
>         image: docker.io/library/busybox
>         resources:
>           requests:
>             memory: 2Gi
>         command:
>         - /bin/sh
>         - "-c"
>         - "echo 'this should be in the logs' && sleep 86400"
>       terminationGracePeriodSeconds: 0
> 
> 
> 4. After new nodes are ready, update maxNodesTotal=9, autoscaler works as
> expected.
> 5. After new nodes are ready, update maxNodesTotal=11, eventually the
> cluster exceeded the max size.
> 
> $ oc get node
> NAME                                         STATUS    ROLES     AGE      
> VERSION
> ip-10-0-132-2.us-east-2.compute.internal     Ready     worker    34m      
> v1.12.4+4dd65df23d
> ip-10-0-137-81.us-east-2.compute.internal    Ready     master    51m      
> v1.12.4+4dd65df23d
> ip-10-0-141-44.us-east-2.compute.internal    Ready     worker    102s     
> v1.12.4+4dd65df23d
> ip-10-0-151-141.us-east-2.compute.internal   Ready     worker    6m49s    
> v1.12.4+4dd65df23d
> ip-10-0-153-33.us-east-2.compute.internal    Ready     master    51m      
> v1.12.4+4dd65df23d
> ip-10-0-153-48.us-east-2.compute.internal    Ready     worker    3m15s    
> v1.12.4+4dd65df23d
> ip-10-0-154-206.us-east-2.compute.internal   Ready     worker    34m      
> v1.12.4+4dd65df23d
> ip-10-0-156-140.us-east-2.compute.internal   Ready     worker    6m49s    
> v1.12.4+4dd65df23d
> ip-10-0-157-6.us-east-2.compute.internal     Ready     worker    3m15s    
> v1.12.4+4dd65df23d
> ip-10-0-160-240.us-east-2.compute.internal   Ready     worker    22m      
> v1.12.4+4dd65df23d
> ip-10-0-167-174.us-east-2.compute.internal   Ready     worker    34m      
> v1.12.4+4dd65df23d
> ip-10-0-168-151.us-east-2.compute.internal   Ready     worker    14m      
> v1.12.4+4dd65df23d
> ip-10-0-172-109.us-east-2.compute.internal   Ready     master    51m      
> v1.12.4+4dd65df23d
> ip-10-0-174-103.us-east-2.compute.internal   Ready     worker    14m      
> v1.12.4+4dd65df23d

Can you try again but this time using:

 scaleDown:
    enabled: true

only in the CA config.

Comment 9 Andrew McDermott 2019-03-01 14:42:09 UTC

I have been able to reproduce this twice today. Will continue to investigate as it doesn't happen every time. Thanks for the logs.

Comment 10 Andrew McDermott 2019-03-04 16:44:15 UTC

PR - https://github.com/openshift/kubernetes-autoscaler/pull/46

Comment 11 Andrew McDermott 2019-03-06 16:55:49 UTC

I'm not sure where to go with this; I simply cannot reproduce this on the two clusters you have shared with me, on my own cluster (or via kubemark). I have tried on/off for over a week now. I will look to extend our e2e test so that it does something similar (if not identical) so that we validate there per commit.

Comment 12 Andrew McDermott 2019-03-07 08:46:58 UTC

Made additional progress here and will either update the existing PR, or close that and raise a new one as the fix looks to be simpler than that raised in https://github.com/openshift/kubernetes-autoscaler/pull/46.

Comment 13 Andrew McDermott 2019-03-07 09:24:05 UTC

New PR: https://github.com/openshift/kubernetes-autoscaler/pull/47

Comment 14 sunzhaohua 2019-03-11 04:51:24 UTC

Verified. It worked as expected. Thanks Andrew McDermott.

clusterversion: 4.0.0-0.nightly-2019-03-06-074438

Comment 17 errata-xmlrpc 2019-06-04 10:42:28 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758

Note You need to log in before you can comment on or make changes to this bug.