Bug 1656330

Summary: [cloud-CA] The node hosting cluster-autoscaler can be scaled down
Product: OpenShift Container Platform Reporter: sunzhaohua <zhsun>
Component: Cloud ComputeAssignee: Andrew McDermott <amcdermo>
Status: CLOSED CURRENTRELEASE QA Contact: sunzhaohua <zhsun>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.1.0CC: jhou
Target Milestone: ---   
Target Release: 4.1.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-02-25 11:28:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description sunzhaohua 2018-12-05 09:20:41 UTC
Description of problem:
On scaling up and scaling down, then re-scaleup, sometimes the node on which the cluster-autoscaler is running can be scaled down. This results in the following isssue:

oc logs -f cluster-autoscaler-default-77f884d74b-527bq
Error from server: Get https://ip-10-0-133-103:10250/containerLogs/openshift-cluster-api/cluster-autoscaler-default-77f884d74b-527bq/cluster-autoscaler?follow=true: dial tcp 10.0.133.103:10250: i/o timeout


Version-Release number of selected component (if applicable):
$ bin/openshift-install version
bin/openshift-install v0.5.0-master-2-g78e2c8b144352b1bef854501d3760a9daaaa2eb0
Terraform v0.11.8

How reproducible:
Sometimes

Steps to Reproduce:
1. Create clusterautoscaler machineautoscaler resources
2. Scale up/down/up the cluster by creating/deleting resources using an RC.
3. Check logs 

Actual results:
The node which hosted cluster-autoscaler pod was deleted, a new node was created.

[szh@dhcp-140-12 installer]$ oc get pod -o wide
NAME                                              READY     STATUS    RESTARTS   AGE       IP            NODE                                         NOMINATED NODE
cluster-autoscaler-default-77f884d74b-527bq       1/1       Running   0          1h        10.131.0.9    ip-10-0-133-103.us-east-2.compute.internal   <none>

$ oc logs -f cluster-autoscaler-default-77f884d74b-527bq

I1205 05:29:32.574965       1 utils.go:388] Removing unregistered node ip-10-0-133-103.us-east-2.compute.internal
W1205 05:29:32.575171       1 utils.go:404] Failed to remove node ip-10-0-133-103.us-east-2.compute.internal: node group min size reached, skipping unregistered node removal
I1205 05:29:32.638606       1 scale_up.go:584] Scale-up: setting group qe-zhsun-worker-us-east-2a size to 3
I1205 05:29:42.661291       1 utils.go:388] Removing unregistered node ip-10-0-133-103.us-east-2.compute.internal
I1205 05:29:42.696036       1 static_autoscaler.go:166] Some unregistered nodes were removed, skipping iteration
I1205 05:29:52.706364       1 utils.go:388] Removing unregistered node ip-10-0-133-103.us-east-2.compute.internal
I1205 05:29:52.739285       1 static_autoscaler.go:166] Some unregistered nodes were removed, skipping iteration
I1205 05:30:02.749076       1 utils.go:388] Removing unregistered node ip-10-0-133-103.us-east-2.compute.internal
W1205 05:30:02.749129       1 utils.go:404] Failed to remove node ip-10-0-133-103.us-east-2.compute.internal: node group min size reached, skipping unregistered node removal
I1205 05:30:02.795779       1 scale_up.go:584] Scale-up: setting group qe-zhsun-worker-us-east-2a size to 3
I1205 05:30:12.825752       1 utils.go:388] Removing unregistered node ip-10-0-133-103.us-east-2.compute.internal
I1205 05:30:12.857847       1 static_autoscaler.go:166] Some unregistered nodes were removed, skipping iteration
I1205 05:30:22.868613       1 utils.go:388] Removing unregistered node ip-10-0-133-103.us-east-2.compute.internal
I1205 05:30:22.897762       1 static_autoscaler.go:166] Some unregistered nodes were removed, skipping iteration


The node where cluster-autoscaler is running is scaled down
$ oc logs -f cluster-autoscaler-default-77f884d74b-527bq
Error from server: Get https://ip-10-0-133-103:10250/containerLogs/openshift-cluster-api/cluster-autoscaler-default-77f884d74b-527bq/cluster-autoscaler?follow=true: dial tcp 10.0.133.103:10250: i/o timeout



And cluster-autoscaler is scheduled to a new node.
$ oc get pod -o wide
NAME                                              READY     STATUS    RESTARTS   AGE       IP            NODE                                         NOMINATED NODE
cluster-autoscaler-default-77f884d74b-glxr7       1/1       Running   0          8m        10.128.2.11   ip-10-0-156-4.us-east-2.compute.internal     <none>
cluster-autoscaler-operator-66f5778477-2lj26      1/1       Running   0          2h        10.128.0.2    ip-10-0-26-85.us-east-2.compute.internal     <none>
clusterapi-manager-controllers-7d7c546f98-8pm8q   4/4       Running   0          2h        10.129.0.4    ip-10-0-3-61.us-east-2.compute.internal      <none>
machine-api-operator-65d5f5dd99-jgl7l             1/1       Running   0          2h        10.128.0.3    ip-10-0-26-85.us-east-2.compute.internal     <none>


$ oc logs -f cluster-autoscaler-default-77f884d74b-glxr7
I1205 05:35:37.183234       1 leaderelection.go:187] attempting to acquire leader lease  openshift-cluster-api/cluster-autoscaler...
I1205 05:35:55.734324       1 leaderelection.go:196] successfully acquired lease openshift-cluster-api/cluster-autoscaler
I1205 05:36:06.187294       1 scale_up.go:584] Scale-up: setting group qe-zhsun-worker-us-east-2c size to 3
I1205 05:36:17.043336       1 scale_up.go:584] Scale-up: setting group qe-zhsun-worker-us-east-2a size to 3

Expected results:
Not sure whether we should deploy cluster-autoscaler on master nodes to prevent the Pod scheduling to worker nodes.

Additional info:

Comment 1 sunzhaohua 2018-12-26 02:56:13 UTC
Cluster-autoscaler pod is deployed on master node now.

$ bin/openshift-install version
bin/openshift-install v0.8.0-master-2-g5e7b36d6351c9cc773f1dadc64abf9d7041151b1

$ oc get pod -o wide
NAME                                              READY     STATUS    RESTARTS   AGE       IP            NODE                                        NOMINATED NODE
cluster-autoscaler-default-7c88c947bc-mnnrj       1/1       Running   0          49s       10.129.0.20   ip-10-0-40-163.us-east-2.compute.internal   <none>
cluster-autoscaler-operator-5c467664fb-zb4hk      1/1       Running   0          19m       10.130.0.11   ip-10-0-31-81.us-east-2.compute.internal    <none>
clusterapi-manager-controllers-7d86667974-hvx4r   4/4       Running   0          16m       10.128.0.4    ip-10-0-8-232.us-east-2.compute.internal    <none>
machine-api-operator-6c8f7f459c-b8hbk             1/1       Running   0          19m       10.130.0.8    ip-10-0-31-81.us-east-2.compute.internal    <none>

$ oc get node
NAME                                         STATUS    ROLES     AGE       VERSION
ip-10-0-132-7.us-east-2.compute.internal     Ready     worker    15m       v1.11.0+85a0623
ip-10-0-146-110.us-east-2.compute.internal   Ready     worker    15m       v1.11.0+85a0623
ip-10-0-175-107.us-east-2.compute.internal   Ready     worker    15m       v1.11.0+85a0623
ip-10-0-31-81.us-east-2.compute.internal     Ready     master    20m       v1.11.0+85a0623
ip-10-0-40-163.us-east-2.compute.internal    Ready     master    20m       v1.11.0+85a0623
ip-10-0-8-232.us-east-2.compute.internal     Ready     master    20m       v1.11.0+85a0623

Comment 2 sunzhaohua 2019-01-14 07:48:55 UTC
Verified.

According to the comment 1,  the autoscaler pod now is deployed on master node.

Comment 3 Andrew McDermott 2019-02-25 11:28:53 UTC
Closing as this was fixed by: https://github.com/openshift/cluster-autoscaler-operator/pull/19