Description of problem: On scaling up and scaling down, then re-scaleup, sometimes the node on which the cluster-autoscaler is running can be scaled down. This results in the following isssue: oc logs -f cluster-autoscaler-default-77f884d74b-527bq Error from server: Get https://ip-10-0-133-103:10250/containerLogs/openshift-cluster-api/cluster-autoscaler-default-77f884d74b-527bq/cluster-autoscaler?follow=true: dial tcp 10.0.133.103:10250: i/o timeout Version-Release number of selected component (if applicable): $ bin/openshift-install version bin/openshift-install v0.5.0-master-2-g78e2c8b144352b1bef854501d3760a9daaaa2eb0 Terraform v0.11.8 How reproducible: Sometimes Steps to Reproduce: 1. Create clusterautoscaler machineautoscaler resources 2. Scale up/down/up the cluster by creating/deleting resources using an RC. 3. Check logs Actual results: The node which hosted cluster-autoscaler pod was deleted, a new node was created. [szh@dhcp-140-12 installer]$ oc get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE cluster-autoscaler-default-77f884d74b-527bq 1/1 Running 0 1h 10.131.0.9 ip-10-0-133-103.us-east-2.compute.internal <none> $ oc logs -f cluster-autoscaler-default-77f884d74b-527bq I1205 05:29:32.574965 1 utils.go:388] Removing unregistered node ip-10-0-133-103.us-east-2.compute.internal W1205 05:29:32.575171 1 utils.go:404] Failed to remove node ip-10-0-133-103.us-east-2.compute.internal: node group min size reached, skipping unregistered node removal I1205 05:29:32.638606 1 scale_up.go:584] Scale-up: setting group qe-zhsun-worker-us-east-2a size to 3 I1205 05:29:42.661291 1 utils.go:388] Removing unregistered node ip-10-0-133-103.us-east-2.compute.internal I1205 05:29:42.696036 1 static_autoscaler.go:166] Some unregistered nodes were removed, skipping iteration I1205 05:29:52.706364 1 utils.go:388] Removing unregistered node ip-10-0-133-103.us-east-2.compute.internal I1205 05:29:52.739285 1 static_autoscaler.go:166] Some unregistered nodes were removed, skipping iteration I1205 05:30:02.749076 1 utils.go:388] Removing unregistered node ip-10-0-133-103.us-east-2.compute.internal W1205 05:30:02.749129 1 utils.go:404] Failed to remove node ip-10-0-133-103.us-east-2.compute.internal: node group min size reached, skipping unregistered node removal I1205 05:30:02.795779 1 scale_up.go:584] Scale-up: setting group qe-zhsun-worker-us-east-2a size to 3 I1205 05:30:12.825752 1 utils.go:388] Removing unregistered node ip-10-0-133-103.us-east-2.compute.internal I1205 05:30:12.857847 1 static_autoscaler.go:166] Some unregistered nodes were removed, skipping iteration I1205 05:30:22.868613 1 utils.go:388] Removing unregistered node ip-10-0-133-103.us-east-2.compute.internal I1205 05:30:22.897762 1 static_autoscaler.go:166] Some unregistered nodes were removed, skipping iteration The node where cluster-autoscaler is running is scaled down $ oc logs -f cluster-autoscaler-default-77f884d74b-527bq Error from server: Get https://ip-10-0-133-103:10250/containerLogs/openshift-cluster-api/cluster-autoscaler-default-77f884d74b-527bq/cluster-autoscaler?follow=true: dial tcp 10.0.133.103:10250: i/o timeout And cluster-autoscaler is scheduled to a new node. $ oc get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE cluster-autoscaler-default-77f884d74b-glxr7 1/1 Running 0 8m 10.128.2.11 ip-10-0-156-4.us-east-2.compute.internal <none> cluster-autoscaler-operator-66f5778477-2lj26 1/1 Running 0 2h 10.128.0.2 ip-10-0-26-85.us-east-2.compute.internal <none> clusterapi-manager-controllers-7d7c546f98-8pm8q 4/4 Running 0 2h 10.129.0.4 ip-10-0-3-61.us-east-2.compute.internal <none> machine-api-operator-65d5f5dd99-jgl7l 1/1 Running 0 2h 10.128.0.3 ip-10-0-26-85.us-east-2.compute.internal <none> $ oc logs -f cluster-autoscaler-default-77f884d74b-glxr7 I1205 05:35:37.183234 1 leaderelection.go:187] attempting to acquire leader lease openshift-cluster-api/cluster-autoscaler... I1205 05:35:55.734324 1 leaderelection.go:196] successfully acquired lease openshift-cluster-api/cluster-autoscaler I1205 05:36:06.187294 1 scale_up.go:584] Scale-up: setting group qe-zhsun-worker-us-east-2c size to 3 I1205 05:36:17.043336 1 scale_up.go:584] Scale-up: setting group qe-zhsun-worker-us-east-2a size to 3 Expected results: Not sure whether we should deploy cluster-autoscaler on master nodes to prevent the Pod scheduling to worker nodes. Additional info:
Cluster-autoscaler pod is deployed on master node now. $ bin/openshift-install version bin/openshift-install v0.8.0-master-2-g5e7b36d6351c9cc773f1dadc64abf9d7041151b1 $ oc get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE cluster-autoscaler-default-7c88c947bc-mnnrj 1/1 Running 0 49s 10.129.0.20 ip-10-0-40-163.us-east-2.compute.internal <none> cluster-autoscaler-operator-5c467664fb-zb4hk 1/1 Running 0 19m 10.130.0.11 ip-10-0-31-81.us-east-2.compute.internal <none> clusterapi-manager-controllers-7d86667974-hvx4r 4/4 Running 0 16m 10.128.0.4 ip-10-0-8-232.us-east-2.compute.internal <none> machine-api-operator-6c8f7f459c-b8hbk 1/1 Running 0 19m 10.130.0.8 ip-10-0-31-81.us-east-2.compute.internal <none> $ oc get node NAME STATUS ROLES AGE VERSION ip-10-0-132-7.us-east-2.compute.internal Ready worker 15m v1.11.0+85a0623 ip-10-0-146-110.us-east-2.compute.internal Ready worker 15m v1.11.0+85a0623 ip-10-0-175-107.us-east-2.compute.internal Ready worker 15m v1.11.0+85a0623 ip-10-0-31-81.us-east-2.compute.internal Ready master 20m v1.11.0+85a0623 ip-10-0-40-163.us-east-2.compute.internal Ready master 20m v1.11.0+85a0623 ip-10-0-8-232.us-east-2.compute.internal Ready master 20m v1.11.0+85a0623
Verified. According to the comment 1, the autoscaler pod now is deployed on master node.
Closing as this was fixed by: https://github.com/openshift/cluster-autoscaler-operator/pull/19