Bug 1660745

Summary: [cloud-CA] scale down node name does not match the autoscaler log output info
Product: OpenShift Container Platform Reporter: sunzhaohua <zhsun>
Component: Cloud ComputeAssignee: Andrew McDermott <amcdermo>
Status: CLOSED WORKSFORME QA Contact: sunzhaohua <zhsun>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.1.0CC: jhou
Target Milestone: ---   
Target Release: 4.1.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-03-14 07:54:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description sunzhaohua 2018-12-19 06:38:55 UTC
Description of problem:
Scale down node name does not match the autoscaler log output info

Version-Release number of selected component (if applicable):
$ bin/openshift-install version
bin/openshift-install v0.7.0-master-35-gead9f4b779a20dc32d51c3b2429d8d71d48ea043

How reproducible:
Sometimes

Steps to Reproduce:
1. Deploy clusterautoscaler and machineautoscaler
2. Create pods to scale up the cluster and check node
$ oc get node
NAME                                         STATUS    ROLES     AGE       VERSION
ip-10-0-1-191.us-east-2.compute.internal     Ready     master    3h        v1.11.0+a2218fc
ip-10-0-128-25.us-east-2.compute.internal    Ready     worker    8m        v1.11.0+a2218fc
ip-10-0-139-245.us-east-2.compute.internal   Ready     worker    8m        v1.11.0+a2218fc
ip-10-0-139-69.us-east-2.compute.internal    Ready     worker    3h        v1.11.0+a2218fc
ip-10-0-150-181.us-east-2.compute.internal   Ready     worker    3h        v1.11.0+a2218fc
ip-10-0-162-109.us-east-2.compute.internal   Ready     worker    8m        v1.11.0+a2218fc
ip-10-0-164-244.us-east-2.compute.internal   Ready     worker    3h        v1.11.0+a2218fc
ip-10-0-169-114.us-east-2.compute.internal   Ready     worker    8m        v1.11.0+a2218fc
ip-10-0-27-8.us-east-2.compute.internal      Ready     master    3h        v1.11.0+a2218fc
ip-10-0-32-233.us-east-2.compute.internal    Ready     master    3h        v1.11.0+a2218fc

3. Delete pods and wait cluster to scale down
4. Check node name and autoscaler log output info
$ oc get node
NAME                                         STATUS    ROLES     AGE       VERSION
ip-10-0-1-191.us-east-2.compute.internal     Ready     master    4h        v1.11.0+a2218fc
ip-10-0-139-69.us-east-2.compute.internal    Ready     worker    3h        v1.11.0+a2218fc
ip-10-0-150-181.us-east-2.compute.internal   Ready     worker    3h        v1.11.0+a2218fc
ip-10-0-162-109.us-east-2.compute.internal   Ready     worker    9m        v1.11.0+a2218fc
ip-10-0-27-8.us-east-2.compute.internal      Ready     master    4h        v1.11.0+a2218fc
ip-10-0-32-233.us-east-2.compute.internal    Ready     master    4h        v1.11.0+a2218fc

$ oc logs -f cluster-autoscaler-default-7c88c947bc-7vp7d
I1219 05:59:17.414597       1 scale_down.go:791] Scale-down: removing empty node ip-10-0-139-245.us-east-2.compute.internal
I1219 05:59:17.415418       1 scale_down.go:791] Scale-down: removing empty node ip-10-0-162-109.us-east-2.compute.internal
I1219 05:59:17.416561       1 scale_down.go:791] Scale-down: removing empty node ip-10-0-128-25.us-east-2.compute.internal
E1219 05:59:17.564286       1 scale_down.go:841] Problem with empty node deletion: failed to delete ip-10-0-139-245.us-east-2.compute.internal: unable to update number of replicas of machineset "openshift-cluster-api/qe-zhsun-1-worker-us-east-2a": Operation cannot be fulfilled on machinesets.cluster.k8s.io "qe-zhsun-1-worker-us-east-2a": the object has been modified; please apply your changes to the latest version and try again
E1219 05:59:17.569363       1 static_autoscaler.go:341] Failed to scale down: <nil>
I1219 05:59:38.746554       1 scale_down.go:791] Scale-down: removing empty node ip-10-0-162-109.us-east-2.compute.internal
I1219 05:59:38.746620       1 scale_down.go:791] Scale-down: removing empty node ip-10-0-128-25.us-east-2.compute.internal

Actual results:
Scale down node name does not match the autoscaler log output info.
In fact, these nodes are removed.
ip-10-0-128-25.us-east-2.compute.internal    Ready     worker    8m        v1.11.0+a2218fc
ip-10-0-139-245.us-east-2.compute.internal   Ready     worker    8m        v1.11.0+a2218fc
ip-10-0-164-244.us-east-2.compute.internal   Ready     worker    3h        v1.11.0+a2218fc
ip-10-0-169-114.us-east-2.compute.internal   Ready     worker    8m        v1.11.0+a2218fc


Expected results:
Scale down node name match with the autoscaler log output info.

Additional info:

Comment 2 Andrew McDermott 2019-03-13 09:58:52 UTC
I wasn't able to reproduce this:

I started out with:

     1	ip-10-0-131-4.us-east-2.compute.internal     Ready   master   69m   v1.12.4+2a194a0f02
     2	ip-10-0-153-24.us-east-2.compute.internal    Ready   master   69m   v1.12.4+2a194a0f02
     3	ip-10-0-162-238.us-east-2.compute.internal   Ready   master   69m   v1.12.4+2a194a0f02
     4	ip-10-0-130-106.us-east-2.compute.internal   Ready   worker   56m   v1.12.4+2a194a0f02
     5	ip-10-0-170-45.us-east-2.compute.internal    Ready   worker   56m   v1.12.4+2a194a0f02
     6	ip-10-0-157-225.us-east-2.compute.internal   Ready   worker   56m   v1.12.4+2a194a0f02

I scaled out to:

     1  ip-10-0-153-24.us-east-2.compute.internal    Ready   master   78m     v1.12.4+2a194a0f02
     2  ip-10-0-131-4.us-east-2.compute.internal     Ready   master   78m     v1.12.4+2a194a0f02
     3  ip-10-0-162-238.us-east-2.compute.internal   Ready   master   78m     v1.12.4+2a194a0f02
     4  ip-10-0-130-106.us-east-2.compute.internal   Ready   worker   65m     v1.12.4+2a194a0f02
     5  ip-10-0-170-45.us-east-2.compute.internal    Ready   worker   65m     v1.12.4+2a194a0f02
     6  ip-10-0-157-225.us-east-2.compute.internal   Ready   worker   65m     v1.12.4+2a194a0f02
     7  ip-10-0-140-62.us-east-2.compute.internal    Ready   worker   6m58s   v1.12.4+2a194a0f02
     8  ip-10-0-140-54.us-east-2.compute.internal    Ready   worker   6m58s   v1.12.4+2a194a0f02
     9  ip-10-0-129-55.us-east-2.compute.internal    Ready   worker   6m58s   v1.12.4+2a194a0f02
    10  ip-10-0-128-202.us-east-2.compute.internal   Ready   worker   6m57s   v1.12.4+2a194a0f02
    11  ip-10-0-139-109.us-east-2.compute.internal   Ready   worker   6m57s   v1.12.4+2a194a0f02
    12  ip-10-0-136-66.us-east-2.compute.internal    Ready   worker   6m57s   v1.12.4+2a194a0f02
    13  ip-10-0-133-132.us-east-2.compute.internal   Ready   worker   6m57s   v1.12.4+2a194a0f02
    14  ip-10-0-138-177.us-east-2.compute.internal   Ready   worker   6m56s   v1.12.4+2a194a0f02
    15  ip-10-0-139-100.us-east-2.compute.internal   Ready   worker   6m56s   v1.12.4+2a194a0f02
    16  ip-10-0-137-149.us-east-2.compute.internal   Ready   worker   6m55s   v1.12.4+2a194a0f02
    17  ip-10-0-139-77.us-east-2.compute.internal    Ready   worker   6m47s   v1.12.4+2a194a0f02

and, after scale down, I ended up with all the new nodes being deleted:

     1  ip-10-0-131-4.us-east-2.compute.internal     Ready   master   96m   v1.12.4+2a194a0f02
     2  ip-10-0-153-24.us-east-2.compute.internal    Ready   master   96m   v1.12.4+2a194a0f02
     3  ip-10-0-162-238.us-east-2.compute.internal   Ready   master   96m   v1.12.4+2a194a0f02
     4  ip-10-0-130-106.us-east-2.compute.internal   Ready   worker   83m   v1.12.4+2a194a0f02
     5  ip-10-0-170-45.us-east-2.compute.internal    Ready   worker   83m   v1.12.4+2a194a0f02
     6  ip-10-0-157-225.us-east-2.compute.internal   Ready   worker   83m   v1.12.4+2a194a0f02

Comment 3 sunzhaohua 2019-03-14 07:54:48 UTC
In new version I wasn't able to reproduce this, too. Close this bug.

Clusterversion: 4.0.0-0.nightly-2019-03-13-233958