Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1812328

Summary: Cluster Autoscaling in AWS creating multiple EC2 instances.
Product: OpenShift Container Platform Reporter: Vedanti Jaypurkar <vjaypurk>
Component: Cloud ComputeAssignee: Joel Speed <jspeed>
Status: CLOSED DUPLICATE QA Contact: Jianwei Hou <jhou>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.2.0CC: aos-bugs, cshulman, eparis, jokerman, mfojtik, mpatel, nagrawal, zyu
Target Milestone: ---   
Target Release: 4.5.0   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-04-03 11:16:55 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Vedanti Jaypurkar 2020-03-11 02:10:20 UTC
Description of problem:
Issue: Cu is using Autoscaling on OCP 4.2 deployed on AWS. Autoscaling sometimes creates duplicate EC2 instances with the same name but different ip's and only one gets added to the cluster. This happens only specific to a particular Availability zone.

Findings: 

From the logs I could see below log lines where the reconciliation is done for "core-us-east-1c" is done twice, while for the rest it done once :
---------
I1212 19:01:26.720384       1 machineautoscaler_controller.go:158] Reconciling MachineAutoscaler openshift-machine-api/core-us-east-1c
I1212 19:01:26.733670       1 validator.go:58] Validation webhook called for MachineAutoscaler: core-us-east-1c
I1212 19:01:26.746764       1 machineautoscaler_controller.go:158] Reconciling MachineAutoscaler openshift-machine-api/core-us-east-1c
I1212 19:01:26.751367       1 machineautoscaler_controller.go:158] Reconciling MachineAutoscaler openshift-machine-api/core-us-east-1c
I1212 19:04:18.599744       1 machineautoscaler_controller.go:158] Reconciling MachineAutoscaler openshift-machine-api/dalsim-us-east-1b
I1212 19:04:33.142341       1 machineautoscaler_controller.go:158] Reconciling MachineAutoscaler openshift-machine-api/dalsim-us-east-1c
I1212 19:04:38.449609       1 machineautoscaler_controller.go:158] Reconciling MachineAutoscaler openshift-machine-api/dalsim-us-east-1a
---------

Cu has clusterautoscaler modifications which they made were to the min & max specs and they added the balanceSimilarNodeGroups since they are autoscaling across AZ's.


Version-Release number of selected component (if applicable):


How reproducible:
Also when checked on a test cluster no duplicate EC2 instances are getting created with Auto scaling.

Actual results:
Duplicate EC2 instances are getting created with same name and only one of them gets added to the cluster.
Expected results:
Duplicate EC2 instances with same name should not be created. Every name must be unique as expected and get added to the cluster automatically.


Additional info:

Comment 6 Joel Speed 2020-04-03 11:10:50 UTC
I've had a look at this and compared the logs to the code to try and work out which specific version of the code they might be on.

Based on the log lines they must be using a one of these three commits https://github.com/openshift/cluster-api-provider-aws/commit/3661b213fae0f9e9b1c887df94d0a4e3981c4d6c, https://github.com/openshift/cluster-api-provider-aws/commit/e91ddd88ef516c62f6de2b5ee0c27d74fa2714eb or https://github.com/openshift/cluster-api-provider-aws/commit/5d8d336d2f415faba1553afd6d9763f9113b56e8

Importantly this means that the version that caused this bug does not include https://github.com/openshift/cluster-api-provider-aws/commit/16d9c184cf29f0f5e48abcf529865b61cb71e811 which fixed a bug with similar description (https://bugzilla.redhat.com/show_bug.cgi?id=1796595 `Exists` also wasn't detecting the machine had already created an instance).

Since this is already fixed and backported to 4.2, I'd encourage the customer to update to a later version in the 4.2 stream (I believe this should be in 4.2.20+)

I believe this issue to be a direct duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1796595

Comment 7 Joel Speed 2020-04-03 11:16:55 UTC

*** This bug has been marked as a duplicate of bug 1796595 ***