Bug 1796595

Summary: Scaling Machine Sets up launches extra instances
Product: OpenShift Container Platform Reporter: Tejas Parikh <tparikh>
Component: Cloud ComputeAssignee: Alberto <agarcial>
Status: CLOSED ERRATA QA Contact: sunzhaohua <zhsun>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.2.zCC: aos-bugs, eparis, jokerman, rphillips, vjaypurk
Target Milestone: ---   
Target Release: 4.2.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-03-04 04:51:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1775874    
Bug Blocks:    
Attachments:
Description Flags
aws screenshot
none
openshift console screenshot none

Description Tejas Parikh 2020-01-30 18:08:29 UTC
Created attachment 1656550 [details]
aws screenshot

Description of problem:

I have an OCP 4.2.12 cluster running on AWS. Cluster is setup with two machine sets. First machine set has three m5.xlarge compute nodes and the second machine set has one m5.metal compute node. When I increased the desired count of the 2nd machine set to two nodes from one node, OpenShift launched two new m5.metal nodes for a total of three m5.metal nodes. Third node never got added to the cluster. Increasing the desired count by one should have only launched a single new instance. I was able to terminate the extra instance from AWS console without impacting the cluster.

Version-Release number of selected component (if applicable):

4.2.12

How reproducible:

I haven’t tried this again as this is our production cluster and m5.metal instances are expensive. 

Steps to Reproduce:
1. Create a second machine set with one m5.metal instance. Wait for the node to be part of the cluster.
2. Increase the desired count to two (2).
3. Wait for the new node to join the cluster.

Actual results:

OpenShift will launch 2 new additional m5.metal instances bringing the total to 3.

Expected results:

You should have two m5.metal instances running.

Additional info:

Your AWS account will need a quota for m5.metal nodes. I have attached screenshots of OpenShift console and AWS console.

Comment 1 Tejas Parikh 2020-01-30 18:09:34 UTC
Created attachment 1656551 [details]
openshift console screenshot

Comment 3 Ryan Phillips 2020-02-04 22:39:04 UTC
Tejas: Could you provide the must gather logs for the cluster?

Comment 4 Alberto 2020-02-05 10:39:25 UTC
can we please share must gather logs where this happened?

Comment 5 Tejas Parikh 2020-02-05 15:44:02 UTC
Alberto,

You can download the must-gather logs from https://drive.google.com/file/d/1vYOBTRSeu_iy2bNFvAJbMuNBkr43wfWj/view?usp=sharing

The file is too big for Bugzilla attachment.

--Tejas

Comment 6 Alberto 2020-02-07 11:47:28 UTC
Tejas Parikh thanks for sharing the logs.
I can't find in the logs when the machineSet was scaled out to 2. Also I can see that the replicas field for c1-gjlfl-builder-worker-us-east-1d is set to 1.

I can see though that after "c1-gjlfl-builder-worker-us-east-1d-t7k4h" have been created, at some point the actuator communication with AWS (wrongly) reports that the instance does not exist resulting in the creation of a new one. 

Reporting instance does not exist twice and attempting creation twice:
2020-01-30T16:38:25.953649643Z I0130 16:38:25.953619       1 actuator.go:487] c1-gjlfl-builder-worker-us-east-1d-t7k4h: Checking if machine exists
2020-01-30T16:38:26.018127422Z I0130 16:38:26.018079       1 actuator.go:495] c1-gjlfl-builder-worker-us-east-1d-t7k4h: Instance does not exist
2020-01-30T16:38:26.018127422Z I0130 16:38:26.018101       1 controller.go:253] Reconciling machine object c1-gjlfl-builder-worker-us-east-1d-t7k4h triggers idempotent create.
2020-01-30T16:38:26.018127422Z I0130 16:38:26.018110       1 actuator.go:113] c1-gjlfl-builder-worker-us-east-1d-t7k4h: creating machine
2020-01-30T16:38:26.018340829Z E0130 16:38:26.018277       1 utils.go:191] NodeRef not found in machine c1-gjlfl-builder-worker-us-east-1d-t7k4h
2020-01-30T16:38:26.060755832Z I0130 16:38:26.060722       1 instances.go:44] No stopped instances found for machine c1-gjlfl-builder-worker-us-east-1d-t7k4h
2020-01-30T16:38:26.060812442Z I0130 16:38:26.060755       1 instances.go:142] Using AMI ami-046fe691f52a953f9
2020-01-30T16:38:26.060812442Z I0130 16:38:26.060760       1 instances.go:74] Describing security groups based on filters
2020-01-30T16:38:26.249271021Z I0130 16:38:26.249226       1 instances.go:119] Describing subnets based on filters
2020-01-30T16:38:27.557196859Z I0130 16:38:27.557158       1 actuator.go:199] c1-gjlfl-builder-worker-us-east-1d-t7k4h: ProviderID updated at machine spec: aws:///us-east-1d/i-0b14bc0828bb6854c
2020-01-30T16:38:27.562648728Z I0130 16:38:27.562619       1 actuator.go:585] c1-gjlfl-builder-worker-us-east-1d-t7k4h: Updating status
2020-01-30T16:38:27.562648728Z I0130 16:38:27.562634       1 actuator.go:629] c1-gjlfl-builder-worker-us-east-1d-t7k4h: finished calculating AWS status
2020-01-30T16:38:27.562686074Z I0130 16:38:27.562677       1 actuator.go:231] c1-gjlfl-builder-worker-us-east-1d-t7k4h: machine status has changed, updating
2020-01-30T16:38:27.568446114Z I0130 16:38:27.568405       1 actuator.go:647] c1-gjlfl-builder-worker-us-east-1d-t7k4h: Instance state still pending, returning an error to requeue
2020-01-30T16:38:27.568446114Z W0130 16:38:27.568428       1 controller.go:255] Failed to create machine "c1-gjlfl-builder-worker-us-east-1d-t7k4h": requeue in: 20s
2020-01-30T16:38:27.568485133Z I0130 16:38:27.568445       1 controller.go:337] Actuator returned requeue-after error: requeue in: 20s
2020-01-30T16:38:27.568514292Z I0130 16:38:27.568497       1 controller.go:133] Reconciling Machine "c1-gjlfl-builder-worker-us-east-1d-t7k4h"
2020-01-30T16:38:27.568620221Z I0130 16:38:27.568535       1 controller.go:304] Machine "c1-gjlfl-builder-worker-us-east-1d-t7k4h" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
2020-01-30T16:38:27.568620221Z I0130 16:38:27.568553       1 actuator.go:487] c1-gjlfl-builder-worker-us-east-1d-t7k4h: Checking if machine exists
2020-01-30T16:38:27.638735752Z I0130 16:38:27.638686       1 actuator.go:495] c1-gjlfl-builder-worker-us-east-1d-t7k4h: Instance does not exist
2020-01-30T16:38:27.638735752Z I0130 16:38:27.638710       1 controller.go:253] Reconciling machine object c1-gjlfl-builder-worker-us-east-1d-t7k4h triggers idempotent create.
2020-01-30T16:38:27.638735752Z I0130 16:38:27.638719       1 actuator.go:113] c1-gjlfl-builder-worker-us-east-1d-t7k4h: creating machine
2020-01-30T16:38:27.638921083Z E0130 16:38:27.638906       1 utils.go:191] NodeRef not found in machine c1-gjlfl-builder-worker-us-east-1d-t7k4h
2020-01-30T16:38:27.732112017Z I0130 16:38:27.732071       1 instances.go:44] No stopped instances found for machine c1-gjlfl-builder-worker-us-east-1d-t7k4h
2020-01-30T16:38:27.732112017Z I0130 16:38:27.732104       1 instances.go:142] Using AMI ami-046fe691f52a953f9
2020-01-30T16:38:27.732186413Z I0130 16:38:27.732109       1 instances.go:74] Describing security groups based on filters
2020-01-30T16:38:27.91975636Z I0130 16:38:27.919713       1 instances.go:119] Describing subnets based on filters
2020-01-30T16:38:29.175089882Z I0130 16:38:29.175053       1 actuator.go:199] c1-gjlfl-builder-worker-us-east-1d-t7k4h: ProviderID updated at machine spec: aws:///us-east-1d/i-068439830512b459d
2020-01-30T16:38:29.179544638Z I0130 16:38:29.179511       1 actuator.go:585] c1-gjlfl-builder-worker-us-east-1d-t7k4h: Updating status
2020-01-30T16:38:29.1795829Z I0130 16:38:29.179560       1 actuator.go:629] c1-gjlfl-builder-worker-us-east-1d-t7k4h: finished calculating AWS status
2020-01-30T16:38:29.179658202Z I0130 16:38:29.179642       1 actuator.go:231] c1-gjlfl-builder-worker-us-east-1d-t7k4h: machine status has changed, updating
2020-01-30T16:38:29.185409455Z I0130 16:38:29.185377       1 actuator.go:647] c1-gjlfl-builder-worker-us-east-1d-t7k4h: Instance state still pending, returning an error to requeue
2020-01-30T16:38:29.185436009Z W0130 16:38:29.185402       1 controller.go:255] Failed to create machine "c1-gjlfl-builder-worker-us-east-1d-t7k4h": requeue in: 20s
2020-01-30T16:38:29.185457262Z I0130 16:38:29.185419       1 controller.go:337] Actuator returned requeue-after error: requeue in: 20s
2020-01-30T16:38:29.185609864Z I0130 16:38:29.185502       1 controller.go:133] Reconciling Machine "c1-gjlfl-builder-worker-us-east-1d-t7k4h"
2020-01-30T16:38:29.185609864Z I0130 16:38:29.185516       1 controller.go:304] Machine "c1-gjlfl-builder-worker-us-east-1d-t7k4h" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster


Reporting two instances for machine:

2020-01-30T16:38:29.185609864Z I0130 16:38:29.185525       1 actuator.go:487] c1-gjlfl-builder-worker-us-east-1d-t7k4h: Checking if machine exists
2020-01-30T16:38:29.357941192Z I0130 16:38:29.357900       1 actuator.go:500] c1-gjlfl-builder-worker-us-east-1d-t7k4h: Instance exists as "i-068439830512b459d"
2020-01-30T16:38:29.357941192Z I0130 16:38:29.357925       1 controller.go:244] Reconciling machine "c1-gjlfl-builder-worker-us-east-1d-t7k4h" triggers idempotent update
2020-01-30T16:38:29.357941192Z I0130 16:38:29.357932       1 actuator.go:402] c1-gjlfl-builder-worker-us-east-1d-t7k4h: updating machine
2020-01-30T16:38:29.358005846Z I0130 16:38:29.357996       1 actuator.go:410] c1-gjlfl-builder-worker-us-east-1d-t7k4h: obtaining EC2 client for region
2020-01-30T16:38:29.422392946Z I0130 16:38:29.422322       1 actuator.go:428] c1-gjlfl-builder-worker-us-east-1d-t7k4h: found 2 existing instances for machine
2020-01-30T16:38:29.427385276Z I0130 16:38:29.427345       1 actuator.go:192] c1-gjlfl-builder-worker-us-east-1d-t7k4h: ProviderID already set in the machine Spec with value:aws:///us-east-1d/i-068439830512b459d
2020-01-30T16:38:29.427385276Z I0130 16:38:29.427363       1 actuator.go:585] c1-gjlfl-builder-worker-us-east-1d-t7k4h: Updating status
2020-01-30T16:38:29.427446901Z I0130 16:38:29.427403       1 actuator.go:629] c1-gjlfl-builder-worker-us-east-1d-t7k4h: finished calculating AWS status
2020-01-30T16:38:29.427517631Z I0130 16:38:29.427499       1 actuator.go:240] c1-gjlfl-builder-worker-us-east-1d-t7k4h: status unchanged
2020-01-30T16:38:29.427517631Z I0130 16:38:29.427511       1 actuator.go:647] c1-gjlfl-builder-worker-us-east-1d-t7k4h: Instance state still pending, returning an error to requeue
2020-01-30T16:38:29.427552258Z E0130 16:38:29.427522       1 controller.go:246] Error updating machine "openshift-machine-api/c1-gjlfl-builder-worker-us-east-1d-t7k4h": requeue in: 20s
2020-01-30T16:38:29.427552258Z I0130 16:38:29.427536       1 controller.go:337] Actuator returned requeue-after error: requeue in: 20s
2020-01-30T16:38:47.568686146Z I0130 16:38:47.568635       1 controller.go:133] Reconciling Machine "c1-gjlfl-builder-worker-us-east-1d-t7k4h"


This is due most likely to aws eventual consistency issues https://docs.aws.amazon.com/AWSEC2/latest/APIReference/query-api-troubleshooting.html#eventual-consistency as analysed here https://bugzilla.redhat.com/show_bug.cgi?id=1772163 and https://bugzilla.redhat.com/show_bug.cgi?id=1761882.

Comment 9 sunzhaohua 2020-02-25 08:51:53 UTC
Verified.
clusterversion: 4.2.0-0.nightly-2020-02-24-065701


1. Create a second machine set with one m5.metal instance. Wait for the node to be part of the cluster.
2. Increase the desired count to two (2).
3. Wait for the new node to join the cluster.
4. only one new m5.metal instances join the cluster.

$ oc get machineset
NAME                              DESIRED   CURRENT   READY   AVAILABLE   AGE
zhsun42-2n9xw-worker-us-east-2a   1         1         1       1           83m
zhsun42-2n9xw-worker-us-east-2b   0         0                             83m
zhsun42-2n9xw-worker-us-east-2c   2         2         2       2           83m

Comment 11 errata-xmlrpc 2020-03-04 04:51:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0614

Comment 12 Joel Speed 2020-04-03 11:16:55 UTC
*** Bug 1812328 has been marked as a duplicate of this bug. ***