1733271 – Machine-controller not creating Nodes for all machines

Bug 1733271 - Machine-controller not creating Nodes for all machines

Summary: Machine-controller not creating Nodes for all machines

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.1.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	4.3.0
Assignee:	Alberto
QA Contact:	Jianwei Hou
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-07-25 14:49 UTC by Naveen Malik
Modified:	2019-11-06 12:43 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-11-06 12:43:55 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
nodelink-controller log (1.05 MB, text/plain) 2019-07-25 14:50 UTC, Naveen Malik	no flags	Details
machine-controller log (1.61 MB, text/plain) 2019-07-25 14:50 UTC, Naveen Malik	no flags	Details
controller-manager log (2.19 MB, text/plain) 2019-07-25 14:50 UTC, Naveen Malik	no flags	Details
View All

Description Naveen Malik 2019-07-25 14:49:25 UTC

Description of problem:
An OSD cluster was provisioned in an account where AWS limits were not sufficient to fulfill all EC2 instances requested resulting in an initial cluster with 3 masters and 2 workers.  When the limits were updated the additional EC2 instances (+2) were created and associated with the Machines in cluster.  But no corresponding Node was created in the cluster.

Version-Release number of selected component (if applicable):
$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.1.7     True        False         10h     Cluster version is 4.1.7


How reproducible:
Intermittent.

Steps to Reproduce:
1. Provision cluster in account with EC2 limit for workers lower than desired number of machines.

2. Observe failure to provision nodes.

$ oc logs machine-api-controllers-5d957c6fd-qzxml -c controller-manager
...
E0725 08:39:47.357292       1 instances.go:309] Error creating EC2 instance: InstanceLimitExceeded: You have requested more instances (7) than your current instance limit of 5 allows for the specified instance type. Please visit http://aws.amazon.com/contact-us/ec2-request to request an adjustment to this limit.
...

3. Increase limit for worker instance type.

Actual results:
EC2 instances created and referenced by Machines.
No Nodes are created for the Machines.

Expected results:
Nodes are created for all Machines that have EC2 instances.


Additional info:

Comment 1 Naveen Malik 2019-07-25 14:50:12 UTC

Created attachment 1593429 [details]
nodelink-controller log

Comment 2 Naveen Malik 2019-07-25 14:50:31 UTC

Created attachment 1593430 [details]
machine-controller log

Comment 3 Naveen Malik 2019-07-25 14:50:50 UTC

Created attachment 1593431 [details]
controller-manager log

Comment 4 Colin Walters 2019-07-25 14:54:19 UTC

Probably this is similar to https://bugzilla.redhat.com/show_bug.cgi?id=1723955
look at `oc get csr`.

Comment 5 Alberto 2019-07-26 14:53:58 UTC

This is expected when aws limit is reached. You'll get a prometheus alert due to missmatch between nodes and machines and once the instance is created, you'll need to manually approve by design. If anything we could try to store the timestamp for the time the instance was actually created so machine approver will consider it legit so bumping to 4.3 to farther considering that

Comment 6 Colin Walters 2019-07-26 14:57:55 UTC

See https://github.com/openshift/cluster-machine-approver/issues/36

Comment 7 Alberto 2019-11-06 12:43:55 UTC

Since we introduced machine phases this should be reflected in the machine phase as provisioning/failed giving a more meaningful output in addition to the alerts. Also multiple fixes were merge for the machine approver which tolerates a bigger timeout now. I'm closing this please reopen if still relevant

Note You need to log in before you can comment on or make changes to this bug.