2090780 – VmWare IPI installer passed OK but there is one node missing

Bug 2090780 - VmWare IPI installer passed OK but there is one node missing

Summary: VmWare IPI installer passed OK but there is one node missing

Keywords:
Status:	CLOSED DUPLICATE of bug 1994820
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.11
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Joel Speed
QA Contact:	sunzhaohua
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-05-26 14:22 UTC by Petr Balogh
Modified:	2022-07-04 15:24 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-07-04 15:24:11 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-api-operator pull 1019	0	None	Merged	Bug 1994820: Degrade operator on cluster bootstrap if not all Machines are Running	2022-07-04 15:24:10 UTC

Comment 2 Scott Dodson 2022-05-26 18:08:11 UTC

In the must-gather from comment 0 I see that all the machines were created at the time of capture but there's a node missing. I didn't dig into kube apiserver logs to see why the node hadn't been created at the time of capture. However, given that waiting additional time yields the 6th node showing up the most likely cause here is simply the vsphere infrastructure not meeting quality of service requirements. You can work around this performance issue by having your CI jobs wait some additional time by running `openshift-install wait-for install-complete` after the current invocation. However, in general, if the infrastructure cannot be provisioned in the allotted time it's not performant enough to meet minimum quality of service and it's likely you'll experience other problems associated with infrastructure performance even after a successful installation.

This should likely be marked as a dupe of 1994820 where the Cloud team is working on some bootstrap time improvements which would allow the installer to more accurately convey the state of the requested machines.

Comment 3 Vijay Avuthu 2022-05-27 08:25:26 UTC

(In reply to Scott Dodson from comment #2)
> In the must-gather from comment 0 I see that all the machines were created
> at the time of capture but there's a node missing. I didn't dig into kube
> apiserver logs to see why the node hadn't been created at the time of
> capture. However, given that waiting additional time yields the 6th node
> showing up the most likely cause here is simply the vsphere infrastructure
> not meeting quality of service requirements. You can work around this
> performance issue by having your CI jobs wait some additional time by
> running `openshift-install wait-for install-complete` after the current
> invocation. However, in general, if the infrastructure cannot be provisioned
> in the allotted time it's not performant enough to meet minimum quality of
> service and it's likely you'll experience other problems associated with
> infrastructure performance even after a successful installation.
> 
> This should likely be marked as a dupe of 1994820 where the Cloud team is
> working on some bootstrap time improvements which would allow the installer
> to more accurately convey the state of the requested machines.

If all nodes are not created in OCP cluster, shouldn't "openshift-install create cluster" command fail?

Comment 4 Joel Speed 2022-05-27 10:33:12 UTC

So we had a discussion about this yesterday on the cluster lifecycle architecture call.

We were talking through the different options and what we are actually trying to achieve with bug 1994820.

We want, in that bug, to signal to the user when "there are insufficient worker machines to complete the installation process". That is, currently, when there are insufficient worker nodes, several operators fail to provision and therefore the cluster install fails, the common cause normally being not enough workers. We want to make this scenario easier to work out, by having an error saying something along the lines of "insufficient workers, go check machine api".

> If all nodes are not created in OCP cluster, shouldn't "openshift-install create cluster" command fail?

However, we also discussed this idea. The answer to your question varies depending on who you ask. Some customers don't care if 2 out of 100 machines fail to provision, their cluster is still up and running and the existing Machine API alerting will highlight their failed machines to them. Others argue that the installer is a declarative process and therefore if you ask for x machines, you should get x machines.

One frustration we know among customers is the fact that installs can often fail, only later for the cluster to be fully functional without end user interaction.
For this reason, it was suggested on the call to focus for now on improving the reporting, but not changing the behaviour of the install.

If we do not have enough machines to create a functional cluster, we will degrade and fail the install.

An addition to the installer to give the end user an option to wait for all machines to be ready was also discussed, but, as this needs a larger discussion, I think a spike should be done by the cloud/installer teams to identify the best way to achieve this

Comment 5 Joel Speed 2022-07-04 15:24:11 UTC

Marking as a duplicate now the other bug is merged, please let us know if you'd like to track future work about making sure all Nodes come up

*** This bug has been marked as a duplicate of bug 1994820 ***

Note You need to log in before you can comment on or make changes to this bug.