In the must-gather from comment 0 I see that all the machines were created at the time of capture but there's a node missing. I didn't dig into kube apiserver logs to see why the node hadn't been created at the time of capture. However, given that waiting additional time yields the 6th node showing up the most likely cause here is simply the vsphere infrastructure not meeting quality of service requirements. You can work around this performance issue by having your CI jobs wait some additional time by running `openshift-install wait-for install-complete` after the current invocation. However, in general, if the infrastructure cannot be provisioned in the allotted time it's not performant enough to meet minimum quality of service and it's likely you'll experience other problems associated with infrastructure performance even after a successful installation. This should likely be marked as a dupe of 1994820 where the Cloud team is working on some bootstrap time improvements which would allow the installer to more accurately convey the state of the requested machines.
(In reply to Scott Dodson from comment #2) > In the must-gather from comment 0 I see that all the machines were created > at the time of capture but there's a node missing. I didn't dig into kube > apiserver logs to see why the node hadn't been created at the time of > capture. However, given that waiting additional time yields the 6th node > showing up the most likely cause here is simply the vsphere infrastructure > not meeting quality of service requirements. You can work around this > performance issue by having your CI jobs wait some additional time by > running `openshift-install wait-for install-complete` after the current > invocation. However, in general, if the infrastructure cannot be provisioned > in the allotted time it's not performant enough to meet minimum quality of > service and it's likely you'll experience other problems associated with > infrastructure performance even after a successful installation. > > This should likely be marked as a dupe of 1994820 where the Cloud team is > working on some bootstrap time improvements which would allow the installer > to more accurately convey the state of the requested machines. If all nodes are not created in OCP cluster, shouldn't "openshift-install create cluster" command fail?
So we had a discussion about this yesterday on the cluster lifecycle architecture call. We were talking through the different options and what we are actually trying to achieve with bug 1994820. We want, in that bug, to signal to the user when "there are insufficient worker machines to complete the installation process". That is, currently, when there are insufficient worker nodes, several operators fail to provision and therefore the cluster install fails, the common cause normally being not enough workers. We want to make this scenario easier to work out, by having an error saying something along the lines of "insufficient workers, go check machine api". > If all nodes are not created in OCP cluster, shouldn't "openshift-install create cluster" command fail? However, we also discussed this idea. The answer to your question varies depending on who you ask. Some customers don't care if 2 out of 100 machines fail to provision, their cluster is still up and running and the existing Machine API alerting will highlight their failed machines to them. Others argue that the installer is a declarative process and therefore if you ask for x machines, you should get x machines. One frustration we know among customers is the fact that installs can often fail, only later for the cluster to be fully functional without end user interaction. For this reason, it was suggested on the call to focus for now on improving the reporting, but not changing the behaviour of the install. If we do not have enough machines to create a functional cluster, we will degrade and fail the install. An addition to the installer to give the end user an option to wait for all machines to be ready was also discussed, but, as this needs a larger discussion, I think a spike should be done by the cloud/installer teams to identify the best way to achieve this
Marking as a duplicate now the other bug is merged, please let us know if you'd like to track future work about making sure all Nodes come up *** This bug has been marked as a duplicate of bug 1994820 ***