1852545 – VSphere IPI fails to bootstrap workers: default resource pool resolves to multiple instances, please specify

Bug 1852545 - VSphere IPI fails to bootstrap workers: default resource pool resolves to multiple instances, please specify

Summary: VSphere IPI fails to bootstrap workers: default resource pool resolves to mul...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.6.0
Assignee:	aos-install
QA Contact:	jima
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1861954 (view as bug list)
Depends On:
Blocks:	1866539
TreeView+	depends on / blocked

Reported:	2020-06-30 16:41 UTC by Brian Ward
Modified:	2023-12-15 18:21 UTC (History)
CC List:	21 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Since resource path is not specified to the machines, the workers try to resolve the right resource path and there are many resource paths, hence causing ambiguity. Adding a resourcePoolPath parameter to provide the right path for the machines to pick the resources from.
Clone Of:
Clones:	1866539 (view as bug list)
Environment:
Last Closed:	2020-10-27 16:10:31 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift installer pull 3863	None	closed	Bug 1852545: Add ResourcePoolPath to machines in vsphere	2021-02-13 21:34:12 UTC
Red Hat Knowledge Base (Solution)	5234861	None	None	None	2020-07-20 08:53:00 UTC
Red Hat Product Errata	RHBA-2020:4196	None	None	None	2020-10-27 16:11:08 UTC

Description Brian Ward 2020-06-30 16:41:52 UTC

Description of problem:

Using nightly https://mirror.openshift.com/pub/openshift-v4/clients/ocp-dev-preview/latest-4.5/openshift-install-linux-4.5.0-0.nightly-2020-06-05-214616.tar.gz

Similar to https://bugzilla.redhat.com/show_bug.cgi?id=1833256

Worker Machines stuck:

status:
  lastUpdated: "2020-06-29T20:57:26Z"
  phase: Provisioning
  providerStatus:
    conditions:
    - lastProbeTime: "2020-06-29T20:57:27Z"
      lastTransitionTime: "2020-06-29T20:57:27Z"
      message: 'unable to get resource pool for <nil>: default resource pool resolves
        to multiple instances, please specify'
      reason: MachineCreationFailed
      status: "False"
      type: MachineCreation


Version-Release number of the following components:


How reproducible:
Only one install performed.  Error suggests likely to reproduce on every go.

Steps to Reproduce:
Run Openshift IPI installer on vpshere with inputs given from install-config.yaml

Actual results:
masters come up but workers fail to provision

Expected results:
Working cluster.

Comment 3 Patrick Dillon 2020-06-30 19:28:17 UTC

Moving this to machine-api operator

This vcenter has multiple datacenters/clusters. It looks like the machine-api operator is failing to find a default resource pool because there are multiple default resource pools. Issue could probably be resolved by finding the default resource pool for the datacenter/cluster in the provider spec.

The error "'unable to get resource pool for <nil>: default resource pool resolves to multiple instances, please specify'" seems to come from the call to this function: https://github.com/openshift/machine-api-operator/blob/release-4.5/pkg/controller/vsphere/reconciler.go#L473

The resourcePoolPath argument is read from the provider, which is generated from the installer, which omits the optional resourcePool field:
https://github.com/openshift/machine-api-operator/blob/release-4.5/pkg/controller/vsphere/reconciler.go#L454
https://github.com/openshift/installer/blob/master/pkg/asset/machines/vsphere/machines.go#L87-L91

So the value for resourcePoolPath is empty. MAO is passing that empty value through to Finder.ResourcePoolOrDefault which seems to fail with above errors when there are multiple dcs/clusters.

Workaround was successfully achieved by specifying path /SDDC-Datacenter/host/Cluster-1/Resources in machineset:

$ oc get machinesets.machine.openshift.io -n openshift-machine-api jcallen-d2q7j-worker --template '{{.spec.template.spec.providerSpec.value.workspace.resourcePool}}'
/SDDC-Datacenter/host/Cluster-1/Resources%   
$ oc get machinesets.machine.openshift.io                                                                
NAME                   DESIRED   CURRENT   READY   AVAILABLE   AGE
jcallen-d2q7j-worker   3         3                             9m45s

MAO could potentially construct a similar path from the provider spec.

Comment 5 Jianwei Hou 2020-07-01 01:43:11 UTC

A relevant bug is https://bugzilla.redhat.com/show_bug.cgi?id=1833256, there is a work around but is not user friendly.

As a user experience improvement, I suggest there should be a way to specify a resource pool at installing time since we can't assume customer has only one resource pool in their vsphere configuration.

Comment 6 Alberto 2020-07-01 07:46:28 UTC

Thanks for reporting. This is currently expected behaviour.

>Issue could probably be resolved by finding the default resource pool for the datacenter/cluster in the provider spec?

We'll explore doing this.

Comment 7 Patrick Dillon 2020-07-01 13:27:24 UTC

On closer inspection, MAO would still be very limited in determining default resource pool because no cluster is provided in the machine provider. Therefore if there are multiple clusters in a datacenter MAO will not be able to resolve the issue.

Moving back to installer, which should populate the resource pool to the root resource pool in the provided cluster, which the installer has access to.

Providing a non-root resource pool would be a new feature and would require changes to terraform and vsphere provider.

Comment 9 Patrick Dillon 2020-07-10 14:04:31 UTC

This is under active code review, but possibly will not merge today so we are adding UpcomingSprint.

Comment 12 David Barreda 2020-07-24 16:34:23 UTC

Hi - back on September (last time I tried using the Terraform install (not thru the openshift-install binary)which was available on the openshift-installer project), terraform would create the resource pool and put all the VM's inside of it. I think this would be the regular behavior now that it's done thru openshift-install.

This would imply that also documentation on the user being able to create resource pools needs to be added as well.

Comment 13 jima 2020-07-29 03:35:22 UTC

Verified on 4.6.0-0.nightly-2020-07-25-091217 and passed.

$ oc get machinesets.machine.openshift.io -n openshift-machine-api wduan0729a-5zjt4-worker --template '{{.spec.template.spec.providerSpec.value.workspace.resourcePool}}'
/dc1/host/devel/Resources

Comment 14 Abhinav Dahiya 2020-07-30 03:04:38 UTC

*** Bug 1861954 has been marked as a duplicate of this bug. ***

Comment 15 Patrick Dillon 2020-07-30 03:06:02 UTC

(In reply to David Barreda from comment #12)
> Hi - back on September (last time I tried using the Terraform install (not
> thru the openshift-install binary)which was available on the
> openshift-installer project), terraform would create the resource pool and
> put all the VM's inside of it. I think this would be the regular behavior
> now that it's done thru openshift-install.
> 
> This would imply that also documentation on the user being able to create
> resource pools needs to be added as well.

openshift-install uses the root resource pool for the cluster designated in the install-config.

Comment 16 Luciano R 2020-07-30 12:41:23 UTC

We have same problem in an installation on vSphere using installer 4.5.4. We solved it by placing the relative path on resourcepool setting on machineset. The complete DC path didnt worked.


=== This didnt worked ===
(...)
workspace:
            datacenter: DC
            datastore: DATASTORE05
            folder: /DC/vm/prd-47q4m
            resourcepool: /DC/Cluster/Resources
            server: customervcenter.com.br
(...)

=== This worked ===
(...)
workspace:
            datacenter: DC
            datastore: DATASTORE05
            folder: /DC/vm/prd-47q4m
            resourcepool: Cluster/Resources
            server: customervcenter.com.br
(...)

Comment 18 Phillip Kramp 2020-08-28 13:39:57 UTC

FYI, we ran into this bug also when no resource pool has been explicitly created and the customer is just using the default resource pool. We tried: 
resourcepool:
  Resources
  /<full path>/Resources
  vmware id: ResourcePool-resgroup-194
  root

Finally we created a resourcepool OpenShift-RP which then worked.
Working config looked like

workspace:
            datacenter: <DC>
            datastore: <DATASTORE>
            folder: ocppoc-ld5s7
            resourcepool: OpenShift-RP
            server: <SERVER>

Comment 21 errata-xmlrpc 2020-10-27 16:10:31 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.

adeshpan
andcosta
david.barreda
dphillip
dsimmons
francesco.trentini
jcall
jhou
jima
jmalde
lscorsin
mbagga
mfuruta
mvardhan
nbhatt
nrevo
padillon
pkramp
rkshirsa
tcort
vpagar