Bug 1918005

Summary: [vsphere] If there are multiple port groups with the same name installation fails
Product: OpenShift Container Platform Reporter: Joseph Callen <jcallen>
Component: InstallerAssignee: Nobody <nobody>
Installer sub component: openshift-installer QA Contact: jima
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: medium CC: asadawar, bleanhar, morgan.peterman, nstielau, padillon, rbost, snetting, zhsun
Version: 4.8   
Target Milestone: ---   
Target Release: 4.11.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: the installer passed an ambiguous name for networks to the terraform provider Consequence: if the terraform provider found more than one network, it could not decide which was correct and would fail Fix: the installer now passes the id for the network Result: the terraform provider knows exactly which network to use and install succeeds. There is no difference in behavior for users (they still provide the same information as before).
Story Points: ---
Clone Of:
: 1955697 (view as bug list) Environment:
Last Closed: 2022-08-10 10:35:38 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1981941    
Bug Blocks:    

Description Joseph Callen 2021-01-19 19:52:58 UTC
If two port groups exist with the same name on the same vCenter instance but existing in two different datacenters terraform will fail with:

error fetching network: path '3214-pxe' resolves to multiple networks, Please specify

See https://registry.terraform.io/providers/hashicorp/vsphere/latest/docs/data-sources/network#distributed_virtual_switch_uuid

for requirements if multiple port groups exist with the same name.

Will need to determine the distributed virtual switch name (if dvs are being used)
Provide name to terraform
Add arg to vsphere_network - distributed_virtual_switch_uuid

Comment 2 Joseph Callen 2021-01-19 19:55:25 UTC
It might be easier to fix the terraform provider. There is no reason why you should need to provide the distributed switch uuid if you already provide the datacenter that the port group belongs to.

Comment 3 Nick Stielau 2021-02-03 19:16:25 UTC
Seems edge-casey enought that I'm confident that it isn't a release blocker.

Comment 4 Matthew Staebler 2021-02-03 19:21:59 UTC
There is a workaround for this, so lowering the severity to medium. Nick beat me to setting "blocker-".

Comment 5 Brenton Leanhardt 2021-02-04 18:45:07 UTC
We're still planning to fix this.

Comment 6 Jeremiah Stuever 2021-04-30 16:58:34 UTC
I split out the datacenter portion of this bug because it will be significantly easier to fix. As for the port groups (networks), our vsphereprivate Terraform provider complicates the fix because we are matching the name of the network without regard to path. Still working to find a solution here.

Comment 7 Jeremiah Stuever 2021-05-19 17:21:21 UTC
I have looked at this from various angles. The difficulty is that the vsphereprivate provider selects a host that has both the data store and the network. However, during this process we only have the network name (excluding path). At this point, I think the best path forward is to move forward with CORS-1476 and deprecate vsphereprivate with import ova from upstream provider. However, I believe this is blocked by the work in CORS-1511 to upgrade Terraform in general.

https://issues.redhat.com/browse/CORS-1476
https://issues.redhat.com/browse/CORS-1511

Comment 8 Russell Teague 2021-07-12 17:32:20 UTC
Still waiting for terraform upgrade.

Comment 9 Russell Teague 2021-08-02 17:21:58 UTC
Still waiting for terraform upgrade.

Comment 19 Patrick Dillon 2022-03-11 19:31:17 UTC
Fix is still in progress.

Comment 22 jima 2022-04-08 01:07:16 UTC
Verified on QE local vsphere env where has a standard port group and distributed port group with same name (VM Network).

Reproduced the issue on 4.11.0-0.nightly-2022-03-26-130745.

$ ./openshift-install create cluster --dir ipi --log-level debug
...
INFO Creating infrastructure resources...         
...           
DEBUG [INFO] running Terraform command: /tmp/openshift-install-pre-bootstrap-909263706/bin/terraform init -no-color -force-copy -input=false -backend=true -get=true -upgrade=false -plugin-dir=/tmp/openshift-install-pre-bootstrap-909263706/plugins 
..
DEBUG                                              
DEBUG Terraform has been successfully initialized! 
DEBUG [INFO] running Terraform command: /tmp/openshift-install-pre-bootstrap-909263706/bin/terraform apply -no-color -auto-approve -input=false -var-file=/tmp/openshift-install-pre-bootstrap-909263706/terraform.tfvars.json -var-file=/tmp/openshift-install-pre-bootstrap-909263706/terraform.platform.auto.tfvars.json -lock=true -parallelism=10 -refresh=true 
ERROR                                              
ERROR Error: error fetching network: path 'VM Network' resolves to multiple networks, Please specify 
ERROR                                              
ERROR   with data.vsphere_network.network,         
ERROR   on main.tf line 38, in data "vsphere_network" "network": 
ERROR   38: data "vsphere_network" "network" {     
ERROR                                              
FATAL failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed to apply Terraform: exit status 1 
FATAL                                              
FATAL Error: error fetching network: path 'VM Network' resolves to multiple networks, Please specify 
FATAL                                              
FATAL   with data.vsphere_network.network,         
FATAL   on main.tf line 38, in data "vsphere_network" "network": 
FATAL   38: data "vsphere_network" "network" {     
FATAL                                              
FATAL                                              


Verified on 4.11.0-0.nightly-2022-04-06-213816 and installer part is passed.
$ ./openshift-install create cluster --dir ipi1 --log-level debug
INFO Creating infrastructure resources...         
...                       
DEBUG [INFO] running Terraform command: /tmp/openshift-install-pre-bootstrap-3262909223/bin/terraform init -no-color -force-copy -input=false -backend=true -get=true -upgrade=false -plugin-dir=/tmp/openshift-install-pre-bootstrap-3262909223/plugins 
...
DEBUG Terraform has been successfully initialized! 
DEBUG [INFO] running Terraform command: /tmp/openshift-install-pre-bootstrap-3262909223/bin/terraform apply -no-color -auto-approve -input=false -var-file=/tmp/openshift-install-pre-bootstrap-3262909223/terraform.tfvars.json -var-file=/tmp/openshift-install-pre-bootstrap-3262909223/terraform.platform.auto.tfvars.json -lock=true -parallelism=10 -refresh=true 
DEBUG                                              
DEBUG Terraform used the selected providers to generate the following execution 
DEBUG plan. Resource actions are indicated with the following symbols: 
DEBUG   + create                                   
DEBUG  <= read (data resources)                    
DEBUG                                              
DEBUG Terraform will perform the following actions: 
DEBUG                                              

This BZ also contains machine-api PR(machine-api-operator#961), need to verify on machine-api side before moving bug to "VERIFIED" status.
Installation on local env is blocked by https://bugzilla.redhat.com/show_bug.cgi?id=2073021 now, will check machine-api function when BZ#2073021 is fixed.

Comment 23 Patrick Dillon 2022-04-11 19:54:24 UTC
The PR for this BZ also fixes: https://bugzilla.redhat.com/show_bug.cgi?id=2063829

Comment 24 sunzhaohua 2022-04-19 04:15:25 UTC
Verified for machine-api part.Run some regression testing for machine scale up/down, all works well.
add workload
$ oc get machine                                                                                                                                                        [11:50:16]
NAME                                PHASE     TYPE   REGION   ZONE   AGE
reliability01-vb569-master-0        Running                          109m
reliability01-vb569-master-1        Running                          109m
reliability01-vb569-master-2        Running                          109m
reliability01-vb569-worker-8sthp    Running                          96m
reliability01-vb569-worker-b2nxs    Running                          96m
reliability01-vb569-worker-x4h5x    Running                          96m
reliability01-vb569-worker1-255bx   Running                          12m
reliability01-vb569-worker1-c9fn4   Running                          12m
reliability01-vb569-worker1-snwc4   Running                          24m

remove workload
$ oc get machine                                                                                                                                                        [11:50:47]
NAME                                PHASE     TYPE   REGION   ZONE   AGE
reliability01-vb569-master-0        Running                          116m
reliability01-vb569-master-1        Running                          116m
reliability01-vb569-master-2        Running                          116m
reliability01-vb569-worker-8sthp    Running                          103m
reliability01-vb569-worker-b2nxs    Running                          103m
reliability01-vb569-worker-x4h5x    Running                          103m
reliability01-vb569-worker1-255bx   Running                          19m

Comment 25 jima 2022-04-19 05:20:40 UTC
Thanks Zhaohua, based on comment#22 and comment#24, move bug to VERIFIED.

Comment 34 errata-xmlrpc 2022-08-10 10:35:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069