Bug 1868755 - [vsphere] terraform provider vsphereprivate crashes when network is unavailable on host
Summary: [vsphere] terraform provider vsphereprivate crashes when network is unavailab...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.6
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.8.0
Assignee: Jeremiah Stuever
QA Contact: jima
URL:
Whiteboard:
Depends On:
Blocks: 1874240
TreeView+ depends on / blocked
 
Reported: 2020-08-13 18:01 UTC by Joseph Callen
Modified: 2021-07-27 22:33 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1874240 (view as bug list)
Environment:
Last Closed: 2021-07-27 22:32:47 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
openshift_install.log (13.91 KB, application/gzip)
2020-08-13 18:01 UTC, Joseph Callen
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift installer pull 4678 0 None open Bug 1868755: vsphereprivate: tf plugin to no longer error if no network found. 2021-02-22 23:25:13 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:33:31 UTC

Description Joseph Callen 2020-08-13 18:01:03 UTC
Created attachment 1711354 [details]
openshift_install.log

Description of problem:

This is certainly an issue but I am not quite sure yet what vSphere cluster configuration is causing this. While working with Chris (cahl) he was having issues with terraform crash when selecting "VM Network". While this network is valid it had no uplinks on the vSwitch. 

Log snipit:

time="2020-08-12T14:26:53-04:00" level=debug msg="2020/08/12 14:26:53 [DEBUG] vsphereprivate_import_ova.import: applying the planned Create change"
time="2020-08-12T14:26:53-04:00" level=debug msg="2020-08-12T14:26:53.949-0400 [DEBUG] plugin.terraform-provider-vsphereprivate: 2020/08/12 14:26:53 [DEBUG] /Users/cahl/Library/Caches/openshift-installer/image_cache/abc7fccbe43d10b0fa665c80e3865ac7: Beginning import ova create"
time="2020-08-12T14:26:54-04:00" level=debug msg="2020-08-12T14:26:54.847-0400 [DEBUG] plugin.terraform-provider-vsphereprivate: panic: runtime error: invalid memory address or nil pointer dereference"
time="2020-08-12T14:26:54-04:00" level=debug msg="2020-08-12T14:26:54.847-0400 [DEBUG] plugin.terraform-provider-vsphereprivate: [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xa92f677]"
time="2020-08-12T14:26:54-04:00" level=debug msg="2020-08-12T14:26:54.847-0400 [DEBUG] plugin.terraform-provider-vsphereprivate: "
time="2020-08-12T14:26:54-04:00" level=debug msg="2020-08-12T14:26:54.847-0400 [DEBUG] plugin.terraform-provider-vsphereprivate: goroutine 61 [running]:"
time="2020-08-12T14:26:54-04:00" level=debug msg="2020-08-12T14:26:54.847-0400 [DEBUG] plugin.terraform-provider-vsphereprivate: github.com/openshift/installer/pkg/terraform/exec/plugins/vsphereprivate.findImportOvaParams(0xc000257080, 0xc000b94210, 0xb, 0xc000b941f0, 0xe, 0xc000b94230, 0xd, 0xc000b94260, 0xa, 0xc000b96240, ...)"
time="2020-08-12T14:26:54-04:00" level=debug msg="2020-08-12T14:26:54.847-0400 [DEBUG] plugin.terraform-provider-vsphereprivate: \t/Users/cahl/go/src/github.com/openshift/installer/pkg/terraform/exec/plugins/vsphereprivate/resource_vsphereprivate_import_ova.go:199 +0xb47"


Port Groups for cluster:

govc  ls -L -t DistributedVirtualPortgroup '*'  
/CICD-Plot01/network/cluster-vlan-canary-i2e-lv
/CICD-Plot01/network/Plot01 Cluster N-DVUplinks-376

govc ls -L -t Network '*' 
/CICD-Plot01/network/VM Network
/CICD-Plot01/network/Public Network


And where the crash happens:

https://github.com/openshift/installer/blob/master/pkg/terraform/exec/plugins/vsphereprivate/resource_vsphereprivate_import_ova.go#L198-L203

Comment 1 Joseph Callen 2020-08-13 18:25:19 UTC
Chris built the installer from the release-4.5, when I checked it was the most recent commit.

I also asked for access to the cluster, we will see if I can get it.

Comment 2 Abhinav Dahiya 2020-08-17 16:50:23 UTC
Can you include the steps to reproduce this on our end?

some details that can be useful is,

- how is the environment setup? Any thing in the environment that might be causing this?
- exact steps of how you installed the cluster
- maybe also include the install-config.yaml _remove the password for vCenter_

Comment 3 cahl 2020-08-19 12:35:17 UTC
The vsphere setup was done by a member of the advanced cluster management (ACM) team.

One thing I noticed is even though the openshift-installer gives 4 options for choosing the network, (and as shown by govc commands previously)
is on vSphere client UI the datacenter shows the following:
Networks:
Public Network
VM Network

Distributed Port Groups:
cluster-vlan-canary-i2e-lv

Uplink Port Groups:
Plot01 Cluster N-DVUplinks-376



The cluster shows the following Networks:
cluster-vlan-canary-i2e-lv
Plot01 Cluster N-DVUplinks-376
Public Network

The choice VM Network is not listed for the cluster.  So maybe this is the reason for the exception 



Note that I am currently using the Public Network option and have been able to deploy.

Comment 8 Abhinav Dahiya 2020-10-12 17:37:23 UTC
Can you help us by proving some reproduction steps? Things we can do to setup our environment to reproduce this.

Comment 11 cahl 2020-11-02 20:17:09 UTC
I provided info previously that showed the VMware setup.  Here it is again:

```
The vsphere setup was done by a member of the advanced cluster management (ACM) team.

One thing I noticed is even though the openshift-installer gives 4 options for choosing the network, (and as shown by govc commands previously)
is on vSphere client UI the datacenter shows the following:
Networks:
Public Network
VM Network

Distributed Port Groups:
cluster-vlan-canary-i2e-lv

Uplink Port Groups:
Plot01 Cluster N-DVUplinks-376



The cluster shows the following Networks:
cluster-vlan-canary-i2e-lv
Plot01 Cluster N-DVUplinks-376
Public Network

The choice VM Network is not listed for the cluster.  So maybe this is the reason for the exception 



Note that I am currently using the Public Network option and have been able to deploy.
```

I was using `VM Network` option as it was one shown as valid, but picking that option caused the error.   Only when I used `Public Network` did it work.  So it appears that if there is a network defined for the datacenter that is not defined at the cluster, that the error occurs.

Comment 12 Brenton Leanhardt 2020-11-30 18:30:39 UTC
We plan to get to this in 4.7.

Comment 13 Scott Dodson 2020-12-09 18:44:59 UTC
Given that we don't believe that it's possible to yield success in this scenario this is mostly improving an error message we're not marking this for 4.7 any longer.

Comment 17 jima 2021-03-02 12:26:36 UTC
Verified on Jeremiah's env since QE don't have such specific env, and passed.

Reproduced issue on 4.7.0-0.nightly-2021-03-01-085007:
Set TF_LOG with DEBUG, and run openshift_install to create cluster, and get same nil pointer error in Description.
time="2021-03-02T07:10:04-05:00" level=debug msg="2021-03-02T07:10:04.434-0500 [DEBUG] plugin.terraform-provider-vsphereprivate: 2021/03/02 07:10:04 [DEBUG] /home/admin/.cache/openshift-installer/image_cache/3b90b8f621548d33b166787e8d70207d: Beginning import ova create"
time="2021-03-02T07:10:04-05:00" level=debug msg="2021-03-02T07:10:04.485-0500 [DEBUG] plugin.terraform-provider-vsphereprivate: panic: runtime error: invalid memory address or nil pointer dereference"
time="2021-03-02T07:10:04-05:00" level=debug msg="2021-03-02T07:10:04.485-0500 [DEBUG] plugin.terraform-provider-vsphereprivate: [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xabc7ece]"
...
time="2021-03-02T07:10:04-05:00" level=debug msg="2021/03/02 07:10:04 [DEBUG] vsphereprivate_import_ova.import: apply errored, but we're indicating that via the Error pointer rather than returning it: rpc error: code = Canceled desc = context canceled"
time="2021-03-02T07:10:04-05:00" level=debug msg="2021/03/02 07:10:04 [ERROR] <root>: eval: *terraform.EvalApplyPost, err: rpc error: code = Canceled desc = context canceled"
time="2021-03-02T07:10:04-05:00" level=debug msg="2021/03/02 07:10:04 [ERROR] <root>: eval: *terraform.EvalSequence, err: rpc error: code = Canceled desc = context canceled"
time="2021-03-02T07:10:04-05:00" level=debug msg="2021-03-02T07:10:04.501-0500 [DEBUG] plugin: plugin process exited: path=/tmp/openshift-install-803644701/plugins/terraform-provider-vsphereprivate pid=2462 error=\"exit status 2\""
time="2021-03-02T07:10:04-05:00" level=debug msg="2021-03-02T07:10:04.501-0500 [WARN]  plugin.stdio: received EOF, stopping recv loop: err=\"rpc error: code = Unavailable desc = transport is closing\""
time="2021-03-02T07:10:04-05:00" level=error
time="2021-03-02T07:10:04-05:00" level=error msg="Error: rpc error: code = Canceled desc = context canceled"
time="2021-03-02T07:10:04-05:00" level=error
time="2021-03-02T07:10:04-05:00" level=error
time="2021-03-02T07:10:04-05:00" level=fatal msg="failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: failed to apply Terraform: failed to complete the change"


Then same command launched on nightly build 4.8.0-0.nightly-2021-03-01-143026, report detailed error message that why failed to create cluster.

time="2021-03-02T06:54:03-05:00" level=debug msg="vsphereprivate_import_ova.import: Creating..."
time="2021-03-02T06:54:03-05:00" level=error
time="2021-03-02T06:54:03-05:00" level=error msg="Error: failed to find provided vSphere objects: failed to find a host in the cluster that contains the provided network"
time="2021-03-02T06:54:03-05:00" level=error
time="2021-03-02T06:54:03-05:00" level=error msg="  on ../../../tmp/openshift-install-781371026/main.tf line 43, in resource \"vsphereprivate_import_ova\" \"import\":"
time="2021-03-02T06:54:03-05:00" level=error msg="  43: resource \"vsphereprivate_import_ova\" \"import\" {"
time="2021-03-02T06:54:03-05:00" level=error
time="2021-03-02T06:54:03-05:00" level=error
time="2021-03-02T06:54:03-05:00" level=fatal msg="failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: failed to apply Terraform: failed to complete the change"

Comment 20 errata-xmlrpc 2021-07-27 22:32:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.