Bug 1990663

Summary: [Assisted-4.8 ][SaaS][vsphere] cluster deployment failed when use OpenShiftSDN and network adapter vmxnet3
Product: OpenShift Container Platform Reporter: Yuri Obshansky <yobshans>
Component: NetworkingAssignee: Nadia Pinaeva <npinaeva>
Networking sub component: openshift-sdn QA Contact: Yuri Obshansky <yobshans>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: urgent CC: aconstan, aos-bugs, astoycos, lgamliel, sasha, sdodson
Version: 4.8Keywords: TestBlocker
Target Milestone: ---   
Target Release: 4.8.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 4.8.9 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-01-11 22:31:15 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1998106    
Bug Blocks:    
Attachments:
Description Flags
installation logs none

Description Yuri Obshansky 2021-08-05 21:59:11 UTC
Created attachment 1811312 [details]
installation logs

Created attachment 1811312 [details]
installation logs

Description of problem:
Cluster deployment on VMware vsphere setup failed on
"Failed installing cluster qe1. Reason: Timeout while waiting for cluster version to be available: context deadline exceeded"
But actually CVO failed to be installed
Operator cvo status: progressing message: Unable to apply 4.8.2: an unknown error has occurred: MultipleErrors

Cluster deployed with 
    "platform": {
        "type": "baremetal",
        "vsphere": {
            "cluster": 
            "datacenter": 
            "defaultDatastore": 
            "folder": 
            "network": 
            "password":
            "username": 
            "vCenter": 
        }
    },

Another error in agent log
Aug 05 18:09:28 master-0.qe1.e2e.bos.redhat.com domain_resoluti[2854]: time="05-08-2021 18:09:28" level=error msg="error occurred during domain resolution of api-int.qe1.e2e.bos.redhat.com" file="domain_resolution.go:33" error="lookup api-int.qe1.e2e.bos.redhat.com on 10.19.143.247:53: no such host"

Version-Release number of selected component (if applicable):
v1.0.24.1

How reproducible:


Steps to Reproduce:
1. Install cluster on VMware setup
2. Cluster failed
3.

Actual results:


Expected results:


Additional info:

Comment 2 Yuri Obshansky 2021-08-06 16:50:54 UTC
Update: SNO cluster deployed successfully:

8/6/2021, 12:17:08 PM	Successfully finished installing cluster qe1
8/6/2021, 12:17:08 PM	Updated status of cluster qe1 to installed
8/6/2021, 12:15:19 PM	Operator console status: available message: All is well
8/6/2021, 12:13:18 PM	Operator cvo status: available message: Done applying 4.8.2
8/6/2021, 12:12:18 PM	Operator cvo status: progressing message: Working towards 4.8.2: downloading update

Looks like the problem with high_availability mode only

Comment 3 liat gamliel 2021-08-09 08:18:18 UTC
Slack thread: https://coreos.slack.com/archives/CUPJTHQ5P/p1628200825352800

Comment 4 Yuri Obshansky 2021-08-09 20:32:41 UTC
Updates:
1. Previous failure was with vip_dhcp_allocation=true which is not supported yet 
As result, ingress_vip and api_vip were not allocated correctly.
Issue reported https://issues.redhat.com/browse/MGMT-7117

2. Latest failure happened when vip_dhcp_allocation=false
and provided correct ingress_vip and api_vip
api.qe1.e2e.bos.redhat.com has address 10.19.114.250
*.apps.qe1.e2e.bos.redhat.com has address 10.19.114.251
https://qaprodauth.cloud.redhat.com/openshift/assisted-installer/clusters/470d3631-c13e-42d5-b1c3-e6b162d26a98
See attached installation logs

Comment 6 Yuri Obshansky 2021-08-12 14:05:13 UTC
OVNKubernetes + vmxnet3 - OK
OpenShiftSDN + vmxnet3 - Failed
OpenShiftSDN + e1000 - OK
OpenShiftSDN + vmxnet3 + version vmx-15 - Failed

Comment 7 Nadia Pinaeva 2021-08-25 16:47:06 UTC
Seems to be fixed by workaround from https://bugzilla.redhat.com/show_bug.cgi?id=1987108
There are couple bugs with similar Vsphere vmxnet3 failures, will see how this ^ will be resolved

Comment 8 Yuri Obshansky 2021-08-27 12:39:41 UTC
Workaround from https://bugzilla.redhat.com/show_bug.cgi?id=1987108 did not work.
Our workaround is to boot vsphere vms with lower version of VMware Hardware.
Current settings: 
[root@rh8-tools yuri]# govc vm.option.info -cluster "e2e" -json | grep HwVersion
    "HwVersion": 17,
The working HwVersion is 13.
So, we should start vm with parameter -version=6.5
Example:
govc vm.create -version=6.5 -net.adapter vmxnet3 -disk.controller pvscsi -c=16 -m=32768 -disk=120GB -disk-datastore=aos-vsphere -net.address="00:50:56:83:eb:fc" -iso-datastore=aos-vsphere -iso="discovery_image_qe1.iso" -folder="e2e-qe" master-0.qe1.e2e.bos.redhat.com 
Works for both networks OpenShiftSDN and OVNKubernetes as well.

Comment 9 Scott Dodson 2021-08-27 20:29:53 UTC
(In reply to Yuri Obshansky from comment #8)
> Workaround from https://bugzilla.redhat.com/show_bug.cgi?id=1987108 did not
> work.
> Our workaround is to boot vsphere vms with lower version of VMware Hardware.
> Current settings: 
> [root@rh8-tools yuri]# govc vm.option.info -cluster "e2e" -json | grep
> HwVersion
>     "HwVersion": 17,
> The working HwVersion is 13.
> So, we should start vm with parameter -version=6.5
> Example:
> govc vm.create -version=6.5 -net.adapter vmxnet3 -disk.controller pvscsi
> -c=16 -m=32768 -disk=120GB -disk-datastore=aos-vsphere
> -net.address="00:50:56:83:eb:fc" -iso-datastore=aos-vsphere
> -iso="discovery_image_qe1.iso" -folder="e2e-qe"
> master-0.qe1.e2e.bos.redhat.com 
> Works for both networks OpenShiftSDN and OVNKubernetes as well.

Is it understood why that didn't work? Are you saying that you've tried with 4.8.8 and latest 4.9 nightlies or you did some other implementation?

Comment 10 Yuri Obshansky 2021-08-27 20:43:18 UTC
sdodson
No, It is not understood.
We just disabled "tx-checksum-ip-generic" on all VMs as suggested in above and use image 4.8.2

Comment 11 Scott Dodson 2021-08-28 02:42:01 UTC
(In reply to Yuri Obshansky from comment #10)
> sdodson
> No, It is not understood.
> We just disabled "tx-checksum-ip-generic" on all VMs as suggested in above
> and use image 4.8.2

I would've expected that to work but it would be good to test 4.8.8 and see if that fixes the problem.

Comment 12 Nadia Pinaeva 2021-08-30 10:38:16 UTC
Target release is set to 4.9 so we're super-close to code freeze, should we change it to --- or what's the plan?

Comment 13 Nadia Pinaeva 2021-08-31 13:14:47 UTC
Decided to wait for next 4.8 release including given workaround to test if it works, tracking https://bugzilla.redhat.com/show_bug.cgi?id=1998106

Comment 14 Yuri Obshansky 2021-09-07 17:35:09 UTC
Verified on Staging UI 1.5.35 and BE v1.0.25.3

image 4.8.9 - Passed -> 
https://qaprodauth.cloud.redhat.com/openshift/assisted-installer/clusters/8c205e19-d683-4f24-b2db-8c0b78c0b8b8
   "name": "qe1",
    "network_type": "OpenShiftSDN",
    "ocp_release_image": "quay.io/openshift-release-dev/ocp-release:4.8.9-x86_64",
    "openshift_cluster_id": "879dd939-a12d-46d2-a392-e8c163daa5f3",
    "openshift_version": "4.8.9",
    "org_id": "13539309",
    "platform": {
        "type": "vsphere",
        "vsphere": {}
    },
    "progress": {
        "finalizing_stage_percentage": 100,
        "installing_stage_percentage": 100,
        "preparing_for_installation_stage_percentage": 100,
        "total_percentage": 100
    },
  "status": "installed",
    "status_info": "Cluster is installed",
    "status_updated_at": "2021-09-07T17:18:02.234Z",

image 4.9.0-fc.0 - Failed ->
https://qaprodauth.cloud.redhat.com/openshift/assisted-installer/clusters/c542e994-0d0e-4fdc-9935-d375e18e2923
    "name": "qe1",
    "network_type": "OpenShiftSDN",
    "ocp_release_image": "quay.io/openshift-release-dev/ocp-release:4.9.0-fc.0-x86_64",
    "openshift_cluster_id": "28f4794f-0227-4396-8ff4-1e40d4ee5514",
    "openshift_version": "4.9.0-fc.0",
    "org_id": "13539309",
    "platform": {
        "type": "vsphere",
        "vsphere": {}
    },
    "progress": {
        "installing_stage_percentage": 100,
        "preparing_for_installation_stage_percentage": 100,
        "total_percentage": 80
    },
    "status": "error",
    "status_info": "Timeout while waiting for cluster version to be available: context deadline exceeded",
    "status_updated_at": "2021-09-07T16:19:36.175Z",

Comment 15 Nadia Pinaeva 2021-09-08 07:46:34 UTC
Are you sure 4.9.0-fc.0 has the fix?
Release page says it's been created at 2021-08-20 12:29:17 +0000 UTC, and PR # is not in the list

Comment 16 Yuri Obshansky 2021-09-08 12:13:09 UTC
(In reply to Nadia Pinaeva from comment #15)
> Are you sure 4.9.0-fc.0 has the fix?
> Release page says it's been created at 2021-08-20 12:29:17 +0000 UTC, and PR
> # is not in the list

Hi Nadia, 

Looks like the fix in 4.9.0-fc.1 - 
https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/releasestream/4-stable/release/4.9.0-fc.1

Will verify it again when 4.9.0-fc.1 will be released on Staging env

Comment 17 zhaozhanqi 2021-09-09 07:48:36 UTC
Hi, Yuri Obshansky I assigned QA to you, thanks.

Comment 18 Yuri Obshansky 2021-09-09 12:59:41 UTC
Verified on Staging UI 1.5.35 and BE v1.0.25.3

image 4.8.9 - Passed -> 
https://qaprodauth.cloud.redhat.com/openshift/assisted-installer/clusters/8c205e19-d683-4f24-b2db-8c0b78c0b8b8
   "name": "qe1",
    "network_type": "OpenShiftSDN",
    "ocp_release_image": "quay.io/openshift-release-dev/ocp-release:4.8.9-x86_64",
    "openshift_cluster_id": "879dd939-a12d-46d2-a392-e8c163daa5f3",
    "openshift_version": "4.8.9",
    "org_id": "13539309",
    "platform": {
        "type": "vsphere",
        "vsphere": {}
    },
    "progress": {
        "finalizing_stage_percentage": 100,
        "installing_stage_percentage": 100,
        "preparing_for_installation_stage_percentage": 100,
        "total_percentage": 100
    },
  "status": "installed",
    "status_info": "Cluster is installed",
    "status_updated_at": "2021-09-07T17:18:02.234Z",

Comment 21 errata-xmlrpc 2022-01-11 22:31:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.8.26 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:0021