Bug 1990663 - [Assisted-4.8 ][SaaS][vsphere] cluster deployment failed when use OpenShiftSDN and network adapter vmxnet3
Summary: [Assisted-4.8 ][SaaS][vsphere] cluster deployment failed when use OpenShiftS...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.8
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.8.z
Assignee: Nadia Pinaeva
QA Contact: Yuri Obshansky
URL:
Whiteboard:
Depends On: 1998106
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-08-05 21:59 UTC by Yuri Obshansky
Modified: 2022-01-11 22:31 UTC (History)
6 users (show)

Fixed In Version: 4.8.9
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-01-11 22:31:15 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
installation logs (80.50 KB, application/x-tar)
2021-08-05 21:59 UTC, Yuri Obshansky
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2022:0021 0 None None None 2022-01-11 22:31:35 UTC

Internal Links: 2016645

Description Yuri Obshansky 2021-08-05 21:59:11 UTC
Created attachment 1811312 [details]
installation logs

Created attachment 1811312 [details]
installation logs

Description of problem:
Cluster deployment on VMware vsphere setup failed on
"Failed installing cluster qe1. Reason: Timeout while waiting for cluster version to be available: context deadline exceeded"
But actually CVO failed to be installed
Operator cvo status: progressing message: Unable to apply 4.8.2: an unknown error has occurred: MultipleErrors

Cluster deployed with 
    "platform": {
        "type": "baremetal",
        "vsphere": {
            "cluster": 
            "datacenter": 
            "defaultDatastore": 
            "folder": 
            "network": 
            "password":
            "username": 
            "vCenter": 
        }
    },

Another error in agent log
Aug 05 18:09:28 master-0.qe1.e2e.bos.redhat.com domain_resoluti[2854]: time="05-08-2021 18:09:28" level=error msg="error occurred during domain resolution of api-int.qe1.e2e.bos.redhat.com" file="domain_resolution.go:33" error="lookup api-int.qe1.e2e.bos.redhat.com on 10.19.143.247:53: no such host"

Version-Release number of selected component (if applicable):
v1.0.24.1

How reproducible:


Steps to Reproduce:
1. Install cluster on VMware setup
2. Cluster failed
3.

Actual results:


Expected results:


Additional info:

Comment 2 Yuri Obshansky 2021-08-06 16:50:54 UTC
Update: SNO cluster deployed successfully:

8/6/2021, 12:17:08 PM	Successfully finished installing cluster qe1
8/6/2021, 12:17:08 PM	Updated status of cluster qe1 to installed
8/6/2021, 12:15:19 PM	Operator console status: available message: All is well
8/6/2021, 12:13:18 PM	Operator cvo status: available message: Done applying 4.8.2
8/6/2021, 12:12:18 PM	Operator cvo status: progressing message: Working towards 4.8.2: downloading update

Looks like the problem with high_availability mode only

Comment 3 liat gamliel 2021-08-09 08:18:18 UTC
Slack thread: https://coreos.slack.com/archives/CUPJTHQ5P/p1628200825352800

Comment 4 Yuri Obshansky 2021-08-09 20:32:41 UTC
Updates:
1. Previous failure was with vip_dhcp_allocation=true which is not supported yet 
As result, ingress_vip and api_vip were not allocated correctly.
Issue reported https://issues.redhat.com/browse/MGMT-7117

2. Latest failure happened when vip_dhcp_allocation=false
and provided correct ingress_vip and api_vip
api.qe1.e2e.bos.redhat.com has address 10.19.114.250
*.apps.qe1.e2e.bos.redhat.com has address 10.19.114.251
https://qaprodauth.cloud.redhat.com/openshift/assisted-installer/clusters/470d3631-c13e-42d5-b1c3-e6b162d26a98
See attached installation logs

Comment 6 Yuri Obshansky 2021-08-12 14:05:13 UTC
OVNKubernetes + vmxnet3 - OK
OpenShiftSDN + vmxnet3 - Failed
OpenShiftSDN + e1000 - OK
OpenShiftSDN + vmxnet3 + version vmx-15 - Failed

Comment 7 Nadia Pinaeva 2021-08-25 16:47:06 UTC
Seems to be fixed by workaround from https://bugzilla.redhat.com/show_bug.cgi?id=1987108
There are couple bugs with similar Vsphere vmxnet3 failures, will see how this ^ will be resolved

Comment 8 Yuri Obshansky 2021-08-27 12:39:41 UTC
Workaround from https://bugzilla.redhat.com/show_bug.cgi?id=1987108 did not work.
Our workaround is to boot vsphere vms with lower version of VMware Hardware.
Current settings: 
[root@rh8-tools yuri]# govc vm.option.info -cluster "e2e" -json | grep HwVersion
    "HwVersion": 17,
The working HwVersion is 13.
So, we should start vm with parameter -version=6.5
Example:
govc vm.create -version=6.5 -net.adapter vmxnet3 -disk.controller pvscsi -c=16 -m=32768 -disk=120GB -disk-datastore=aos-vsphere -net.address="00:50:56:83:eb:fc" -iso-datastore=aos-vsphere -iso="discovery_image_qe1.iso" -folder="e2e-qe" master-0.qe1.e2e.bos.redhat.com 
Works for both networks OpenShiftSDN and OVNKubernetes as well.

Comment 9 Scott Dodson 2021-08-27 20:29:53 UTC
(In reply to Yuri Obshansky from comment #8)
> Workaround from https://bugzilla.redhat.com/show_bug.cgi?id=1987108 did not
> work.
> Our workaround is to boot vsphere vms with lower version of VMware Hardware.
> Current settings: 
> [root@rh8-tools yuri]# govc vm.option.info -cluster "e2e" -json | grep
> HwVersion
>     "HwVersion": 17,
> The working HwVersion is 13.
> So, we should start vm with parameter -version=6.5
> Example:
> govc vm.create -version=6.5 -net.adapter vmxnet3 -disk.controller pvscsi
> -c=16 -m=32768 -disk=120GB -disk-datastore=aos-vsphere
> -net.address="00:50:56:83:eb:fc" -iso-datastore=aos-vsphere
> -iso="discovery_image_qe1.iso" -folder="e2e-qe"
> master-0.qe1.e2e.bos.redhat.com 
> Works for both networks OpenShiftSDN and OVNKubernetes as well.

Is it understood why that didn't work? Are you saying that you've tried with 4.8.8 and latest 4.9 nightlies or you did some other implementation?

Comment 10 Yuri Obshansky 2021-08-27 20:43:18 UTC
sdodson
No, It is not understood.
We just disabled "tx-checksum-ip-generic" on all VMs as suggested in above and use image 4.8.2

Comment 11 Scott Dodson 2021-08-28 02:42:01 UTC
(In reply to Yuri Obshansky from comment #10)
> sdodson
> No, It is not understood.
> We just disabled "tx-checksum-ip-generic" on all VMs as suggested in above
> and use image 4.8.2

I would've expected that to work but it would be good to test 4.8.8 and see if that fixes the problem.

Comment 12 Nadia Pinaeva 2021-08-30 10:38:16 UTC
Target release is set to 4.9 so we're super-close to code freeze, should we change it to --- or what's the plan?

Comment 13 Nadia Pinaeva 2021-08-31 13:14:47 UTC
Decided to wait for next 4.8 release including given workaround to test if it works, tracking https://bugzilla.redhat.com/show_bug.cgi?id=1998106

Comment 14 Yuri Obshansky 2021-09-07 17:35:09 UTC
Verified on Staging UI 1.5.35 and BE v1.0.25.3

image 4.8.9 - Passed -> 
https://qaprodauth.cloud.redhat.com/openshift/assisted-installer/clusters/8c205e19-d683-4f24-b2db-8c0b78c0b8b8
   "name": "qe1",
    "network_type": "OpenShiftSDN",
    "ocp_release_image": "quay.io/openshift-release-dev/ocp-release:4.8.9-x86_64",
    "openshift_cluster_id": "879dd939-a12d-46d2-a392-e8c163daa5f3",
    "openshift_version": "4.8.9",
    "org_id": "13539309",
    "platform": {
        "type": "vsphere",
        "vsphere": {}
    },
    "progress": {
        "finalizing_stage_percentage": 100,
        "installing_stage_percentage": 100,
        "preparing_for_installation_stage_percentage": 100,
        "total_percentage": 100
    },
  "status": "installed",
    "status_info": "Cluster is installed",
    "status_updated_at": "2021-09-07T17:18:02.234Z",

image 4.9.0-fc.0 - Failed ->
https://qaprodauth.cloud.redhat.com/openshift/assisted-installer/clusters/c542e994-0d0e-4fdc-9935-d375e18e2923
    "name": "qe1",
    "network_type": "OpenShiftSDN",
    "ocp_release_image": "quay.io/openshift-release-dev/ocp-release:4.9.0-fc.0-x86_64",
    "openshift_cluster_id": "28f4794f-0227-4396-8ff4-1e40d4ee5514",
    "openshift_version": "4.9.0-fc.0",
    "org_id": "13539309",
    "platform": {
        "type": "vsphere",
        "vsphere": {}
    },
    "progress": {
        "installing_stage_percentage": 100,
        "preparing_for_installation_stage_percentage": 100,
        "total_percentage": 80
    },
    "status": "error",
    "status_info": "Timeout while waiting for cluster version to be available: context deadline exceeded",
    "status_updated_at": "2021-09-07T16:19:36.175Z",

Comment 15 Nadia Pinaeva 2021-09-08 07:46:34 UTC
Are you sure 4.9.0-fc.0 has the fix?
Release page says it's been created at 2021-08-20 12:29:17 +0000 UTC, and PR # is not in the list

Comment 16 Yuri Obshansky 2021-09-08 12:13:09 UTC
(In reply to Nadia Pinaeva from comment #15)
> Are you sure 4.9.0-fc.0 has the fix?
> Release page says it's been created at 2021-08-20 12:29:17 +0000 UTC, and PR
> # is not in the list

Hi Nadia, 

Looks like the fix in 4.9.0-fc.1 - 
https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/releasestream/4-stable/release/4.9.0-fc.1

Will verify it again when 4.9.0-fc.1 will be released on Staging env

Comment 17 zhaozhanqi 2021-09-09 07:48:36 UTC
Hi, Yuri Obshansky I assigned QA to you, thanks.

Comment 18 Yuri Obshansky 2021-09-09 12:59:41 UTC
Verified on Staging UI 1.5.35 and BE v1.0.25.3

image 4.8.9 - Passed -> 
https://qaprodauth.cloud.redhat.com/openshift/assisted-installer/clusters/8c205e19-d683-4f24-b2db-8c0b78c0b8b8
   "name": "qe1",
    "network_type": "OpenShiftSDN",
    "ocp_release_image": "quay.io/openshift-release-dev/ocp-release:4.8.9-x86_64",
    "openshift_cluster_id": "879dd939-a12d-46d2-a392-e8c163daa5f3",
    "openshift_version": "4.8.9",
    "org_id": "13539309",
    "platform": {
        "type": "vsphere",
        "vsphere": {}
    },
    "progress": {
        "finalizing_stage_percentage": 100,
        "installing_stage_percentage": 100,
        "preparing_for_installation_stage_percentage": 100,
        "total_percentage": 100
    },
  "status": "installed",
    "status_info": "Cluster is installed",
    "status_updated_at": "2021-09-07T17:18:02.234Z",

Comment 21 errata-xmlrpc 2022-01-11 22:31:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.8.26 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:0021


Note You need to log in before you can comment on or make changes to this bug.