Bug 1990663
Summary: | [Assisted-4.8 ][SaaS][vsphere] cluster deployment failed when use OpenShiftSDN and network adapter vmxnet3 | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Yuri Obshansky <yobshans> | ||||
Component: | Networking | Assignee: | Nadia Pinaeva <npinaeva> | ||||
Networking sub component: | openshift-sdn | QA Contact: | Yuri Obshansky <yobshans> | ||||
Status: | CLOSED ERRATA | Docs Contact: | |||||
Severity: | urgent | ||||||
Priority: | urgent | CC: | aconstan, aos-bugs, astoycos, lgamliel, sasha, sdodson | ||||
Version: | 4.8 | Keywords: | TestBlocker | ||||
Target Milestone: | --- | ||||||
Target Release: | 4.8.z | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | 4.8.9 | Doc Type: | No Doc Update | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2022-01-11 22:31:15 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | 1998106 | ||||||
Bug Blocks: | |||||||
Attachments: |
|
Description
Yuri Obshansky
2021-08-05 21:59:11 UTC
Update: SNO cluster deployed successfully: 8/6/2021, 12:17:08 PM Successfully finished installing cluster qe1 8/6/2021, 12:17:08 PM Updated status of cluster qe1 to installed 8/6/2021, 12:15:19 PM Operator console status: available message: All is well 8/6/2021, 12:13:18 PM Operator cvo status: available message: Done applying 4.8.2 8/6/2021, 12:12:18 PM Operator cvo status: progressing message: Working towards 4.8.2: downloading update Looks like the problem with high_availability mode only Updates: 1. Previous failure was with vip_dhcp_allocation=true which is not supported yet As result, ingress_vip and api_vip were not allocated correctly. Issue reported https://issues.redhat.com/browse/MGMT-7117 2. Latest failure happened when vip_dhcp_allocation=false and provided correct ingress_vip and api_vip api.qe1.e2e.bos.redhat.com has address 10.19.114.250 *.apps.qe1.e2e.bos.redhat.com has address 10.19.114.251 https://qaprodauth.cloud.redhat.com/openshift/assisted-installer/clusters/470d3631-c13e-42d5-b1c3-e6b162d26a98 See attached installation logs OVNKubernetes + vmxnet3 - OK OpenShiftSDN + vmxnet3 - Failed OpenShiftSDN + e1000 - OK OpenShiftSDN + vmxnet3 + version vmx-15 - Failed Seems to be fixed by workaround from https://bugzilla.redhat.com/show_bug.cgi?id=1987108 There are couple bugs with similar Vsphere vmxnet3 failures, will see how this ^ will be resolved Workaround from https://bugzilla.redhat.com/show_bug.cgi?id=1987108 did not work. Our workaround is to boot vsphere vms with lower version of VMware Hardware. Current settings: [root@rh8-tools yuri]# govc vm.option.info -cluster "e2e" -json | grep HwVersion "HwVersion": 17, The working HwVersion is 13. So, we should start vm with parameter -version=6.5 Example: govc vm.create -version=6.5 -net.adapter vmxnet3 -disk.controller pvscsi -c=16 -m=32768 -disk=120GB -disk-datastore=aos-vsphere -net.address="00:50:56:83:eb:fc" -iso-datastore=aos-vsphere -iso="discovery_image_qe1.iso" -folder="e2e-qe" master-0.qe1.e2e.bos.redhat.com Works for both networks OpenShiftSDN and OVNKubernetes as well. (In reply to Yuri Obshansky from comment #8) > Workaround from https://bugzilla.redhat.com/show_bug.cgi?id=1987108 did not > work. > Our workaround is to boot vsphere vms with lower version of VMware Hardware. > Current settings: > [root@rh8-tools yuri]# govc vm.option.info -cluster "e2e" -json | grep > HwVersion > "HwVersion": 17, > The working HwVersion is 13. > So, we should start vm with parameter -version=6.5 > Example: > govc vm.create -version=6.5 -net.adapter vmxnet3 -disk.controller pvscsi > -c=16 -m=32768 -disk=120GB -disk-datastore=aos-vsphere > -net.address="00:50:56:83:eb:fc" -iso-datastore=aos-vsphere > -iso="discovery_image_qe1.iso" -folder="e2e-qe" > master-0.qe1.e2e.bos.redhat.com > Works for both networks OpenShiftSDN and OVNKubernetes as well. Is it understood why that didn't work? Are you saying that you've tried with 4.8.8 and latest 4.9 nightlies or you did some other implementation? sdodson No, It is not understood. We just disabled "tx-checksum-ip-generic" on all VMs as suggested in above and use image 4.8.2 (In reply to Yuri Obshansky from comment #10) > sdodson > No, It is not understood. > We just disabled "tx-checksum-ip-generic" on all VMs as suggested in above > and use image 4.8.2 I would've expected that to work but it would be good to test 4.8.8 and see if that fixes the problem. Target release is set to 4.9 so we're super-close to code freeze, should we change it to --- or what's the plan? Decided to wait for next 4.8 release including given workaround to test if it works, tracking https://bugzilla.redhat.com/show_bug.cgi?id=1998106 Verified on Staging UI 1.5.35 and BE v1.0.25.3 image 4.8.9 - Passed -> https://qaprodauth.cloud.redhat.com/openshift/assisted-installer/clusters/8c205e19-d683-4f24-b2db-8c0b78c0b8b8 "name": "qe1", "network_type": "OpenShiftSDN", "ocp_release_image": "quay.io/openshift-release-dev/ocp-release:4.8.9-x86_64", "openshift_cluster_id": "879dd939-a12d-46d2-a392-e8c163daa5f3", "openshift_version": "4.8.9", "org_id": "13539309", "platform": { "type": "vsphere", "vsphere": {} }, "progress": { "finalizing_stage_percentage": 100, "installing_stage_percentage": 100, "preparing_for_installation_stage_percentage": 100, "total_percentage": 100 }, "status": "installed", "status_info": "Cluster is installed", "status_updated_at": "2021-09-07T17:18:02.234Z", image 4.9.0-fc.0 - Failed -> https://qaprodauth.cloud.redhat.com/openshift/assisted-installer/clusters/c542e994-0d0e-4fdc-9935-d375e18e2923 "name": "qe1", "network_type": "OpenShiftSDN", "ocp_release_image": "quay.io/openshift-release-dev/ocp-release:4.9.0-fc.0-x86_64", "openshift_cluster_id": "28f4794f-0227-4396-8ff4-1e40d4ee5514", "openshift_version": "4.9.0-fc.0", "org_id": "13539309", "platform": { "type": "vsphere", "vsphere": {} }, "progress": { "installing_stage_percentage": 100, "preparing_for_installation_stage_percentage": 100, "total_percentage": 80 }, "status": "error", "status_info": "Timeout while waiting for cluster version to be available: context deadline exceeded", "status_updated_at": "2021-09-07T16:19:36.175Z", Are you sure 4.9.0-fc.0 has the fix? Release page says it's been created at 2021-08-20 12:29:17 +0000 UTC, and PR # is not in the list (In reply to Nadia Pinaeva from comment #15) > Are you sure 4.9.0-fc.0 has the fix? > Release page says it's been created at 2021-08-20 12:29:17 +0000 UTC, and PR > # is not in the list Hi Nadia, Looks like the fix in 4.9.0-fc.1 - https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/releasestream/4-stable/release/4.9.0-fc.1 Will verify it again when 4.9.0-fc.1 will be released on Staging env Hi, Yuri Obshansky I assigned QA to you, thanks. Verified on Staging UI 1.5.35 and BE v1.0.25.3 image 4.8.9 - Passed -> https://qaprodauth.cloud.redhat.com/openshift/assisted-installer/clusters/8c205e19-d683-4f24-b2db-8c0b78c0b8b8 "name": "qe1", "network_type": "OpenShiftSDN", "ocp_release_image": "quay.io/openshift-release-dev/ocp-release:4.8.9-x86_64", "openshift_cluster_id": "879dd939-a12d-46d2-a392-e8c163daa5f3", "openshift_version": "4.8.9", "org_id": "13539309", "platform": { "type": "vsphere", "vsphere": {} }, "progress": { "finalizing_stage_percentage": 100, "installing_stage_percentage": 100, "preparing_for_installation_stage_percentage": 100, "total_percentage": 100 }, "status": "installed", "status_info": "Cluster is installed", "status_updated_at": "2021-09-07T17:18:02.234Z", Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.8.26 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:0021 |