Created attachment 1811312 [details] installation logs Created attachment 1811312 [details] installation logs Description of problem: Cluster deployment on VMware vsphere setup failed on "Failed installing cluster qe1. Reason: Timeout while waiting for cluster version to be available: context deadline exceeded" But actually CVO failed to be installed Operator cvo status: progressing message: Unable to apply 4.8.2: an unknown error has occurred: MultipleErrors Cluster deployed with "platform": { "type": "baremetal", "vsphere": { "cluster": "datacenter": "defaultDatastore": "folder": "network": "password": "username": "vCenter": } }, Another error in agent log Aug 05 18:09:28 master-0.qe1.e2e.bos.redhat.com domain_resoluti[2854]: time="05-08-2021 18:09:28" level=error msg="error occurred during domain resolution of api-int.qe1.e2e.bos.redhat.com" file="domain_resolution.go:33" error="lookup api-int.qe1.e2e.bos.redhat.com on 10.19.143.247:53: no such host" Version-Release number of selected component (if applicable): v1.0.24.1 How reproducible: Steps to Reproduce: 1. Install cluster on VMware setup 2. Cluster failed 3. Actual results: Expected results: Additional info:
Update: SNO cluster deployed successfully: 8/6/2021, 12:17:08 PM Successfully finished installing cluster qe1 8/6/2021, 12:17:08 PM Updated status of cluster qe1 to installed 8/6/2021, 12:15:19 PM Operator console status: available message: All is well 8/6/2021, 12:13:18 PM Operator cvo status: available message: Done applying 4.8.2 8/6/2021, 12:12:18 PM Operator cvo status: progressing message: Working towards 4.8.2: downloading update Looks like the problem with high_availability mode only
Slack thread: https://coreos.slack.com/archives/CUPJTHQ5P/p1628200825352800
Updates: 1. Previous failure was with vip_dhcp_allocation=true which is not supported yet As result, ingress_vip and api_vip were not allocated correctly. Issue reported https://issues.redhat.com/browse/MGMT-7117 2. Latest failure happened when vip_dhcp_allocation=false and provided correct ingress_vip and api_vip api.qe1.e2e.bos.redhat.com has address 10.19.114.250 *.apps.qe1.e2e.bos.redhat.com has address 10.19.114.251 https://qaprodauth.cloud.redhat.com/openshift/assisted-installer/clusters/470d3631-c13e-42d5-b1c3-e6b162d26a98 See attached installation logs
OVNKubernetes + vmxnet3 - OK OpenShiftSDN + vmxnet3 - Failed OpenShiftSDN + e1000 - OK OpenShiftSDN + vmxnet3 + version vmx-15 - Failed
Seems to be fixed by workaround from https://bugzilla.redhat.com/show_bug.cgi?id=1987108 There are couple bugs with similar Vsphere vmxnet3 failures, will see how this ^ will be resolved
Workaround from https://bugzilla.redhat.com/show_bug.cgi?id=1987108 did not work. Our workaround is to boot vsphere vms with lower version of VMware Hardware. Current settings: [root@rh8-tools yuri]# govc vm.option.info -cluster "e2e" -json | grep HwVersion "HwVersion": 17, The working HwVersion is 13. So, we should start vm with parameter -version=6.5 Example: govc vm.create -version=6.5 -net.adapter vmxnet3 -disk.controller pvscsi -c=16 -m=32768 -disk=120GB -disk-datastore=aos-vsphere -net.address="00:50:56:83:eb:fc" -iso-datastore=aos-vsphere -iso="discovery_image_qe1.iso" -folder="e2e-qe" master-0.qe1.e2e.bos.redhat.com Works for both networks OpenShiftSDN and OVNKubernetes as well.
(In reply to Yuri Obshansky from comment #8) > Workaround from https://bugzilla.redhat.com/show_bug.cgi?id=1987108 did not > work. > Our workaround is to boot vsphere vms with lower version of VMware Hardware. > Current settings: > [root@rh8-tools yuri]# govc vm.option.info -cluster "e2e" -json | grep > HwVersion > "HwVersion": 17, > The working HwVersion is 13. > So, we should start vm with parameter -version=6.5 > Example: > govc vm.create -version=6.5 -net.adapter vmxnet3 -disk.controller pvscsi > -c=16 -m=32768 -disk=120GB -disk-datastore=aos-vsphere > -net.address="00:50:56:83:eb:fc" -iso-datastore=aos-vsphere > -iso="discovery_image_qe1.iso" -folder="e2e-qe" > master-0.qe1.e2e.bos.redhat.com > Works for both networks OpenShiftSDN and OVNKubernetes as well. Is it understood why that didn't work? Are you saying that you've tried with 4.8.8 and latest 4.9 nightlies or you did some other implementation?
sdodson No, It is not understood. We just disabled "tx-checksum-ip-generic" on all VMs as suggested in above and use image 4.8.2
(In reply to Yuri Obshansky from comment #10) > sdodson > No, It is not understood. > We just disabled "tx-checksum-ip-generic" on all VMs as suggested in above > and use image 4.8.2 I would've expected that to work but it would be good to test 4.8.8 and see if that fixes the problem.
Target release is set to 4.9 so we're super-close to code freeze, should we change it to --- or what's the plan?
Decided to wait for next 4.8 release including given workaround to test if it works, tracking https://bugzilla.redhat.com/show_bug.cgi?id=1998106
Verified on Staging UI 1.5.35 and BE v1.0.25.3 image 4.8.9 - Passed -> https://qaprodauth.cloud.redhat.com/openshift/assisted-installer/clusters/8c205e19-d683-4f24-b2db-8c0b78c0b8b8 "name": "qe1", "network_type": "OpenShiftSDN", "ocp_release_image": "quay.io/openshift-release-dev/ocp-release:4.8.9-x86_64", "openshift_cluster_id": "879dd939-a12d-46d2-a392-e8c163daa5f3", "openshift_version": "4.8.9", "org_id": "13539309", "platform": { "type": "vsphere", "vsphere": {} }, "progress": { "finalizing_stage_percentage": 100, "installing_stage_percentage": 100, "preparing_for_installation_stage_percentage": 100, "total_percentage": 100 }, "status": "installed", "status_info": "Cluster is installed", "status_updated_at": "2021-09-07T17:18:02.234Z", image 4.9.0-fc.0 - Failed -> https://qaprodauth.cloud.redhat.com/openshift/assisted-installer/clusters/c542e994-0d0e-4fdc-9935-d375e18e2923 "name": "qe1", "network_type": "OpenShiftSDN", "ocp_release_image": "quay.io/openshift-release-dev/ocp-release:4.9.0-fc.0-x86_64", "openshift_cluster_id": "28f4794f-0227-4396-8ff4-1e40d4ee5514", "openshift_version": "4.9.0-fc.0", "org_id": "13539309", "platform": { "type": "vsphere", "vsphere": {} }, "progress": { "installing_stage_percentage": 100, "preparing_for_installation_stage_percentage": 100, "total_percentage": 80 }, "status": "error", "status_info": "Timeout while waiting for cluster version to be available: context deadline exceeded", "status_updated_at": "2021-09-07T16:19:36.175Z",
Are you sure 4.9.0-fc.0 has the fix? Release page says it's been created at 2021-08-20 12:29:17 +0000 UTC, and PR # is not in the list
(In reply to Nadia Pinaeva from comment #15) > Are you sure 4.9.0-fc.0 has the fix? > Release page says it's been created at 2021-08-20 12:29:17 +0000 UTC, and PR > # is not in the list Hi Nadia, Looks like the fix in 4.9.0-fc.1 - https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/releasestream/4-stable/release/4.9.0-fc.1 Will verify it again when 4.9.0-fc.1 will be released on Staging env
Hi, Yuri Obshansky I assigned QA to you, thanks.
Verified on Staging UI 1.5.35 and BE v1.0.25.3 image 4.8.9 - Passed -> https://qaprodauth.cloud.redhat.com/openshift/assisted-installer/clusters/8c205e19-d683-4f24-b2db-8c0b78c0b8b8 "name": "qe1", "network_type": "OpenShiftSDN", "ocp_release_image": "quay.io/openshift-release-dev/ocp-release:4.8.9-x86_64", "openshift_cluster_id": "879dd939-a12d-46d2-a392-e8c163daa5f3", "openshift_version": "4.8.9", "org_id": "13539309", "platform": { "type": "vsphere", "vsphere": {} }, "progress": { "finalizing_stage_percentage": 100, "installing_stage_percentage": 100, "preparing_for_installation_stage_percentage": 100, "total_percentage": 100 }, "status": "installed", "status_info": "Cluster is installed", "status_updated_at": "2021-09-07T17:18:02.234Z",
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.8.26 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:0021