Description: Installer fails sometimes with error: level=error msg=Error: failed to attach disk: Fault reason is "Operation Failed". Fault detail is "[Cannot attach Virtual Disk: Disk is locked. Please try again later.]". HTTP response code is "409". HTTP response message is "409 Conflict". level=error level=error msg= on ../tmp/openshift-install-367614815/template/main.tf line 46, in resource "ovirt_vm" "tmp_import_vm": level=error msg= 46: resource "ovirt_vm" "tmp_import_vm" { level=error level=error level=fatal msg=failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed to apply Terraform: failed to complete the change For example: 1. https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.7/1351411349434929152 2. https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_installer/4375/pull-ci-openshift-installer-master-e2e-ovirt/1351351091190566912 The problem is probably the wait condition in the terraform resource, that we need to examine: https://github.com/oVirt/terraform-provider-ovirt/blob/master/ovirt/resource_ovirt_vm.go#L601-L605 How to test: 1. Start an installation of openshift on RHV version X 2. Make sure the terraform part is done, meaning the resources are create on oVirt and the installer will print: "INFO Waiting up to 20m0s for the Kubernetes API at ..." 3. Cancel the installation, no need to wait till it is finished since this is just a terraform Bug. 4. Repeat the process 3 times. ** Since this Bug is related to RHV version, please try on Version 4.4.7 and 4.4.8 of RHV
Hi Gal, I'm just another customer but had some observations that might help you out. I was also doing a RHEV IPI install and came across this same exact reported error. I found in my install-config.yaml, my platform.ovirt.api_vip and platform.ovirt.ingress_vip fields were set to incorrect IPs, and when I went to correct that, it fixed my error. Hope that's something to look at for your issue. Kevin
(In reply to Kevin Chung from comment #1) > Hi Gal, I'm just another customer but had some observations that might help > you out. I was also doing a RHEV IPI install and came across this same > exact reported error. I found in my install-config.yaml, my > platform.ovirt.api_vip and platform.ovirt.ingress_vip fields were set to > incorrect IPs, and when I went to correct that, it fixed my error. Hope > that's something to look at for your issue. > > > Kevin Hi Kevin thanks for the help but that is not the case for us, This is something that is happening in our CI and the VIP are always the same. There are a lot of terraform errors that can happen and they look very similar. Do you happen to have your logs? just to see if this is the exact error ?
From Looking into Kevin logs I can confirm that this is the same issue. It was a surprise to me since I assumed that this bug is reproducible on CI mostly since it wasn't common at all on our CI. I'm surprised to see that this can happen on users envs more frequently than our CI. Increasing the priority and severity of the bug since it affects CI and Users
due to capacity constraints, we will be revisiting this bug in the upcoming sprint
This will be fixed when we switch to the new client library.
Hello Team, Is there any workaround or something else that we can check? As in one of the case, the customer is facing the same issue again and again for 4.8.9 version, the API and Ingress hostnames are resolving to correct VIP. Regards, Ayush Garg
(In reply to aygarg from comment #12) > Hello Team, > > Is there any workaround or something else that we can check? As in one of > the case, the customer is facing the same issue again and again for 4.8.9 > version, the API and Ingress hostnames are resolving to correct VIP. > > Regards, > Ayush Garg If the customer is getting the: ``` level=error msg=Error: failed to attach disk: Fault reason is "Operation Failed". Fault detail is "[Cannot attach Virtual Disk: Disk is locked. Please try again later.]". HTTP response code is "409". HTTP response message is "409 Conflict". ``` error from terraform repeatedly and cant start the installation, please attach the engine and vdsm logs from the customer so we can take a look... it is probably a problem with the oVirt engine or customer network. This error should be very flaky and mostly affect our internal CI, not customers, if a customer is getting this error in a consistent manner then this is probably a different issue.
Hello Gal, Thanks for replying. Before I posted my comment on the Bugzilla, we tried to check the logs of oVirt engine service but there were no logs, so requested them to check internally with the RHV team. Later on we found that this was happening due to their RHV configuration itself. In customer's words: "adjusting some rules in the firewall between the RHEVM server segment and Openshift VMs segment". Still, we are facing some other issues as following. ERROR Error: couldn't find resource (21 retries) ERROR ERROR on ../../../tmp/openshift-install-364336629/template/main.tf line 78, in resource "ovirt_template" "releaseimage_template": ERROR 78: resource "ovirt_template" "releaseimage_template" { ERROR ERROR FATAL failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed to apply Terraform: failed to complete the change However, I think this is a separate issue now and again related to the customer's RHV configuration only so I will check with them and if required raise a new Bugzilla. Thanks a lot for all of your help. Regards, Ayush Garg
@aygarg is this an entirely new attempt after adjusting the firewalls? (Did you run destroy cluster after the previous attempt or remove the local state files?) The reason why I'm asking is because the Terraform state file may remain in place with a half-broken system and may need to be removed. Ping me if you need help.
We tried the new installation each and every time by destroying the old cluster, deleting the older files completely (even the hidden ones).
Hello Team, I have a customer who is facing a similar issue and I suspect it is hitting the same bug. ERROR: An exception occurred during task execution. To see the full traceback, use -vvv. The error was: ovirtsdk4.Error: Fault reason is "Operation Failed". Fault detail is "[Cannot attach Virtual Disk: Disk is locked. Please try again later.]". HTTP response code is 409. fatal: [localhost]: FAILED! => {"changed": false, "msg": "Fault reason is \"Operation Failed\". Fault detail is \"[Cannot attach Virtual Disk: Disk is locked. Please try again later.]\". HTTP response code is 409."} Environment: RHV 4.4.8 I can share log collector if needed. The customer didn't face this issue in an earlier version 4.4.7 Regards, Yash Motiyele Red Hat
The workaround: - When openshift-install is on module.template.ovirt_image_transfer.releaseimage[0] state, you can pause installation program using ctrl+Z and after some time (about 40sec) you can use "fg" command to unpause this process. Example: DEBUG module.template.ovirt_image_transfer.releaseimage[0]: Still creating... [50s elapsed] ^Z <- here is ctrl+Z [1]+ Stopped openshift-install create cluster --log-level=debug [ap@work install]$ fg For me it was about 40sec (I checked the disk state in the RHV console on Storage->Disks). You have to wait to change state from "Finalizing" to be "OK".
(In reply to aygarg from comment #14) > Hello Gal, > > Thanks for replying. > > Before I posted my comment on the Bugzilla, we tried to check the logs of > oVirt engine service but there were no logs, so requested them to check > internally with the RHV team. Later on we found that this was happening due > to their RHV configuration itself. In customer's words: "adjusting some > rules in the firewall between the RHEVM server segment and Openshift VMs > segment". > > Still, we are facing some other issues as following. > > ERROR Error: couldn't find resource (21 retries) > ERROR > ERROR on ../../../tmp/openshift-install-364336629/template/main.tf line > 78, in resource "ovirt_template" "releaseimage_template": > ERROR 78: resource "ovirt_template" "releaseimage_template" { > ERROR > ERROR > FATAL failed to fetch Cluster: failed to generate asset "Cluster": failed to > create cluster: failed to apply Terraform: failed to complete the change > > > However, I think this is a separate issue now and again related to the > customer's RHV configuration only so I will check with them and if required > raise a new Bugzilla. Thanks a lot for all of your help. > > > Regards, > Ayush Garg This doesn't look like a regular terraform error but as Janos mentioned something which is leftover from previous attempts... I'm not sure what the customer is running but please start from scratch, a new installation directory, completing and questions and generating a new install config and so on. If you are still hitting the issue please open a new bug and upload: 1. Installation logs 2. All terraform files 3. Engine log
The same issue here. Looking in the engine.log on the RHV side, the uploaded image is not released when terraform attempts to attach the disk to the VM - hence the error. Very very soon after it tries, the lock is released. I can provide logs if that would be helpful.
This is a serious issues that is causing failures that are reported from several directions, customer support cases, and field associates. Bumping the priority, and targeting to the earliest OpenShift 4.8.z stream. This will also need to be backported to our EUS release OpenShift 4.6
Targeting to current dev release, for urgent back porting to at least OpenShift 4.8.z
RHV: 4.4.9.2-0.6 OCP: 4.10.0-0.nightly-2021-10-20-193037 step: 1) install ocp on rhv 2) verify that check this error doesn't appear -Fault reason is > "Operation Failed". Fault detail is "[Cannot attach Virtual Disk: Disk is > locked. Please try again later.]". HTTP response code is 409. > fatal: [localhost]: FAILED! => {"changed": false, "msg": "Fault reason is > \"Operation Failed\". Fault detail is \"[Cannot attach Virtual Disk: Disk is > locked. Please try again later.]\". HTTP response code is 409."} actual: there is no error expected: no error in installation process
This bug has been fixed and the fix has been merged into OCP 4.10.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056