Bug 1917893

Summary: [ovirt] install fails: due to terraform error "Cannot attach Virtual Disk: Disk is locked" on vm resource
Product: OpenShift Container Platform Reporter: Gal Zaidman <gzaidman>
Component: InstallerAssignee: Janos Bonic <jpasztor>
Installer sub component: OpenShift on RHV QA Contact: michal <mgold>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: urgent CC: apoczeka, aygarg, dgautam, emarcus, gveitmic, jpasztor, kechung, kurathod, mburman, mgold, openshift-bugs-escalate, pelauter, pkhaire, plarsen, sstagnar, ymotiyel
Version: 4.8Flags: jpasztor: needinfo-
jpasztor: needinfo-
jpasztor: needinfo-
Target Milestone: ---   
Target Release: 4.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Previously, disk uploads in the Terraform provider were not handled properly, and as a result the OpenShift Installer failed. In this release, disk upload handling has been fixed, and disk uploads succeed.
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-03-10 16:02:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2015811    

Description Gal Zaidman 2021-01-19 15:41:13 UTC
Description:

Installer fails sometimes with error:

level=error msg=Error: failed to attach disk: Fault reason is "Operation Failed". Fault detail is "[Cannot attach Virtual Disk: Disk is locked. Please try again later.]". HTTP response code is "409". HTTP response message is "409 Conflict".
level=error
level=error msg=  on ../tmp/openshift-install-367614815/template/main.tf line 46, in resource "ovirt_vm" "tmp_import_vm":
level=error msg=  46: resource "ovirt_vm" "tmp_import_vm" {
level=error
level=error
level=fatal msg=failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed to apply Terraform: failed to complete the change


For example:

1. https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.7/1351411349434929152
2. https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_installer/4375/pull-ci-openshift-installer-master-e2e-ovirt/1351351091190566912

The problem is probably the wait condition in the terraform resource, that we need to examine:
https://github.com/oVirt/terraform-provider-ovirt/blob/master/ovirt/resource_ovirt_vm.go#L601-L605


How to test:
1. Start an installation of openshift on RHV version X
2. Make sure the terraform part is done, meaning the resources are create on oVirt and the installer will print: "INFO Waiting up to 20m0s for the Kubernetes API at ..."
3. Cancel the installation, no need to wait till it is finished since this is just a terraform Bug.
4. Repeat the process 3 times.
** Since this Bug is related to RHV version, please try on Version 4.4.7 and 4.4.8 of RHV

Comment 1 Kevin Chung 2021-02-02 04:03:36 UTC
Hi Gal, I'm just another customer but had some observations that might help you out.  I was also doing a RHEV IPI install and came across this same exact reported error.  I found in my install-config.yaml, my platform.ovirt.api_vip and platform.ovirt.ingress_vip fields were set to incorrect IPs, and when I went to correct that, it fixed my error.  Hope that's something to look at for your issue.


Kevin

Comment 2 Gal Zaidman 2021-02-02 16:37:21 UTC
(In reply to Kevin Chung from comment #1)
> Hi Gal, I'm just another customer but had some observations that might help
> you out.  I was also doing a RHEV IPI install and came across this same
> exact reported error.  I found in my install-config.yaml, my
> platform.ovirt.api_vip and platform.ovirt.ingress_vip fields were set to
> incorrect IPs, and when I went to correct that, it fixed my error.  Hope
> that's something to look at for your issue.
> 
> 
> Kevin

Hi Kevin thanks for the help but that is not the case for us,
This is something that is happening in our CI and the VIP are always the same.
There are a lot of terraform errors that can happen and they look very similar.
Do you happen to have your logs? just to see if this is the exact error ?

Comment 4 Gal Zaidman 2021-02-03 11:57:50 UTC
From Looking into Kevin logs I can confirm that this is the same issue.
It was a surprise to me since I assumed that this bug is reproducible on CI mostly since it wasn't common at all on our CI.
I'm surprised to see that this can happen on users envs more frequently than our CI.
Increasing the priority and severity of the bug since it affects CI and Users

Comment 5 Gal Zaidman 2021-03-30 14:08:31 UTC
due to capacity constraints, we will be revisiting this bug in the upcoming sprint

Comment 6 Janos Bonic 2021-07-01 11:57:31 UTC
This will be fixed when we switch to the new client library.

Comment 12 aygarg 2021-09-15 17:35:44 UTC
Hello Team,

Is there any workaround or something else that we can check? As in one of the case, the customer is facing the same issue again and again for 4.8.9 version, the API and Ingress hostnames are resolving to correct VIP.

Regards,
Ayush Garg

Comment 13 Gal Zaidman 2021-09-19 07:04:01 UTC
(In reply to aygarg from comment #12)
> Hello Team,
> 
> Is there any workaround or something else that we can check? As in one of
> the case, the customer is facing the same issue again and again for 4.8.9
> version, the API and Ingress hostnames are resolving to correct VIP.
> 
> Regards,
> Ayush Garg

If the customer is getting the:
```
level=error msg=Error: failed to attach disk: Fault reason is "Operation Failed". Fault detail is "[Cannot attach Virtual Disk: Disk is locked. Please try again later.]". HTTP response code is "409". HTTP response message is "409 Conflict".
```
error from terraform repeatedly and cant start the installation, please attach the engine and vdsm logs from the customer so we can take a look... it is probably a problem with the oVirt engine or customer network.

This error should be very flaky and mostly affect our internal CI, not customers, if a customer is getting this error in a consistent manner then this is probably a different issue.

Comment 14 aygarg 2021-09-20 12:15:09 UTC
Hello Gal,

Thanks for replying.

Before I posted my comment on the Bugzilla, we tried to check the logs of oVirt engine service but there were no logs, so requested them to check internally with the RHV team. Later on we found that this was happening due to their RHV configuration itself. In customer's words: "adjusting some rules in the firewall between the RHEVM server segment and Openshift VMs segment".

Still, we are facing some other issues as following.

ERROR Error: couldn't find resource (21 retries)
ERROR
ERROR   on ../../../tmp/openshift-install-364336629/template/main.tf line 78, in resource "ovirt_template" "releaseimage_template":
ERROR   78: resource "ovirt_template" "releaseimage_template" {
ERROR
ERROR
FATAL failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed to apply Terraform: failed to complete the change


However, I think this is a separate issue now and again related to the customer's RHV configuration only so I will check with them and if required raise a new Bugzilla. Thanks a lot for all of your help.


Regards,
Ayush Garg

Comment 15 Janos Bonic 2021-09-20 12:17:56 UTC
@aygarg is this an entirely new attempt after adjusting the firewalls? (Did you run destroy cluster after the previous attempt or remove the local state files?) The reason why I'm asking is because the Terraform state file may remain in place with a half-broken system and may need to be removed. Ping me if you need help.

Comment 16 aygarg 2021-09-20 12:33:52 UTC
We tried the new installation each and every time by destroying the old cluster, deleting the older files completely (even the hidden ones).

Comment 19 Yash 2021-09-21 05:28:08 UTC
Hello Team,

I have a customer who is facing a similar issue and I suspect it is hitting the same bug.

ERROR: An exception occurred during task execution. To see the full traceback, use -vvv. The error was: ovirtsdk4.Error: Fault reason is "Operation Failed". Fault detail is "[Cannot attach Virtual Disk: Disk is locked. Please try again later.]". HTTP response code is 409.
fatal: [localhost]: FAILED! => {"changed": false, "msg": "Fault reason is \"Operation Failed\". Fault detail is \"[Cannot attach Virtual Disk: Disk is locked. Please try again later.]\". HTTP response code is 409."}

Environment: RHV 4.4.8

I can share log collector if needed. The customer didn't face this issue in an earlier version 4.4.7 


Regards,
Yash Motiyele
Red Hat

Comment 31 apoczeka 2021-09-24 20:05:50 UTC
The workaround:
- When openshift-install is on module.template.ovirt_image_transfer.releaseimage[0] state, you can pause installation program using ctrl+Z and after some time (about 40sec) you can use "fg" command to unpause this process.

Example:
DEBUG module.template.ovirt_image_transfer.releaseimage[0]: Still creating... [50s elapsed] 
^Z <- here is ctrl+Z
[1]+  Stopped                 openshift-install create cluster --log-level=debug
[ap@work install]$ fg


For me it was about 40sec (I checked the disk state in the RHV console on Storage->Disks). You have to wait to change state from "Finalizing" to be "OK".

Comment 33 Gal Zaidman 2021-09-29 12:34:49 UTC
(In reply to aygarg from comment #14)
> Hello Gal,
> 
> Thanks for replying.
> 
> Before I posted my comment on the Bugzilla, we tried to check the logs of
> oVirt engine service but there were no logs, so requested them to check
> internally with the RHV team. Later on we found that this was happening due
> to their RHV configuration itself. In customer's words: "adjusting some
> rules in the firewall between the RHEVM server segment and Openshift VMs
> segment".
> 
> Still, we are facing some other issues as following.
> 
> ERROR Error: couldn't find resource (21 retries)
> ERROR
> ERROR   on ../../../tmp/openshift-install-364336629/template/main.tf line
> 78, in resource "ovirt_template" "releaseimage_template":
> ERROR   78: resource "ovirt_template" "releaseimage_template" {
> ERROR
> ERROR
> FATAL failed to fetch Cluster: failed to generate asset "Cluster": failed to
> create cluster: failed to apply Terraform: failed to complete the change
> 
> 
> However, I think this is a separate issue now and again related to the
> customer's RHV configuration only so I will check with them and if required
> raise a new Bugzilla. Thanks a lot for all of your help.
> 
> 
> Regards,
> Ayush Garg

This doesn't look like a regular terraform error but as Janos mentioned something which is leftover from previous attempts...
I'm not sure what the customer is running but please start from scratch, a new installation directory, completing and questions and generating a new install config and so on.
If you are still hitting the issue please open a new bug and upload:
1. Installation logs
2. All terraform files
3. Engine log

Comment 36 Peter Larsen 2021-10-13 15:39:25 UTC
The same issue here. Looking in the engine.log on the RHV side, the uploaded image is not released when terraform attempts to attach the disk to the VM - hence the error. Very very soon after it tries, the lock is released. I can provide logs if that would be helpful.

Comment 37 Peter Lauterbach 2021-10-14 15:28:29 UTC
This is a serious issues that is causing failures that are reported from several directions, customer support cases, and field associates. Bumping the priority, and targeting to the earliest OpenShift 4.8.z stream.
This will also need to be backported to our EUS release OpenShift 4.6

Comment 39 Peter Lauterbach 2021-10-14 21:13:39 UTC
Targeting to current dev release, for urgent back porting to at least OpenShift 4.8.z

Comment 43 michal 2021-10-21 08:21:03 UTC
RHV: 4.4.9.2-0.6
OCP: 4.10.0-0.nightly-2021-10-20-193037

step:
1) install ocp on rhv
2) verify that check this error doesn't appear -Fault reason is
> "Operation Failed". Fault detail is "[Cannot attach Virtual Disk: Disk is
> locked. Please try again later.]". HTTP response code is 409.
> fatal: [localhost]: FAILED! => {"changed": false, "msg": "Fault reason is
> \"Operation Failed\". Fault detail is \"[Cannot attach Virtual Disk: Disk is
> locked. Please try again later.]\". HTTP response code is 409."}

actual:
there is no error 

expected: no error in installation process

Comment 50 Janos Bonic 2022-02-03 07:17:21 UTC
This bug has been fixed and the fix has been merged into OCP 4.10.

Comment 52 errata-xmlrpc 2022-03-10 16:02:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056