Bug 1923178 - Can not download VM disks due to 'Cannot transfer Virtual Disk: Disk is locked'
Summary: Can not download VM disks due to 'Cannot transfer Virtual Disk: Disk is locked'
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Storage
Version: 4.4.5.3
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ovirt-4.4.9
: ---
Assignee: Pavel Bar
QA Contact: Amit Sharir
URL:
Whiteboard:
: 2001894 (view as bug list)
Depends On: 1976020
Blocks: 1883949 1984308 1994001 2014017
TreeView+ depends on / blocked
 
Reported: 2021-02-01 14:17 UTC by Ilan Zuckerman
Modified: 2021-12-23 16:29 UTC (History)
15 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2014017 (view as bug list)
Environment:
Last Closed: 2021-10-21 07:27:14 UTC
oVirt Team: Storage
Embargoed:
pm-rhel: ovirt-4.4+
eshenitz: blocker+
asharir: testing_plan_complete+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 115593 0 master MERGED imagetransfer: fix 'finalize_transfer' step 2021-07-15 17:40:04 UTC
oVirt gerrit 115673 0 master ABANDONED core: add a new command infrastructure "conclude()" method 2021-10-13 22:09:20 UTC
oVirt gerrit 116264 0 master ABANDONED api: poll the job of TransferDiskImage 2021-08-24 07:04:19 UTC
oVirt gerrit 116781 0 master ABANDONED core: fixing the 'Disk is locked' issue by waiting for locks release 2021-10-13 22:08:41 UTC
oVirt gerrit 116984 0 master MERGED core: release memory locks before unlocking entities when image transfer is finished 2021-10-13 09:58:10 UTC
oVirt gerrit 117138 0 ovirt-engine-4.4 MERGED core: release memory locks before unlocking entities when image transfer is finished 2021-10-13 15:46:18 UTC

Comment 1 Nir Soffer 2021-02-01 14:43:47 UTC
(In reply to Ilan Zuckerman from comment #0)
...
> The failure is due to:
> ovirtsdk4.Error: Fault reason is "Operation Failed". Fault detail is
> "[Cannot transfer Virtual Disk: Disk is locked. Please try again later.]".
> HTTP response code is 409.

This means the engine did not release the memory lock yet.

> Both, checksum checks, and the act of downloading the disks is done via the
> SDK example scripts:
> checksum_disk.py
> download_disk.py

Both scripts are waiting until the disk is OK, but unfortunately engine changes
the disk to OK, few seconds before releasing the memory lock. 

...
> From investigating both of the failures, I can see it occurs only on
> GlusterFS type storage.

May be because glusterfs is slower, it may cause additional delays on 
engine side that cause the lock to be released later.

...
> This issue is not marked as a regression because those are new automated
> tests. They were recently introduced (i think in rhv-4.4.5-1) and they are
> using different kind of approach for taking the checksums and downloading
> the disks (using directly SDK example scripts).

I think this is a duplicate of the same bug opened by virt-v2v. This is 
a known issue.

> The flow of the failed TC is as following:
> 
> 1. clone vm from a pre-defined template latest-rhel-guest-image-8.3 and
> start it
> 2. Create 4 disks permutations with filesystem on VM:
> virtio_scsiraw, virtiocow, virtio_scsicow, virtioraw
> 3. Stop the VM
> 4. Check disks checksums as in this example taken from the failed TC [1]
> 6. WAIT FOR DISKS STATUS TO BE 'OK' 

This is not needed, checksum_disk and download_disk already do this, and should
wait until the disk is free to use for the next operation.

> 7. CHECK ENGINE LOG TO SEE THAT ALL OF THE DISKS COMPLETED THEIR CHECKSUMS
> BY SEARCHING THE FOLLOWING STRING: "Image Download with disk {disk}
> succeeded" ,AND ONLY THEN CONTINUE

This may not be enough. I think the only way that will work today is to monitor
the events for the transfer job. This should be fixed in the sdk scripts in
helpers/imagetransfer.py

Both finallize/cancel must wait until the job has finished/failed before 
completing.

...
> * Per each storage type, steps 1 to 9 run once (with proxy download), AND
> steps 4 to 9 once more (this time with transfer)

Does it fail only when using the proxy? Maybe engine is installed on a very
slow host and the download on the engine machine cause engine to respond
slower, and unlock the disk later?

...
> Shortly after the failure to download the disk [2], We see a null pointer
> exception in the engine log [3].

...
> [3]:
> 2021-01-30 18:56:51,142+02 ERROR
> [org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller]
> (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-92)
> [a8eeea5b-2886-4224-9998-01b704d7a7aa] Exception:
> java.lang.NullPointerException
> 	at
> deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.storage.disk.image.
> TransferDiskImageCommand.getImageAlias(TransferDiskImageCommand.java:317)
> 	at
> deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.storage.disk.image.
> TransferDiskImageCommand.getTransferDescription(TransferDiskImageCommand.
> java:1379)
> 	at
> deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.storage.disk.image.
> TransferDiskImageCommand.handleFinishedSuccess(TransferDiskImageCommand.java:
> 893)

This error looks promising - the error is raised here:

    protected String getImageAlias() {
        return  getParameters().getAddDiskParameters() != null ?
                getParameters().getAddDiskParameters().getDiskInfo().getDiskAlias() :
                getDiskImage().getDiskAlias();
    }

    protected DiskImage getDiskImage() {
        if (!Guid.isNullOrEmpty(getParameters().getImageId())) {
            setImageId(getParameters().getImageId());
            return super.getDiskImage();
        }
        DiskImage diskImage = super.getDiskImage();
        if (diskImage == null) {
            diskImage = (DiskImage) diskDao.get(getParameters().getImageGroupID());
        }
        return diskImage;
    }

Looks like getDiskImage() or getParameters().getAddDiskParameters().getDiskInfo()
return null, which is probably a bug.

Once we fix this issue and log clear error message it will be more clear why
this happened.

A second bug is likely that failure to get transfer description result in failure
to cleanup the job and release the lock properly.

Changing priority to high since we cannot have such issues in normal flows.
This will also affect backup applications.

Comment 3 Tal Nisan 2021-02-04 10:15:19 UTC

*** This bug has been marked as a duplicate of bug 1849861 ***

Comment 11 Nir Soffer 2021-08-10 12:28:19 UTC
This blocks v2v bug 1976020

In this comment we can see that image transfer switched state to FINISHED
9 seconds before the lock was released.
https://bugzilla.redhat.com/show_bug.cgi?id=1976020#c7

Comment 12 Eyal Shenitzky 2021-10-04 06:54:34 UTC
*** Bug 2001894 has been marked as a duplicate of this bug. ***

Comment 13 Amit Sharir 2021-10-05 13:59:46 UTC
Hi, 

I tried to reproduce bug 1994001 on RHEV 4.4.9-3 and encountered a similar error to what was seen in this bug. 
During my reproduction scenario, I ran QE's automation test case "TestUploadImages" which uses upload_disk.py and checksum_disk.py SDK's. 
After the upload, the flow does a checksum of the disk.
In my tests, I got the following errors:

2021-10-05 15:37:13,716 - ThreadPoolExecutor-2_1 - VDS - ERROR - [10.46.12.145] Failed to run command ['python3', '/usr/share/doc/python3-ovirt-engine-sdk4/examples/checksum_disk.py', '-c', 'engine', 'bc443017-6542-46b7-8e3c-61c277f68f44'] ERR: Traceback (most recent call last):
  File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/checksum_disk.py", line 50, in <module>
    connection, disk, types.ImageTransferDirection.DOWNLOAD)
  File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/helpers/imagetransfer.py", line 200, in create_transfer
    transfer = transfers_service.add(transfer)
  File "/usr/lib64/python3.6/site-packages/ovirtsdk4/services.py", line 14153, in add
    return self._internal_add(image_transfer, headers, query, wait)
  File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 232, in _internal_add
    return future.wait() if wait else future
  File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 55, in wait
    return self._code(response)
  File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 229, in callback
    self._check_fault(response)
  File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 132, in _check_fault
    self._raise_error(response, body)
  File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 118, in _raise_error
    raise error
ovirtsdk4.Error: Fault reason is "Operation Failed". Fault detail is "[Cannot transfer Virtual Disk: Disk is locked. Please try again later.]". HTTP response code is 409.
 OUT: 




What is weird is that the test fails on types.ImageTransferDirection.DOWNLOAD although the test doesn't perform a download operation at all.
Attaching relevant logs.

Pavel can you please advise on this?

Comment 16 Pavel Bar 2021-10-18 08:03:09 UTC
(In reply to Amit Sharir from comment #13)
> Hi, 
> 
> I tried to reproduce bug 1994001 on RHEV 4.4.9-3 and encountered a similar
> error to what was seen in this bug. 
> During my reproduction scenario, I ran QE's automation test case
> "TestUploadImages" which uses upload_disk.py and checksum_disk.py SDK's. 
> After the upload, the flow does a checksum of the disk.
> In my tests, I got the following errors:
> 
> 2021-10-05 15:37:13,716 - ThreadPoolExecutor-2_1 - VDS - ERROR -
> [10.46.12.145] Failed to run command ['python3',
> '/usr/share/doc/python3-ovirt-engine-sdk4/examples/checksum_disk.py', '-c',
> 'engine', 'bc443017-6542-46b7-8e3c-61c277f68f44'] ERR: Traceback (most
> recent call last):
>   File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/checksum_disk.py",
> line 50, in <module>
>     connection, disk, types.ImageTransferDirection.DOWNLOAD)
>   File
> "/usr/share/doc/python3-ovirt-engine-sdk4/examples/helpers/imagetransfer.py",
> line 200, in create_transfer
>     transfer = transfers_service.add(transfer)
>   File "/usr/lib64/python3.6/site-packages/ovirtsdk4/services.py", line
> 14153, in add
>     return self._internal_add(image_transfer, headers, query, wait)
>   File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 232,
> in _internal_add
>     return future.wait() if wait else future
>   File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 55,
> in wait
>     return self._code(response)
>   File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 229,
> in callback
>     self._check_fault(response)
>   File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 132,
> in _check_fault
>     self._raise_error(response, body)
>   File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 118,
> in _raise_error
>     raise error
> ovirtsdk4.Error: Fault reason is "Operation Failed". Fault detail is
> "[Cannot transfer Virtual Disk: Disk is locked. Please try again later.]".
> HTTP response code is 409.
>  OUT: 
> 
> 
> 
> 
> What is weird is that the test fails on
> types.ImageTransferDirection.DOWNLOAD although the test doesn't perform a
> download operation at all.
> Attaching relevant logs.
> 
> Pavel can you please advise on this?


The error that you encountered happens for the same error as this bug (1923178).
After image upload, "imagetransfer.finalize_transfer()" checks for "ImageTransferPhase.FINISHED_SUCCESS/FINISHED_FAILURE" and finishes the moment one of the final statuses appears, allowing next command to start running.
But the final status appears too soon, before all the disks locks were released.
Thus the following image upload / download operation might fail.
Regarding your checksum issue and your surprise to see the seemingly not related "ImageTransferDirection.DOWNLOAD" there - the checksum is actually a 2-steps operation: (1) download image & (2) calculate checksum of the image downloaded at step #1.
https://github.com/oVirt/python-ovirt-engine-sdk4/blob/main/examples/checksum_disk.py#L50
So actually it's upload+download scenario that fails due to "locks not yet released although 1st command status already updated as finished" issue.

Comment 17 Avihai 2021-10-18 08:58:50 UTC
As this bug was marked as a duplicate of bug 2001894 we/QE need to make sure we also cover the dup bug flow.

@Amit please add this bug 2001894 flow to the verification effort:

From the dup bug description:
"
We have a script that does the following flow:
1. Create snapshot
2. Download VM at the time of that snapshot
3. Delete snapshot

sometimes a failure when removing the snapshot
"

Comment 19 Amit Sharir 2021-10-19 10:48:46 UTC
Version:
ovirt-engine-4.4.9.2-0.6.el8ev.noarch
vdsm-4.40.90.2-1.el8ev.x86_64

Verification flow:
I Used multiple automation/manual tests (with different storage types) that reproduced this issue in the past - without any success in reproduction.

Verification Conclusions:
The expected output matched the actual output.
All the flows I mentioned were completed with no errors.


Bug verified.

Comment 20 Sandro Bonazzola 2021-10-21 07:27:14 UTC
This bugzilla is included in oVirt 4.4.9 release, published on October 20th 2021.

Since the problem described in this bug report should be resolved in oVirt 4.4.9 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.