Bug 1796231 - VM disk remains in locked state if image transfer (image download) timesout due to inactivity.
Summary: VM disk remains in locked state if image transfer (image download) timesout d...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 4.3.7
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ovirt-4.4.4
: 4.4.4
Assignee: Bella Khizgiyaev
QA Contact: Ilan Zuckerman
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-01-29 22:57 UTC by Bimal Chollera
Modified: 2023-12-15 17:14 UTC (History)
10 users (show)

Fixed In Version: ovirt-engine-4.4.4.4
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-02-02 13:57:12 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
preallocated snap disk finalized without 'DOWNLOAD_IMAGE_CANCELED_TIMEOUT' message (1.05 MB, application/zip)
2021-01-26 13:28 UTC, Ilan Zuckerman
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 4775701 0 None None None 2020-01-29 23:29:07 UTC
Red Hat Product Errata RHSA-2021:0381 0 None None None 2021-02-02 13:57:41 UTC
oVirt gerrit 112264 0 master MERGED core: Snapshot disk remains locked if image transfer times out 2021-02-12 16:07:33 UTC

Description Bimal Chollera 2020-01-29 22:57:53 UTC
Description of problem:


VM disk remains in locked state if image_transfer times out due to inactivity.
As a result the image transfer finalize fails.

-	Engine logs - download image cancelled due to transfer inactivity.

~~~
2020-01-29 16:07:33,012-05 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-23) [663d2eb9] EVENT_ID: DOWNLOAD_IMAGE_CANCELED_TIMEOUT(1,072), Download was canceled by system. Reason: timeout due to transfer inactivity.

2020-01-29 16:07:43,026-05 INFO  [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-32) [7ff556e1-9183-465e-8447-5988541884c5] Transfer cancelled by system for Download disk 'windows' (disk id: 'b005cb13-be14-40a1-8145-fb304ed8208f', image id: 'bac90005-68fb-4a4f-9091-e638376a0fb4')

2020-01-29 16:07:43,027-05 INFO  [org.ovirt.engine.core.bll.storage.disk.image.ImageTransferUpdater] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-32) [7ff556e1-9183-465e-8447-5988541884c5] Updating image transfer 3f2fcbbd-1246-49c6-8185-c09349b62383 phase to Finalizing Failure

2020-01-29 16:07:53,041-05 ERROR [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-25) [7ff556e1-9183-465e-8447-5988541884c5] Finalizing failed transfer. Download disk 'windows' (disk id: 'b005cb13-be14-40a1-8145-fb304ed8208f', image id: 'bac90005-68fb-4a4f-9091-e638376a0fb4')

2020-01-29 16:07:53,050-05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.RemoveImageTicketVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-25) [7ff556e1-9183-465e-8447-5988541884c5] START, RemoveImageTicketVDSCommand(HostName = rhevh23, RemoveImageTicketVDSCommandParameters:{hostId='8f22f76c-1428-4492-a25e-3787ea78d6b8', ticketId='a08b06b2-4612-45be-aaf8-cc3ea4e37cb2', timeout='null'}), log id: 6eca44b4

2020-01-29 16:07:53,057-05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.RemoveImageTicketVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-25) [7ff556e1-9183-465e-8447-5988541884c5] FINISH, RemoveImageTicketVDSCommand, return: StatusOnlyReturn [status=Status [code=0, message=Done]], log id: 6eca44b4

2020-01-29 16:07:53,057-05 INFO  [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-25) [7ff556e1-9183-465e-8447-5988541884c5] Successfully stopped image transfer session for ticket 'a08b06b2-4612-45be-aaf8-cc3ea4e37cb2'

2020-01-29 16:07:53,057-05 INFO  [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-25) [7ff556e1-9183-465e-8447-5988541884c5] Removing image ticket from ovirt-imageio-proxy, id a08b06b2-4612-45be-aaf8-cc3ea4e37cb2

2020-01-29 16:07:53,301-05 INFO  [org.ovirt.engine.core.bll.storage.disk.image.ImageTransferUpdater] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-25) [7ff556e1-9183-465e-8447-5988541884c5] Updating image transfer 3f2fcbbd-1246-49c6-8185-c09349b62383 phase to Finished Failure

2020-01-29 16:08:03,377-05 ERROR [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-57) [7ff556e1-9183-465e-8447-5988541884c5] Transfer failed. Download disk 'amar-windows' (disk id: 'b005cb13-be14-40a1-8145-fb304ed8208f', image id: 'bac90005-68fb-4a4f-9091-e638376a0fb4')

2020-01-29 16:08:04,386-05 INFO  [org.ovirt.engine.core.bll.storage.disk.image.ImageTransferUpdater] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-32) [7ff556e1-9183-465e-8447-5988541884c5] Updating image transfer 3f2fcbbd-1246-49c6-8185-c09349b62383 phase to Finished Failure
~~~

The image download api is widely use to to download VM disk image as backup.

Version-Release number of selected component (if applicable):

~~~
ovirt-engine-4.3.7.2-0.1.el7.noarch
ovirt-imageio-proxy-1.5.2-0.el7ev.noarch
~~~

How reproducible:
100%

Steps to Reproduce:

Using the following:
1.	https://github.com/oVirt/ovirt-engine-sdk/blob/master/sdk/examples/download_disk_snapshots.py
	Modify to comment out transfer_service.finalize() on line 146.

2.	Change the TransferImageClientInactivityTimeoutInSeconds from 60 to 5 to shorten the timeout.

engine-config -s TransferImageClientInactivityTimeoutInSeconds=5
systemctl restart ovirt-engine

Actual results:

After the inactivity timeout, the VM disk will remain LOCKED.

Expected results:

The VM disk should not be LOCKED after the timeout.

Additional info:

Comment 2 Bimal Chollera 2020-01-29 23:17:43 UTC
~~~
image_guid            | bac90005-68fb-4a4f-9091-e638376a0fb4
creation_date         | 2019-11-20 11:18:04-05
size                  | 21474836480
it_guid               | 00000000-0000-0000-0000-000000000000
parentid              | 00000000-0000-0000-0000-000000000000
imagestatus           | 2  <<======================
lastmodified          | 2020-01-29 14:53:32.324-05
vm_snapshot_id        | 39258704-f692-4b49-a064-d55cff67366e
volume_type           | 2
volume_format         | 4
image_group_id        | b005cb13-be14-40a1-8145-fb304ed8208f
_create_date          | 2019-11-20 11:18:00.60448-05
_update_date          | 2020-01-29 14:53:32.325182-05
active                | f
volume_classification | 1
qcow_compat           | 2
~~~

Comment 3 Nir Soffer 2020-04-26 01:20:04 UTC
ovirt-imageio-daemon does not control disks state, this is engine issue.

Comment 8 Bella Khizgiyaev 2020-12-09 11:41:03 UTC
There were some changes in the SDK since the bug was opened, so in order to check that everything is working and reproduce:

using the download_disk_snapshot.py from the SDK -  https://github.com/oVirt/ovirt-engine-sdk/blob/450fe32573778e7d6fa7613a60e69a2b81a8959b/sdk/examples/download_disk_snapshot.py#L155 

1. checking that for regular snapshot download everything working fine.

2. checking that for download snapshot with time out also working and the image returns to OK status and not remains locked,
this can be done by comment out the finally section at the end of download_disk_snapshot.py

Comment 10 Ilan Zuckerman 2020-12-15 07:59:08 UTC
Verified on rhv-release-4.4.4-5-001.noarch

1. in the download_disk_snapshot.py script comment out the last 'finally' block where it finalizes the transfer.
2. Start downloading disk snapshot
3. Notice that the disk is now locked
4. After a minute, an engine ERROR will appear saying that "DOWNLOAD_IMAGE_CANCELED_TIMEOUT"
5. The transfer will be cancelled, and the disk should reclaim the 'ok' status.

All worked as expected.

Comment 12 Ilan Zuckerman 2021-01-26 13:27:41 UTC
Hi Bella, While working on automatic this test case i noticed the following:

When Downloading (without finalizing the transfer) snapshot disk with 'preallocated' allocation policy gets finalized without the ' DOWNLOAD_IMAGE_CANCELED_TIMEOUT ' message in engine log.
Is that an expected behavior?

Attaching the VM dumpxml

Attempting to download preallocated disk e634618d-d2ff-4a87-b6e0-9788c919cd5c without finalizing it (removed the 'finally' block from download_disk_snapshot):

[root@storage-ge13-vdsm3 ~]# python3 /usr/share/doc/python3-ovirt-engine-sdk4/examples/download_disk_snapshot.py iscsi_1 e634618d-d2ff-4a87-b6e0-9788c919cd5c /home/ilan/download/e634618d-d2ff-4a87-b6e0-9788c919cd5c -c engine
[   0.0 ] Connecting...
[   0.2 ] Creating image transfer...
[   2.9 ] Transfer ID: c61bb5bd-7011-43a2-b168-e8e03c07c099
[   2.9 ] Transfer host name: host_mixed_3
[   2.9 ] Downloading image...
[ 100.00% ] 1.00 GiB, 6.83 seconds, 149.84 MiB/s                               

From engine log:

2021-01-26 15:10:53,594+02 INFO  [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (default tas
k-26) [09b73bef-6a8f-4cbf-bc01-6de764524b13] Successfully added Download disk 'disk_virtioraw_2614493995' (disk id: '
16b4e917-5acd-4ffa-802d-4c6a4561a367', image id: 'e634618d-d2ff-4a87-b6e0-9788c919cd5c') for image transfer command '
c61bb5bd-7011-43a2-b168-e8e03c07c099'
.
.
.
2021-01-26 15:11:18,316+02 INFO  [org.ovirt.engine.core.bll.storage.disk.image.ImageTransferUpdater] (EE-ManagedSched
uledExecutorService-engineScheduledThreadPool-Thread-65) [ccb9265] Updating image transfer c61bb5bd-7011-43a2-b168-e8
e03c07c099 phase to Finalizing Success

Attaching VM dumpxml, engine log, vdsm log, vdsm daemon log, engine daemon log.

Comment 13 Ilan Zuckerman 2021-01-26 13:28:31 UTC
Created attachment 1750894 [details]
preallocated snap disk finalized without 'DOWNLOAD_IMAGE_CANCELED_TIMEOUT' message

Comment 14 Bella Khizgiyaev 2021-01-27 17:22:21 UTC
After looking if the image transfer operation is download and the disk is a snapshot disk the following behavior happening:

In case that the disk is the thin provision (raw) as in the previous cases we can know in advance when the transfer is finished so in that case if the finalized function is not called (commend out in the transfer script) it is not executed and after some time connectivity issue is rais and onFailed() is executed.

For pre-allocated disks (qcow) the engine knows in advance what the size needs to be transferred and at the end of the operation a check is performed and if all of the bytes was transfer the finalize function is preformed even though it's not explicitly being called and the timeout limit is not reached. 
https://github.com/oVirt/ovirt-engine/blob/9059df4d26f5d327c87495ee9e19cb1876ad92b7/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/storage/disk/image/TransferDiskImageCommand.java#L734

So according to the behavior explaining for each case, the behavior described in the previous comment is expected and not an issue.

Comment 17 errata-xmlrpc 2021-02-02 13:57:12 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Low: RHV-M(ovirt-engine) 4.4.z security, bug fix, enhancement update [ovirt-4.4.4]), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:0381

Comment 18 Nir Soffer 2021-05-13 10:51:37 UTC
Bella, can  you explain what was the issue in engine and how this was fixed?

Backup partners need to understand who to workaround this issue in 4.3.

Comment 19 Bella Khizgiyaev 2021-05-18 14:03:31 UTC
The issue in the engine was a part of the finalizing flow, when image_transfer times out due to inactivity handleFinalizingFailure() being called from TransferDiskImageCommand class which triggers the cleanup() method.

as part of the cleanup() there was a check "if(!Guid.isNullOrEmpty(getParameters().getImageGroupID())" and only if this condition is not apply the disk lock was realsed and the disk returns to be OK.
From looking into diskImage initialization this condition was set only to snapshot disks, hence only they remained locked.
After some additional checks, I didn't found any reason for the snapshot disk to remain locked in case of download action so this condition simply was removed and this solved the issue.

This is the patch with the fix https://gerrit.ovirt.org/#/c/ovirt-engine/+/112264/


Note You need to log in before you can comment on or make changes to this bug.