Description of problem: VM disk remains in locked state if image_transfer times out due to inactivity. As a result the image transfer finalize fails. - Engine logs - download image cancelled due to transfer inactivity. ~~~ 2020-01-29 16:07:33,012-05 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-23) [663d2eb9] EVENT_ID: DOWNLOAD_IMAGE_CANCELED_TIMEOUT(1,072), Download was canceled by system. Reason: timeout due to transfer inactivity. 2020-01-29 16:07:43,026-05 INFO [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-32) [7ff556e1-9183-465e-8447-5988541884c5] Transfer cancelled by system for Download disk 'windows' (disk id: 'b005cb13-be14-40a1-8145-fb304ed8208f', image id: 'bac90005-68fb-4a4f-9091-e638376a0fb4') 2020-01-29 16:07:43,027-05 INFO [org.ovirt.engine.core.bll.storage.disk.image.ImageTransferUpdater] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-32) [7ff556e1-9183-465e-8447-5988541884c5] Updating image transfer 3f2fcbbd-1246-49c6-8185-c09349b62383 phase to Finalizing Failure 2020-01-29 16:07:53,041-05 ERROR [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-25) [7ff556e1-9183-465e-8447-5988541884c5] Finalizing failed transfer. Download disk 'windows' (disk id: 'b005cb13-be14-40a1-8145-fb304ed8208f', image id: 'bac90005-68fb-4a4f-9091-e638376a0fb4') 2020-01-29 16:07:53,050-05 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.RemoveImageTicketVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-25) [7ff556e1-9183-465e-8447-5988541884c5] START, RemoveImageTicketVDSCommand(HostName = rhevh23, RemoveImageTicketVDSCommandParameters:{hostId='8f22f76c-1428-4492-a25e-3787ea78d6b8', ticketId='a08b06b2-4612-45be-aaf8-cc3ea4e37cb2', timeout='null'}), log id: 6eca44b4 2020-01-29 16:07:53,057-05 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.RemoveImageTicketVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-25) [7ff556e1-9183-465e-8447-5988541884c5] FINISH, RemoveImageTicketVDSCommand, return: StatusOnlyReturn [status=Status [code=0, message=Done]], log id: 6eca44b4 2020-01-29 16:07:53,057-05 INFO [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-25) [7ff556e1-9183-465e-8447-5988541884c5] Successfully stopped image transfer session for ticket 'a08b06b2-4612-45be-aaf8-cc3ea4e37cb2' 2020-01-29 16:07:53,057-05 INFO [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-25) [7ff556e1-9183-465e-8447-5988541884c5] Removing image ticket from ovirt-imageio-proxy, id a08b06b2-4612-45be-aaf8-cc3ea4e37cb2 2020-01-29 16:07:53,301-05 INFO [org.ovirt.engine.core.bll.storage.disk.image.ImageTransferUpdater] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-25) [7ff556e1-9183-465e-8447-5988541884c5] Updating image transfer 3f2fcbbd-1246-49c6-8185-c09349b62383 phase to Finished Failure 2020-01-29 16:08:03,377-05 ERROR [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-57) [7ff556e1-9183-465e-8447-5988541884c5] Transfer failed. Download disk 'amar-windows' (disk id: 'b005cb13-be14-40a1-8145-fb304ed8208f', image id: 'bac90005-68fb-4a4f-9091-e638376a0fb4') 2020-01-29 16:08:04,386-05 INFO [org.ovirt.engine.core.bll.storage.disk.image.ImageTransferUpdater] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-32) [7ff556e1-9183-465e-8447-5988541884c5] Updating image transfer 3f2fcbbd-1246-49c6-8185-c09349b62383 phase to Finished Failure ~~~ The image download api is widely use to to download VM disk image as backup. Version-Release number of selected component (if applicable): ~~~ ovirt-engine-4.3.7.2-0.1.el7.noarch ovirt-imageio-proxy-1.5.2-0.el7ev.noarch ~~~ How reproducible: 100% Steps to Reproduce: Using the following: 1. https://github.com/oVirt/ovirt-engine-sdk/blob/master/sdk/examples/download_disk_snapshots.py Modify to comment out transfer_service.finalize() on line 146. 2. Change the TransferImageClientInactivityTimeoutInSeconds from 60 to 5 to shorten the timeout. engine-config -s TransferImageClientInactivityTimeoutInSeconds=5 systemctl restart ovirt-engine Actual results: After the inactivity timeout, the VM disk will remain LOCKED. Expected results: The VM disk should not be LOCKED after the timeout. Additional info:
~~~ image_guid | bac90005-68fb-4a4f-9091-e638376a0fb4 creation_date | 2019-11-20 11:18:04-05 size | 21474836480 it_guid | 00000000-0000-0000-0000-000000000000 parentid | 00000000-0000-0000-0000-000000000000 imagestatus | 2 <<====================== lastmodified | 2020-01-29 14:53:32.324-05 vm_snapshot_id | 39258704-f692-4b49-a064-d55cff67366e volume_type | 2 volume_format | 4 image_group_id | b005cb13-be14-40a1-8145-fb304ed8208f _create_date | 2019-11-20 11:18:00.60448-05 _update_date | 2020-01-29 14:53:32.325182-05 active | f volume_classification | 1 qcow_compat | 2 ~~~
ovirt-imageio-daemon does not control disks state, this is engine issue.
There were some changes in the SDK since the bug was opened, so in order to check that everything is working and reproduce: using the download_disk_snapshot.py from the SDK - https://github.com/oVirt/ovirt-engine-sdk/blob/450fe32573778e7d6fa7613a60e69a2b81a8959b/sdk/examples/download_disk_snapshot.py#L155 1. checking that for regular snapshot download everything working fine. 2. checking that for download snapshot with time out also working and the image returns to OK status and not remains locked, this can be done by comment out the finally section at the end of download_disk_snapshot.py
Verified on rhv-release-4.4.4-5-001.noarch 1. in the download_disk_snapshot.py script comment out the last 'finally' block where it finalizes the transfer. 2. Start downloading disk snapshot 3. Notice that the disk is now locked 4. After a minute, an engine ERROR will appear saying that "DOWNLOAD_IMAGE_CANCELED_TIMEOUT" 5. The transfer will be cancelled, and the disk should reclaim the 'ok' status. All worked as expected.
Hi Bella, While working on automatic this test case i noticed the following: When Downloading (without finalizing the transfer) snapshot disk with 'preallocated' allocation policy gets finalized without the ' DOWNLOAD_IMAGE_CANCELED_TIMEOUT ' message in engine log. Is that an expected behavior? Attaching the VM dumpxml Attempting to download preallocated disk e634618d-d2ff-4a87-b6e0-9788c919cd5c without finalizing it (removed the 'finally' block from download_disk_snapshot): [root@storage-ge13-vdsm3 ~]# python3 /usr/share/doc/python3-ovirt-engine-sdk4/examples/download_disk_snapshot.py iscsi_1 e634618d-d2ff-4a87-b6e0-9788c919cd5c /home/ilan/download/e634618d-d2ff-4a87-b6e0-9788c919cd5c -c engine [ 0.0 ] Connecting... [ 0.2 ] Creating image transfer... [ 2.9 ] Transfer ID: c61bb5bd-7011-43a2-b168-e8e03c07c099 [ 2.9 ] Transfer host name: host_mixed_3 [ 2.9 ] Downloading image... [ 100.00% ] 1.00 GiB, 6.83 seconds, 149.84 MiB/s From engine log: 2021-01-26 15:10:53,594+02 INFO [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (default tas k-26) [09b73bef-6a8f-4cbf-bc01-6de764524b13] Successfully added Download disk 'disk_virtioraw_2614493995' (disk id: ' 16b4e917-5acd-4ffa-802d-4c6a4561a367', image id: 'e634618d-d2ff-4a87-b6e0-9788c919cd5c') for image transfer command ' c61bb5bd-7011-43a2-b168-e8e03c07c099' . . . 2021-01-26 15:11:18,316+02 INFO [org.ovirt.engine.core.bll.storage.disk.image.ImageTransferUpdater] (EE-ManagedSched uledExecutorService-engineScheduledThreadPool-Thread-65) [ccb9265] Updating image transfer c61bb5bd-7011-43a2-b168-e8 e03c07c099 phase to Finalizing Success Attaching VM dumpxml, engine log, vdsm log, vdsm daemon log, engine daemon log.
Created attachment 1750894 [details] preallocated snap disk finalized without 'DOWNLOAD_IMAGE_CANCELED_TIMEOUT' message
After looking if the image transfer operation is download and the disk is a snapshot disk the following behavior happening: In case that the disk is the thin provision (raw) as in the previous cases we can know in advance when the transfer is finished so in that case if the finalized function is not called (commend out in the transfer script) it is not executed and after some time connectivity issue is rais and onFailed() is executed. For pre-allocated disks (qcow) the engine knows in advance what the size needs to be transferred and at the end of the operation a check is performed and if all of the bytes was transfer the finalize function is preformed even though it's not explicitly being called and the timeout limit is not reached. https://github.com/oVirt/ovirt-engine/blob/9059df4d26f5d327c87495ee9e19cb1876ad92b7/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/storage/disk/image/TransferDiskImageCommand.java#L734 So according to the behavior explaining for each case, the behavior described in the previous comment is expected and not an issue.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Low: RHV-M(ovirt-engine) 4.4.z security, bug fix, enhancement update [ovirt-4.4.4]), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:0381
Bella, can you explain what was the issue in engine and how this was fixed? Backup partners need to understand who to workaround this issue in 4.3.
The issue in the engine was a part of the finalizing flow, when image_transfer times out due to inactivity handleFinalizingFailure() being called from TransferDiskImageCommand class which triggers the cleanup() method. as part of the cleanup() there was a check "if(!Guid.isNullOrEmpty(getParameters().getImageGroupID())" and only if this condition is not apply the disk lock was realsed and the disk returns to be OK. From looking into diskImage initialization this condition was set only to snapshot disks, hence only they remained locked. After some additional checks, I didn't found any reason for the snapshot disk to remain locked in case of download action so this condition simply was removed and this solved the issue. This is the patch with the fix https://gerrit.ovirt.org/#/c/ovirt-engine/+/112264/