Created attachment 1793017 [details] engine.log Description of problem: If we are downloading a snapshot disk and the HE VM is rebooted, the image transfer fails but the disk remains in locked state in the database. Version-Release number of selected component (if applicable): ovirt-engine-4.4.6.8-0.1.el8ev.noarch ovirt-imageio-daemon-2.1.1-1.el8ev.x86_64 RHVH host 4.4.6.1-0.20210527.0 How reproducible: I've reproduced it two times (100%) Steps to Reproduce: 1. Create a snapshot of a VM 2. Download the snapshot from an external machine using the proxy_url: # curl -k -X POST \ -H 'Version: 4' \ -H 'Accept: application/json' \ -d 'grant_type=password&scope=ovirt-app-api&username=admin%40internal&password=password' \ https://jorti-rhvm44.lab.sbr-virt.gsslab.brq.redhat.com/ovirt-engine/sso/oauth/token # curl -k -X POST \ -H 'Version: 4' \ -H 'Accept: application/xml' \ -H 'Content-Type: application/xml' \ -H 'Authorization: Bearer ISXD5wwMtWjBEFKQivXc_LHRyjTHFsojTGNnvIxWIXabxkCZK0TFzhqPQ0gXXtR63hz8SHAd3yuayBj0tk0OgQ' \ -d '<image_transfer><snapshot id="595574ec-5b73-4407-978a-d986a21a735f"/><direction>download</direction></image_transfer>' \ https://jorti-rhvm44.lab.sbr-virt.gsslab.brq.redhat.com/ovirt-engine/api/imagetransfers <?xml version="1.0" encoding="UTF-8" standalone="yes"?> <image_transfer href="/ovirt-engine/api/imagetransfers/ccfef028-9541-466c-ab60-96e88b5ead00" id="ccfef028-9541-466c-ab60-96e88b5ead00"> <actions> <link href="/ovirt-engine/api/imagetransfers/ccfef028-9541-466c-ab60-96e88b5ead00/resume" rel="resume"/> <link href="/ovirt-engine/api/imagetransfers/ccfef028-9541-466c-ab60-96e88b5ead00/cancel" rel="cancel"/> <link href="/ovirt-engine/api/imagetransfers/ccfef028-9541-466c-ab60-96e88b5ead00/extend" rel="extend"/> <link href="/ovirt-engine/api/imagetransfers/ccfef028-9541-466c-ab60-96e88b5ead00/pause" rel="pause"/> <link href="/ovirt-engine/api/imagetransfers/ccfef028-9541-466c-ab60-96e88b5ead00/finalize" rel="finalize"/> </actions> <active>false</active> <direction>download</direction> <format>cow</format> <inactivity_timeout>60</inactivity_timeout> <phase>transferring</phase> <proxy_url>https://jorti-rhvm44.lab.sbr-virt.gsslab.brq.redhat.com:54323/images/a3a3cbac-12df-4bf0-8e20-3707ce40fe8b</proxy_url> <shallow>false</shallow> <timeout_policy>legacy</timeout_policy> <transfer_url>https://jorti-rhvh44-01.lab.sbr-virt.gsslab.brq.redhat.com:54322/images/a3a3cbac-12df-4bf0-8e20-3707ce40fe8b</transfer_url> <transferred>0</transferred> <host href="/ovirt-engine/api/hosts/fd89aed2-fdae-4e90-bec6-8a6f8e16d5e9" id="fd89aed2-fdae-4e90-bec6-8a6f8e16d5e9"/> </image_transfer> # curl -vk \ -H 'Version: 4' \ -H 'Authorization: Bearer ISXD5wwMtWjBEFKQivXc_LHRyjTHFsojTGNnvIxWIXabxkCZK0TFzhqPQ0gXXtR63hz8SHAd3yuayBj0tk0OgQ' \ https://jorti-rhvm44.lab.sbr-virt.gsslab.brq.redhat.com:54323/images/a3a3cbac-12df-4bf0-8e20-3707ce40fe8b -o /var/tmp/image_transfer.qcow2 3. While the download is in progress, reboot the HostedEngine VM. Actual results: The image transfer is marked as failed and it is logged that the lock on the disk is freed, however its status is still locked in the database. 2021-06-22 12:54:52,350+02 ERROR [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-3) [a8a0f626-642b-4aca-9333-3c31b8f04aab] Failed to transfer disk '595574ec-5b73-4407-978a-d986a21a735f' (command id 'ccfef028-9541-466c-ab60-96e88b5ead00') 2021-06-22 12:54:52,351+02 INFO [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-3) [a8a0f626-642b-4aca-9333-3c31b8f04aab] Ending command 'org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand' successfully. 2021-06-22 12:54:52,352+02 INFO [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-3) [a8a0f626-642b-4aca-9333-3c31b8f04aab] Lock freed to object 'EngineLock:{exclusiveLocks='[116646a2-de9d-4339-bff2-1b4ea9ede5a5=DISK]', sharedLocks='[75465d8e-681f-4258-81b5-d0e9e9d092f6=VM]'}' <---- 2021-06-22 12:54:52,457+02 ERROR [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-3) [a8a0f626-642b-4aca-9333-3c31b8f04aab] Error during log command: org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand. Exception null engine=> select image_guid, parentid, image_group_id, imagestatus, active from images where image_group_id = '116646a2-de9d-4339-bff2-1b4ea9ede5a5'; image_guid | parentid | image_group_id | imagestatus | active --------------------------------------+--------------------------------------+--------------------------------------+-------------+-------- ebb3b181-272d-4ebe-a1c2-b0e075c57ea2 | 595574ec-5b73-4407-978a-d986a21a735f | 116646a2-de9d-4339-bff2-1b4ea9ede5a5 | 1 | t 595574ec-5b73-4407-978a-d986a21a735f | 00000000-0000-0000-0000-000000000000 | 116646a2-de9d-4339-bff2-1b4ea9ede5a5 | 2 | f <---- (2 rows) Expected results: Image unlocked after the transfer is marked as failed. Additional info: In customer's environment this happened after the restart of several ovirt services (ovirt-engine, ovirt-imageio, ovirt-vmconsole-proxy-sshd, ovirt-provider-ovn and ovirt-websocket-proxy) instead of the reboot of the full VM.
I met similar problem on rhv-4.4.10.2-0.1.el8ev, cannot switch rhv node to Maintenance mode because Image transfer is in progress although all images have been cleaned in storage->disk option, please refer to screenshot, besides, stopping ovirt-imageio service can't resolve this problem
(In reply to mxie from comment #2) > I met similar problem on rhv-4.4.10.2-0.1.el8ev, cannot switch rhv node to > Maintenance mode because Image transfer is in progress although all images > have been cleaned in storage->disk option, please refer to screenshot, > besides, stopping ovirt-imageio service can't resolve this problem This sounds like a different issue - we enable switching a host to maintenance when a image transfer is in status FAILED However, the issue you face might be the same as the one reported in bz 2037057 - so if you can provide us logs that cover the lifecycle of the image transfer(s) due to which the move-to-maintenance is blocked (in a separate bug), that would be appreciated
A simple reproducer: 1. Start disk download 2. While transfer is in progress, stop ovirt-engine 3. Restart the ovirt-imageio service 4. Start ovirt-engine
This should be fixed by recent changes in image transfer, moving to ON_QA
(In reply to Arik from comment #4) > (In reply to mxie from comment #2) > > I met similar problem on rhv-4.4.10.2-0.1.el8ev, cannot switch rhv node to > > Maintenance mode because Image transfer is in progress although all images > > have been cleaned in storage->disk option, please refer to screenshot, > > besides, stopping ovirt-imageio service can't resolve this problem > > This sounds like a different issue - we enable switching a host to > maintenance when a image transfer is in status FAILED > However, the issue you face might be the same as the one reported in bz > 2037057 - so if you can provide us logs that cover the lifecycle of the > image transfer(s) due to which the move-to-maintenance is blocked (in a > separate bug), that would be appreciated Just saw the comment, file bug2070491 to track the problem, thanks!
It is likely fixes for the following bugs https://bugzilla.redhat.com/show_bug.cgi?id=2043984 https://bugzilla.redhat.com/show_bug.cgi?id=2057445 (this one was backported to 4.4.10) Fixed this issue
Verified. The disk is not locked after reproducing the steps. Versions: ovirt-engine-4.5.0.5-0.7.el8ev ovirt-imageio-daemon-2.4.3-1.el8ev.x86_64
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: RHV Manager (ovirt-engine) [ovirt-4.5.0] security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:4711
Due to QE capacity, we are not going to cover this issue in our automation