Created attachment 1699990 [details] engine and vdsm logs Description of problem: When using SDK for downloading qcow2 disk, the transfer times out to finalize transfer with the following ERROR from the SDK [1], with the following error from engine [2], With the following error from vdsm host [3] I doubt this is regression bug for the following reasons: - I wasnt able to reproduce it (i tried more than once) - Very similar test (involving download disk) succeed just before this one. Having said that, this is still a bug and needs to be treated. Unfortunately daemon.log files couldn't be retrieved from hosts and engine. There was very similar bug for downloading from UI: https://bugzilla.redhat.com/show_bug.cgi?id=1835601 [1]: 2020-07-05 10:37:18,495 - MainThread - RemoteExecutor - DEBUG - [root.16.37/qum5net] Results of command: python3 /usr/share/doc/python3-ovirt-engine-sdk4/examples/backup_vm.py download fac3911d-95b6-41be-a538-dad345691502 --password-file /root/pass.txt --engine-url https://hosted-engine-06.lab.eng.tlv2.redhat.com --username admin@internal --cafile /root/ca.crt --backup-dir /root/full_backup/ --backup-uuid 44ae937c-8c30-41a1-bb68-d4daf3177d7e 2020-07-05 10:37:18,496 - MainThread - RemoteExecutor - DEBUG - [root.16.37/qum5net] OUT: [ 0.0 ] Downloading VM fac3911d-95b6-41be-a538-dad345691502 disks [ 0.2 ] Creating image transfer for disk 2d075957-7481-4af3-9920-e63d7759af83 [ 2.9 ] Image transfer 391d9654-0e59-4e24-9524-e59a37cb8b53 is ready [ ------- ] 0 bytes, 0.00 seconds, 0 bytes/s Formatting '/root/full_backup/2d075957-7481-4af3-9920-e63d7759af83.202007051036.full.qcow2', fmt=qcow2 size=1073741824 cluster_size=65536 lazy_refcounts=off refcount_bits=16 [ 99.89% ] 1022.88 MiB, 0.22 seconds, 4.56 GiB/s [ 99.92% ] 1023.19 MiB, 0.33 seconds, 3.00 GiB/s [ 99.99% ] 1023.88 MiB, 0.44 seconds, 2.29 GiB/s [ 100.00% ] 1.00 GiB, 0.46 seconds, 2.17 GiB/s [ 3.3 ] Finalizing image transfer 2020-07-05 10:37:18,496 - MainThread - RemoteExecutor - DEBUG - [root.16.37/qum5net] ERR: Traceback (most recent call last): File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/backup_vm.py", line 476, in <module> main() File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/backup_vm.py", line 173, in main args.command(args) File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/backup_vm.py", line 244, in cmd_download connection, args.backup_uuid, args, incremental=args.incremental) File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/backup_vm.py", line 385, in download_backup download_disk(connection, backup_uuid, disk, disk_path, args, incremental=incremental) File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/backup_vm.py", line 430, in download_disk imagetransfer.finalize_transfer(connection, transfer, disk) File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/helpers/imagetransfer.py", line 296, in finalize_transfer .format(transfer.id)) RuntimeError: Timed out waiting for transfer 391d9654-0e59-4e24-9524-e59a37cb8b53 to finalize --------------------------------------------------------------------------------------------- [2]: 2020-07-05 10:37:20,801+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.GetImageTicketVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-48) [b434516f-9c7f-46fc-9f0c-50ee4ffc3775] START, GetImageTicketVDSCommand(HostName = host_mixed_1, GetImageTicketVDSCommandParameters:{hostId='9fb37721-708b-4327-9148-d3f302e51436', ticketId='0287b970-6b73-4b9e-b683-47c1f44dbe86', timeout='null'}), log id: 382fd9d6 2020-07-05 10:37:20,807+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.GetImageTicketVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-48) [b434516f-9c7f-46fc-9f0c-50ee4ffc3775] FINISH, GetImageTicketVDSCommand, return: org.ovirt.engine.core.common.businessentities.storage.ImageTicketInformation@dc80e36, log id: 382fd9d6 2020-07-05 10:37:20,825+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-48) [b434516f-9c7f-46fc-9f0c-50ee4ffc3775] EVENT_ID: DOWNLOAD_IMAGE_CANCELED_TIMEOUT(1,072), Download was canceled by system. Reason: timeout due to transfer inactivity. 2020-07-05 10:37:20,826+03 INFO [org.ovirt.engine.core.bll.storage.disk.image.ImageTransferUpdater] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-48) [b434516f-9c7f-46fc-9f0c-50ee4ffc3775] Updating image transfer 391d9654-0e59-4e24-9524-e59a37cb8b53 (image 2d075957-7481-4af3-9920-e63d7759af83) phase to Cancelled --------------------------------------------------------------------------------------------- [3]: 2020-07-05 10:37:19,686+0300 INFO (jsonrpc/7) [api.virt] FINISH stop_backup return={'status': {'code': 0, 'message': 'Done'}} from=::ffff:10.46.16.252,60486, flow_id=backups_syncAction_9850ddba-f8b8-4b8, vmId=fac3911d-95b6-41be-a538-dad345691502 (api:54) 2020-07-05 10:37:19,686+0300 INFO (jsonrpc/7) [jsonrpc.JsonRpcServer] RPC call VM.stop_backup succeeded in 0.01 seconds (__init__:312) 2020-07-05 10:37:19,694+0300 ERROR (libvirt/events) [virt.vm] (vmId='fac3911d-95b6-41be-a538-dad345691502') Block job (untracked) type BACKUP for drive sda was canceled (vm:5306) 2020-07-05 10:37:19,710+0300 ERROR (libvirt/events) [virt.vm] (vmId='fac3911d-95b6-41be-a538-dad345691502') Block job (untracked) type BACKUP for drive vdb was canceled (vm:5306) 2020-07-05 10:37:19,732+0300 ERROR (libvirt/events) [virt.vm] (vmId='fac3911d-95b6-41be-a538-dad345691502') Block job (untracked) type BACKUP for drive sdb was canceled (vm:5306) 2020-07-05 10:37:19,783+0300 ERROR (libvirt/events) [virt.vm] (vmId='fac3911d-95b6-41be-a538-dad345691502') Block job (untracked) type BACKUP for drive vdc was canceled (vm:5306) Version-Release number of selected component (if applicable): rhv-4.4.1-10 How reproducible: Wasnt able to reproduce. Happened as a result of automation run. Steps to Reproduce: 1. Create a vm from template 2. Attach permutation disks (with fs + data) 3. Perform full backup on all disks 4. Download the all of the backed up disks Actual results: Timed out waiting for transfer XXX to finalize Expected results: Shouldnt time out
(In reply to Ilan Zuckerman from comment #0) > Created attachment 1699990 [details] > engine and vdsm logs > > Description of problem: > > When using SDK for downloading qcow2 disk, the transfer times out to > finalize transfer Looks like a bug, finalized should not time out in normal conditions. > There was very similar bug for downloading from UI: > https://bugzilla.redhat.com/show_bug.cgi?id=1835601 That bug is closed verified. > [1]: > 2020-07-05 10:37:18,495 - MainThread - RemoteExecutor - DEBUG - > [root.16.37/qum5net] Results of command: python3 > /usr/share/doc/python3-ovirt-engine-sdk4/examples/backup_vm.py download > fac3911d-95b6-41be-a538-dad345691502 --password-file /root/pass.txt > --engine-url https://hosted-engine-06.lab.eng.tlv2.redhat.com --username > admin@internal --cafile /root/ca.crt --backup-dir /root/full_backup/ > --backup-uuid 44ae937c-8c30-41a1-bb68-d4daf3177d7e > 2020-07-05 10:37:18,496 - MainThread - RemoteExecutor - DEBUG - > [root.16.37/qum5net] OUT: [ 0.0 ] Downloading VM > fac3911d-95b6-41be-a538-dad345691502 disks ... > [ 100.00% ] 1.00 GiB, 0.46 seconds, 2.17 GiB/s Transfer was successful... > [ 3.3 ] Finalizing image transfer ... > "/usr/share/doc/python3-ovirt-engine-sdk4/examples/helpers/imagetransfer.py", > line 296, in finalize_transfer > .format(transfer.id)) > RuntimeError: Timed out waiting for transfer > 391d9654-0e59-4e24-9524-e59a37cb8b53 to finalize This means engine could not finialize the transfer in 60 seconds, which is a lot of time to remove the ticket and teardown the image. > [2]: > 2020-07-05 10:37:20,801+03 INFO > [org.ovirt.engine.core.vdsbroker.vdsbroker.GetImageTicketVDSCommand] > (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-48) > [b434516f-9c7f-46fc-9f0c-50ee4ffc3775] START, > GetImageTicketVDSCommand(HostName = host_mixed_1, > GetImageTicketVDSCommandParameters:{hostId='9fb37721-708b-4327-9148- > d3f302e51436', ticketId='0287b970-6b73-4b9e-b683-47c1f44dbe86', > timeout='null'}), log id: 382fd9d6 > 2020-07-05 10:37:20,807+03 INFO > [org.ovirt.engine.core.vdsbroker.vdsbroker.GetImageTicketVDSCommand] > (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-48) > [b434516f-9c7f-46fc-9f0c-50ee4ffc3775] FINISH, GetImageTicketVDSCommand, > return: > org.ovirt.engine.core.common.businessentities.storage. > ImageTicketInformation@dc80e36, log id: 382fd9d6 > 2020-07-05 10:37:20,825+03 ERROR > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-48) > [b434516f-9c7f-46fc-9f0c-50ee4ffc3775] EVENT_ID: > DOWNLOAD_IMAGE_CANCELED_TIMEOUT(1,072), Download was canceled by system. > Reason: timeout due to transfer inactivity. Looks like engine did not receive the finalize request, or there is some race that caused it to ignore the finalize request. > 2020-07-05 10:37:20,826+03 INFO > [org.ovirt.engine.core.bll.storage.disk.image.ImageTransferUpdater] > (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-48) > [b434516f-9c7f-46fc-9f0c-50ee4ffc3775] Updating image transfer > 391d9654-0e59-4e24-9524-e59a37cb8b53 (image > 2d075957-7481-4af3-9920-e63d7759af83) phase to Cancelled This is expected if a user did not finalize the transfer in time, which is not the case here. I did not look in vdsm log but engine logs tell that this is engine bug in handling of the finalize call. This may be a race when calling finalized very quickly after starting the transfer. In this case the transfer was completed in 0.46 seconds: > [ 100.00% ] 1.00 GiB, 0.46 seconds, 2.17 GiB/s So maybe engine was not ready for the finalize call. To reproduce this issue we can write a small scripts doing: for i in range(1000): transfer = imagetransfer.create_transfer(...) finalize transfer(...) It should work without any error, but I think this will quickly trigger this bug. Ilan, do you want to try to create such reproducer?
After trying to run the script, it failed on iteration 311, but the issue wasnt reproduced. It has failed with: "Cannot transfer Virtual Disk: Disk is locked. Please try again" Non the less, a 10 seconds wait was added between the transfers, but apparently it didnt help.. Attaching the full script. The relevant lines in the script are: with closing(connection): for i in range(1000): print("Iteration %", i) print("Creating transfer...") transfer = imagetransfer.create_transfer(connection, disk, types.ImageTransferDirection.DOWNLOAD) print("Transfer ID: %s" % transfer.id) print("Transfer host name: %s" % transfer.host.name) print("Finalizing image transfer...") imagetransfer.finalize_transfer(connection, transfer, disk) print("Transfer finalized. sleeping 10 sec") time.sleep(10) print("**********************************") Iteration % 311 Creating transfer... Traceback (most recent call last): File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/bz_1854038.py", line 35, in <module> transfer = imagetransfer.create_transfer(connection, disk, types.ImageTransferDirection.DOWNLOAD) File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/helpers/imagetransfer.py", line 162, in create_transfer format=types.DiskFormat.RAW, File "/usr/lib64/python3.6/site-packages/ovirtsdk4/services.py", line 14009, in add return self._internal_add(image_transfer, headers, query, wait) File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 232, in _internal_add return future.wait() if wait else future File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 55, in wait return self._code(response) File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 229, in callback self._check_fault(response) File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 132, in _check_fault self._raise_error(response, body) File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 118, in _raise_error raise error ovirtsdk4.Error: Fault reason is "Operation Failed". Fault detail is "[Cannot transfer Virtual Disk: Disk is locked. Please try again later.]". HTTP response code is 409. Engine: 2020-07-07 13:54:19,538+03 WARN [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (default task-13) [fd790e43-b2af-4990-a908-7342e231b2a8] Validation of action 'TransferDiskImage' failed for user admin@internal-authz. Reasons: VAR__TYPE__DISK,VAR__ACTION__TRANSFER,ACTION_TYPE_FAILED_DISK_IS_LOCKED 2020-07-07 13:54:19,539+03 ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (default task-13) [] Operation Failed: [Cannot transfer Virtual Disk: Disk is locked. Please try again later.] 2020-07-07 13:54:19,583+03 INFO [org.ovirt.engine.core.bll.storage.disk.image.ImageTransferUpdater] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-83) [5cdc9ffe-720a-4f4f-947f-df50bc95d6e3] Updating image transfer 16a5c8f7-f84e-4c27-ac82-5d528a8c20a0 (image 94139efa-8967-4738-b79e-3bd492919a8f) phase to Finished Success 2020-07-07 13:54:19,589+03 INFO [org.ovirt.engine.core.sso.servlets.OAuthRevokeServlet] (default task-16) [7f5f64d5] User admin@internal successfully logged out 2020-07-07 13:54:19,596+03 INFO [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-83) [5cdc9ffe-720a-4f4f-947f-df50bc95d6e3] Successfully transferred disk '00000000-0000-0000-0000-000000000000' (command id '16a5c8f7-f84e-4c27-ac82-5d528a8c20a0')
Created attachment 1700149 [details] reproduction script
(In reply to Ilan Zuckerman from comment #3) > After trying to run the script, it failed on iteration 311, but the issue > wasnt reproduced. > It has failed with: > "Cannot transfer Virtual Disk: Disk is locked. Please try again" This is the same issue again, finalize finished and we waited 10 seconds but the disk is till locked in engine. You can try to use 20 second sleep and hopefully we will reproduce it. Also I would add short sleep after create_transfer() and before finalize(), since backup will never take zero time. Lets use 0.5 seconds like we had in the failing test. Once we fix imagetransfer.finalize() to wait properly until the transfer was really completed it will be easier to test this issue.
Nir, unfortunately the same kind of problematic behavior wasn't reproduced with a help of my script after 1k iterations [1] (suggested by you). Any other advises we can do here? Attaching the updated script here (20 sec wait + 0.5 in between) just as you said. [1]: Iteration % 999 Creating transfer... Transfer ID: 2eeff75d-199b-436a-b27a-b590fca9ca8c Transfer host name: host_mixed_2 Finalizing image transfer... Transfer finalized. sleeping 20 sec **********************************
(In reply to Ilan Zuckerman from comment #6) > Nir, unfortunately the same kind of problematic behavior wasn't reproduced > with a help of my script after 1k iterations [1] (suggested by you). > Any other advises we can do here? > > Attaching the updated script here (20 sec wait + 0.5 in between) just as you > said. > > [1]: > Iteration % 999 > Creating transfer... > Transfer ID: 2eeff75d-199b-436a-b27a-b590fca9ca8c > Transfer host name: host_mixed_2 > Finalizing image transfer... > Transfer finalized. sleeping 20 sec > ********************************** Thanks, this is very good info even if we could not reproduce. No we know that adding 20 seconds sleep works as a workaround :-) I guess that something is missing in the flow - maybe we need to transfer some data to trigger this bug. Engine monitors the activity on the ticket and the ticket status may effect the logic used to clean up the transfer. But I think you invested enough and now we need to wait for engine folks to debug this.
Created attachment 1700922 [details] Reproduction logs when uploading disk Just reproduced this issue when uploading disk. There was nothing special in the upload and it completed successfully, but then the finalizing step got stuck. I interrupted this finalize after 30 seconds. $ ./upload_disk.py --engine-url https://engine3 --username admin@internal --password-file engine3-password --cafile engine3.pem --sd-name nfs1 --disk-format raw /var/tmp/fedora-32.raw Checking image... Image format: raw Disk format: raw Disk content type: data Disk provisioned size: 6442450944 Disk initial size: 6442450944 Disk name: fedora-32.raw Disk backup: False Connecting... Creating disk... Disk ID: e47fc0ab-01a2-48de-8ea7-9ff1af223b50 Creating image transfer... Transfer ID: d8d1dfd7-351c-48e2-a18e-764d4f1e1f40 Transfer host name: host4 Uploading image... [ 100.00% ] 6.00 GiB, 4.01 seconds, 1.50 GiB/s Finalizing image transfer... ^CTraceback (most recent call last): File "./upload_disk.py", line 351, in <module> imagetransfer.finalize_transfer(connection, transfer, disk) File "/home/nsoffer/src/ovirt-engine-sdk/sdk/examples/helpers/imagetransfer.py", line 275, in finalize_transfer time.sleep(1) KeyboardInterrupt
Bumping priority and severity. Engine APIs cannot fail randomly.
This issue was reproduced again during scheduled Automation run on PPC env with NFS storage From Automation log [1] From Engine log [2] 'transfer_id': 'f26e8661-05f2-4c32-a5b7-18884588f7b2' Attaching daemon log from VDSM + engine daemon.log rhv-4.4.3-13 [1]: 2020-11-12 00:12:18,485 - MainThread - RemoteExecutor - DEBUG - [root.214.116/qum5net] Results of command: python3 /usr/share/doc/python3-ovirt-engine-sdk4/examples/upload_disk.py -c engine /root/backup/51c21155-b4d0-47cb-b6e8-a46a87e15818.202011120009.full.qcow2 --disk-sparse --sd-name nfs_0 2020-11-12 00:12:18,485 - MainThread - RemoteExecutor - DEBUG - [root.214.116/qum5net] OUT: [ 0.0 ] Checking image... [ 0.0 ] Image format: qcow2 [ 0.0 ] Disk format: cow [ 0.0 ] Disk content type: data [ 0.0 ] Disk provisioned size: 1073741824 [ 0.0 ] Disk initial size: 1638400 [ 0.0 ] Disk name: 51c21155-b4d0-47cb-b6e8-a46a87e15818.202011120009.full.qcow2 [ 0.0 ] Disk backup: False [ 0.0 ] Connecting... [ 0.0 ] Creating disk... [ 6.5 ] Disk ID: 52ac0207-c84e-421c-851b-cd426ebb187c [ 6.5 ] Creating image transfer... [ 10.3 ] Transfer ID: f26e8661-05f2-4c32-a5b7-18884588f7b2 [ 10.3 ] Transfer host name: host_mixed_2 [ 10.3 ] Uploading image... [ ------- ] 0 bytes, 0.00 seconds, 0 bytes/s [ 100.00% ] 1.00 GiB, 0.08 seconds, 12.89 GiB/s [ 10.4 ] Finalizing image transfer... 2020-11-12 00:12:18,486 - MainThread - RemoteExecutor - DEBUG - [root.214.116/qum5net] ERR: Traceback (most recent call last): File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/upload_disk.py", line 276, in <module> imagetransfer.finalize_transfer(connection, transfer, disk) File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/helpers/imagetransfer.py", line 327, in finalize_transfer .format(transfer.id)) RuntimeError: Timed out waiting for transfer f26e8661-05f2-4c32-a5b7-18884588f7b2 to finalize =============================================================================================== [2]: 2020-11-12 00:12:21,593+02 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.GetImageTicketVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-18) [3c74ce45-f11b-460c-8bdd-5813998b8f36] START, GetImageTicketVDSCommand(HostName = host_mixed_2, GetImageTicketVDSCommandParameters:{hostId='f8307695-11d0-419d-a774-6ab6ce02df88', ticketId='4aaf638a-1007-4b92-880d-4f5eb5d77af0', timeout='null'}), log id: 6079253b 2020-11-12 00:12:21,761+02 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.GetImageTicketVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-18) [3c74ce45-f11b-460c-8bdd-5813998b8f36] FINISH, GetImageTicketVDSCommand, return: org.ovirt.engine.core.common.businessentities.storage.ImageTicketInformation@2fbf289a, log id: 6079253b 2020-11-12 00:12:21,780+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-18) [3c74ce45-f11b-460c-8bdd-5813998b8f36] EVENT_ID: UPLOAD_IMAGE_PAUSED_BY_SYSTEM_TIMEOUT(1,071), Upload was paused by system. Reason: timeout due to transfer inactivity. 2020-11-12 00:12:21,781+02 INFO [org.ovirt.engine.core.bll.storage.disk.image.ImageTransferUpdater] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-18) [3c74ce45-f11b-460c-8bdd-5813998b8f36] Updating image transfer f26e8661-05f2-4c32-a5b7-18884588f7b2 (image 52ac0207-c84e-421c-851b-cd426ebb187c) phase to Paused by System RHV-4.4-tier1-PPC #27
Created attachment 1728689 [details] 3rd reproduction. engine + vdsm daemon logs
Created attachment 1728690 [details] 3rd reproduction vdsm + engine logs
Reproduced again during automation run. Attaching vdsm daemon log, engine daemon log, engine log, vdsm log Upload disk attempt: 2020-12-21 23:39:21,929 - MainThread - RemoteExecutor - DEBUG - [root.16.38/qum5net] Results of command: python3 /usr/share/doc/python3-ovirt-engine-sdk4/examples/upload_disk.py -c engine /root/backup/208ebc8d-7eb1-4686-afff-48327ea6c210.202012212337.incremental.qcow2 --disk-sparse --sd-name fcp_2 2020-12-21 23:39:21,929 - MainThread - RemoteExecutor - DEBUG - [root.16.38/qum5net] OUT: [ 0.0 ] Checking image... [ 0.0 ] Image format: qcow2 [ 0.0 ] Disk format: cow [ 0.0 ] Disk content type: data [ 0.0 ] Disk provisioned size: 1073741824 [ 0.0 ] Disk initial size: 1572864 [ 0.0 ] Disk name: 208ebc8d-7eb1-4686-afff-48327ea6c210.202012212337.incremental.qcow2 [ 0.0 ] Disk backup: False [ 0.0 ] Connecting... [ 0.0 ] Creating disk... [ 29.4 ] Disk ID: 1ace3648-bee5-44b2-85f9-d1d38245d0b5 [ 29.4 ] Creating image transfer... [ 32.9 ] Transfer ID: c3c01c89-d7f7-4af4-8d91-f466c10f21e5 [ 32.9 ] Transfer host name: host_mixed_2 [ 32.9 ] Uploading image... [ ------- ] 0 bytes, 0.00 seconds, 0 bytes/s [ 100.00% ] 1.00 GiB, 0.09 seconds, 11.57 GiB/s [ 33.0 ] Finalizing image transfer... 2020-12-21 23:39:21,929 - MainThread - RemoteExecutor - DEBUG - [root.16.38/qum5net] ERR: Traceback (most recent call last): File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/upload_disk.py", line 276, in <module> imagetransfer.finalize_transfer(connection, transfer, disk) File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/helpers/imagetransfer.py", line 327, in finalize_transfer .format(transfer.id)) RuntimeError: Timed out waiting for transfer c3c01c89-d7f7-4af4-8d91-f466c10f21e5 to finalize Engine: 2020-12-21 23:39:57,696+02 INFO [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-8) [fcfd41eb-3d65-4327-b5e0-031aef0f199d] Renewing transfer ticket for Upload disk '208ebc8d-7eb1-4686-afff-48327ea6c210.202012212337.incremental.qcow2' (disk id: '1ace3648-bee5-44b2-85f9-d1d38245d0b5', image id: '190c8cbb-19d7-4e64-b63c-2c64cf41d4f5') 2020-12-21 23:39:57,706+02 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ExtendImageTicketVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-8) [fcfd41eb-3d65-4327-b5e0-031aef0f199d] START, ExtendImageTicketVDSCommand(HostName = host_mixed_2, ExtendImageTicketVDSCommandParameters:{hostId='41e89cd1-d19c-4209-8074-411d1969e328', ticketId='36f18989-c9a2-4c57-9af1-da2600f8135b', timeout='300'}), log id: 4b6573a0 2020-12-21 23:39:57,738+02 INFO [org.ovirt.engine.extension.aaa.jdbc.core.Tasks] (default task-1) [] (house keeping) deleting failed logins prior to 2020-12-14 21:39:57Z. 2020-12-21 23:39:57,783+02 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ExtendImageTicketVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-8) [fcfd41eb-3d65-4327-b5e0-031aef0f199d] FINISH, ExtendImageTicketVDSCommand, return: StatusOnlyReturn [status=Status [code=0, message=Done]], log id: 4b6573a0 2020-12-21 23:39:57,783+02 INFO [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-8) [fcfd41eb-3d65-4327-b5e0-031aef0f199d] Transfer session with ticket id 36f18989-c9a2-4c57-9af1-da2600f8135b extended, timeout 300 seconds 2020-12-21 23:39:57,800+02 INFO [org.ovirt.engine.core.sso.utils.AuthenticationUtils] (default task-1) [] User admin@internal-authz with profile [internal] successfully logged in with scopes: ovirt-app-api ovirt-ext=token-info:authz-search ovirt-ext=token-info:public-authz-search ovirt-ext=token-info:validate ovirt-ext=token:password-access 2020-12-21 23:39:57,931+02 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.GetImageTicketVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-8) [fcfd41eb-3d65-4327-b5e0-031aef0f199d] START, GetImageTicketVDSCommand(HostName = host_mixed_2, GetImageTicketVDSCommandParameters:{hostId='41e89cd1-d19c-4209-8074-411d1969e328', ticketId='36f18989-c9a2-4c57-9af1-da2600f8135b', timeout='null'}), log id: 5e24d145 2020-12-21 23:39:57,948+02 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.GetImageTicketVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-8) [fcfd41eb-3d65-4327-b5e0-031aef0f199d] FINISH, GetImageTicketVDSCommand, return: org.ovirt.engine.core.common.businessentities.storage.ImageTicketInformation@7982fbe5, log id: 5e24d145 2020-12-21 23:39:57,960+02 INFO [org.ovirt.engine.core.bll.aaa.CreateUserSessionCommand] (default task-2) [5feb4053] Running command: CreateUserSessionCommand internal: false. 2020-12-21 23:39:58,057+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-8) [fcfd41eb-3d65-4327-b5e0-031aef0f199d] EVENT_ID: UPLOAD_IMAGE_PAUSED_BY_SYSTEM_TIMEOUT(1,071), Upload was paused by system. Reason: timeout due to transfer inactivity. 2020-12-21 23:39:58,064+02 INFO [org.ovirt.engine.core.bll.storage.disk.image.ImageTransferUpdater] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-8) [fcfd41eb-3d65-4327-b5e0-031aef0f199d] Updating image transfer c3c01c89-d7f7-4af4-8d91-f466c10f21e5 (image 1ace3648-bee5-44b2-85f9-d1d38245d0b5) phase to Paused by System 2020-12-21 23:39:58,081+02 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-2) [5feb4053] EVENT_ID: USER_VDC_LOGIN(30), User admin@internal-authz connecting from '10.46.16.101' using session 'F4fvKEGVfVOHJ4/oEuKXwTK/rv9Akcx8gDq5GSbH4N9GI+C4hu0hPUpv+vm7om2Nvxk293rnJxXXDvXEnTgR2w==' logged in. 2020-12-21 23:39:58,304+02 INFO [org.ovirt.engine.core.bll.aaa.LogoutSessionCommand] (default task-2) [66c17bc6] Running command: LogoutSessionCommand internal: false. 2020-12-21 23:39:58,337+02 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-2) [66c17bc6] EVENT_ID: USER_VDC_LOGOUT(31), User admin@internal-authz connected from '10.46.16.101' using session 'F4fvKEGVfVOHJ4/oEuKXwTK/rv9Akcx8gDq5GSbH4N9GI+C4hu0hPUpv+vm7om2Nvxk293rnJxXXDvXEnTgR2w==' logged out. 2020-12-21 23:39:58,353+02 INFO [org.ovirt.engine.core.bll.hostdev.RefreshHostDevicesCommand] (EE-ManagedThreadFactory-engine-Thread-5) [55a60555] Running command: RefreshHostDevicesCommand internal: true. Entities affected : ID: 01d89ff1-0855-47ba-8cf6-0c1eb31226bf Type: VDSAction group MANIPULATE_HOST with role type ADMIN 2020-12-21 23:39:58,359+02 INFO [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to lynx22.lab.eng.tlv2.redhat.com/10.46.16.37 2020-12-21 23:39:59,288+02 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-2) [] EVENT_ID: NO_FAULTY_MULTIPATHS_ON_HOST(11,501), No faulty multipath paths on host host_mixed_1 2020-12-21 23:39:59,326+02 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-56) [] EVENT_ID: NO_FAULTY_MULTIPATHS_ON_HOST(11,501), No faulty multipath paths on host host_mixed_2 2020-12-21 23:39:59,331+02 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-15) [] EVENT_ID: NO_FAULTY_MULTIPATHS_ON_HOST(11,501), No faulty multipath paths on host host_mixed_3 2020-12-21 23:40:00,177+02 INFO [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-21) [fcfd41eb-3d65-4327-b5e0-031aef0f199d] Transfer was paused by system. Upload disk '208ebc8d-7eb1-4686-afff-48327ea6c210.202012212337.incremental.qcow2' (disk id: '1ace3648-bee5-44b2-85f9-d1d38245d0b5', image id: '190c8cbb-19d7-4e64-b63c-2c64cf41d4f5') VDSM daemon log: 2020-12-21 23:38:20,566 INFO (Thread-12) [tickets] [local] ADD ticket={'dirty': False, 'ops': ['write'], 'size': 1073741824, 'spars e': False, 'transfer_id': 'c3c01c89-d7f7-4af4-8d91-f466c10f21e5', 'uuid': '36f18989-c9a2-4c57-9af1-da2600f8135b', 'timeout': 300, 'url ': 'nbd:unix:/run/vdsm/nbd/36f18989-c9a2-4c57-9af1-da2600f8135b.sock'} . . . 2020-12-21 23:38:21,743 INFO (Thread-14) [http] CLOSE connection=14 client=local [connection 1 ops, 0.069095 s] [dispatch 7 ops, 0. 021079 s] [write 2 ops, 0.013585 s, 704.00 KiB, 50.61 MiB/s] [write.read 2 ops, 0.008867 s, 704.00 KiB, 77.53 MiB/s] [write.write 2 op s, 0.004582 s, 704.00 KiB, 150.03 MiB/s] [zero 4 ops, 0.004011 s, 509.88 MiB, 124.14 GiB/s] [zero.zero 4 ops, 0.003920 s, 509.88 MiB, 127.03 GiB/s] [flush 1 ops, 0.000657 s] 2020-12-21 23:38:21,743 INFO (Thread-15) [backends.nbd] Close backend address='/run/vdsm/nbd/36f18989-c9a2-4c57-9af1-da2600f8135b.s ock' . . . 2020-12-21 23:39:57,739 INFO (Thread-25) [tickets] [local] EXTEND timeout=300 ticket=36f18989-c9a2-4c57-9af1-da2600f8135b 2020-12-21 23:39:57,740 INFO (Thread-25) [http] CLOSE connection=25 client=local [connection 1 ops, 0.000861 s] [dispatch 1 ops, 0.000303 s] Engine daemon log: 2020-12-21 23:38:20,576 INFO (Thread-12) [tickets] [127.0.0.1] ADD ticket={'dirty': False, 'ops': ['write'], 'size': 1073741824, 's parse': False, 'transfer_id': 'c3c01c89-d7f7-4af4-8d91-f466c10f21e5', 'uuid': '36f18989-c9a2-4c57-9af1-da2600f8135b', 'timeout': 300, 'url': 'https://lynx23.lab.eng.tlv2.redhat.com:54322/images/36f18989-c9a2-4c57-9af1-da2600f8135b'} 2020-12-21 23:38:20,577 INFO (Thread-12) [http] CLOSE connection=12 client=127.0.0.1 [connection 1 ops, 0.002462 s] [dispatch 1 ops , 0.001231 s] 2020-12-21 23:39:57,800 INFO (Thread-13) [http] OPEN connection=13 client=127.0.0.1 2020-12-21 23:39:57,805 INFO (Thread-13) [tickets] [127.0.0.1] EXTEND timeout=300 ticket=36f18989-c9a2-4c57-9af1-da2600f8135b 2020-12-21 23:39:57,812 INFO (Thread-13) [http] CLOSE connection=13 client=127.0.0.1 [connection 1 ops, 0.010146 s] [dispatch 1 ops, 0.001279 s]
Created attachment 1741488 [details] 4th time reproduction
Ilan, what kind of info do you need? We know that this issue is not fixed yet.
After looking in the logs seems that something is not working correctly before the finalizing stage. only from the vdsm and engine logs, it's a bit hard to understand what exactly happened, also I didn't manage to reproduce the issue on my env, it seems to be failing pretty random. If you have the imageio logs as well can you please attach them? if not will you able to collect them the next time you will be facing this issue?
Hi Bella, please find the imageio (daemon) logs in the following attached archives: "4th time reproduction" "3rd reproduction. engine + vdsm daemon logs "
A similar issue was found by virt-v2v, but I can't find any records about the transfer id '89b947d5-7f90-434f-9914-ab995a4def07' in engine.log. Anyway, I attached the logs of the issue happened. 2021-08-06 16:27:43,295 process L0416 DEBUG| [stdout] [ 332.1] Creating output metadata 2021-08-06 16:32:44,202 process L0416 DEBUG| [stderr] Traceback (most recent call last): 2021-08-06 16:32:44,203 process L0416 DEBUG| [stderr] File "/tmp/v2v.aKUb7Q/rhv-upload-finalize.py", line 172, in <module> 2021-08-06 16:32:44,203 process L0416 DEBUG| [stderr] finalize_transfer(connection, transfer_id, disk_id) 2021-08-06 16:32:44,203 process L0416 DEBUG| [stderr] File "/tmp/v2v.aKUb7Q/rhv-upload-finalize.py", line 130, in finalize_transfer 2021-08-06 16:32:44,203 process L0416 DEBUG| [stderr] raise RuntimeError( 2021-08-06 16:32:44,203 process L0416 DEBUG| [stderr] RuntimeError: timed out waiting for transfer 89b947d5-7f90-434f-9914-ab995a4def07 to finalize, transfer is transferring 2021-08-06 16:32:44,213 process L0416 DEBUG| [stderr] virt-v2v: error: failed to finalize the transfers, see earlier errors
Reproducing this bug is not easy, the verification I used for the patch is: 1. Start a small disk upload 2. Set a breakpoint at TransferImageStatusCommand#execute[1] 3. Once another thread reaches the same breakpoint with phase=null, switch to it, and let it finish 4. continue the thread from step 2
(In reply to Benny Zlotnik from comment #22) > Reproducing this bug is not easy, the verification I used for the patch is: > 1. Start a small disk upload > 2. Set a breakpoint at TransferImageStatusCommand#execute[1] > 3. Once another thread reaches the same breakpoint with phase=null, switch > to it, and let it finish > 4. continue the thread from step 2 As there is no clear verification flow, this bug was verified based on our automation on ovirt-engine-4.4.10.5: - All Tier1 runs - passed successfully. - Ran a few automation TCs (once in a while we saw there this failure) in a loop over the night - passed successfully. - Ran an SDK script 'download_disk_skip.py' (was created by Nir) in a loop on 4 different disks, 1000 iterations each - passed successfully. Moving to 'Verified'.