Bug 1854038 - Download or upload disk (SDK) fails due to 'Timed out waiting for transfer XXX to finalize'
Summary: Download or upload disk (SDK) fails due to 'Timed out waiting for transfer XX...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Storage
Version: future
Hardware: x86_64
OS: Linux
high
high vote
Target Milestone: ovirt-4.4.10-1
: 4.4.10.5
Assignee: Benny Zlotnik
QA Contact: Evelina Shames
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-07-06 07:22 UTC by Ilan Zuckerman
Modified: 2022-02-04 10:32 UTC (History)
11 users (show)

Fixed In Version: ovirt-engine-4.4.10.5
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-02-04 10:32:02 UTC
oVirt Team: Storage
pm-rhel: ovirt-4.4+
pm-rhel: exception+


Attachments (Terms of Use)
engine and vdsm logs (871.07 KB, application/zip)
2020-07-06 07:22 UTC, Ilan Zuckerman
no flags Details
reproduction script (1.37 KB, text/plain)
2020-07-07 12:42 UTC, Ilan Zuckerman
no flags Details
Reproduction logs when uploading disk (25.98 KB, application/gzip)
2020-07-14 00:16 UTC, Nir Soffer
no flags Details
3rd reproduction. engine + vdsm daemon logs (14.90 KB, application/zip)
2020-11-12 10:31 UTC, Ilan Zuckerman
no flags Details
3rd reproduction vdsm + engine logs (711.71 KB, application/zip)
2020-11-12 10:32 UTC, Ilan Zuckerman
no flags Details
4th time reproduction (418.14 KB, application/zip)
2020-12-23 07:42 UTC, Ilan Zuckerman
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 116698 0 None MERGED core: reduce transaction scope for TransferImageStatus 2021-09-19 08:14:05 UTC
oVirt gerrit 118307 0 ovirt-engine-4.4 MERGED core: reduce transaction scope for TransferImageStatus 2022-01-19 18:48:45 UTC

Description Ilan Zuckerman 2020-07-06 07:22:13 UTC
Created attachment 1699990 [details]
engine and vdsm logs

Description of problem:

When using SDK for downloading qcow2 disk, the transfer times out to finalize transfer with the following ERROR from the SDK [1], 
with the following error from engine [2],
With the following error from vdsm host [3]

I doubt this is regression bug for the following reasons:
- I wasnt able to reproduce it (i tried more than once)
- Very similar test (involving download disk) succeed just before this one.

Having said that, this is still a bug and needs to be treated.
Unfortunately daemon.log files couldn't be retrieved from hosts and engine.

There was very similar bug for downloading from UI:
https://bugzilla.redhat.com/show_bug.cgi?id=1835601


[1]:
2020-07-05 10:37:18,495 - MainThread - RemoteExecutor - DEBUG - [root@10.46.16.37/qum5net] Results of command: python3 /usr/share/doc/python3-ovirt-engine-sdk4/examples/backup_vm.py download fac3911d-95b6-41be-a538-dad345691502 --password-file /root/pass.txt --engine-url https://hosted-engine-06.lab.eng.tlv2.redhat.com --username admin@internal --cafile /root/ca.crt --backup-dir /root/full_backup/ --backup-uuid 44ae937c-8c30-41a1-bb68-d4daf3177d7e
2020-07-05 10:37:18,496 - MainThread - RemoteExecutor - DEBUG - [root@10.46.16.37/qum5net]   OUT: [   0.0 ] Downloading VM fac3911d-95b6-41be-a538-dad345691502 disks
[   0.2 ] Creating image transfer for disk 2d075957-7481-4af3-9920-e63d7759af83
[   2.9 ] Image transfer 391d9654-0e59-4e24-9524-e59a37cb8b53 is ready
[ ------- ] 0 bytes, 0.00 seconds, 0 bytes/s                                   
Formatting '/root/full_backup/2d075957-7481-4af3-9920-e63d7759af83.202007051036.full.qcow2', fmt=qcow2 size=1073741824 cluster_size=65536 lazy_refcounts=off refcount_bits=16
[  99.89% ] 1022.88 MiB, 0.22 seconds, 4.56 GiB/s                              
[  99.92% ] 1023.19 MiB, 0.33 seconds, 3.00 GiB/s                              
[  99.99% ] 1023.88 MiB, 0.44 seconds, 2.29 GiB/s                              
[ 100.00% ] 1.00 GiB, 0.46 seconds, 2.17 GiB/s                                 
[   3.3 ] Finalizing image transfer

2020-07-05 10:37:18,496 - MainThread - RemoteExecutor - DEBUG - [root@10.46.16.37/qum5net]   ERR: Traceback (most recent call last):
  File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/backup_vm.py", line 476, in <module>
    main()
  File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/backup_vm.py", line 173, in main
    args.command(args)
  File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/backup_vm.py", line 244, in cmd_download
    connection, args.backup_uuid, args, incremental=args.incremental)
  File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/backup_vm.py", line 385, in download_backup
    download_disk(connection, backup_uuid, disk, disk_path, args, incremental=incremental)
  File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/backup_vm.py", line 430, in download_disk
    imagetransfer.finalize_transfer(connection, transfer, disk)
  File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/helpers/imagetransfer.py", line 296, in finalize_transfer
    .format(transfer.id))
RuntimeError: Timed out waiting for transfer 391d9654-0e59-4e24-9524-e59a37cb8b53 to finalize

---------------------------------------------------------------------------------------------

[2]:
2020-07-05 10:37:20,801+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetImageTicketVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-48) [b434516f-9c7f-46fc-9f0c-50ee4ffc3775] START, GetImageTicketVDSCommand(HostName = host_mixed_1, GetImageTicketVDSCommandParameters:{hostId='9fb37721-708b-4327-9148-d3f302e51436', ticketId='0287b970-6b73-4b9e-b683-47c1f44dbe86', timeout='null'}), log id: 382fd9d6
2020-07-05 10:37:20,807+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetImageTicketVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-48) [b434516f-9c7f-46fc-9f0c-50ee4ffc3775] FINISH, GetImageTicketVDSCommand, return: org.ovirt.engine.core.common.businessentities.storage.ImageTicketInformation@dc80e36, log id: 382fd9d6
2020-07-05 10:37:20,825+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-48) [b434516f-9c7f-46fc-9f0c-50ee4ffc3775] EVENT_ID: DOWNLOAD_IMAGE_CANCELED_TIMEOUT(1,072), Download was canceled by system. Reason: timeout due to transfer inactivity.
2020-07-05 10:37:20,826+03 INFO  [org.ovirt.engine.core.bll.storage.disk.image.ImageTransferUpdater] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-48) [b434516f-9c7f-46fc-9f0c-50ee4ffc3775] Updating image transfer 391d9654-0e59-4e24-9524-e59a37cb8b53 (image 2d075957-7481-4af3-9920-e63d7759af83) phase to Cancelled

---------------------------------------------------------------------------------------------

[3]:
2020-07-05 10:37:19,686+0300 INFO  (jsonrpc/7) [api.virt] FINISH stop_backup return={'status': {'code': 0, 'message': 'Done'}} from=::ffff:10.46.16.252,60486, flow_id=backups_syncAction_9850ddba-f8b8-4b8, vmId=fac3911d-95b6-41be-a538-dad345691502 (api:54)
2020-07-05 10:37:19,686+0300 INFO  (jsonrpc/7) [jsonrpc.JsonRpcServer] RPC call VM.stop_backup succeeded in 0.01 seconds (__init__:312)
2020-07-05 10:37:19,694+0300 ERROR (libvirt/events) [virt.vm] (vmId='fac3911d-95b6-41be-a538-dad345691502') Block job (untracked) type BACKUP for drive sda was canceled (vm:5306)
2020-07-05 10:37:19,710+0300 ERROR (libvirt/events) [virt.vm] (vmId='fac3911d-95b6-41be-a538-dad345691502') Block job (untracked) type BACKUP for drive vdb was canceled (vm:5306)
2020-07-05 10:37:19,732+0300 ERROR (libvirt/events) [virt.vm] (vmId='fac3911d-95b6-41be-a538-dad345691502') Block job (untracked) type BACKUP for drive sdb was canceled (vm:5306)
2020-07-05 10:37:19,783+0300 ERROR (libvirt/events) [virt.vm] (vmId='fac3911d-95b6-41be-a538-dad345691502') Block job (untracked) type BACKUP for drive vdc was canceled (vm:5306)


Version-Release number of selected component (if applicable):
rhv-4.4.1-10

How reproducible:
Wasnt able to reproduce.
Happened as a result of automation run.

Steps to Reproduce:
    1. Create a vm from template
    2. Attach permutation disks (with fs + data)
    3. Perform full backup on all disks
    4. Download the all of the backed up disks

Actual results:
Timed out waiting for transfer XXX to finalize

Expected results:
Shouldnt time out

Comment 1 Nir Soffer 2020-07-06 11:38:56 UTC
(In reply to Ilan Zuckerman from comment #0)
> Created attachment 1699990 [details]
> engine and vdsm logs
> 
> Description of problem:
> 
> When using SDK for downloading qcow2 disk, the transfer times out to
> finalize transfer 

Looks like a bug, finalized should not time out in normal conditions.

> There was very similar bug for downloading from UI:
> https://bugzilla.redhat.com/show_bug.cgi?id=1835601

That bug is closed verified.

> [1]:
> 2020-07-05 10:37:18,495 - MainThread - RemoteExecutor - DEBUG -
> [root@10.46.16.37/qum5net] Results of command: python3
> /usr/share/doc/python3-ovirt-engine-sdk4/examples/backup_vm.py download
> fac3911d-95b6-41be-a538-dad345691502 --password-file /root/pass.txt
> --engine-url https://hosted-engine-06.lab.eng.tlv2.redhat.com --username
> admin@internal --cafile /root/ca.crt --backup-dir /root/full_backup/
> --backup-uuid 44ae937c-8c30-41a1-bb68-d4daf3177d7e
> 2020-07-05 10:37:18,496 - MainThread - RemoteExecutor - DEBUG -
> [root@10.46.16.37/qum5net]   OUT: [   0.0 ] Downloading VM
> fac3911d-95b6-41be-a538-dad345691502 disks
...
> [ 100.00% ] 1.00 GiB, 0.46 seconds, 2.17 GiB/s

Transfer was successful...
                   
> [   3.3 ] Finalizing image transfer
...
> "/usr/share/doc/python3-ovirt-engine-sdk4/examples/helpers/imagetransfer.py",
> line 296, in finalize_transfer
>     .format(transfer.id))
> RuntimeError: Timed out waiting for transfer
> 391d9654-0e59-4e24-9524-e59a37cb8b53 to finalize

This means engine could not finialize the transfer in 60 seconds, which is a lot
of time to remove the ticket and teardown the image.

> [2]:
> 2020-07-05 10:37:20,801+03 INFO 
> [org.ovirt.engine.core.vdsbroker.vdsbroker.GetImageTicketVDSCommand]
> (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-48)
> [b434516f-9c7f-46fc-9f0c-50ee4ffc3775] START,
> GetImageTicketVDSCommand(HostName = host_mixed_1,
> GetImageTicketVDSCommandParameters:{hostId='9fb37721-708b-4327-9148-
> d3f302e51436', ticketId='0287b970-6b73-4b9e-b683-47c1f44dbe86',
> timeout='null'}), log id: 382fd9d6
> 2020-07-05 10:37:20,807+03 INFO 
> [org.ovirt.engine.core.vdsbroker.vdsbroker.GetImageTicketVDSCommand]
> (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-48)
> [b434516f-9c7f-46fc-9f0c-50ee4ffc3775] FINISH, GetImageTicketVDSCommand,
> return:
> org.ovirt.engine.core.common.businessentities.storage.
> ImageTicketInformation@dc80e36, log id: 382fd9d6
> 2020-07-05 10:37:20,825+03 ERROR
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-48)
> [b434516f-9c7f-46fc-9f0c-50ee4ffc3775] EVENT_ID:
> DOWNLOAD_IMAGE_CANCELED_TIMEOUT(1,072), Download was canceled by system.
> Reason: timeout due to transfer inactivity.

Looks like engine did not receive the finalize request, or there is some race
that caused it to ignore the finalize request.


> 2020-07-05 10:37:20,826+03 INFO 
> [org.ovirt.engine.core.bll.storage.disk.image.ImageTransferUpdater]
> (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-48)
> [b434516f-9c7f-46fc-9f0c-50ee4ffc3775] Updating image transfer
> 391d9654-0e59-4e24-9524-e59a37cb8b53 (image
> 2d075957-7481-4af3-9920-e63d7759af83) phase to Cancelled

This is expected if a user did not finalize the transfer in time, which
is not the case here.

I did not look in vdsm log but engine logs tell that this is engine bug
in handling of the finalize call.

This may be a race when calling finalized very quickly after starting the transfer.
In this case the transfer was completed in 0.46 seconds:

    > [ 100.00% ] 1.00 GiB, 0.46 seconds, 2.17 GiB/s

So maybe engine was not ready for the finalize call.

To reproduce this issue we can write a small scripts doing:


    for i in range(1000):
        transfer = imagetransfer.create_transfer(...)
        finalize transfer(...)

It should work without any error, but I think this will quickly trigger this
bug.

Ilan, do you want to try to create such reproducer?

Comment 3 Ilan Zuckerman 2020-07-07 12:41:10 UTC
After trying to run the script, it failed on iteration 311, but the issue wasnt reproduced.
It has failed with:
"Cannot transfer Virtual Disk: Disk is locked. Please try again"

Non the less, a 10 seconds wait was added between the transfers, but apparently it didnt help..

Attaching the full script.

The relevant lines in the script are:

with closing(connection):
    for i in range(1000):
        print("Iteration %", i)
        print("Creating transfer...")
        transfer = imagetransfer.create_transfer(connection, disk, types.ImageTransferDirection.DOWNLOAD)
        print("Transfer ID: %s" % transfer.id)
        print("Transfer host name: %s" % transfer.host.name)
        print("Finalizing image transfer...")
        imagetransfer.finalize_transfer(connection, transfer, disk)
        print("Transfer finalized. sleeping 10 sec")
        time.sleep(10)
        print("**********************************")


Iteration % 311
Creating transfer...
Traceback (most recent call last):
  File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/bz_1854038.py", line 35, in <module>
    transfer = imagetransfer.create_transfer(connection, disk, types.ImageTransferDirection.DOWNLOAD)
  File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/helpers/imagetransfer.py", line 162, in create_transfer
    format=types.DiskFormat.RAW,
  File "/usr/lib64/python3.6/site-packages/ovirtsdk4/services.py", line 14009, in add
    return self._internal_add(image_transfer, headers, query, wait)
  File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 232, in _internal_add
    return future.wait() if wait else future
  File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 55, in wait
    return self._code(response)
  File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 229, in callback
    self._check_fault(response)
  File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 132, in _check_fault
    self._raise_error(response, body)
  File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 118, in _raise_error
    raise error
ovirtsdk4.Error: Fault reason is "Operation Failed". Fault detail is "[Cannot transfer Virtual Disk: Disk is locked. Please try again later.]". HTTP response code is 409.


Engine:

2020-07-07 13:54:19,538+03 WARN  [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (default task-13) [fd790e43-b2af-4990-a908-7342e231b2a8] Validation of action 'TransferDiskImage' failed for user admin@internal-authz. Reasons: VAR__TYPE__DISK,VAR__ACTION__TRANSFER,ACTION_TYPE_FAILED_DISK_IS_LOCKED
2020-07-07 13:54:19,539+03 ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (default task-13) [] Operation Failed: [Cannot transfer Virtual Disk: Disk is locked. Please try again later.]
2020-07-07 13:54:19,583+03 INFO  [org.ovirt.engine.core.bll.storage.disk.image.ImageTransferUpdater] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-83) [5cdc9ffe-720a-4f4f-947f-df50bc95d6e3] Updating image transfer 16a5c8f7-f84e-4c27-ac82-5d528a8c20a0 (image 94139efa-8967-4738-b79e-3bd492919a8f) phase to Finished Success
2020-07-07 13:54:19,589+03 INFO  [org.ovirt.engine.core.sso.servlets.OAuthRevokeServlet] (default task-16) [7f5f64d5] User admin@internal successfully logged out
2020-07-07 13:54:19,596+03 INFO  [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-83) [5cdc9ffe-720a-4f4f-947f-df50bc95d6e3] Successfully transferred disk '00000000-0000-0000-0000-000000000000' (command id '16a5c8f7-f84e-4c27-ac82-5d528a8c20a0')

Comment 4 Ilan Zuckerman 2020-07-07 12:42:09 UTC
Created attachment 1700149 [details]
reproduction script

Comment 5 Nir Soffer 2020-07-07 12:57:19 UTC
(In reply to Ilan Zuckerman from comment #3)
> After trying to run the script, it failed on iteration 311, but the issue
> wasnt reproduced.
> It has failed with:
> "Cannot transfer Virtual Disk: Disk is locked. Please try again"

This is the same issue again, finalize finished and we waited 10
seconds but the disk is till locked in engine.

You can try to use 20 second sleep and hopefully we will reproduce
it.

Also I would add short sleep after create_transfer() and before finalize(),
since backup will never take zero time. Lets use 0.5 seconds like we had
in the failing test.

Once we fix imagetransfer.finalize() to wait properly until the transfer
was really completed it will be easier to test this issue.

Comment 6 Ilan Zuckerman 2020-07-08 12:34:58 UTC
Nir, unfortunately the same kind of problematic behavior wasn't reproduced with a help of my script after 1k iterations [1] (suggested by you).
Any other advises we can do here?

Attaching the updated script here (20 sec wait + 0.5 in between) just as you said.

[1]:
Iteration % 999
Creating transfer...
Transfer ID: 2eeff75d-199b-436a-b27a-b590fca9ca8c
Transfer host name: host_mixed_2
Finalizing image transfer...
Transfer finalized. sleeping 20 sec
**********************************

Comment 8 Nir Soffer 2020-07-13 14:36:48 UTC
(In reply to Ilan Zuckerman from comment #6)
> Nir, unfortunately the same kind of problematic behavior wasn't reproduced
> with a help of my script after 1k iterations [1] (suggested by you).
> Any other advises we can do here?
> 
> Attaching the updated script here (20 sec wait + 0.5 in between) just as you
> said.
> 
> [1]:
> Iteration % 999
> Creating transfer...
> Transfer ID: 2eeff75d-199b-436a-b27a-b590fca9ca8c
> Transfer host name: host_mixed_2
> Finalizing image transfer...
> Transfer finalized. sleeping 20 sec
> **********************************

Thanks, this is very good info even if we could not reproduce.

No we know that adding 20 seconds sleep works as a workaround :-)

I guess that something is missing in the flow - maybe we need to
transfer some data to trigger this bug. Engine monitors the activity
on the ticket and the ticket status may effect the logic used to 
clean up the transfer.

But I think you invested enough and now we need to wait for engine
folks to debug this.

Comment 9 Nir Soffer 2020-07-14 00:16:01 UTC
Created attachment 1700922 [details]
Reproduction logs when uploading disk

Just reproduced this issue when uploading disk.

There was nothing special in the upload and it completed successfully, but
then the finalizing step got stuck. I interrupted this finalize after
30 seconds.

$ ./upload_disk.py --engine-url https://engine3 --username admin@internal --password-file engine3-password --cafile engine3.pem --sd-name nfs1 --disk-format raw /var/tmp/fedora-32.raw 
Checking image...
Image format: raw
Disk format: raw
Disk content type: data
Disk provisioned size: 6442450944
Disk initial size: 6442450944
Disk name: fedora-32.raw
Disk backup: False
Connecting...
Creating disk...
Disk ID: e47fc0ab-01a2-48de-8ea7-9ff1af223b50
Creating image transfer...
Transfer ID: d8d1dfd7-351c-48e2-a18e-764d4f1e1f40
Transfer host name: host4
Uploading image...
[ 100.00% ] 6.00 GiB, 4.01 seconds, 1.50 GiB/s
Finalizing image transfer...
^CTraceback (most recent call last):
  File "./upload_disk.py", line 351, in <module>
    imagetransfer.finalize_transfer(connection, transfer, disk)
  File "/home/nsoffer/src/ovirt-engine-sdk/sdk/examples/helpers/imagetransfer.py", line 275, in finalize_transfer
    time.sleep(1)
KeyboardInterrupt

Comment 10 Nir Soffer 2020-07-14 00:31:38 UTC
Bumping priority and severity. Engine APIs cannot fail randomly.

Comment 11 Ilan Zuckerman 2020-11-12 10:30:49 UTC
This issue was reproduced again during scheduled Automation run on PPC env with NFS storage

From Automation log [1]
From Engine log [2]  'transfer_id': 'f26e8661-05f2-4c32-a5b7-18884588f7b2'

Attaching daemon log from VDSM + engine daemon.log
 rhv-4.4.3-13


[1]:

2020-11-12 00:12:18,485 - MainThread - RemoteExecutor - DEBUG - [root@10.16.214.116/qum5net] Results of command: python3 /usr/share/doc/python3-ovirt-engine-sdk4/examples/upload_disk.py -c engine /root/backup/51c21155-b4d0-47cb-b6e8-a46a87e15818.202011120009.full.qcow2 --disk-sparse --sd-name nfs_0
2020-11-12 00:12:18,485 - MainThread - RemoteExecutor - DEBUG - [root@10.16.214.116/qum5net]   OUT: [   0.0 ] Checking image...
[   0.0 ] Image format: qcow2
[   0.0 ] Disk format: cow
[   0.0 ] Disk content type: data
[   0.0 ] Disk provisioned size: 1073741824
[   0.0 ] Disk initial size: 1638400
[   0.0 ] Disk name: 51c21155-b4d0-47cb-b6e8-a46a87e15818.202011120009.full.qcow2
[   0.0 ] Disk backup: False
[   0.0 ] Connecting...
[   0.0 ] Creating disk...
[   6.5 ] Disk ID: 52ac0207-c84e-421c-851b-cd426ebb187c
[   6.5 ] Creating image transfer...
[  10.3 ] Transfer ID: f26e8661-05f2-4c32-a5b7-18884588f7b2
[  10.3 ] Transfer host name: host_mixed_2
[  10.3 ] Uploading image...
[ ------- ] 0 bytes, 0.00 seconds, 0 bytes/s                                   
[ 100.00% ] 1.00 GiB, 0.08 seconds, 12.89 GiB/s                                
[  10.4 ] Finalizing image transfer...

2020-11-12 00:12:18,486 - MainThread - RemoteExecutor - DEBUG - [root@10.16.214.116/qum5net]   ERR: Traceback (most recent call last):
  File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/upload_disk.py", line 276, in <module>
    imagetransfer.finalize_transfer(connection, transfer, disk)
  File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/helpers/imagetransfer.py", line 327, in finalize_transfer
    .format(transfer.id))
RuntimeError: Timed out waiting for transfer f26e8661-05f2-4c32-a5b7-18884588f7b2 to finalize

===============================================================================================

[2]:

2020-11-12 00:12:21,593+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetImageTicketVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-18) [3c74ce45-f11b-460c-8bdd-5813998b8f36] START, GetImageTicketVDSCommand(HostName = host_mixed_2, GetImageTicketVDSCommandParameters:{hostId='f8307695-11d0-419d-a774-6ab6ce02df88', ticketId='4aaf638a-1007-4b92-880d-4f5eb5d77af0', timeout='null'}), log id: 6079253b
2020-11-12 00:12:21,761+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetImageTicketVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-18) [3c74ce45-f11b-460c-8bdd-5813998b8f36] FINISH, GetImageTicketVDSCommand, return: org.ovirt.engine.core.common.businessentities.storage.ImageTicketInformation@2fbf289a, log id: 6079253b
2020-11-12 00:12:21,780+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-18) [3c74ce45-f11b-460c-8bdd-5813998b8f36] EVENT_ID: UPLOAD_IMAGE_PAUSED_BY_SYSTEM_TIMEOUT(1,071), Upload was paused by system. Reason: timeout due to transfer inactivity.
2020-11-12 00:12:21,781+02 INFO  [org.ovirt.engine.core.bll.storage.disk.image.ImageTransferUpdater] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-18) [3c74ce45-f11b-460c-8bdd-5813998b8f36] Updating image transfer f26e8661-05f2-4c32-a5b7-18884588f7b2 (image 52ac0207-c84e-421c-851b-cd426ebb187c) phase to Paused by System


RHV-4.4-tier1-PPC #27

Comment 12 Ilan Zuckerman 2020-11-12 10:31:26 UTC
Created attachment 1728689 [details]
3rd reproduction. engine + vdsm daemon logs

Comment 13 Ilan Zuckerman 2020-11-12 10:32:23 UTC
Created attachment 1728690 [details]
3rd reproduction vdsm + engine logs

Comment 14 Ilan Zuckerman 2020-12-23 07:36:18 UTC
Reproduced again during automation run.
Attaching vdsm daemon log, engine daemon log, engine log, vdsm log


Upload disk attempt:

2020-12-21 23:39:21,929 - MainThread - RemoteExecutor - DEBUG - [root@10.46.16.38/qum5net] Results of command: python3 /usr/share/doc/python3-ovirt-engine-sdk4/examples/upload_disk.py -c engine /root/backup/208ebc8d-7eb1-4686-afff-48327ea6c210.202012212337.incremental.qcow2 --disk-sparse --sd-name fcp_2
2020-12-21 23:39:21,929 - MainThread - RemoteExecutor - DEBUG - [root@10.46.16.38/qum5net]   OUT: [   0.0 ] Checking image...
[   0.0 ] Image format: qcow2
[   0.0 ] Disk format: cow
[   0.0 ] Disk content type: data
[   0.0 ] Disk provisioned size: 1073741824
[   0.0 ] Disk initial size: 1572864
[   0.0 ] Disk name: 208ebc8d-7eb1-4686-afff-48327ea6c210.202012212337.incremental.qcow2
[   0.0 ] Disk backup: False
[   0.0 ] Connecting...
[   0.0 ] Creating disk...
[  29.4 ] Disk ID: 1ace3648-bee5-44b2-85f9-d1d38245d0b5
[  29.4 ] Creating image transfer...
[  32.9 ] Transfer ID: c3c01c89-d7f7-4af4-8d91-f466c10f21e5
[  32.9 ] Transfer host name: host_mixed_2
[  32.9 ] Uploading image...
[ ------- ] 0 bytes, 0.00 seconds, 0 bytes/s                                   
[ 100.00% ] 1.00 GiB, 0.09 seconds, 11.57 GiB/s                                
[  33.0 ] Finalizing image transfer...

2020-12-21 23:39:21,929 - MainThread - RemoteExecutor - DEBUG - [root@10.46.16.38/qum5net]   ERR: Traceback (most recent call last):
  File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/upload_disk.py", line 276, in <module>
    imagetransfer.finalize_transfer(connection, transfer, disk)
  File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/helpers/imagetransfer.py", line 327, in finalize_transfer
    .format(transfer.id))
RuntimeError: Timed out waiting for transfer c3c01c89-d7f7-4af4-8d91-f466c10f21e5 to finalize


Engine:

2020-12-21 23:39:57,696+02 INFO  [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-8) [fcfd41eb-3d65-4327-b5e0-031aef0f199d] Renewing transfer ticket for Upload disk '208ebc8d-7eb1-4686-afff-48327ea6c210.202012212337.incremental.qcow2' (disk id: '1ace3648-bee5-44b2-85f9-d1d38245d0b5', image id: '190c8cbb-19d7-4e64-b63c-2c64cf41d4f5')
2020-12-21 23:39:57,706+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ExtendImageTicketVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-8) [fcfd41eb-3d65-4327-b5e0-031aef0f199d] START, ExtendImageTicketVDSCommand(HostName = host_mixed_2, ExtendImageTicketVDSCommandParameters:{hostId='41e89cd1-d19c-4209-8074-411d1969e328', ticketId='36f18989-c9a2-4c57-9af1-da2600f8135b', timeout='300'}), log id: 4b6573a0
2020-12-21 23:39:57,738+02 INFO  [org.ovirt.engine.extension.aaa.jdbc.core.Tasks] (default task-1) [] (house keeping) deleting failed logins prior to 2020-12-14 21:39:57Z.
2020-12-21 23:39:57,783+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ExtendImageTicketVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-8) [fcfd41eb-3d65-4327-b5e0-031aef0f199d] FINISH, ExtendImageTicketVDSCommand, return: StatusOnlyReturn [status=Status [code=0, message=Done]], log id: 4b6573a0
2020-12-21 23:39:57,783+02 INFO  [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-8) [fcfd41eb-3d65-4327-b5e0-031aef0f199d] Transfer session with ticket id 36f18989-c9a2-4c57-9af1-da2600f8135b extended, timeout 300 seconds
2020-12-21 23:39:57,800+02 INFO  [org.ovirt.engine.core.sso.utils.AuthenticationUtils] (default task-1) [] User admin@internal-authz with profile [internal] successfully logged in with scopes: ovirt-app-api ovirt-ext=token-info:authz-search ovirt-ext=token-info:public-authz-search ovirt-ext=token-info:validate ovirt-ext=token:password-access
2020-12-21 23:39:57,931+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetImageTicketVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-8) [fcfd41eb-3d65-4327-b5e0-031aef0f199d] START, GetImageTicketVDSCommand(HostName = host_mixed_2, GetImageTicketVDSCommandParameters:{hostId='41e89cd1-d19c-4209-8074-411d1969e328', ticketId='36f18989-c9a2-4c57-9af1-da2600f8135b', timeout='null'}), log id: 5e24d145
2020-12-21 23:39:57,948+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetImageTicketVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-8) [fcfd41eb-3d65-4327-b5e0-031aef0f199d] FINISH, GetImageTicketVDSCommand, return: org.ovirt.engine.core.common.businessentities.storage.ImageTicketInformation@7982fbe5, log id: 5e24d145
2020-12-21 23:39:57,960+02 INFO  [org.ovirt.engine.core.bll.aaa.CreateUserSessionCommand] (default task-2) [5feb4053] Running command: CreateUserSessionCommand internal: false.
2020-12-21 23:39:58,057+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-8) [fcfd41eb-3d65-4327-b5e0-031aef0f199d] EVENT_ID: UPLOAD_IMAGE_PAUSED_BY_SYSTEM_TIMEOUT(1,071), Upload was paused by system. Reason: timeout due to transfer inactivity.
2020-12-21 23:39:58,064+02 INFO  [org.ovirt.engine.core.bll.storage.disk.image.ImageTransferUpdater] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-8) [fcfd41eb-3d65-4327-b5e0-031aef0f199d] Updating image transfer c3c01c89-d7f7-4af4-8d91-f466c10f21e5 (image 1ace3648-bee5-44b2-85f9-d1d38245d0b5) phase to Paused by System
2020-12-21 23:39:58,081+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-2) [5feb4053] EVENT_ID: USER_VDC_LOGIN(30), User admin@internal-authz connecting from '10.46.16.101' using session 'F4fvKEGVfVOHJ4/oEuKXwTK/rv9Akcx8gDq5GSbH4N9GI+C4hu0hPUpv+vm7om2Nvxk293rnJxXXDvXEnTgR2w==' logged in.
2020-12-21 23:39:58,304+02 INFO  [org.ovirt.engine.core.bll.aaa.LogoutSessionCommand] (default task-2) [66c17bc6] Running command: LogoutSessionCommand internal: false.
2020-12-21 23:39:58,337+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-2) [66c17bc6] EVENT_ID: USER_VDC_LOGOUT(31), User admin@internal-authz connected from '10.46.16.101' using session 'F4fvKEGVfVOHJ4/oEuKXwTK/rv9Akcx8gDq5GSbH4N9GI+C4hu0hPUpv+vm7om2Nvxk293rnJxXXDvXEnTgR2w==' logged out.
2020-12-21 23:39:58,353+02 INFO  [org.ovirt.engine.core.bll.hostdev.RefreshHostDevicesCommand] (EE-ManagedThreadFactory-engine-Thread-5) [55a60555] Running command: RefreshHostDevicesCommand internal: true. Entities affected :  ID: 01d89ff1-0855-47ba-8cf6-0c1eb31226bf Type: VDSAction group MANIPULATE_HOST with role type ADMIN
2020-12-21 23:39:58,359+02 INFO  [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to lynx22.lab.eng.tlv2.redhat.com/10.46.16.37
2020-12-21 23:39:59,288+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-2) [] EVENT_ID: NO_FAULTY_MULTIPATHS_ON_HOST(11,501), No faulty multipath paths on host host_mixed_1
2020-12-21 23:39:59,326+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-56) [] EVENT_ID: NO_FAULTY_MULTIPATHS_ON_HOST(11,501), No faulty multipath paths on host host_mixed_2
2020-12-21 23:39:59,331+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-15) [] EVENT_ID: NO_FAULTY_MULTIPATHS_ON_HOST(11,501), No faulty multipath paths on host host_mixed_3
2020-12-21 23:40:00,177+02 INFO  [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-21) [fcfd41eb-3d65-4327-b5e0-031aef0f199d] Transfer was paused by system. Upload disk '208ebc8d-7eb1-4686-afff-48327ea6c210.202012212337.incremental.qcow2' (disk id: '1ace3648-bee5-44b2-85f9-d1d38245d0b5', image id: '190c8cbb-19d7-4e64-b63c-2c64cf41d4f5')


VDSM daemon log:

2020-12-21 23:38:20,566 INFO    (Thread-12) [tickets] [local] ADD ticket={'dirty': False, 'ops': ['write'], 'size': 1073741824, 'spars
e': False, 'transfer_id': 'c3c01c89-d7f7-4af4-8d91-f466c10f21e5', 'uuid': '36f18989-c9a2-4c57-9af1-da2600f8135b', 'timeout': 300, 'url
': 'nbd:unix:/run/vdsm/nbd/36f18989-c9a2-4c57-9af1-da2600f8135b.sock'}
.
.
.
2020-12-21 23:38:21,743 INFO    (Thread-14) [http] CLOSE connection=14 client=local [connection 1 ops, 0.069095 s] [dispatch 7 ops, 0.
021079 s] [write 2 ops, 0.013585 s, 704.00 KiB, 50.61 MiB/s] [write.read 2 ops, 0.008867 s, 704.00 KiB, 77.53 MiB/s] [write.write 2 op
s, 0.004582 s, 704.00 KiB, 150.03 MiB/s] [zero 4 ops, 0.004011 s, 509.88 MiB, 124.14 GiB/s] [zero.zero 4 ops, 0.003920 s, 509.88 MiB, 
127.03 GiB/s] [flush 1 ops, 0.000657 s]
2020-12-21 23:38:21,743 INFO    (Thread-15) [backends.nbd] Close backend address='/run/vdsm/nbd/36f18989-c9a2-4c57-9af1-da2600f8135b.s
ock'
.
.
.
2020-12-21 23:39:57,739 INFO    (Thread-25) [tickets] [local] EXTEND timeout=300 ticket=36f18989-c9a2-4c57-9af1-da2600f8135b
2020-12-21 23:39:57,740 INFO    (Thread-25) [http] CLOSE connection=25 client=local [connection 1 ops, 0.000861 s] [dispatch 1 ops, 0.000303 s]



Engine daemon log:

2020-12-21 23:38:20,576 INFO    (Thread-12) [tickets] [127.0.0.1] ADD ticket={'dirty': False, 'ops': ['write'], 'size': 1073741824, 's
parse': False, 'transfer_id': 'c3c01c89-d7f7-4af4-8d91-f466c10f21e5', 'uuid': '36f18989-c9a2-4c57-9af1-da2600f8135b', 'timeout': 300, 
'url': 'https://lynx23.lab.eng.tlv2.redhat.com:54322/images/36f18989-c9a2-4c57-9af1-da2600f8135b'}
2020-12-21 23:38:20,577 INFO    (Thread-12) [http] CLOSE connection=12 client=127.0.0.1 [connection 1 ops, 0.002462 s] [dispatch 1 ops
, 0.001231 s]
2020-12-21 23:39:57,800 INFO    (Thread-13) [http] OPEN connection=13 client=127.0.0.1
2020-12-21 23:39:57,805 INFO    (Thread-13) [tickets] [127.0.0.1] EXTEND timeout=300 ticket=36f18989-c9a2-4c57-9af1-da2600f8135b
2020-12-21 23:39:57,812 INFO    (Thread-13) [http] CLOSE connection=13 client=127.0.0.1 [connection 1 ops, 0.010146 s] [dispatch 1 ops, 0.001279 s]

Comment 15 Ilan Zuckerman 2020-12-23 07:42:40 UTC
Created attachment 1741488 [details]
4th time reproduction

Comment 16 Nir Soffer 2021-01-03 17:07:55 UTC
Ilan, what kind of info do you need? We know that this issue is not fixed yet.

Comment 18 Bella Khizgiyaev 2021-06-01 11:58:15 UTC
After looking in the logs seems that something is not working correctly before the finalizing stage.
only from the vdsm and engine logs, it's a bit hard to understand what exactly happened, also I didn't manage to reproduce the issue on my env, it seems to be failing pretty random.
If you have the imageio logs as well can you please attach them? if not will you able to collect them the next time you will be facing this issue?

Comment 19 Ilan Zuckerman 2021-06-01 12:49:21 UTC
Hi Bella, please find the imageio (daemon) logs in the following attached archives:
"4th time reproduction"
"3rd reproduction. engine + vdsm daemon logs "

Comment 20 Xiaodai Wang 2021-08-09 11:03:45 UTC
A similar issue was found by virt-v2v, but I can't find any records about the transfer id
'89b947d5-7f90-434f-9914-ab995a4def07' in engine.log. 
Anyway, I attached the logs of the issue happened.

2021-08-06 16:27:43,295 process          L0416 DEBUG| [stdout] [ 332.1] Creating output metadata
2021-08-06 16:32:44,202 process          L0416 DEBUG| [stderr] Traceback (most recent call last):
2021-08-06 16:32:44,203 process          L0416 DEBUG| [stderr]   File "/tmp/v2v.aKUb7Q/rhv-upload-finalize.py", line 172, in <module>
2021-08-06 16:32:44,203 process          L0416 DEBUG| [stderr]     finalize_transfer(connection, transfer_id, disk_id)
2021-08-06 16:32:44,203 process          L0416 DEBUG| [stderr]   File "/tmp/v2v.aKUb7Q/rhv-upload-finalize.py", line 130, in finalize_transfer
2021-08-06 16:32:44,203 process          L0416 DEBUG| [stderr]     raise RuntimeError(
2021-08-06 16:32:44,203 process          L0416 DEBUG| [stderr] RuntimeError: timed out waiting for transfer 89b947d5-7f90-434f-9914-ab995a4def07 to finalize, transfer is transferring
2021-08-06 16:32:44,213 process          L0416 DEBUG| [stderr] virt-v2v: error: failed to finalize the transfers, see earlier errors

Comment 22 Benny Zlotnik 2021-09-14 21:25:39 UTC
Reproducing this bug is not easy, the verification I used for the patch is:
1. Start a small disk upload
2. Set a breakpoint at TransferImageStatusCommand#execute[1]
3. Once another thread reaches the same breakpoint with phase=null, switch to it, and let it finish
4. continue the thread from step 2

Comment 25 Evelina Shames 2022-01-26 16:13:09 UTC
(In reply to Benny Zlotnik from comment #22)
> Reproducing this bug is not easy, the verification I used for the patch is:
> 1. Start a small disk upload
> 2. Set a breakpoint at TransferImageStatusCommand#execute[1]
> 3. Once another thread reaches the same breakpoint with phase=null, switch
> to it, and let it finish
> 4. continue the thread from step 2

As there is no clear verification flow, this bug was verified based on our automation on ovirt-engine-4.4.10.5:
- All Tier1 runs - passed successfully.
- Ran a few automation TCs (once in a while we saw there this failure) in a loop over the night - passed successfully.
- Ran an SDK script 'download_disk_skip.py' (was created by Nir) in a loop on 4 different disks, 1000 iterations each -  passed successfully.

Moving to 'Verified'.


Note You need to log in before you can comment on or make changes to this bug.