Bug 1892676 - The ImageIO transfer doesn't finalize after timeout is reached
Summary: The ImageIO transfer doesn't finalize after timeout is reached
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: Backup-Restore.VMs
Version: 4.4.1.1
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: ovirt-4.4.5
: ---
Assignee: Ahmad Khiet
QA Contact: Ilan Zuckerman
bugs@ovirt.org
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-10-29 12:39 UTC by Yuriy Khokhlov (Veeam)
Modified: 2021-03-18 15:14 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-03-18 15:14:06 UTC
oVirt Team: Storage
Embargoed:
pm-rhel: ovirt-4.4+


Attachments (Terms of Use)
logs (1.03 MB, application/zip)
2021-03-08 08:02 UTC, Ilan Zuckerman
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 113125 0 master MERGED restapi: add timeout_policy attribute to imagetransfer 2021-02-18 22:04:22 UTC
oVirt gerrit 113126 0 master MERGED core: add timeout policy to an image transfer 2021-03-01 14:41:51 UTC
oVirt gerrit 113163 0 master MERGED Update to model 4.4.23, metamodel 1.3.4 2021-02-18 22:04:22 UTC
oVirt gerrit 113184 0 master MERGED restapi: Refine ImageTransferTimeoutPolicy documentation 2021-02-18 22:04:22 UTC
oVirt gerrit 113235 0 master MERGED python-sdk: added timeout_policy attribute 2021-03-01 15:42:56 UTC

Description Yuriy Khokhlov (Veeam) 2020-10-29 12:39:25 UTC
Description of problem:
Open ImageIO transfer and wait until timeout expired
After expiration, the transfer became inactive, but not closed
VM disk left in a locked state until the user manually not finalizing the transfer.

Version-Release number of selected component (if applicable):
Two RHEV 4.4 nodes with layer rhvh-4.4.1.1-0.20200722.0+1 
Ovirt-manager 4.4.1
NFS 4.1 storage domain
ImageIO version:
# rpm -qa | grep image
ovirt-imageio-client-2.0.9-1.el8ev.x86_64
genisoimage-1.1.11-39.el8.x86_64
ovirt-imageio-common-2.0.9-1.el8ev.x86_64
ovirt-imageio-daemon-2.0.9-1.el8ev.x86_64
redhat-virtualization-host-image-update-placeholder-4.4.1-1.el8ev.noarch


How reproducible:


Steps to Reproduce:
1. Start an Image Transfer for upload. 
2. Wait until the timeout expired.

Actual results:
After expiration, the transfer became inactive, but not closed
VM disk left in a locked state until the user manually not finalizing the transfer.

Expected results:
After expiration, the transfer became inactive and closed.
VM disk should be unlocked.


Additional info:
Probably this is:
https://bugzilla.redhat.com/show_bug.cgi?id=1524184

Comment 1 Nir Soffer 2020-11-11 11:56:05 UTC
(In reply to Yuriy Khokhlov (Veeam) from comment #0)
> Steps to Reproduce:
> 1. Start an Image Transfer for upload. 
> 2. Wait until the timeout expired.

Do you mean inactivity_timeout?

> Actual results:
> After expiration, the transfer became inactive, but not closed
> VM disk left in a locked state until the user manually not finalizing the
> transfer.

This is the expected behavior in the current system.

We cancel transfer only for download and only if the the entire disk was downloaded.
This heuristics is required when downloading disk from the UI, when the download
is performed by the browser, and the browser has no way to tell the system when
the operation is finished.

When using the SDK, the SDK user is responsible for finalizing or cancelling transfers. 

> Expected results:
> After expiration, the transfer became inactive and closed.
> VM disk should be unlocked.

It is possible to change the behavior so the system cancels image transfers
after a timeout, but this may cause trouble to exiting applications that 
depend on the current behavior.

I think this should be closed as not a bug. The backup application must ensure that
image transfers are canceled or finalized in all cases.

Comment 2 Yury.Panchenko 2020-11-11 16:05:27 UTC
Hello, Nir.
> It is possible to change the behavior so the system cancels image transfers
> after a timeout, but this may cause trouble to exiting applications that 
> depend on the current behavior.
I understood your point. Its correct for UI download case.
But, i don't agreed that backup software can always close transfer session.
For example, backup server can crushed during backup, in that case vm will be locked in backup state, until system administrators fix backup server.
Second thing - lost network connection for backup server to hypervisor.
Third thing - just lost backup server database (ransomware attack?). 
Of course this issues can be fixed in some time, and after that backup software can finish session (but only if we have normal backup database with save sessions lock).
But, all this time customer cant poweroff or restart vm and cant make changing in configuration, i think this behavior isn't good for customer.

Is it possible to add special flag for UI, which preserve current behavior, but for regular API usage make this timeout working?
Thanks.

Comment 3 Nir Soffer 2020-11-11 16:48:41 UTC
(In reply to Yury.Panchenko from comment #2)
Yuriy, what if we add a flag so you can control timeout policy?

Example usage:

   <imagetransfer>
       <inactivity_timeout>60</inactivity_timeout>
       <timeout_policy>cancel</timeout_policy>
       ...
   </imagetransfer>

With this if the system does not detect any activity for 60 seconds
the system will cancel the transfer and unlock the disk.

So if your backup application crashes or cannot access the engine the
system will recover automatically.

With the flag existing users not specifying the flag will use the old
policy.

Comment 4 Yury.Panchenko 2020-11-11 17:37:44 UTC
> Yuriy, what if we add a flag so you can control timeout policy?
It will be great.
We can use that flag in our software, and problem is gone.

Comment 5 Yuriy Khokhlov (Veeam) 2020-11-19 08:46:38 UTC
Great!

Comment 8 Ilan Zuckerman 2021-03-08 08:01:44 UTC
Could NOT verify on rhv-release-4.4.5-7-001.noarch

See below the following cases:
Set timeout_policy to 'cancel' and Initiate Upload transfer
set the timeout_policy to 'pause' and Initiate Download transfer

Attaching engine + vdsm logs



Setup:
Modify the helpers/imagetransfer.py   'inactivity_timeout' to 10 sec:
Leave the timeout policy as 'legacy'

def create_transfer(
        connection, disk=None, direction=types.ImageTransferDirection.UPLOAD,
        host=None, backup=None, inactivity_timeout=10, timeout=60,
        disk_snapshot=None, shallow=None,
        timeout_policy=types.ImageTransferTimeoutPolicy.LEGACY):

initiate Upload transfer.
Expected: Upload will be paused due to inactivity.
Actual: As expected

[root@storage-ge13-vdsm1 examples]# python3 image_transfer.py -c engine upload 0826a12c-d05c-4331-bf4e-85153d2c3973


Engine.log:
2021-03-08 09:28:01,462+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-15) [f29865ec-1a5a-4ff2-92a4-0343964e8780] EVENT_ID: UPLOAD_IMAGE_PAUSED_BY_SYSTEM_TIMEOUT(1,071), Upload was paused by system. Reason: timeout due to transfer inactivity.

-------------------------------------

Initiate Download transfer
Expected: The transfer is cancelled
Actual: As expected

[root@storage-ge13-vdsm1 examples]# python3 image_transfer.py -c engine download 0826a12c-d05c-4331-bf4e-85153d2c3973

engine.log:
2021-03-08 09:26:06,120+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-61) [79e18f93-0c9d-490b-baf4-397554c42eab] EVENT_ID: DOWNLOAD_IMAGE_CANCELED_TIMEOUT(1,072), Download was canceled by system. Reason: timeout due to transfer inactivity.


---------------------------------------

Edit the file again, and set the timeout_policy to 'cancel':

def create_transfer(
        connection, disk=None, direction=types.ImageTransferDirection.UPLOAD,
        host=None, backup=None, inactivity_timeout=10, timeout=60,
        disk_snapshot=None, shallow=None,
        timeout_policy=types.ImageTransferTimeoutPolicy.CANCEL):


Initiate Download transfer
Expected: The transfer is cancelled
Actual: As expected

2021-03-08 09:36:16,778+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-94) [288c0127-33b9-4945-8889-ffb2ad9ea9a6] EVENT_ID: DOWNLOAD_IMAGE_CANCELED_TIMEOUT(1,072), Download was canceled by system. Reason: timeout due to transfer inactivity.


Initiate Upload transfer
Expected: The transfer is cancelled
!!!!! Actual: THE TRANSFER IS PAUSED !!!!!

[root@storage-ge13-vdsm1 examples]# python3 image_transfer.py -c engine upload 0826a12c-d05c-4331-bf4e-85153d2c3973

engine.log:
2021-03-08 09:37:46,076+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-3) [e4c7aa5f-654e-4982-9b8d-a628e4bd91e8] EVENT_ID: UPLOAD_IMAGE_PAUSED_BY_SYSTEM_TIMEOUT(1,071), Upload was paused by system. Reason: timeout due to transfer inactivity.

from image transfer log:
2021-03-08 09:37:31,407 INFO    (MainThread) [helpers] Creating image transfer for disk=0826a12c-d05c-4331-bf4e-85153d2c3973, direction=upload host=None backup=None shallow=False timeout_policy=cancel
2021-03-08 09:37:32,295 INFO    (MainThread) [helpers] Transfer 


---------------------------------------

Edit the file again, and set the timeout_policy to 'pause':

def create_transfer(
        connection, disk=None, direction=types.ImageTransferDirection.UPLOAD,
        host=None, backup=None, inactivity_timeout=10, timeout=60,
        disk_snapshot=None, shallow=None,
        timeout_policy=types.ImageTransferTimeoutPolicy.PAUSE):

Initiate Download transfer
Expected: The transfer is PAUSED
!!!! Actual: The transfer was CANCELLED !!!!

[root@storage-ge13-vdsm1 examples]# python3 image_transfer.py -c engine download 0826a12c-d05c-4331-bf4e-85153d2c3973

engine.log:
2021-03-08 09:49:15,356+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-77) [bb5964fb-e4f1-436e-8c15-5cc7834a343f] EVENT_ID: DOWNLOAD_IMAGE_CANCELED_TIMEOUT(1,072), Download was canceled by system. Reason: timeout due to transfer inactivity.

from image transfer log:
2021-03-08 09:49:00,268 INFO    (MainThread) [helpers] Creating image transfer for disk=0826a12c-d05c-4331-bf4e-85153d2c3973, direction=download host=None backup=None shallow=False timeout_policy=pause


Initiate Upload transfer
Expected: The transfer is PAUSED
Actual: As expected

[root@storage-ge13-vdsm1 examples]# python3 image_transfer.py -c engine upload 0826a12c-d05c-4331-bf4e-85153d2c3973

engine.log:
2021-03-08 09:53:59,717+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-52) [cb58c751-ce7a-4c7d-8a14-4aa604a2a735] EVENT_ID: UPLOAD_IMAGE_PAUSED_BY_SYSTEM_TIMEOUT(1,071), Upload was paused by system. Reason: timeout due to transfer inactivity.

from image transfer log:
2021-03-08 09:53:44,802 INFO    (MainThread) [helpers] Creating image transfer for disk=0826a12c-d05c-4331-bf4e-85153d2c3973, direction=upload host=None backup=None shallow=False timeout_policy=pause

Comment 9 Ilan Zuckerman 2021-03-08 08:02:27 UTC
Created attachment 1761507 [details]
logs

Comment 10 Ahmad Khiet 2021-03-08 20:04:26 UTC
Hi Ilan, 

After a long debugging session, and forcing each transfer to be a timed-out transfer by disabling the time, and forcing each time to perform a timeout policy, 
I see that all the described is working in my environment. 

can you please provide more information, about image type, image size, or maybe share the same image.
I see that if the image size is too small the transfer always will succeed.

let's meet tomorrow.

Comment 11 Ahmad Khiet 2021-03-09 09:42:27 UTC
After further checking with Ilan we found that the patch https://gerrit.ovirt.org/c/ovirt-engine/+/113126
is not included in realse: rhv-release-4.4.5-7-001.noarch, and will be included in the next release.

Comment 12 Ahmad Khiet 2021-03-09 09:43:52 UTC
*build

Comment 14 Ilan Zuckerman 2021-03-10 12:34:55 UTC
Moving to verified. 

Both of the problematic cases from comment #8 were also verified on rhv-4.4.5-9

Set timeout_policy to 'cancel' and Initiate Upload transfer
Expected: The transfer is cancelled
Actual: the transfer is cancelled

set the timeout_policy to 'pause' and Initiate Download transfer
Expected: The transfer is Paused
Actual: the transfer is paused

Comment 15 Sandro Bonazzola 2021-03-18 15:14:06 UTC
This bugzilla is included in oVirt 4.4.5 release, published on March 18th 2021.

Since the problem described in this bug report should be resolved in oVirt 4.4.5 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.