Bug 2057445

Summary: VM disk remains in locked state if image transfer (image download) times out
Product: Red Hat Enterprise Virtualization Manager Reporter: Raul Aldaz <raldaz>
Component: ovirt-engineAssignee: Benny Zlotnik <bzlotnik>
Status: CLOSED ERRATA QA Contact: Evelina Shames <eshames>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.4.9CC: aefrat, ahadas, bcholler, bzlotnik, emarcus, jortialc, mgokhool, mtessun, pelauter
Target Milestone: ovirt-4.4.10-3Keywords: ZStream
Target Release: 4.4.10   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ovirt-engine-4.4.10.7 Doc Type: Bug Fix
Doc Text:
Previously, when an image transfer failed during cleanup, the cleanup was not retried as it should, and the disk remained locked. In this release, the retry mechanism for the transfer cleanup phase works correctly, and the disk is unlocked once the transfer is complete.
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-03-24 13:30:22 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Raul Aldaz 2022-02-23 12:07:33 UTC
Description of problem:


During a Netbackup job, Image of VM snapshot is downloaded ok, but it is left in locked state after client connection times out and ticket is closed on ovirt-imageio

Looks similar to BZ #1796231

Engine logs showing time out in ticket cancellation and unable to remove locked image:

~~~
2022-02-15 20:45:49,697+01 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.RemoveImageTicketVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-47) [f09f0e1c-4123-4778-917c-a7c823b42dc6] Failed in 'RemoveImageTicketVDS' method
2022-02-15 20:45:49,703+01 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-47) [f09f0e1c-4123-4778-917c-a7c823b42dc6] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM <hostname> command RemoveImageTicketVDS failed: Image daemon request failed: 'status=409, reason=Conflict, error=Timeout cancelling ticket 07fe38f2-c130-45f0-84b3-51c399936899\n'
2022-02-15 20:45:49,703+01 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.RemoveImageTicketVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-47) [f09f0e1c-4123-4778-917c-a7c823b42dc6] Command 'org.ovirt.engine.core.vdsbroker.vdsbroker.RemoveImageTicketVDSCommand' return value 'StatusOnlyReturn [status=Status [code=482, message=Image daemon request failed: 'status=409, reason=Conflict, error=Timeout cancelling ticket 07fe38f2-c130-45f0-84b3-51c399936899\n']]'
2022-02-15 20:45:49,703+01 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.RemoveImageTicketVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-47) [f09f0e1c-4123-4778-917c-a7c823b42dc6] HostName = <hostname>
2022-02-15 20:45:49,703+01 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.RemoveImageTicketVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-47) [f09f0e1c-4123-4778-917c-a7c823b42dc6] Command 'RemoveImageTicketVDSCommand(HostName = <hostname>, RemoveImageTicketVDSCommandParameters:{hostId='86ba23d1-8f2d-4852-93d8-c323ef879f76', ticketId='07fe38f2-c130-45f0-84b3-51c399936899', timeout='null'})' execution failed: VDSGenericException: VDSErrorException: Failed to RemoveImageTicketVDS, error = Image daemon request failed: 'status=409, reason=Conflict, error=Timeout cancelling ticket 07fe38f2-c130-45f0-84b3-51c399936899\n', code = 482
2022-02-15 20:45:49,703+01 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.RemoveImageTicketVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-47) [f09f0e1c-4123-4778-917c-a7c823b42dc6] FINISH, RemoveImageTicketVDSCommand, return: , log id: 5f8b94d0
2022-02-15 20:45:49,703+01 ERROR [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-47) [f09f0e1c-4123-4778-917c-a7c823b42dc6] Failed to stop image transfer session for ticket '07fe38f2-c130-45f0-84b3-51c399936899': {}: org.ovirt.engine.core.common.errors.EngineException: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to RemoveImageTicketVDS, error = Image daemon request failed: 'status=409, reason=Conflict, error=Timeout cancelling ticket 07fe38f2-c130-45f0-84b3-51c399936899\n', code = 482 (Failed with error ImageDeamonError and code 482)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.VdsHandler.handleVdsResult(VdsHandler.java:114)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.VDSBrokerFrontendImpl.runVdsCommand(VDSBrokerFrontendImpl.java:33)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand.removeImageTicketFromDaemon(TransferDiskImageCommand.java:1270)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand.stopImageTransferSession(TransferDiskImageCommand.java:1244)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand.handleFinalizingSuccess(TransferDiskImageCommand.java:811)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand.executeStateHandler(TransferDiskImageCommand.java:526)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand.proceedCommandExecution(TransferDiskImageCommand.java:492)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.storage.disk.image.TransferImageCommandCallback.doPolling(TransferImageCommandCallback.java:21)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller.invokeCallbackMethodsImpl(CommandCallbacksPoller.java:175)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller.invokeCallbackMethods(CommandCallbacksPoller.java:109)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
        at org.glassfish.javax.enterprise.concurrent.1.redhat-00001//org.glassfish.enterprise.concurrent.internal.ManagedScheduledThreadPoolExecutor$ManagedScheduledFutureTask.access$201(ManagedScheduledThreadPoolExecutor.java:360)
        at org.glassfish.javax.enterprise.concurrent.1.redhat-00001//org.glassfish.enterprise.concurrent.internal.ManagedScheduledThreadPoolExecutor$ManagedScheduledFutureTask.run(ManagedScheduledThreadPoolExecutor.java:511)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)
        at org.glassfish.javax.enterprise.concurrent.1.redhat-00001//org.glassfish.enterprise.concurrent.ManagedThreadFactoryImpl$ManagedThread.run(ManagedThreadFactoryImpl.java:227)
Caused by: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to RemoveImageTicketVDS, error = Image daemon request failed: 'status=409, reason=Conflict, error=Timeout cancelling ticket 07fe38f2-c130-45f0-84b3-51c399936899\n', code = 482

2022-02-15 20:47:18,057+01 INFO  [org.ovirt.engine.core.bll.snapshots.RemoveSnapshotCommand] (default task-248) [30219dfb-71f3-4537-97e6-21ccd0618719] Lock Acquired to object 'EngineLock:{exclusiveLocks='[421cd54e-9f41-d987-976e-2e259c5e218f=VM]', sharedLocks=''}'
2022-02-15 20:47:18,089+01 WARN  [org.ovirt.engine.core.bll.snapshots.RemoveSnapshotCommand] (default task-248) [30219dfb-71f3-4537-97e6-21ccd0618719] Validation of action 'RemoveSnapshot' failed for user <username>. Reasons: VAR__TYPE__SNAPSHOT,VAR__ACTION__REMOVE,ACTION_TYPE_FAILED_DISKS_LOCKED,$diskAliases <VM name>
2022-02-15 20:47:18,089+01 INFO  [org.ovirt.engine.core.bll.snapshots.RemoveSnapshotCommand] (default task-248) [30219dfb-71f3-4537-97e6-21ccd0618719] Lock freed to object 'EngineLock:{exclusiveLocks='[421cd54e-9f41-d987-976e-2e259c5e218f=VM]', sharedLocks=''}'
~~~

ovirt-imageio logs showing timeout is reached after default 60s for remove_timeout config:

~~~
2022-02-15 20:44:49,693 DEBUG   (Thread-518) [auth] Cancelling ticket 07fe38f2-c130-45f0-84b3-51c399936899
2022-02-15 20:44:49,693 INFO    (Thread-518) [auth] Waiting until ticket 07fe38f2-c130-45f0-84b3-51c399936899 is unused
2022-02-15 20:45:49,693 ERROR   (Thread-518) [http] Client error: Timeout cancelling ticket 07fe38f2-c130-45f0-84b3-51c399936899
2022-02-15 20:46:59,153 DEBUG   (Thread-455) [auth] Removing connection 455 context from ticket 07fe38f2-c130-45f0-84b3-51c399936899
2022-02-15 20:46:59,155 DEBUG   (Thread-456) [auth] Removing connection 456 context from ticket 07fe38f2-c130-45f0-84b3-51c399936899
2022-02-15 20:46:59,158 DEBUG   (Thread-456) [auth] Removing last connection for ticket 07fe38f2-c130-45f0-84b3-51c399936899
~~~


Version-Release number of selected component (if applicable):

ovirt-engine-4.4.9.5-0.1.el8ev.noarch
ovirt-imageio-daemon-2.3.0-1.el8ev.x86_64


How reproducible:

Unable to reproduce by now



Actual results:

Snapshot image is left in locked state. This disallows its removal and further backup jobs.


Expected results:

Snapshot image state should be cleared, so it can be removed.


Additional info:

According to associated case, this behavior appeared after upgrade from 4.4.7 to 4.4.9.

Comment 1 Juan Orti 2022-02-25 12:50:40 UTC
Based on the following comment, the "409 Conflict" error should be retried by the client, but it looks there's no attempt to retry the removal of the ticket:

https://github.com/oVirt/ovirt-imageio/blob/d5e9c757e0a44fdd88d9e54b77283e718407b9cc/ovirt_imageio/_internal/config.py#L166

    # Number of seconds to wait when removing a ticket. If ticket cannot be
    # removed within this timeout, the request will fail with "409 Conflict",
    # and the user need to retry the request again. A ticket can be removed
    # only when the number of connections using the ticket is zero.
    remove_timeout = 60

Comment 9 Benny Zlotnik 2022-03-07 09:33:17 UTC
Reproducer with a vdsm code change:

edit /usr/lib/python3.6/site-packages/vdsm/storage/imagetickets.py +95

failed_once = False
@requires_image_daemon
def remove_ticket(uuid):
    global failed_once

    if not failed_once:
        failed_once = True
        raise Exception("failed remove ticked")

    failed_once = False
    _request("DELETE", uuid)

after editing:
$ systemctl restart vdsmd

Downloading a disk should show an exception like this:
2022-03-07 04:28:29,683-05 ERROR [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-13) [d002a3d7-d9c1-41f7-a420-9f5ae48fcaad] Failed to stop image transfer 'a82a9461-da19-477e-8db5-b29370462ed9' for ticket '4a4f5a28-a53f-4dd7-a9a6-9ad4ab280b2d': {}: org.ovirt.engine.core.common.errors.EngineException: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to RemoveImageTicketVDS, error = failed remove ticked, code = 100 (Failed with error GeneralException and code 100)

However, it will be retried and should succeed:
2022-03-07 04:28:33,720-05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.RemoveImageTicketVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-54) [d002a3d7-d9c1-41f7-a420-9f5ae48fcaad] START, RemoveImageTicketVDSCommand(HostName = hosto72, RemoveImageTicketVDSCommandParameters:{hostId='7ee44999-d01b-43c2-93dc-499f6345a9fe', ticketId='4a4f5a28-a53f-4dd7-a9a6-9ad4ab280b2d', timeout='null'}), log id: 5789d81c

2022-03-07 04:28:33,735-05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.RemoveImageTicketVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-54) [d002a3d7-d9c1-41f7-a420-9f5ae48fcaad] FINISH, RemoveImageTicketVDSCommand, return: StatusOnlyReturn [status=Status [code=0, message=Done]], log id: 5789d81c

The disk should be unlocked and subsequent transfers on the same disk should succeed

Comment 14 Evelina Shames 2022-03-14 10:26:22 UTC
(In reply to Benny Zlotnik from comment #9)
> Reproducer with a vdsm code change:
> 
> edit /usr/lib/python3.6/site-packages/vdsm/storage/imagetickets.py +95
> 
> failed_once = False
> @requires_image_daemon
> def remove_ticket(uuid):
>     global failed_once
> 
>     if not failed_once:
>         failed_once = True
>         raise Exception("failed remove ticked")
> 
>     failed_once = False
>     _request("DELETE", uuid)
> 
> after editing:
> $ systemctl restart vdsmd
> 
> Downloading a disk should show an exception like this:
> 2022-03-07 04:28:29,683-05 ERROR
> [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand]
> (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-13)
> [d002a3d7-d9c1-41f7-a420-9f5ae48fcaad] Failed to stop image transfer
> 'a82a9461-da19-477e-8db5-b29370462ed9' for ticket
> '4a4f5a28-a53f-4dd7-a9a6-9ad4ab280b2d': {}:
> org.ovirt.engine.core.common.errors.EngineException: EngineException:
> org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException:
> VDSGenericException: VDSErrorException: Failed to RemoveImageTicketVDS,
> error = failed remove ticked, code = 100 (Failed with error GeneralException
> and code 100)
> 
> However, it will be retried and should succeed:
> 2022-03-07 04:28:33,720-05 INFO 
> [org.ovirt.engine.core.vdsbroker.vdsbroker.RemoveImageTicketVDSCommand]
> (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-54)
> [d002a3d7-d9c1-41f7-a420-9f5ae48fcaad] START,
> RemoveImageTicketVDSCommand(HostName = hosto72,
> RemoveImageTicketVDSCommandParameters:{hostId='7ee44999-d01b-43c2-93dc-
> 499f6345a9fe', ticketId='4a4f5a28-a53f-4dd7-a9a6-9ad4ab280b2d',
> timeout='null'}), log id: 5789d81c
> 
> 2022-03-07 04:28:33,735-05 INFO 
> [org.ovirt.engine.core.vdsbroker.vdsbroker.RemoveImageTicketVDSCommand]
> (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-54)
> [d002a3d7-d9c1-41f7-a420-9f5ae48fcaad] FINISH, RemoveImageTicketVDSCommand,
> return: StatusOnlyReturn [status=Status [code=0, message=Done]], log id:
> 5789d81c
> 
> The disk should be unlocked and subsequent transfers on the same disk should
> succeed

Verified with the above steps on ovirt-engine-4.4.10.7:
Downloading a disk showed an exception:
2022-03-14 12:10:43,211+02 ERROR [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-28) [dfd772a6-671f-42f5-abc4-627d2963923f] Failed to stop image transfer session for ticket 'a5c2a518-9e9a-4690-93eb-0af5cd4b8f7e': {}: org.ovirt.engine.core.common.errors.
EngineException: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to RemoveImageTicketVDS, error = failed remove ticked, code = 100 (Failed with error GeneralException and code 100)

Retried:
2022-03-14 12:10:53,234+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.RemoveImageTicketVDSCommand] (EE-ManagedScheduledExecutorSe
rvice-engineScheduledThreadPool-Thread-75) [dfd772a6-671f-42f5-abc4-627d2963923f] START, RemoveImageTicketVDSCommand(HostName = host_mi
xed_1, RemoveImageTicketVDSCommandParameters:{hostId='15614953-8aef-415c-bc8f-de97af12ee6f', ticketId='a5c2a518-9e9a-4690-93eb-0af5cd4b
8f7e', timeout='null'}), log id: 5e75230d
2022-03-14 12:10:53,244+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.RemoveImageTicketVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-75) [dfd772a6-671f-42f5-abc4-627d2963923f] FINISH, RemoveImageTicketVDSCommand, return: StatusOnlyReturn [status=Status [code=0, message=Done]], log id: 5e75230d

Operation succeeded:
2022-03-14 12:10:54,559+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorSer
vice-engineScheduledThreadPool-Thread-78) [dfd772a6-671f-42f5-abc4-627d2963923f] EVENT_ID: TRANSFER_IMAGE_SUCCEEDED(1,032), Image Downl
oad with disk test succeeded.

The disk is in 'OK' state.

Moving to 'Verified'.

Comment 16 errata-xmlrpc 2022-03-24 13:30:22 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (RHV Manager (ovirt-engine) [ovirt-4.4.10-3]), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:1052