Bug 2057445 - VM disk remains in locked state if image transfer (image download) times out
Summary: VM disk remains in locked state if image transfer (image download) times out
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 4.4.9
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ovirt-4.4.10-3
: 4.4.10
Assignee: Benny Zlotnik
QA Contact: Evelina Shames
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-02-23 12:07 UTC by Raul Aldaz
Modified: 2022-08-21 07:56 UTC (History)
9 users (show)

Fixed In Version: ovirt-engine-4.4.10.7
Doc Type: Bug Fix
Doc Text:
Previously, when an image transfer failed during cleanup, the cleanup was not retried as it should, and the disk remained locked. In this release, the retry mechanism for the transfer cleanup phase works correctly, and the disk is unlocked once the transfer is complete.
Clone Of:
Environment:
Last Closed: 2022-03-24 13:30:22 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github oVirt ovirt-engine pull 112 0 None open core: fix retry for transfer image stopImageTransferSession failue 2022-03-02 12:38:41 UTC
Github oVirt ovirt-engine pull 125 0 None open core: fix retry for transfer image stopImageTransferSession failure 2022-03-06 19:24:14 UTC
Red Hat Issue Tracker RHV-44767 0 None None None 2022-02-23 12:17:26 UTC
Red Hat Knowledge Base (Solution) 6761491 0 None None None 2022-02-25 13:25:12 UTC
Red Hat Product Errata RHBA-2022:1052 0 None None None 2022-03-24 13:30:25 UTC

Description Raul Aldaz 2022-02-23 12:07:33 UTC
Description of problem:


During a Netbackup job, Image of VM snapshot is downloaded ok, but it is left in locked state after client connection times out and ticket is closed on ovirt-imageio

Looks similar to BZ #1796231

Engine logs showing time out in ticket cancellation and unable to remove locked image:

~~~
2022-02-15 20:45:49,697+01 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.RemoveImageTicketVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-47) [f09f0e1c-4123-4778-917c-a7c823b42dc6] Failed in 'RemoveImageTicketVDS' method
2022-02-15 20:45:49,703+01 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-47) [f09f0e1c-4123-4778-917c-a7c823b42dc6] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM <hostname> command RemoveImageTicketVDS failed: Image daemon request failed: 'status=409, reason=Conflict, error=Timeout cancelling ticket 07fe38f2-c130-45f0-84b3-51c399936899\n'
2022-02-15 20:45:49,703+01 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.RemoveImageTicketVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-47) [f09f0e1c-4123-4778-917c-a7c823b42dc6] Command 'org.ovirt.engine.core.vdsbroker.vdsbroker.RemoveImageTicketVDSCommand' return value 'StatusOnlyReturn [status=Status [code=482, message=Image daemon request failed: 'status=409, reason=Conflict, error=Timeout cancelling ticket 07fe38f2-c130-45f0-84b3-51c399936899\n']]'
2022-02-15 20:45:49,703+01 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.RemoveImageTicketVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-47) [f09f0e1c-4123-4778-917c-a7c823b42dc6] HostName = <hostname>
2022-02-15 20:45:49,703+01 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.RemoveImageTicketVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-47) [f09f0e1c-4123-4778-917c-a7c823b42dc6] Command 'RemoveImageTicketVDSCommand(HostName = <hostname>, RemoveImageTicketVDSCommandParameters:{hostId='86ba23d1-8f2d-4852-93d8-c323ef879f76', ticketId='07fe38f2-c130-45f0-84b3-51c399936899', timeout='null'})' execution failed: VDSGenericException: VDSErrorException: Failed to RemoveImageTicketVDS, error = Image daemon request failed: 'status=409, reason=Conflict, error=Timeout cancelling ticket 07fe38f2-c130-45f0-84b3-51c399936899\n', code = 482
2022-02-15 20:45:49,703+01 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.RemoveImageTicketVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-47) [f09f0e1c-4123-4778-917c-a7c823b42dc6] FINISH, RemoveImageTicketVDSCommand, return: , log id: 5f8b94d0
2022-02-15 20:45:49,703+01 ERROR [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-47) [f09f0e1c-4123-4778-917c-a7c823b42dc6] Failed to stop image transfer session for ticket '07fe38f2-c130-45f0-84b3-51c399936899': {}: org.ovirt.engine.core.common.errors.EngineException: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to RemoveImageTicketVDS, error = Image daemon request failed: 'status=409, reason=Conflict, error=Timeout cancelling ticket 07fe38f2-c130-45f0-84b3-51c399936899\n', code = 482 (Failed with error ImageDeamonError and code 482)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.VdsHandler.handleVdsResult(VdsHandler.java:114)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.VDSBrokerFrontendImpl.runVdsCommand(VDSBrokerFrontendImpl.java:33)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand.removeImageTicketFromDaemon(TransferDiskImageCommand.java:1270)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand.stopImageTransferSession(TransferDiskImageCommand.java:1244)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand.handleFinalizingSuccess(TransferDiskImageCommand.java:811)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand.executeStateHandler(TransferDiskImageCommand.java:526)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand.proceedCommandExecution(TransferDiskImageCommand.java:492)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.storage.disk.image.TransferImageCommandCallback.doPolling(TransferImageCommandCallback.java:21)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller.invokeCallbackMethodsImpl(CommandCallbacksPoller.java:175)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller.invokeCallbackMethods(CommandCallbacksPoller.java:109)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
        at org.glassfish.javax.enterprise.concurrent.1.redhat-00001//org.glassfish.enterprise.concurrent.internal.ManagedScheduledThreadPoolExecutor$ManagedScheduledFutureTask.access$201(ManagedScheduledThreadPoolExecutor.java:360)
        at org.glassfish.javax.enterprise.concurrent.1.redhat-00001//org.glassfish.enterprise.concurrent.internal.ManagedScheduledThreadPoolExecutor$ManagedScheduledFutureTask.run(ManagedScheduledThreadPoolExecutor.java:511)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)
        at org.glassfish.javax.enterprise.concurrent.1.redhat-00001//org.glassfish.enterprise.concurrent.ManagedThreadFactoryImpl$ManagedThread.run(ManagedThreadFactoryImpl.java:227)
Caused by: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to RemoveImageTicketVDS, error = Image daemon request failed: 'status=409, reason=Conflict, error=Timeout cancelling ticket 07fe38f2-c130-45f0-84b3-51c399936899\n', code = 482

2022-02-15 20:47:18,057+01 INFO  [org.ovirt.engine.core.bll.snapshots.RemoveSnapshotCommand] (default task-248) [30219dfb-71f3-4537-97e6-21ccd0618719] Lock Acquired to object 'EngineLock:{exclusiveLocks='[421cd54e-9f41-d987-976e-2e259c5e218f=VM]', sharedLocks=''}'
2022-02-15 20:47:18,089+01 WARN  [org.ovirt.engine.core.bll.snapshots.RemoveSnapshotCommand] (default task-248) [30219dfb-71f3-4537-97e6-21ccd0618719] Validation of action 'RemoveSnapshot' failed for user <username>. Reasons: VAR__TYPE__SNAPSHOT,VAR__ACTION__REMOVE,ACTION_TYPE_FAILED_DISKS_LOCKED,$diskAliases <VM name>
2022-02-15 20:47:18,089+01 INFO  [org.ovirt.engine.core.bll.snapshots.RemoveSnapshotCommand] (default task-248) [30219dfb-71f3-4537-97e6-21ccd0618719] Lock freed to object 'EngineLock:{exclusiveLocks='[421cd54e-9f41-d987-976e-2e259c5e218f=VM]', sharedLocks=''}'
~~~

ovirt-imageio logs showing timeout is reached after default 60s for remove_timeout config:

~~~
2022-02-15 20:44:49,693 DEBUG   (Thread-518) [auth] Cancelling ticket 07fe38f2-c130-45f0-84b3-51c399936899
2022-02-15 20:44:49,693 INFO    (Thread-518) [auth] Waiting until ticket 07fe38f2-c130-45f0-84b3-51c399936899 is unused
2022-02-15 20:45:49,693 ERROR   (Thread-518) [http] Client error: Timeout cancelling ticket 07fe38f2-c130-45f0-84b3-51c399936899
2022-02-15 20:46:59,153 DEBUG   (Thread-455) [auth] Removing connection 455 context from ticket 07fe38f2-c130-45f0-84b3-51c399936899
2022-02-15 20:46:59,155 DEBUG   (Thread-456) [auth] Removing connection 456 context from ticket 07fe38f2-c130-45f0-84b3-51c399936899
2022-02-15 20:46:59,158 DEBUG   (Thread-456) [auth] Removing last connection for ticket 07fe38f2-c130-45f0-84b3-51c399936899
~~~


Version-Release number of selected component (if applicable):

ovirt-engine-4.4.9.5-0.1.el8ev.noarch
ovirt-imageio-daemon-2.3.0-1.el8ev.x86_64


How reproducible:

Unable to reproduce by now



Actual results:

Snapshot image is left in locked state. This disallows its removal and further backup jobs.


Expected results:

Snapshot image state should be cleared, so it can be removed.


Additional info:

According to associated case, this behavior appeared after upgrade from 4.4.7 to 4.4.9.

Comment 1 Juan Orti 2022-02-25 12:50:40 UTC
Based on the following comment, the "409 Conflict" error should be retried by the client, but it looks there's no attempt to retry the removal of the ticket:

https://github.com/oVirt/ovirt-imageio/blob/d5e9c757e0a44fdd88d9e54b77283e718407b9cc/ovirt_imageio/_internal/config.py#L166

    # Number of seconds to wait when removing a ticket. If ticket cannot be
    # removed within this timeout, the request will fail with "409 Conflict",
    # and the user need to retry the request again. A ticket can be removed
    # only when the number of connections using the ticket is zero.
    remove_timeout = 60

Comment 9 Benny Zlotnik 2022-03-07 09:33:17 UTC
Reproducer with a vdsm code change:

edit /usr/lib/python3.6/site-packages/vdsm/storage/imagetickets.py +95

failed_once = False
@requires_image_daemon
def remove_ticket(uuid):
    global failed_once

    if not failed_once:
        failed_once = True
        raise Exception("failed remove ticked")

    failed_once = False
    _request("DELETE", uuid)

after editing:
$ systemctl restart vdsmd

Downloading a disk should show an exception like this:
2022-03-07 04:28:29,683-05 ERROR [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-13) [d002a3d7-d9c1-41f7-a420-9f5ae48fcaad] Failed to stop image transfer 'a82a9461-da19-477e-8db5-b29370462ed9' for ticket '4a4f5a28-a53f-4dd7-a9a6-9ad4ab280b2d': {}: org.ovirt.engine.core.common.errors.EngineException: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to RemoveImageTicketVDS, error = failed remove ticked, code = 100 (Failed with error GeneralException and code 100)

However, it will be retried and should succeed:
2022-03-07 04:28:33,720-05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.RemoveImageTicketVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-54) [d002a3d7-d9c1-41f7-a420-9f5ae48fcaad] START, RemoveImageTicketVDSCommand(HostName = hosto72, RemoveImageTicketVDSCommandParameters:{hostId='7ee44999-d01b-43c2-93dc-499f6345a9fe', ticketId='4a4f5a28-a53f-4dd7-a9a6-9ad4ab280b2d', timeout='null'}), log id: 5789d81c

2022-03-07 04:28:33,735-05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.RemoveImageTicketVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-54) [d002a3d7-d9c1-41f7-a420-9f5ae48fcaad] FINISH, RemoveImageTicketVDSCommand, return: StatusOnlyReturn [status=Status [code=0, message=Done]], log id: 5789d81c

The disk should be unlocked and subsequent transfers on the same disk should succeed

Comment 14 Evelina Shames 2022-03-14 10:26:22 UTC
(In reply to Benny Zlotnik from comment #9)
> Reproducer with a vdsm code change:
> 
> edit /usr/lib/python3.6/site-packages/vdsm/storage/imagetickets.py +95
> 
> failed_once = False
> @requires_image_daemon
> def remove_ticket(uuid):
>     global failed_once
> 
>     if not failed_once:
>         failed_once = True
>         raise Exception("failed remove ticked")
> 
>     failed_once = False
>     _request("DELETE", uuid)
> 
> after editing:
> $ systemctl restart vdsmd
> 
> Downloading a disk should show an exception like this:
> 2022-03-07 04:28:29,683-05 ERROR
> [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand]
> (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-13)
> [d002a3d7-d9c1-41f7-a420-9f5ae48fcaad] Failed to stop image transfer
> 'a82a9461-da19-477e-8db5-b29370462ed9' for ticket
> '4a4f5a28-a53f-4dd7-a9a6-9ad4ab280b2d': {}:
> org.ovirt.engine.core.common.errors.EngineException: EngineException:
> org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException:
> VDSGenericException: VDSErrorException: Failed to RemoveImageTicketVDS,
> error = failed remove ticked, code = 100 (Failed with error GeneralException
> and code 100)
> 
> However, it will be retried and should succeed:
> 2022-03-07 04:28:33,720-05 INFO 
> [org.ovirt.engine.core.vdsbroker.vdsbroker.RemoveImageTicketVDSCommand]
> (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-54)
> [d002a3d7-d9c1-41f7-a420-9f5ae48fcaad] START,
> RemoveImageTicketVDSCommand(HostName = hosto72,
> RemoveImageTicketVDSCommandParameters:{hostId='7ee44999-d01b-43c2-93dc-
> 499f6345a9fe', ticketId='4a4f5a28-a53f-4dd7-a9a6-9ad4ab280b2d',
> timeout='null'}), log id: 5789d81c
> 
> 2022-03-07 04:28:33,735-05 INFO 
> [org.ovirt.engine.core.vdsbroker.vdsbroker.RemoveImageTicketVDSCommand]
> (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-54)
> [d002a3d7-d9c1-41f7-a420-9f5ae48fcaad] FINISH, RemoveImageTicketVDSCommand,
> return: StatusOnlyReturn [status=Status [code=0, message=Done]], log id:
> 5789d81c
> 
> The disk should be unlocked and subsequent transfers on the same disk should
> succeed

Verified with the above steps on ovirt-engine-4.4.10.7:
Downloading a disk showed an exception:
2022-03-14 12:10:43,211+02 ERROR [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-28) [dfd772a6-671f-42f5-abc4-627d2963923f] Failed to stop image transfer session for ticket 'a5c2a518-9e9a-4690-93eb-0af5cd4b8f7e': {}: org.ovirt.engine.core.common.errors.
EngineException: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to RemoveImageTicketVDS, error = failed remove ticked, code = 100 (Failed with error GeneralException and code 100)

Retried:
2022-03-14 12:10:53,234+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.RemoveImageTicketVDSCommand] (EE-ManagedScheduledExecutorSe
rvice-engineScheduledThreadPool-Thread-75) [dfd772a6-671f-42f5-abc4-627d2963923f] START, RemoveImageTicketVDSCommand(HostName = host_mi
xed_1, RemoveImageTicketVDSCommandParameters:{hostId='15614953-8aef-415c-bc8f-de97af12ee6f', ticketId='a5c2a518-9e9a-4690-93eb-0af5cd4b
8f7e', timeout='null'}), log id: 5e75230d
2022-03-14 12:10:53,244+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.RemoveImageTicketVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-75) [dfd772a6-671f-42f5-abc4-627d2963923f] FINISH, RemoveImageTicketVDSCommand, return: StatusOnlyReturn [status=Status [code=0, message=Done]], log id: 5e75230d

Operation succeeded:
2022-03-14 12:10:54,559+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorSer
vice-engineScheduledThreadPool-Thread-78) [dfd772a6-671f-42f5-abc4-627d2963923f] EVENT_ID: TRANSFER_IMAGE_SUCCEEDED(1,032), Image Downl
oad with disk test succeeded.

The disk is in 'OK' state.

Moving to 'Verified'.

Comment 16 errata-xmlrpc 2022-03-24 13:30:22 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (RHV Manager (ovirt-engine) [ovirt-4.4.10-3]), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:1052


Note You need to log in before you can comment on or make changes to this bug.