Bug 2001894

Summary: ImageTransfer phase changes to finished_success before the transfer finish and locks released
Product: [oVirt] ovirt-engine Reporter: Jean-Louis Dupond <jean-louis>
Component: BLL.StorageAssignee: Eyal Shenitzky <eshenitz>
Status: CLOSED DUPLICATE QA Contact: Avihai <aefrat>
Severity: high Docs Contact:
Priority: high    
Version: 4.4.8.4CC: bugs, nsoffer, pbar
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-10-04 06:54:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jean-Louis Dupond 2021-09-07 11:53:01 UTC
Description of problem:

We have a script that does the following flow:
1. Create snapshot
2. Download VM at the time of that snapshot
3. Delete snapshot

Now while this works fine 99% of the time, we see sometimes a failure when removing the snapshot with the following error:
Cannot remove Snapshot: VM is locked. Please try again in a few minutes.

The biggest issue here is that the imagetransfer.finalize() was already completed. And the VM is not in a locked state when checked via the API.
So we have no single way to check if the VM is still locked and delay the snapshot removal.

Image transfer status BEFORE trying to remove the snapshot:
DEBUG:root:< GET /ovirt-engine/api/imagetransfers/f852baf5-1d16-4ab1-ab8b-d226443a9a7d HTTP/1.1
DEBUG:root:< Host: xxxxx
DEBUG:root:< User-Agent: PythonSDK/4.4.15
DEBUG:root:< Version: 4
DEBUG:root:< Content-Type: application/xml
DEBUG:root:< Accept: application/xml
DEBUG:root:< Authorization: Bearer xxxxx
DEBUG:root:> HTTP/1.1 200 OK
DEBUG:root:> Date: Tue, 07 Sep 2021 10:54:33 GMT
DEBUG:root:> Server: Apache/2.4.37 (centos) OpenSSL/1.1.1k mod_wsgi/4.6.4 Python/3.6
DEBUG:root:> Content-Type: application/xml;charset=UTF-8
DEBUG:root:> Content-Length: 1440
DEBUG:root:> Correlation-Id: 661ff01c-3fd3-425e-9dcd-ca29e9c50ee5
DEBUG:root:> Vary: Accept-Encoding
DEBUG:root:> 
DEBUG:root:> <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
DEBUG:root:> <image_transfer href="/ovirt-engine/api/imagetransfers/f852baf5-1d16-4ab1-ab8b-d226443a9a7d" id="f852baf5-1d16-4ab1-ab8b-d226443a9a7d">
DEBUG:root:>     <actions>
DEBUG:root:>         <link href="/ovirt-engine/api/imagetransfers/f852baf5-1d16-4ab1-ab8b-d226443a9a7d/resume" rel="resume"/>
DEBUG:root:>         <link href="/ovirt-engine/api/imagetransfers/f852baf5-1d16-4ab1-ab8b-d226443a9a7d/cancel" rel="cancel"/>
DEBUG:root:>         <link href="/ovirt-engine/api/imagetransfers/f852baf5-1d16-4ab1-ab8b-d226443a9a7d/pause" rel="pause"/>
DEBUG:root:>         <link href="/ovirt-engine/api/imagetransfers/f852baf5-1d16-4ab1-ab8b-d226443a9a7d/extend" rel="extend"/>
DEBUG:root:>         <link href="/ovirt-engine/api/imagetransfers/f852baf5-1d16-4ab1-ab8b-d226443a9a7d/finalize" rel="finalize"/>
DEBUG:root:>     </actions>
DEBUG:root:>     <active>true</active>
DEBUG:root:>     <direction>download</direction>
DEBUG:root:>     <format>raw</format>
DEBUG:root:>     <inactivity_timeout>60</inactivity_timeout>
DEBUG:root:>     <phase>finalizing_success</phase>
DEBUG:root:>     <proxy_url>https://xxxxx:54323/images/593e4fad-8c26-4808-8d5c-4895b1e31a28</proxy_url>
DEBUG:root:>     <shallow>false</shallow>
DEBUG:root:>     <timeout_policy>legacy</timeout_policy>
DEBUG:root:>     <transfer_url>https://xxxxx:54322/images/593e4fad-8c26-4808-8d5c-4895b1e31a28</transfer_url>
DEBUG:root:>     <transferred>34367668224</transferred>
DEBUG:root:>     <host href="/ovirt-engine/api/hosts/37150444-cf80-4831-bf49-457227d9a22e" id="37150444-cf80-4831-bf49-457227d9a22e"/>
DEBUG:root:> </image_transfer>

Snapshot removal:
DEBUG:root:< DELETE /ovirt-engine/api/vms/6703070e-669d-43b4-b4fc-e94fdc298516/snapshots/6d31946d-b759-410b-9a6f-a4f0df9dc328 HTTP/1.1
DEBUG:root:< Host: xxxxx
DEBUG:root:< User-Agent: PythonSDK/4.4.15
DEBUG:root:< Version: 4
DEBUG:root:< Content-Type: application/xml
DEBUG:root:< Accept: application/xml
DEBUG:root:< Authorization: Bearer xxxxx
DEBUG:root:> HTTP/1.1 409 Conflict
DEBUG:root:> Date: Tue, 07 Sep 2021 10:54:39 GMT
DEBUG:root:> Server: Apache/2.4.37 (centos) OpenSSL/1.1.1k mod_wsgi/4.6.4 Python/3.6
DEBUG:root:> Content-Type: application/xml;charset=UTF-8
DEBUG:root:> Content-Length: 207
DEBUG:root:> Correlation-Id: 8fe3a13b-ae59-4a39-bcb0-756396d5fba2
DEBUG:root:> 
DEBUG:root:> <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
DEBUG:root:> <fault>
DEBUG:root:>     <detail>[Cannot remove Snapshot: VM is locked. Please try again in a few minutes.]</detail>
DEBUG:root:>     <reason>Operation Failed</reason>
DEBUG:root:> </fault>


Now as seen from the logs there seems to be indeed a race condition:
2021-09-07 12:54:33,990+02 INFO  [org.ovirt.engine.core.bll.storage.disk.image.ImageTransferUpdater] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-7) [7096fca4-3c23-4173-a16f-c3039bc94ee6] Updating image transfer f852baf5-1d16-4ab1-ab8b-d226443a9a7d phase to Finished Success
2021-09-07 12:54:34,463+02 WARN  [org.ovirt.engine.api.restapi.util.LinkHelper] (default task-152) [] Can't find relative path for class "org.ovirt.engine.api.resource.StorageDomainVmDiskAttachmentsResource", will return null
2021-09-07 12:54:34,463+02 WARN  [org.ovirt.engine.api.restapi.util.LinkHelper] (default task-152) [] Can't find relative path for class "org.ovirt.engine.api.resource.StorageDomainVmDiskAttachmentsResource", will return null
2021-09-07 12:54:39,664+02 INFO  [org.ovirt.engine.core.bll.snapshots.RemoveSnapshotCommand] (default task-152) [8fe3a13b-ae59-4a39-bcb0-756396d5fba2] Failed to Acquire Lock to object 'EngineLock:{exclusiveLocks='[6703070e-669d-43b4-b4fc-e94fdc298516=VM]', sharedLocks=''}'
2021-09-07 12:54:39,664+02 WARN  [org.ovirt.engine.core.bll.snapshots.RemoveSnapshotCommand] (default task-152) [8fe3a13b-ae59-4a39-bcb0-756396d5fba2] Validation of action 'RemoveSnapshot' failed for user admin@internal-authz. Reasons: VAR__TYPE__SNAPSHOT,VAR__ACTION__REMOVE,ACTION_TYPE_FAILED_VM_IS_LOCKED
2021-09-07 12:54:39,667+02 ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (default task-152) [] Operation Failed: [Cannot remove Snapshot: VM is locked. Please try again in a few minutes.]
2021-09-07 12:54:44,040+02 INFO  [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-14) [7096fca4-3c23-4173-a16f-c3039bc94ee6] Transfer was successful. Download disk 'srv001-000' (disk id: '37d540f5-0882-4573-bed3-1365510e6775', image id: '0f1e7cb7-f468-4dab-b5e9-d1
caf00586d0')
2021-09-07 12:54:45,044+02 INFO  [org.ovirt.engine.core.bll.storage.disk.image.ImageTransferUpdater] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-59) [7096fca4-3c23-4173-a16f-c3039bc94ee6] Updating image transfer f852baf5-1d16-4ab1-ab8b-d226443a9a7d phase to Finished Success
2021-09-07 12:54:45,048+02 INFO  [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-59) [7096fca4-3c23-4173-a16f-c3039bc94ee6] Successfully transferred disk '0f1e7cb7-f468-4dab-b5e9-d1caf00586d0' (command id 'f852baf5-1d16-4ab1-ab8b-d226443a9a7d')
2021-09-07 12:54:45,048+02 INFO  [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-59) [7096fca4-3c23-4173-a16f-c3039bc94ee6] Ending command 'org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand' successfully.
2021-09-07 12:54:45,049+02 INFO  [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-59) [7096fca4-3c23-4173-a16f-c3039bc94ee6] Lock freed to object 'EngineLock:{exclusiveLocks='[37d540f5-0882-4573-bed3-1365510e6775=DISK]', sharedLocks='[6703070e-669d-43b4-b4fc-e94fdc298516=VM]'}'
2021-09-07 12:54:45,064+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-59) [7096fca4-3c23-4173-a16f-c3039bc94ee6] EVENT_ID: TRANSFER_IMAGE_SUCCEEDED(1,032), Image Download with disk srv001-000 succeeded.


We have some redundant updates to FINISHED_SUCCESS in the code:
1) https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/storage/disk/image/TransferDiskImageCommand.java#L825

I think this one gets called directly after FINALIZING_SUCCESS.

2) https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/storage/disk/image/TransferDiskImageCommand.java#L1418

This one is called AFTER the lock is removed and everything is finished.
Guess this is the only one we should keep?

Comment 1 RHEL Program Management 2021-09-07 15:14:08 UTC
The documentation text flag should only be set after 'doc text' field is provided. Please provide the documentation text and set the flag to '?' again.

Comment 2 Nir Soffer 2021-09-12 17:22:09 UTC
Changing priority/severity to hight, this breaks backup functionality.

Comment 3 Nir Soffer 2021-09-12 17:30:56 UTC
(In reply to Jean-Louis Dupond from comment #0)
> We have some redundant updates to FINISHED_SUCCESS in the code:
> 1)
> https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules/
> bll/src/main/java/org/ovirt/engine/core/bll/storage/disk/image/
> TransferDiskImageCommand.java#L825

This is wrong. The FINISHED_* states should be set only after the 
command completed. They should not be part of the state machine.

> I think this one gets called directly after FINALIZING_SUCCESS.
> 
> 2)
> https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules/
> bll/src/main/java/org/ovirt/engine/core/bll/storage/disk/image/
> TransferDiskImageCommand.java#L1418
> 
> This one is called AFTER the lock is removed and everything is finished.
> Guess this is the only one we should keep?

Yes, but even this is wrong, it should set the phase *after* calling 
endSuccessfuly().

These issues were hidden when we deleted the transfer quickly after
completion. Now that we keep the transfer for a while, clients waiting
correctly for completion are broken.

Comment 4 Eyal Shenitzky 2021-09-19 08:09:03 UTC
Pavel, is the root cause for this bug is the same as bug 1923178 ?

Comment 5 Pavel Bar 2021-10-03 15:20:28 UTC
(In reply to Eyal Shenitzky from comment #4)
> Pavel, is the root cause for this bug is the same as bug 1923178 ?

Yes, after looking at it I see the root case is the same as bug 1923178.
It's just another interesting flow in which we fail on being unable to lock the VM vs. being unable to lock the disk (as in bug 1923178).
But the reason is the same: "imagetransfer.finalize_transfer()" got "FINISHED_SUCCESS" too soon, before the disk and VM locks were actually released.

Comment 6 Eyal Shenitzky 2021-10-04 06:54:34 UTC
(In reply to Pavel Bar from comment #5)
> (In reply to Eyal Shenitzky from comment #4)
> > Pavel, is the root cause for this bug is the same as bug 1923178 ?
> 
> Yes, after looking at it I see the root case is the same as bug 1923178.
> It's just another interesting flow in which we fail on being unable to lock
> the VM vs. being unable to lock the disk (as in bug 1923178).
> But the reason is the same: "imagetransfer.finalize_transfer()" got
> "FINISHED_SUCCESS" too soon, before the disk and VM locks were actually
> released.

Thanks.
Closing as duplication of bug 1923178

*** This bug has been marked as a duplicate of bug 1923178 ***