Bug 1573184 - StreamingAPI - Transfer disk Lock is freed before transfer is complete - after Finalizing phase but before Finished Success
Summary: StreamingAPI - Transfer disk Lock is freed before transfer is complete - afte...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Storage
Version: 4.2.3.2
Hardware: Unspecified
OS: Unspecified
unspecified
high vote
Target Milestone: ovirt-4.2.6
: 4.2.6
Assignee: Daniel Erez
QA Contact: Avihai
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-04-30 12:09 UTC by Avihai
Modified: 2018-09-03 15:09 UTC (History)
3 users (show)

Fixed In Version: ovirt-engine-4.2.6
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-09-03 15:09:58 UTC
oVirt Team: Storage
rule-engine: ovirt-4.2+
rule-engine: ovirt-4.3+


Attachments (Terms of Use)
engine, vdsm , proxy , deamon logs (873.56 KB, application/x-gzip)
2018-04-30 12:09 UTC, Avihai
no flags Details


Links
System ID Priority Status Summary Last Updated
oVirt gerrit 91674 master MERGED core: transfer image - unlock disk after teardown 2020-09-20 23:30:18 UTC
oVirt gerrit 93420 ovirt-engine-4.2 MERGED core: transfer image - unlock disk after teardown 2020-09-20 23:30:21 UTC

Description Avihai 2018-04-30 12:09:36 UTC
Created attachment 1428808 [details]
engine, vdsm , proxy , deamon logs

Description of problem:
IMHO the root cause of this issue is the disk Lock is freed before transfer finalization is complete , or to be more exact after Finalizing phase but before Finished Success ( see engine log below).

Impact does not seem high as the downloaded disk is done and data is there.
A nasty exception is seen in the engine and final phase change to  Finished Failure. 

Engine log + timeline:
1) Transfer (download) initiated:
2018-04-30 14:31:16,017+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-14) [c3247df1-7a05-4485-abf0-b3f2a4c616b2] EVENT_ID: TRANSFER_IMAGE_INITIATED(1,031), Image Download with disk vm1_Disk1 was initiated by admin@internal-authz.

2) Transfer finalize& finished successfully :
2018-04-30 14:33:55,326+03 INFO  [org.ovirt.engine.core.bll.storage.disk.image.ImageTransferUpdater] (EE-ManagedThreadFactory-engineScheduled-Thread-79) [196ff614] Updating image transfer 83781409-06ff-4db7-8e55-b49d3bfb382c (image e121df20-e1a7-46a9-8482-2aaec371ab42) phase to Finalizing Success

3) ****THIS IS THE MAIN CAUSE OF THE ISSUE IMHO- > Lock is freed on the disk before transfer finalization is complete ! ! ******

Lock is released:
2018-04-30 14:33:55,363+03 INFO  [org.ovirt.engine.core.bll.storage.disk.image.TransferImageStatusCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-79) [196ff614] Lock freed to object 'EngineLock:{exclusiveLocks='', sharedLocks='[e121df20-e1a7-46a9-8482-2aaec371ab42=DISK]'}'

4) After(!!!) lock is released that Transfer moves to state " Finished Success"
2018-04-30 14:34:05,687+03 INFO  [org.ovirt.engine.core.bll.storage.disk.image.ImageTransferUpdater] (EE-ManagedThreadFactory-engineScheduled-Thread-41) [c3247df1-7a05-4485-abf0-b3f2a4c616b2] Updating image transfer 83781409-06ff-4db7-8e55-b49d3bfb382c (image e121df20-e1a7-46a9-8482-2aaec371ab42) phase to Finished Success

2018-04-30 14:34:05,762+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.TeardownImageVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-41) [c3247df1-7a05-4485-abf0-b3f2a4c616b2] START, TeardownImageVDSCommand(HostName = host_mixed_2, ImageActionsVDSCommandParameters:{hostId='14ddbea5-dd3b-44ae-9f39-c137b8d03344'}), log id: 567f6825
2018-04-30 14:34:06,435+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.TeardownImageVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-41) [c3247df1-7a05-4485-abf0-b3f2a4c616b2] FINISH, TeardownImageVDSCommand, return: StatusReturn:{status='Status [code=0, message=Done]'}, log id: 567f6825


5) Remove VM+ attached disk is initiated:
2018-04-30 14:34:08,517+03 INFO  [org.ovirt.engine.core.bll.RemoveVmCommand] (EE-ManagedThreadFactory-engine-Thread-816) [01f6d0a6-cccb-44c6-885a-0596d7090ab3] Running command: RemoveVmCommand internal: false. Entities affected :  ID: b6f8af84-7362-44dc-b48d-6ad2f1287604 Type: VMAction group DELETE_VM with role type USER

2018-04-30 14:34:11,181+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-816) [01f6d0a6-cccb-44c6-885a-0596d7090ab3] EVENT_ID: USER_REMOVE_VM
_FINISHED(113), VM vm1 was successfully removed.

6) Exception seen:
2018-04-30 14:34:16,511+03 ERROR [org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller] (EE-ManagedThreadFactory-engineScheduled-Thread-52) [c3247df1-7a05-4485-abf0-b3f2a4c616b2] Error invoking callback method
 'doPolling' for 'ACTIVE' command '83781409-06ff-4db7-8e55-b49d3bfb382c'
2018-04-30 14:34:16,512+03 ERROR [org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller] (EE-ManagedThreadFactory-engineScheduled-Thread-52) [c3247df1-7a05-4485-abf0-b3f2a4c616b2] Exception: java.lang.NullPoint
erException
        at org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand.getImageAlias(TransferDiskImageCommand.java:172) [bll.jar:]
        at org.ovirt.engine.core.bll.storage.disk.image.TransferImageCommand.getTransferDescription(TransferImageCommand.java:925) [bll.jar:]
        at org.ovirt.engine.core.bll.storage.disk.image.TransferImageCommand.handleFinishedSuccess(TransferImageCommand.java:474) [bll.jar:]
        at org.ovirt.engine.core.bll.storage.disk.image.TransferImageCommand.executeStateHandler(TransferImageCommand.java:188) [bll.jar:]
        at org.ovirt.engine.core.bll.storage.disk.image.TransferImageCommand.proceedCommandExecution(TransferImageCommand.java:151) [bll.jar:]
        at org.ovirt.engine.core.bll.storage.disk.image.TransferImageCommandCallback.doPolling(TransferImageCommandCallback.java:21) [bll.jar:]
        at org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller.invokeCallbackMethodsImpl(CommandCallbacksPoller.java:145) [bll.jar:]
        at org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller.invokeCallbackMethods(CommandCallbacksPoller.java:106) [bll.jar:]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [rt.jar:1.8.0_161]
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [rt.jar:1.8.0_161]
        at org.glassfish.enterprise.concurrent.internal.ManagedScheduledThreadPoolExecutor$ManagedScheduledFutureTask.access$201(ManagedScheduledThreadPoolExecutor.java:383) [javax.enterprise.concurrent.jar:1.0.
0.redhat-1]
        at org.glassfish.enterprise.concurrent.internal.ManagedScheduledThreadPoolExecutor$ManagedScheduledFutureTask.run(ManagedScheduledThreadPoolExecutor.java:534) [javax.enterprise.concurrent.jar:1.0.0.redha
t-1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [rt.jar:1.8.0_161]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [rt.jar:1.8.0_161]
        at java.lang.Thread.run(Thread.java:748) [rt.jar:1.8.0_161]
        at org.glassfish.enterprise.concurrent.ManagedThreadFactoryImpl$ManagedThread.run(ManagedThreadFactoryImpl.java:250) [javax.enterprise.concurrent.jar:1.0.0.redhat-1]
        at org.jboss.as.ee.concurrent.service.ElytronManagedThreadFactory$ElytronManagedThread.run(ElytronManagedThreadFactory.java:78)

2018-04-30 14:34:17,541+03 INFO  [org.ovirt.engine.core.bll.storage.disk.image.ImageTransferUpdater] (EE-ManagedThreadFactory-engineScheduled-Thread-62) [c3247df1-7a05-4485-abf0-b3f2a4c616b2] Updating image tran
sfer 83781409-06ff-4db7-8e55-b49d3bfb382c (image e121df20-e1a7-46a9-8482-2aaec371ab42) phase to Finished Failure
2018-04-30 14:34:17,552+03 ERROR [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-62) [c3247df1-7a05-4485-abf0-b3f2a4c616b2] Failed to trans
fer disk '00000000-0000-0000-0000-000000000000' (command id '83781409-06ff-4db7-8e55-b49d3bfb382c')
2018-04-30 14:34:17,561+03 INFO  [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-62) [c3247df1-7a05-4485-abf0-b3f2a4c616b2] Ending command 
'org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand' successfully.
2018-04-30 14:34:17,563+03 WARN  [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (EE-ManagedThreadFactory-engineScheduled-Thread-62) [c3247df1-7a05-4485-abf0-b3f2a4c616b2] Trying to release a shared lock fo
r key: 'e121df20-e1a7-46a9-8482-2aaec371ab42DISK' , but lock does not exist
2018-04-30 14:34:17,563+03 INFO  [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-62) [c3247df1-7a05-4485-abf0-b3f2a4c616b2] Lock freed to o
bject 'EngineLock:{exclusiveLocks='', sharedLocks='[e121df20-e1a7-46a9-8482-2aaec371ab42=DISK]'}'
2018-04-30 14:34:17,611+03 ERROR [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-62) [c3247df1-7a05-4485-abf0-b3f2a4c616b2] Error during log command: org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand. Exception null


7) Transfer move phase to Finished Failure ! 

018-04-30 14:34:17,541+03 INFO  [org.ovirt.engine.core.bll.storage.disk.image.ImageTransferUpdater] (EE-ManagedThreadFactory-engineScheduled-Thread-62) [c3247df1-7a05-4485-abf0-b3f2a4c616b2] Updating image transfer 83781409-06ff-4db7-8e55-b49d3bfb382c (image e121df20-e1a7-46a9-8482-2aaec371ab42) phase to Finished Failure

Version-Release number of selected component (if applicable):


How reproducible:
Not easily as you need to remove VM right after image state is after Finalizing phase but before Finished Success.

BUT this occurred several times & manually easily reproduced.

Steps to Reproduce:
1) Create VM & attach a disk to it.( I used preallocated ISCSI disk)
2) Initiate download via UI/SDK
3) Right after(!) Transfer is (finalize successfully phase) remove the VM(+ attached disks)

Actual results:
Delete VM starts & NullPointerException Exception occurs and the disk does not finish the disk transfer (end's up with phase moved to  phase to Finished Failure)

This is due to disk Lock freed before transfer finalization is complete, or to be more exact after Finalizing phase but before Finished Success 

Expected results:
disk lock should not be freed before ALL transfer activity is done.


Additional info:

Comment 1 Avihai 2018-04-30 12:36:28 UTC
More info:
We also did not test it deliberately as this was not in the test plan, 
This issue popped up due to an issue with our automation(that started about a week ago) which caused detach disks from the VM to NOT occur before starting the transfer.

I also succeeded to reproduce manually vi UI right on the first try.
Just try to remove the VM right when transfer reaches the end of the download and you will hit it.

Or in automation remove VM right after you wait for the disk to go to 'OK' state.( I do not have an SDK script to do all of it as our automation use half SDK - transfer & the remove VM is done via REST API)

Comment 2 Yaniv Kaul 2018-05-01 09:03:43 UTC
So if you wait few seconds or so, all is good?

Comment 3 Avihai 2018-05-08 11:33:01 UTC
(In reply to Yaniv Kaul from comment #2)
> So if you wait few seconds or so, all is good?

Yes, if you wait for a few seconds(# of sec's varied in each test) this will not occur as transfer finalization will be done.
This timeout solution is about betting on a blind timeout & hope transfer finalization will finish before deleting the disk.

IMHO, The better option is to wait for ALL transfer finalization tasks are done & only then unlock the disk. (which should be the fix for this bug)

Comment 4 Daniel Erez 2018-05-28 09:08:03 UTC
The lock release that appears in the log [1], is related only to TransferImageStatusCommand (lock is taken in ImageTransferUpdater). However, we do unlock the disk in database before completing the disk teardown. Hence, the suggested fix solves this issue.

[1]
[org.ovirt.engine.core.bll.storage.disk.image.TransferImageStatusCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-79) [196ff614] Lock freed to object 'EngineLock:{exclusiveLocks='', sharedLocks='[e121df20-e1a7-46a9-8482-2aaec371ab42=DISK]'}

Comment 5 Avihai 2018-08-07 13:27:15 UTC
verified at 4.2.6-1


Note You need to log in before you can comment on or make changes to this bug.