Created attachment 1428808 [details] engine, vdsm , proxy , deamon logs Description of problem: IMHO the root cause of this issue is the disk Lock is freed before transfer finalization is complete , or to be more exact after Finalizing phase but before Finished Success ( see engine log below). Impact does not seem high as the downloaded disk is done and data is there. A nasty exception is seen in the engine and final phase change to Finished Failure. Engine log + timeline: 1) Transfer (download) initiated: 2018-04-30 14:31:16,017+03 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-14) [c3247df1-7a05-4485-abf0-b3f2a4c616b2] EVENT_ID: TRANSFER_IMAGE_INITIATED(1,031), Image Download with disk vm1_Disk1 was initiated by admin@internal-authz. 2) Transfer finalize& finished successfully : 2018-04-30 14:33:55,326+03 INFO [org.ovirt.engine.core.bll.storage.disk.image.ImageTransferUpdater] (EE-ManagedThreadFactory-engineScheduled-Thread-79) [196ff614] Updating image transfer 83781409-06ff-4db7-8e55-b49d3bfb382c (image e121df20-e1a7-46a9-8482-2aaec371ab42) phase to Finalizing Success 3) ****THIS IS THE MAIN CAUSE OF THE ISSUE IMHO- > Lock is freed on the disk before transfer finalization is complete ! ! ****** Lock is released: 2018-04-30 14:33:55,363+03 INFO [org.ovirt.engine.core.bll.storage.disk.image.TransferImageStatusCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-79) [196ff614] Lock freed to object 'EngineLock:{exclusiveLocks='', sharedLocks='[e121df20-e1a7-46a9-8482-2aaec371ab42=DISK]'}' 4) After(!!!) lock is released that Transfer moves to state " Finished Success" 2018-04-30 14:34:05,687+03 INFO [org.ovirt.engine.core.bll.storage.disk.image.ImageTransferUpdater] (EE-ManagedThreadFactory-engineScheduled-Thread-41) [c3247df1-7a05-4485-abf0-b3f2a4c616b2] Updating image transfer 83781409-06ff-4db7-8e55-b49d3bfb382c (image e121df20-e1a7-46a9-8482-2aaec371ab42) phase to Finished Success 2018-04-30 14:34:05,762+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.TeardownImageVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-41) [c3247df1-7a05-4485-abf0-b3f2a4c616b2] START, TeardownImageVDSCommand(HostName = host_mixed_2, ImageActionsVDSCommandParameters:{hostId='14ddbea5-dd3b-44ae-9f39-c137b8d03344'}), log id: 567f6825 2018-04-30 14:34:06,435+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.TeardownImageVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-41) [c3247df1-7a05-4485-abf0-b3f2a4c616b2] FINISH, TeardownImageVDSCommand, return: StatusReturn:{status='Status [code=0, message=Done]'}, log id: 567f6825 5) Remove VM+ attached disk is initiated: 2018-04-30 14:34:08,517+03 INFO [org.ovirt.engine.core.bll.RemoveVmCommand] (EE-ManagedThreadFactory-engine-Thread-816) [01f6d0a6-cccb-44c6-885a-0596d7090ab3] Running command: RemoveVmCommand internal: false. Entities affected : ID: b6f8af84-7362-44dc-b48d-6ad2f1287604 Type: VMAction group DELETE_VM with role type USER 2018-04-30 14:34:11,181+03 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-816) [01f6d0a6-cccb-44c6-885a-0596d7090ab3] EVENT_ID: USER_REMOVE_VM _FINISHED(113), VM vm1 was successfully removed. 6) Exception seen: 2018-04-30 14:34:16,511+03 ERROR [org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller] (EE-ManagedThreadFactory-engineScheduled-Thread-52) [c3247df1-7a05-4485-abf0-b3f2a4c616b2] Error invoking callback method 'doPolling' for 'ACTIVE' command '83781409-06ff-4db7-8e55-b49d3bfb382c' 2018-04-30 14:34:16,512+03 ERROR [org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller] (EE-ManagedThreadFactory-engineScheduled-Thread-52) [c3247df1-7a05-4485-abf0-b3f2a4c616b2] Exception: java.lang.NullPoint erException at org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand.getImageAlias(TransferDiskImageCommand.java:172) [bll.jar:] at org.ovirt.engine.core.bll.storage.disk.image.TransferImageCommand.getTransferDescription(TransferImageCommand.java:925) [bll.jar:] at org.ovirt.engine.core.bll.storage.disk.image.TransferImageCommand.handleFinishedSuccess(TransferImageCommand.java:474) [bll.jar:] at org.ovirt.engine.core.bll.storage.disk.image.TransferImageCommand.executeStateHandler(TransferImageCommand.java:188) [bll.jar:] at org.ovirt.engine.core.bll.storage.disk.image.TransferImageCommand.proceedCommandExecution(TransferImageCommand.java:151) [bll.jar:] at org.ovirt.engine.core.bll.storage.disk.image.TransferImageCommandCallback.doPolling(TransferImageCommandCallback.java:21) [bll.jar:] at org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller.invokeCallbackMethodsImpl(CommandCallbacksPoller.java:145) [bll.jar:] at org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller.invokeCallbackMethods(CommandCallbacksPoller.java:106) [bll.jar:] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [rt.jar:1.8.0_161] at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [rt.jar:1.8.0_161] at org.glassfish.enterprise.concurrent.internal.ManagedScheduledThreadPoolExecutor$ManagedScheduledFutureTask.access$201(ManagedScheduledThreadPoolExecutor.java:383) [javax.enterprise.concurrent.jar:1.0. 0.redhat-1] at org.glassfish.enterprise.concurrent.internal.ManagedScheduledThreadPoolExecutor$ManagedScheduledFutureTask.run(ManagedScheduledThreadPoolExecutor.java:534) [javax.enterprise.concurrent.jar:1.0.0.redha t-1] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [rt.jar:1.8.0_161] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [rt.jar:1.8.0_161] at java.lang.Thread.run(Thread.java:748) [rt.jar:1.8.0_161] at org.glassfish.enterprise.concurrent.ManagedThreadFactoryImpl$ManagedThread.run(ManagedThreadFactoryImpl.java:250) [javax.enterprise.concurrent.jar:1.0.0.redhat-1] at org.jboss.as.ee.concurrent.service.ElytronManagedThreadFactory$ElytronManagedThread.run(ElytronManagedThreadFactory.java:78) 2018-04-30 14:34:17,541+03 INFO [org.ovirt.engine.core.bll.storage.disk.image.ImageTransferUpdater] (EE-ManagedThreadFactory-engineScheduled-Thread-62) [c3247df1-7a05-4485-abf0-b3f2a4c616b2] Updating image tran sfer 83781409-06ff-4db7-8e55-b49d3bfb382c (image e121df20-e1a7-46a9-8482-2aaec371ab42) phase to Finished Failure 2018-04-30 14:34:17,552+03 ERROR [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-62) [c3247df1-7a05-4485-abf0-b3f2a4c616b2] Failed to trans fer disk '00000000-0000-0000-0000-000000000000' (command id '83781409-06ff-4db7-8e55-b49d3bfb382c') 2018-04-30 14:34:17,561+03 INFO [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-62) [c3247df1-7a05-4485-abf0-b3f2a4c616b2] Ending command 'org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand' successfully. 2018-04-30 14:34:17,563+03 WARN [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (EE-ManagedThreadFactory-engineScheduled-Thread-62) [c3247df1-7a05-4485-abf0-b3f2a4c616b2] Trying to release a shared lock fo r key: 'e121df20-e1a7-46a9-8482-2aaec371ab42DISK' , but lock does not exist 2018-04-30 14:34:17,563+03 INFO [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-62) [c3247df1-7a05-4485-abf0-b3f2a4c616b2] Lock freed to o bject 'EngineLock:{exclusiveLocks='', sharedLocks='[e121df20-e1a7-46a9-8482-2aaec371ab42=DISK]'}' 2018-04-30 14:34:17,611+03 ERROR [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-62) [c3247df1-7a05-4485-abf0-b3f2a4c616b2] Error during log command: org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand. Exception null 7) Transfer move phase to Finished Failure ! 018-04-30 14:34:17,541+03 INFO [org.ovirt.engine.core.bll.storage.disk.image.ImageTransferUpdater] (EE-ManagedThreadFactory-engineScheduled-Thread-62) [c3247df1-7a05-4485-abf0-b3f2a4c616b2] Updating image transfer 83781409-06ff-4db7-8e55-b49d3bfb382c (image e121df20-e1a7-46a9-8482-2aaec371ab42) phase to Finished Failure Version-Release number of selected component (if applicable): How reproducible: Not easily as you need to remove VM right after image state is after Finalizing phase but before Finished Success. BUT this occurred several times & manually easily reproduced. Steps to Reproduce: 1) Create VM & attach a disk to it.( I used preallocated ISCSI disk) 2) Initiate download via UI/SDK 3) Right after(!) Transfer is (finalize successfully phase) remove the VM(+ attached disks) Actual results: Delete VM starts & NullPointerException Exception occurs and the disk does not finish the disk transfer (end's up with phase moved to phase to Finished Failure) This is due to disk Lock freed before transfer finalization is complete, or to be more exact after Finalizing phase but before Finished Success Expected results: disk lock should not be freed before ALL transfer activity is done. Additional info:
More info: We also did not test it deliberately as this was not in the test plan, This issue popped up due to an issue with our automation(that started about a week ago) which caused detach disks from the VM to NOT occur before starting the transfer. I also succeeded to reproduce manually vi UI right on the first try. Just try to remove the VM right when transfer reaches the end of the download and you will hit it. Or in automation remove VM right after you wait for the disk to go to 'OK' state.( I do not have an SDK script to do all of it as our automation use half SDK - transfer & the remove VM is done via REST API)
So if you wait few seconds or so, all is good?
(In reply to Yaniv Kaul from comment #2) > So if you wait few seconds or so, all is good? Yes, if you wait for a few seconds(# of sec's varied in each test) this will not occur as transfer finalization will be done. This timeout solution is about betting on a blind timeout & hope transfer finalization will finish before deleting the disk. IMHO, The better option is to wait for ALL transfer finalization tasks are done & only then unlock the disk. (which should be the fix for this bug)
The lock release that appears in the log [1], is related only to TransferImageStatusCommand (lock is taken in ImageTransferUpdater). However, we do unlock the disk in database before completing the disk teardown. Hence, the suggested fix solves this issue. [1] [org.ovirt.engine.core.bll.storage.disk.image.TransferImageStatusCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-79) [196ff614] Lock freed to object 'EngineLock:{exclusiveLocks='', sharedLocks='[e121df20-e1a7-46a9-8482-2aaec371ab42=DISK]'}
verified at 4.2.6-1