Bug 1598996 - deleteVolume task is not cleared after a failed LSM due to power off VM during snapshot creation
Summary: deleteVolume task is not cleared after a failed LSM due to power off VM durin...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Storage
Version: 4.2.4.1
Hardware: x86_64
OS: Unspecified
unspecified
medium
Target Milestone: ovirt-4.2.7
: ---
Assignee: Benny Zlotnik
QA Contact: Elad
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-07-07 17:07 UTC by Elad
Modified: 2018-11-02 14:31 UTC (History)
3 users (show)

Fixed In Version: ovirt-engine-4.2.7
Clone Of:
Environment:
Last Closed: 2018-11-02 14:31:08 UTC
oVirt Team: Storage
Embargoed:
rule-engine: ovirt-4.2+


Attachments (Terms of Use)
logs (16.71 MB, application/x-gzip)
2018-07-16 14:16 UTC, Elad
no flags Details
logs (1.20 MB, application/x-gzip)
2018-08-22 08:59 UTC, Yosi Ben Shimon
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 93120 0 master MERGED core: track DestroyImage command 2018-08-08 12:10:23 UTC
oVirt gerrit 93577 0 ovirt-engine-4.2 MERGED core: track DestroyImage command 2018-08-12 14:57:21 UTC
oVirt gerrit 93938 0 master MERGED core: detach DestroyImageCommand from parent 2018-08-29 09:55:23 UTC
oVirt gerrit 93999 0 ovirt-engine-4.2 MERGED core: detach DestroyImageCommand from parent 2018-08-30 08:46:51 UTC

Description Elad 2018-07-07 17:07:36 UTC
Description of problem:
After a failure to live migrate a disk due to VM power off during LSM's live snapshot creation, the deleteImage task is not cleared on vdsm.


Version-Release number of selected component (if applicable):
ovirt-engine-4.2.4.5-0.1.el7_3.noarch
vdsm-4.20.33-1.el7ev.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Live migrate a VM disk and during LSM's snapshot creation, power off the VM


Actual results:

LSM starts:

018-07-07 06:01:30,483+03 INFO  [org.ovirt.engine.core.bll.storage.disk.MoveDiskCommand] (default task-63) [disks_syncAction_2e3f3f46-2d97-479f] Lock Acquired to object 'EngineLock:{exclusiveLocks='[4ea04b6e-9ed0-466f-afd5-728e053f9ea2=DISK]', sharedLocks=''}'
2018-07-07 06:01:30,485+03 INFO  [org.ovirt.engine.core.bll.storage.disk.MoveDiskCommand] (default task-63) [disks_syncAction_2e3f3f46-2d97-479f] Running command: MoveDiskCommand internal: false. Entities affected :  ID: 4ea04b6e-9ed0-466f-afd5-728e053f9ea2 Type: DiskAction group CONFIGURE_DISK_STORAGE with role type USER,  ID: 9fc25352-4296-4e11-8aaa-c28bff1cd711 Type: StorageAction group CREATE_DISK with role type USER


Volume creation for LSM's snapshot:

2018-07-07 06:01:30,892+03 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.CreateVolumeVDSCommand] (default task-63) [disks_syncAction_2e3f3f46-2d97-479f] START, CreateVolumeVDSCommand( CreateVolumeVDSCommandPa
rameters:{storagePoolId='a43f9eca-8147-11e8-b6d6-00163e7bd001', ignoreFailoverLimit='false', storageDomainId='f008e138-6e5b-4cc5-9493-966060017522', imageGroupId='4ea04b6e-9ed0-466f-afd5-728e053f9ea2', imageSize
InBytes='1073741824', volumeFormat='COW', newImageId='464058e8-1196-4890-a003-bad0d967cb94', imageType='Sparse', newImageDescription='', imageInitialSizeInBytes='0', imageId='623f40d0-ab1f-4b77-87d4-7e7745767134
', sourceImageGroupId='4ea04b6e-9ed0-466f-afd5-728e053f9ea2'}), log id: 1f251a13


In vdsm:

2018-07-07 06:01:30,896+0300 INFO  (jsonrpc/0) [vdsm.api] START createVolume(sdUUID='f008e138-6e5b-4cc5-9493-966060017522', spUUID='a43f9eca-8147-11e8-b6d6-00163e7bd001', imgUUID='4ea04b6e-9ed0-466f-afd5-728e053f9ea2', size='1073741824', volFormat=4, preallocate=2, diskType='DATA', volUUID='464058e8-1196-4890-a003-bad0d967cb94', desc=u'', srcImgUUID='4ea04b6e-9ed0-466f-afd5-728e053f9ea2', srcVolUUID='623f40d0-ab1f-4b77-87d4-7e7745767134', initialSize=None) from=::ffff:10.35.162.71,46396, flow_id=disks_syncAction_2e3f3f46-2d97-479f, task_id=d28bfc2b-f49d-49d1-b3dc-9700f37a6ed9 (api:46)


VM got stopped:

2018-07-07 06:01:33,670+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (ForkJoinPool-1-worker-2) [] START, DestroyVDSCommand(HostName = host_mixed_2, DestroyVmVDSCommandParameters:{hostId='6e8926ea-7859-47e1-8d9b-4e59238914de', vmId='5824e6f0-3337-4cec-a5ba-0853281788ac', secondsToWait='0', gracefully='false', reason='', ignoreNoVm='true'}), log id: 539a8312
2018-07-07 06:01:33,672+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (ForkJoinPool-1-worker-2) [] Failed to destroy VM '5824e6f0-3337-4cec-a5ba-0853281788ac' because VM does not exist, ignoring
2018-07-07 06:01:33,672+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (ForkJoinPool-1-worker-2) [] FINISH, DestroyVDSCommand, log id: 539a8312
2018-07-07 06:01:33,672+03 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-2) [] VM '5824e6f0-3337-4cec-a5ba-0853281788ac'(vm_TestCase5967_0705593609) moved from 'Up' --> 'Down'
2018-07-07 06:01:33,677+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-44) [vms_syncAction_cfcd8adc-f69a-45e3] EVENT_ID: USER_STOP_VM(33), VM vm_TestCase5967_0705593609 powered off by admin@internal-authz (Host: host_mixed_2).


deleteImage starts (LSM roll back) with task id 259861b9-bea1-4898-a3a2-49f6b0b151f5:


2018-07-07 06:01:35,122+0300 DEBUG (jsonrpc/7) [storage.TaskManager.Task] (Task='259861b9-bea1-4898-a3a2-49f6b0b151f5') moving from state preparing -> state preparing (task:602)
2018-07-07 06:01:35,122+0300 DEBUG (jsonrpc/7) [storage.TaskManager.Task] (Task='259861b9-bea1-4898-a3a2-49f6b0b151f5') moving from state preparing -> state preparing (task:602)
2018-07-07 06:01:35,122+0300 INFO  (jsonrpc/7) [vdsm.api] START deleteVolume(sdUUID='f008e138-6e5b-4cc5-9493-966060017522', spUUID='a43f9eca-8147-11e8-b6d6-00163e7bd001', imgUUID='4ea04b6e-9ed0-466f-afd5-728e053f9ea2', volumes=['464058e8-1196-4890-a003-bad0d967cb94'], postZero='true', force='true', discard=True) from=::ffff:10.35.162.71,46396, flow_id=disks_syncAction_2e3f3f46-2d97-479f, task_id=259861b9-bea1-4898-a3a2-49f6b0b151f5 (api:46)


The task remain

And the task remains in 'finished' state for ~10 hours:

2018-07-07 16:49:00,185+0300 INFO  (jsonrpc/5) [vdsm.api] FINISH getAllTasksStatuses return={'allTasksStatus': {'259861b9-bea1-4898-a3a2-49f6b0b151f5': {'code': 0, 'message': u'1 jobs completed successfully', 't
askState': 'finished', 'taskResult': 'success', 'taskID': '259861b9-bea1-4898-a3a2-49f6b0b151f5'}}} from=::ffff:10.35.162.71,53814, flow_id=29222a1e, task_id=101c7515-b8df-4690-adda-15f5eac50784 (api:52)




Only here it's being cleared:

2018-07-07 16:49:00,207+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-68) [29222a1e] FINISH, HSMClearTaskVDSCommand, log id: 1b5dd4c8
2018-07-07 16:49:00,208+03 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.SPMClearTaskVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-68) [29222a1e] FINISH, SPMClearTaskVDSCommand, log id: 1ed14821
2018-07-07 16:49:00,212+03 INFO  [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (EE-ManagedThreadFactory-engineScheduled-Thread-68) [29222a1e] BaseAsyncTask::removeTaskFromDB: Removed task '259861b9-bea1-4898-a3
a2-49f6b0b151f5' from DataBase


2018-07-07 16:49:00,189+0300 INFO  (jsonrpc/1) [vdsm.api] START clearTask(taskID='259861b9-bea1-4898-a3a2-49f6b0b151f5', spUUID=None, options=None) from=::ffff:10.35.162.71,53814, flow_id=29222a1e, task_id=21650
601-2709-41e3-ab74-759534f39e33 (api:46)



Expected results:
The task should be cleared

Comment 1 Benny Zlotnik 2018-07-16 12:33:54 UTC
Do we have full logs for this?

Comment 2 Elad 2018-07-16 14:16:36 UTC
Created attachment 1459190 [details]
logs

Comment 4 Evelina Shames 2018-08-15 12:37:16 UTC
Verified.

Versions:
ovirt-engine-4.2.6.3
vdsm-4.20.37-2

Comment 5 Yosi Ben Shimon 2018-08-22 08:58:38 UTC
Moving back to ASSIGNED because this bug seems to still exist.
Tested using:
ovirt-engine-4.2.6.3-1.el7.noarch
vdsm-4.20.37-3.git924eec4.el7.x86_64

While running the specified test case (5967) I found that there is a "deleteVolume" task that is not being cleared.

[root@storage-ge16-vdsm3 ~]# vdsm-client Host getAllTasks
{
    "48a1e000-9eff-48a0-9bf5-9152978854a2": {
        "verb": "deleteVolume", 
        "code": 0, 
        "state": "finished", 
        "tag": "spm", 
        "result": "", 
        "message": "1 jobs completed successfully", 
        "id": "48a1e000-9eff-48a0-9bf5-9152978854a2"
    }
}


According to the VDSM log, this task started at 2018-08-22 10:38:28:

2018-08-22 10:38:28,982+0300 DEBUG (jsonrpc/1) [storage.TaskManager.Task] (Task='48a1e000-9eff-48a0-9bf5
-9152978854a2') moving from state preparing -> state preparing (task:602)
2018-08-22 10:38:28,983+0300 INFO  (jsonrpc/1) [vdsm.api] START deleteVolume(sdUUID='affb03c6-52c8-4355-
be7d-aa8d91452601', spUUID='493b429f-d032-476a-9dad-bb18a4ef4151', imgUUID='8e617cf0-0693-412e-bcc5-9eb5
fe2037cb', volumes=['91ceae38-9a63-45c3-9091-9cf2ba819e3b'], postZero='true', force='false', discard=Tru
e) from=::ffff:10.35.163.251,41464, flow_id=disks_syncAction_77f49a65-13b4-465c, task_id=48a1e000-9eff-4
8a0-9bf5-9152978854a2 (api:46)

and still in the logs at 2018-08-22 11:10:45.

2018-08-22 11:10:45,270+0300 DEBUG (jsonrpc/3) [storage.TaskManager.Task] (Task='3e035f0d-d425-4dfb-ba1f
-d9fefede9b4a') finished: {'allTasksStatus': {'48a1e000-9eff-48a0-9bf5-9152978854a2': {'code': 0, 'messa
ge': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': '48a1e
000-9eff-48a0-9bf5-9152978854a2'}}} (task:1201)


Additional info:
When I tried to deactivate this host (SPM), I got this error in the engine log (make sense):
2018-08-22 11:10:45,291+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (default task-36) [49f1eb8e-2a86-4371-bf85-c64974dc5a69] SpmStopVDSCommand::Not stopping SPM on vds 'host_mixed_3', pool id '493b429f-d032-476a-9dad-bb18a4ef4151' as there are uncleared tasks 'Task '48a1e000-9eff-48a0-9bf5-9152978854a2', status 'finished''

Comment 6 Red Hat Bugzilla Rules Engine 2018-08-22 08:58:42 UTC
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 7 Yosi Ben Shimon 2018-08-22 08:59:21 UTC
Created attachment 1477831 [details]
logs

Comment 8 Benny Zlotnik 2018-08-22 09:04:08 UTC
Yosi, does this happen often? Were there runs in which task was cleared successfully?

Looks like the culprit is this NPE:
2018-08-22 10:38:44,139+03 ERROR [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (EE-ManagedThreadFactory-engine-Thread-198283) [] [within thread]: endAction for action type CreateSnapshot threw an exception.: java.lang.NullPointerException
	at org.ovirt.engine.core.bll.tasks.CoCoAsyncTaskHelper.endAction(CoCoAsyncTaskHelper.java:351) [bll.jar:]
	at org.ovirt.engine.core.bll.tasks.CommandCoordinatorImpl.endAction(CommandCoordinatorImpl.java:347) [bll.jar:]
	at org.ovirt.engine.core.bll.tasks.CommandAsyncTask.endCommandAction(CommandAsyncTask.java:160) [bll.jar:]
	at org.ovirt.engine.core.bll.tasks.CommandAsyncTask.lambda$endActionIfNecessary$0(CommandAsyncTask.java:112) [bll.jar:]
	at org.ovirt.engine.core.utils.threadpool.ThreadPoolUtil$InternalWrapperRunnable.run(ThreadPoolUtil.java:96) [utils.jar:]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [rt.jar:1.8.0_181]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [rt.jar:1.8.0_181]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [rt.jar:1.8.0_181]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [rt.jar:1.8.0_181]
	at java.lang.Thread.run(Thread.java:748) [rt.jar:1.8.0_181]
	at org.glassfish.enterprise.concurrent.ManagedThreadFactoryImpl$ManagedThread.run(ManagedThreadFactoryImpl.java:250) [javax.enterprise.concurrent-1.0.jar:]
	at org.jboss.as.ee.concurrent.service.ElytronManagedThreadFactory$ElytronManagedThread.run(ElytronManagedThreadFactory.java:78)

Comment 9 Yosi Ben Shimon 2018-08-22 11:36:29 UTC
(In reply to Benny Zlotnik from comment #8)
> Yosi, does this happen often? Were there runs in which task was cleared
> successfully?

It happens in most of the cases.
I didn't see this test case execution which it didn't occur.

> 
> Looks like the culprit is this NPE:
> 2018-08-22 10:38:44,139+03 ERROR
> [org.ovirt.engine.core.bll.tasks.CommandAsyncTask]
> (EE-ManagedThreadFactory-engine-Thread-198283) [] [within thread]: endAction
> for action type CreateSnapshot threw an exception.:
> java.lang.NullPointerException
> 	at
> org.ovirt.engine.core.bll.tasks.CoCoAsyncTaskHelper.
> endAction(CoCoAsyncTaskHelper.java:351) [bll.jar:]
> 	at
> org.ovirt.engine.core.bll.tasks.CommandCoordinatorImpl.
> endAction(CommandCoordinatorImpl.java:347) [bll.jar:]
> 	at
> org.ovirt.engine.core.bll.tasks.CommandAsyncTask.
> endCommandAction(CommandAsyncTask.java:160) [bll.jar:]
> 	at
> org.ovirt.engine.core.bll.tasks.CommandAsyncTask.
> lambda$endActionIfNecessary$0(CommandAsyncTask.java:112) [bll.jar:]
> 	at
> org.ovirt.engine.core.utils.threadpool.
> ThreadPoolUtil$InternalWrapperRunnable.run(ThreadPoolUtil.java:96)
> [utils.jar:]
> 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> [rt.jar:1.8.0_181]
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> [rt.jar:1.8.0_181]
> 	at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:
> 1149) [rt.jar:1.8.0_181]
> 	at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:
> 624) [rt.jar:1.8.0_181]
> 	at java.lang.Thread.run(Thread.java:748) [rt.jar:1.8.0_181]
> 	at
> org.glassfish.enterprise.concurrent.ManagedThreadFactoryImpl$ManagedThread.
> run(ManagedThreadFactoryImpl.java:250) [javax.enterprise.concurrent-1.0.jar:]
> 	at
> org.jboss.as.ee.concurrent.service.
> ElytronManagedThreadFactory$ElytronManagedThread.
> run(ElytronManagedThreadFactory.java:78)

Comment 10 Elad 2018-10-10 14:49:58 UTC
deleteImage task is being cleared when LSM fails due to VM power off.

deleteImage task starts:

2018-10-10 17:34:32,992+0300 DEBUG (jsonrpc/6) [storage.TaskManager.Task] (Task='97a1bac7-b21f-49ed-98f2-687257b60f6f') moving from state preparing -> state preparing (task:602)
2018-10-10 17:34:32,993+0300 INFO  (jsonrpc/6) [vdsm.api] START deleteImage(sdUUID=u'2d633c7e-e417-4b4d-8b30-e973f1d6d870', spUUID=u'fb339a72-b298-451b-9a13-6554de8ef545', imgUUID=u'edc9bd45-2014-45a8-ab8d-d7c8d83b2f2b', postZero=u'false', force=u'false', discard=False) from=::ffff:10.35.161.127,51832, flow_id=vms_delete_fe902a07-01e2-4ee9, task_id=97a1bac7-b21f-49ed-98f2-687257b60f6f (api:46)


And cleared right after:

2018-10-10 17:34:39,476+0300 INFO  (jsonrpc/1) [vdsm.api] START clearTask(taskID=u'97a1bac7-b21f-49ed-98f2-687257b60f6f', spUUID=None, options=None) from=::ffff:10.35.161.127,51820, flow_id=vms_delete_fe902a07-0
1e2-4ee9, task_id=22a4ce37-b806-4def-b7fe-4a44eb4f9f3a (api:46)
2018-10-10 17:34:39,476+0300 DEBUG (jsonrpc/1) [storage.TaskManager] Entry. taskID: 97a1bac7-b21f-49ed-98f2-687257b60f6f (taskManager:172)



=============
Used:
rhvm-4.2.7.2-0.1.el7ev.noarch
vdsm-4.20.42-1.el7ev.x86_64

Comment 11 Sandro Bonazzola 2018-11-02 14:31:08 UTC
This bugzilla is included in oVirt 4.2.7 release, published on November 2nd 2018.

Since the problem described in this bug report should be
resolved in oVirt 4.2.7 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.