Bug 1003506 - [vdsm] after a failure in moveImage, deleteImage task won't be cleaned on vdsm
Summary: [vdsm] after a failure in moveImage, deleteImage task won't be cleaned on vdsm
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 3.3.0
Hardware: x86_64
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 3.3.0
Assignee: Federico Simoncelli
QA Contact: Elad
URL:
Whiteboard: storage
Depends On:
Blocks: 3.3snap1
TreeView+ depends on / blocked
 
Reported: 2013-09-02 08:42 UTC by Elad
Modified: 2016-02-10 19:29 UTC (History)
10 users (show)

Fixed In Version: is19
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:
oVirt Team: Storage
Target Upstream Version:
Embargoed:
amureini: Triaged+


Attachments (Terms of Use)
logs (4.17 MB, application/x-gzip)
2013-09-02 08:42 UTC, Elad
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 19802 0 None None None Never
oVirt gerrit 20044 0 None None None Never

Description Elad 2013-09-02 08:42:38 UTC
Created attachment 792758 [details]
logs

Description of problem:
deleteImage task remains running forever on vdsm, after vdsm crashed right after moveImage (op=copy) finished.

Version-Release number of selected component (if applicable):
vdsm-4.12.0-72.git287bb7e.el6ev.x86_64

How reproducible:
100%

Steps to Reproduce:
On a block pool with more than 1 SDs:
- create vm 
- move vm's disk to another SD (offline storage migration)
- restart vdsm right when moveImage ends (before deleteImage starts)

Actual results:

When vdsm starts, it get the deleteImage task from engine (to roll-forward the MoveOrCopyDisk action):

engine:

 2013-09-01 17:35:12,045 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand] (pool-5-thread-50) START, DeleteImageGroupVDSCommand( storagePoolId = 7a93c0d1-1316-40e2-b946-3180c3415007, ignoreFailoverLimit = false, storageDomainId = 07b2ea73-27d3-46ba-832b-7fd5620443e4, imageGroupId = 076f4876-473a-4894-a11e-f847934372be, postZeros = true, forceDelete = false), log id: 72fd0d2e


vdsm starts deleteImage:

Thread-74::INFO::2013-09-01 17:35:22,879::logUtils::44::dispatcher::(wrapper) Run and protect: deleteImage(sdUUID='07b2ea73-27d3-46ba-832b-7fd5620443e4', spUUID='7a93c0d1-1316-40e2-b946-3180c3415007', imgUUID='076
f4876-473a-4894-a11e-f847934372be', postZero='true', force='false')

lvremove starts and never ends:

f8f41c68-5941-4e99-97af-7eaa182d9ce2::DEBUG::2013-09-01 17:35:47,117::lvm::314::Storage.Misc.excCmd::(cmd) '/usr/bin/sudo -n /sbin/lvm lvremove --config " devices { preferred_names = [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ \'a|/dev/mapper/1elad11370784|/dev/mapper/3514f0c54d5c00002|/dev/mapper/3514f0c54d5c00003|/dev/mapper/3514f0c54d5c00004|/dev/mapper/3514f0c54d5c00005|/dev/mapper/3514f0c54d5c00006|/dev/mapper/3514f0c54d5c00007|/dev/mapper/3514f0c54d5c00008|/dev/mapper/3514f0c54d5c00009|/dev/mapper/3514f0c54d5c0000a|/dev/mapper/3514f0c54d5c0000b|/dev/mapper/3514f0c54d5c0000c|/dev/mapper/3514f0c54d5c0000d|/dev/mapper/3514f0c54d5c0000e|/dev/mapper/3514f0c54d5c0000f|/dev/mapper/3514f0c54d5c00597|/dev/mapper/3514f0c54d5c00989|\', \'r|.*|\' ] }  global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1 }  backup {  retain_min = 50  retain_days = 0 } " -f --autobackup n 07b2ea73-27d3-46ba-832b-7fd5620443e4/56786842-6c95-4f8c-913f-cf8944fc55b0' (cwd None)

Tasks remains running forever. 

on engine log:

2013-09-02 03:49:30,459 INFO  [org.ovirt.engine.core.bll.CommandMultiAsyncTasks] (DefaultQuartzScheduler_Worker-75) Task with DB Task ID 961b037e-a668-4dac-b67f-8cd72d47f659 and VDSM Task ID f8f41c68-5941-4e99-97af-7eaa182d9ce2 is in state Polling. End action for command 9994af32-7fdd-42d0-8105-a32dd1d67d6e will proceed when all the entitys tasks are completed.
2013-09-02 03:49:40,490 ERROR [org.ovirt.engine.core.bll.SPMAsyncTask] (DefaultQuartzScheduler_Worker-62) BaseAsyncTask::LogEndTaskFailure: Task f8f41c68-5941-4e99-97af-7eaa182d9ce2 (Parent Command RemoveImage, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters) ended with failure:
-- Result: cleanSuccess
-- Message: VDSGenericException: VDSTaskResultNotSuccessException: TaskState contained successful return code, but a non-success result ('cleanSuccess').,
-- Exception: VDSGenericException: VDSTaskResultNotSuccessException: TaskState contained successful return code, but a non-success result ('cleanSuccess').


Expected results:
Task should either fail with lvm failure or with timeout

Additional info: logs

Comment 2 Ayal Baron 2013-09-02 15:06:19 UTC
Why is this a regression?

Comment 3 Elad 2013-09-02 16:29:53 UTC
(In reply to Ayal Baron from comment #2)
> Why is this a regression?

Does not happen on 3.2

Comment 4 Federico Simoncelli 2013-10-02 09:06:42 UTC
(In reply to Elad from comment #0)
> vdsm starts deleteImage:
> 
> Thread-74::INFO::2013-09-01
> 17:35:22,879::logUtils::44::dispatcher::(wrapper) Run and protect:
> deleteImage(sdUUID='07b2ea73-27d3-46ba-832b-7fd5620443e4',
> spUUID='7a93c0d1-1316-40e2-b946-3180c3415007', imgUUID='076
> f4876-473a-4894-a11e-f847934372be', postZero='true', force='false')
> 
> lvremove starts and never ends:
> 
> f8f41c68-5941-4e99-97af-7eaa182d9ce2::DEBUG::2013-09-01
> 17:35:47,117::lvm::314::Storage.Misc.excCmd::(cmd) '/usr/bin/sudo -n
> /sbin/lvm lvremove --config " devices { preferred_names =
> [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write_cache_state=0
> disable_after_error_count=3 filter = [
> \'a|/dev/mapper/1elad11370784|/dev/mapper/3514f0c54d5c00002|/dev/mapper/
> 3514f0c54d5c00003|/dev/mapper/3514f0c54d5c00004|/dev/mapper/
> 3514f0c54d5c00005|/dev/mapper/3514f0c54d5c00006|/dev/mapper/
> 3514f0c54d5c00007|/dev/mapper/3514f0c54d5c00008|/dev/mapper/
> 3514f0c54d5c00009|/dev/mapper/3514f0c54d5c0000a|/dev/mapper/
> 3514f0c54d5c0000b|/dev/mapper/3514f0c54d5c0000c|/dev/mapper/
> 3514f0c54d5c0000d|/dev/mapper/3514f0c54d5c0000e|/dev/mapper/
> 3514f0c54d5c0000f|/dev/mapper/3514f0c54d5c00597|/dev/mapper/
> 3514f0c54d5c00989|\', \'r|.*|\' ] }  global {  locking_type=1 
> prioritise_write_locks=1  wait_for_locks=1 }  backup {  retain_min = 50 
> retain_days = 0 } " -f --autobackup n
> 07b2ea73-27d3-46ba-832b-7fd5620443e4/56786842-6c95-4f8c-913f-cf8944fc55b0'
> (cwd None)

This is wrong, on the vdsm side task f8f41c68-5941-4e99-97af-7eaa182d9ce2 begins at:

Thread-74::DEBUG::2013-09-01 17:35:22,879::task::579::TaskManager.Task::(_updateState) Task=`f8f41c68-5941-4e99-97af-7eaa182d9ce2`::moving from state init -> state preparing
Thread-74::INFO::2013-09-01 17:35:22,879::logUtils::44::dispatcher::(wrapper) Run and protect: deleteImage(sdUUID='07b2ea73-27d3-46ba-832b-7fd5620443e4', spUUID='7a93c0d1-1316-40e2-b946-3180c3415007', imgUUID='076f4876-473a-4894-a11e-f847934372be', postZero='true', force='false')
...

executes lvremove at:

f8f41c68-5941-4e99-97af-7eaa182d9ce2::DEBUG::2013-09-01 17:35:47,117::lvm::314::Storage.Misc.excCmd::(cmd) '/usr/bin/sudo -n /sbin/lvm lvremove --config " devices { preferred_names = [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ \'a|/dev/mapper/1elad11370784|/dev/mapper/3514f0c54d5c00002|/dev/mapper/3514f0c54d5c00003|/dev/mapper/3514f0c54d5c00004|/dev/mapper/3514f0c54d5c00005|/dev/mapper/3514f0c54d5c00006|/dev/mapper/3514f0c54d5c00007|/dev/mapper/3514f0c54d5c00008|/dev/mapper/3514f0c54d5c00009|/dev/mapper/3514f0c54d5c0000a|/dev/mapper/3514f0c54d5c0000b|/dev/mapper/3514f0c54d5c0000c|/dev/mapper/3514f0c54d5c0000d|/dev/mapper/3514f0c54d5c0000e|/dev/mapper/3514f0c54d5c0000f|/dev/mapper/3514f0c54d5c00597|/dev/mapper/3514f0c54d5c00989|\', \'r|.*|\' ] }  global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1 }  backup {  retain_min = 50  retain_days = 0 } " -f --autobackup n 07b2ea73-27d3-46ba-832b-7fd5620443e4/56786842-6c95-4f8c-913f-cf8944fc55b0' (cwd None)

and finishes at:

f8f41c68-5941-4e99-97af-7eaa182d9ce2::DEBUG::2013-09-01 17:35:47,968::task::897::TaskManager.Task::(_runJobs) Task=`f8f41c68-5941-4e99-97af-7eaa182d9ce2`::Task.run: exit - success: result
f8f41c68-5941-4e99-97af-7eaa182d9ce2::DEBUG::2013-09-01 17:35:47,968::task::974::TaskManager.Task::(_decref) Task=`f8f41c68-5941-4e99-97af-7eaa182d9ce2`::ref 0 aborting False


The problem is on the engine side as backend fails to detect that the task has ended.
Moving to the ovirt-engine-backend component.

So analyzing what happened:

1. user requested to move an image from a storage domain to another (implemented as copy + delete)
2. moveImage (with copy operation) was sent to vdsm:

2013-09-01 17:33:55,121 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.MoveImageGroupVDSCommand] (pool-5-thread-45) [24ea19a1] START, MoveImageGroupVDSCommand( storagePoolId = 7a93c0d1-1316-40e2-b946-3180c3415007, ignoreFailoverLimit = false, storageDomainId = 07b2ea73-27d3-46ba-832b-7fd5620443e4, imageGroupId = 076f4876-473a-4894-a11e-f847934372be, dstDomainId = 62fe552d-5bcb-4bee-bc94-808a0b566fca, vmId = 00000000-0000-0000-0000-000000000000, op = Copy, postZero = true, force = false), log id: 2839ca05

3. move image finishes successfully:

b6ffbde7-367a-4623-8e1a-459d792484da::INFO::2013-09-01 17:34:53,452::image::527::Storage.Image::(move) COPY task on image 076f4876-473a-4894-a11e-f847934372be was successfully finished
b6ffbde7-367a-4623-8e1a-459d792484da::DEBUG::2013-09-01 17:34:53,455::task::579::TaskManager.Task::(_updateState) Task=`b6ffbde7-367a-4623-8e1a-459d792484da`::moving from state running -> state finished

4. vdsm is restarted:

MainThread::INFO::2013-09-01 17:35:01,693::vdsm::101::vds::(run) (PID: 14922) I am the actual vdsm 4.12.0-72.git287bb7e.el6ev nott-vds2.qa.lab.tlv.redhat.com (2.6.32-358.18.1.el6.x86_64)

5. engine notices that the spm was lost:

2013-09-01 17:34:48,153 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStatusVDSCommand] (DefaultQuartzScheduler_Worker-39) Command SpmStatusVDS execution failed. Exception: VDSNetworkException: java.net.ConnectException: Connection refused
2013-09-01 17:34:58,183 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand] (DefaultQuartzScheduler_Worker-41) Command org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand return value
TaskStatusListReturnForXmlRpc [mStatus=StatusForXmlRpc [mCode=654, mMessage=Not SPM: ()]]

6. engine starts the spm:

2013-09-01 17:35:02,766 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-68) START, SpmStartVDSCommand(HostName = nott-vds2, HostId = 69721c22-bf76-4643-80e5-210b81abf930, storagePoolId = 7a93c0d1-1316-40e2-b946-3180c3415007, prevId=1, prevLVER=11, storagePoolFormatType=V3, recoveryMode=Manual, SCSIFencing=false), log id: 21c193c6

Thread-41::INFO::2013-09-01 17:35:13,581::logUtils::44::dispatcher::(wrapper) Run and protect: spmStart(spUUID='7a93c0d1-1316-40e2-b946-3180c3415007', prevID=1, prevLVER='11', recoveryMode=None, scsiFencing='false', maxHostID=250, domVersion='3', options=None)

7. the move image task is re-detected and cleaned:

b6ffbde7-367a-4623-8e1a-459d792484da::DEBUG::2013-09-01 17:35:22,482::task::1218::TaskManager.Task::(recover) Task=`b6ffbde7-367a-4623-8e1a-459d792484da`::(recover): recovering: state finished

2013-09-01 17:35:11,997 INFO  [org.ovirt.engine.core.bll.SPMAsyncTask] (DefaultQuartzScheduler_Worker-68) SPMAsyncTask::PollTask: Polling task b6ffbde7-367a-4623-8e1a-459d792484da (Parent Command MoveOrCopyDisk, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters) returned status finished, result 'success'.
2013-09-01 17:35:12,001 INFO  [org.ovirt.engine.core.bll.SPMAsyncTask] (DefaultQuartzScheduler_Worker-68) BaseAsyncTask::OnTaskEndSuccess: Task b6ffbde7-367a-4623-8e1a-459d792484da (Parent Command MoveOrCopyDisk, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters) ended successfully.
...
2013-09-01 17:35:12,013 INFO  [org.ovirt.engine.core.bll.MoveOrCopyDiskCommand] (pool-5-thread-50) Ending command successfully: org.ovirt.engine.core.bll.MoveOrCopyDiskCommand
2013-09-01 17:35:12,015 INFO  [org.ovirt.engine.core.bll.MoveImageGroupCommand] (pool-5-thread-50) Ending command successfully: org.ovirt.engine.core.bll.MoveImageGroupCommand

8. the moveImage EndAction is executed triggering the deleteImage command:

2013-09-01 17:35:12,045 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand] (pool-5-thread-50) START, DeleteImageGroupVDSCommand( storagePoolId = 7a93c0d1-1316-40e2-b946-3180c3415007, ignoreFailoverLimit = false, storageDomainId = 07b2ea73-27d3-46ba-832b-7fd5620443e4, imageGroupId = 076f4876-473a-4894-a11e-f847934372be, postZeros = true, forceDelete = false), log id: 72fd0d2e

Thread-74::INFO::2013-09-01 17:35:22,879::logUtils::44::dispatcher::(wrapper) Run and protect: deleteImage(sdUUID='07b2ea73-27d3-46ba-832b-7fd5620443e4', spUUID='7a93c0d1-1316-40e2-b946-3180c3415007',

9. the engine storagePoolUpEvent is triggered (probably from the successful spmStart) and it stops all the tasks including the newly deleteImage (race), according to:

    @Override
    public void storagePoolUpEvent(StoragePool storagePool, boolean isNewSpm) {
        if (isNewSpm) {
            AsyncTaskManager.getInstance().StopStoragePoolTasks(storagePool);
        } else {
            AsyncTaskManager.getInstance().AddStoragePoolExistingTasks(storagePool);
        }
    }

2013-09-01 17:35:12,006 INFO  [org.ovirt.engine.core.bll.AsyncTaskManager] (pool-5-thread-45) Attempting to get and stop tasks on storage pool iscsi1
...
2013-09-01 17:35:12,887 INFO  [org.ovirt.engine.core.bll.SPMAsyncTask] (pool-5-thread-45) SPMAsyncTask::StopTask: Attempting to stop task f8f41c68-5941-4e99-97af-7eaa182d9ce2 (Parent Command RemoveImage, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters).

10. it seems that StopStoragePoolTasks marked the task as Ended (stopTask) only in the database but not in memory as engine loops on:

2013-09-01 17:35:42,080 INFO  [org.ovirt.engine.core.bll.CommandMultiAsyncTasks] (DefaultQuartzScheduler_Worker-83) Task with DB Task ID 961b037e-a668-4dac-b67f-8cd72d47f659 and VDSM Task ID f8f41c68-5941-4e99-97af-7eaa182d9ce2 is in state Polling. End action for command 9994af32-7fdd-42d0-8105-a32dd1d67d6e will proceed when all the entitys tasks are completed.


To summarize the issue after the spm has been re-elected:

1. Attempting to EndAction MoveOrCopyDisk (issues deleteImage at point 4)
2. Attempting to get and stop tasks on storage pool iscsi1 (StopStoragePoolTasks)
3. Ending command successfully: org.ovirt.engine.core.bll.MoveImageGroupCommand
4. START, DeleteImageGroupVDSCommand( storagePoolId = 7a93c0d1-1316-40e2-b946-3180c3415007...
5. BaseAsyncTask::StartPollingTask: Starting to poll task f8f41c68-5941-4e99-97af-7eaa182d9ce2
6. Discovered 2 tasks on Storage Pool iscsi1, 1 added to manager. (StopStoragePoolTasks)
7. Attempting to stop task f8f41c68-5941-4e99-97af-7eaa182d9ce2 ...
8. Task with DB Task ID 961b037e-a668-4dac-b67f-8cd72d47f659 and VDSM Task ID f8f41c68-5941-4e99-97af-7eaa182d9ce2 is in state Polling (looping)

I think that part of the solution would be removing completely StopStoragePoolTasks. When a new spm is elected (isNewSpm) we don't need to stop the tasks as all the previous tasks were automatically stopped.
Removing StopStoragePoolTasks would avoid the race with the newly started tasks (e.g. deleteImage) and the inconsistency between tasks in db (Ended) and in memory (Polling).

Comment 5 Elad 2013-10-27 12:37:55 UTC
DeleteImageGroupVDSCommand task gets cleaned after it ends.

Verified on is19 RHEVM and is20 vdsm:

RHEVM:
rhevm-3.3.0-0.27.beta1.el6ev.noarch

VDSM:
vdsm-4.13.0-0.5.beta1.el6ev.x86_64

Comment 6 Itamar Heim 2014-01-21 22:26:12 UTC
Closing - RHEV 3.3 Released

Comment 7 Itamar Heim 2014-01-21 22:29:41 UTC
Closing - RHEV 3.3 Released


Note You need to log in before you can comment on or make changes to this bug.