Bug 1003506 - [vdsm] after a failure in moveImage, deleteImage task won't be cleaned on vdsm
[vdsm] after a failure in moveImage, deleteImage task won't be cleaned on vdsm
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine (Show other bugs)
3.3.0
x86_64 Unspecified
urgent Severity urgent
: ---
: 3.3.0
Assigned To: Federico Simoncelli
Elad
storage
: Regression
Depends On:
Blocks: 3.3snap1
  Show dependency treegraph
 
Reported: 2013-09-02 04:42 EDT by Elad
Modified: 2016-02-10 14:29 EST (History)
10 users (show)

See Also:
Fixed In Version: is19
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Storage
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
amureini: Triaged+


Attachments (Terms of Use)
logs (4.17 MB, application/x-gzip)
2013-09-02 04:42 EDT, Elad
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
oVirt gerrit 19802 None None None Never
oVirt gerrit 20044 None None None Never

  None (edit)
Description Elad 2013-09-02 04:42:38 EDT
Created attachment 792758 [details]
logs

Description of problem:
deleteImage task remains running forever on vdsm, after vdsm crashed right after moveImage (op=copy) finished.

Version-Release number of selected component (if applicable):
vdsm-4.12.0-72.git287bb7e.el6ev.x86_64

How reproducible:
100%

Steps to Reproduce:
On a block pool with more than 1 SDs:
- create vm 
- move vm's disk to another SD (offline storage migration)
- restart vdsm right when moveImage ends (before deleteImage starts)

Actual results:

When vdsm starts, it get the deleteImage task from engine (to roll-forward the MoveOrCopyDisk action):

engine:

 2013-09-01 17:35:12,045 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand] (pool-5-thread-50) START, DeleteImageGroupVDSCommand( storagePoolId = 7a93c0d1-1316-40e2-b946-3180c3415007, ignoreFailoverLimit = false, storageDomainId = 07b2ea73-27d3-46ba-832b-7fd5620443e4, imageGroupId = 076f4876-473a-4894-a11e-f847934372be, postZeros = true, forceDelete = false), log id: 72fd0d2e


vdsm starts deleteImage:

Thread-74::INFO::2013-09-01 17:35:22,879::logUtils::44::dispatcher::(wrapper) Run and protect: deleteImage(sdUUID='07b2ea73-27d3-46ba-832b-7fd5620443e4', spUUID='7a93c0d1-1316-40e2-b946-3180c3415007', imgUUID='076
f4876-473a-4894-a11e-f847934372be', postZero='true', force='false')

lvremove starts and never ends:

f8f41c68-5941-4e99-97af-7eaa182d9ce2::DEBUG::2013-09-01 17:35:47,117::lvm::314::Storage.Misc.excCmd::(cmd) '/usr/bin/sudo -n /sbin/lvm lvremove --config " devices { preferred_names = [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ \'a|/dev/mapper/1elad11370784|/dev/mapper/3514f0c54d5c00002|/dev/mapper/3514f0c54d5c00003|/dev/mapper/3514f0c54d5c00004|/dev/mapper/3514f0c54d5c00005|/dev/mapper/3514f0c54d5c00006|/dev/mapper/3514f0c54d5c00007|/dev/mapper/3514f0c54d5c00008|/dev/mapper/3514f0c54d5c00009|/dev/mapper/3514f0c54d5c0000a|/dev/mapper/3514f0c54d5c0000b|/dev/mapper/3514f0c54d5c0000c|/dev/mapper/3514f0c54d5c0000d|/dev/mapper/3514f0c54d5c0000e|/dev/mapper/3514f0c54d5c0000f|/dev/mapper/3514f0c54d5c00597|/dev/mapper/3514f0c54d5c00989|\', \'r|.*|\' ] }  global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1 }  backup {  retain_min = 50  retain_days = 0 } " -f --autobackup n 07b2ea73-27d3-46ba-832b-7fd5620443e4/56786842-6c95-4f8c-913f-cf8944fc55b0' (cwd None)

Tasks remains running forever. 

on engine log:

2013-09-02 03:49:30,459 INFO  [org.ovirt.engine.core.bll.CommandMultiAsyncTasks] (DefaultQuartzScheduler_Worker-75) Task with DB Task ID 961b037e-a668-4dac-b67f-8cd72d47f659 and VDSM Task ID f8f41c68-5941-4e99-97af-7eaa182d9ce2 is in state Polling. End action for command 9994af32-7fdd-42d0-8105-a32dd1d67d6e will proceed when all the entitys tasks are completed.
2013-09-02 03:49:40,490 ERROR [org.ovirt.engine.core.bll.SPMAsyncTask] (DefaultQuartzScheduler_Worker-62) BaseAsyncTask::LogEndTaskFailure: Task f8f41c68-5941-4e99-97af-7eaa182d9ce2 (Parent Command RemoveImage, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters) ended with failure:
-- Result: cleanSuccess
-- Message: VDSGenericException: VDSTaskResultNotSuccessException: TaskState contained successful return code, but a non-success result ('cleanSuccess').,
-- Exception: VDSGenericException: VDSTaskResultNotSuccessException: TaskState contained successful return code, but a non-success result ('cleanSuccess').


Expected results:
Task should either fail with lvm failure or with timeout

Additional info: logs
Comment 2 Ayal Baron 2013-09-02 11:06:19 EDT
Why is this a regression?
Comment 3 Elad 2013-09-02 12:29:53 EDT
(In reply to Ayal Baron from comment #2)
> Why is this a regression?

Does not happen on 3.2
Comment 4 Federico Simoncelli 2013-10-02 05:06:42 EDT
(In reply to Elad from comment #0)
> vdsm starts deleteImage:
> 
> Thread-74::INFO::2013-09-01
> 17:35:22,879::logUtils::44::dispatcher::(wrapper) Run and protect:
> deleteImage(sdUUID='07b2ea73-27d3-46ba-832b-7fd5620443e4',
> spUUID='7a93c0d1-1316-40e2-b946-3180c3415007', imgUUID='076
> f4876-473a-4894-a11e-f847934372be', postZero='true', force='false')
> 
> lvremove starts and never ends:
> 
> f8f41c68-5941-4e99-97af-7eaa182d9ce2::DEBUG::2013-09-01
> 17:35:47,117::lvm::314::Storage.Misc.excCmd::(cmd) '/usr/bin/sudo -n
> /sbin/lvm lvremove --config " devices { preferred_names =
> [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write_cache_state=0
> disable_after_error_count=3 filter = [
> \'a|/dev/mapper/1elad11370784|/dev/mapper/3514f0c54d5c00002|/dev/mapper/
> 3514f0c54d5c00003|/dev/mapper/3514f0c54d5c00004|/dev/mapper/
> 3514f0c54d5c00005|/dev/mapper/3514f0c54d5c00006|/dev/mapper/
> 3514f0c54d5c00007|/dev/mapper/3514f0c54d5c00008|/dev/mapper/
> 3514f0c54d5c00009|/dev/mapper/3514f0c54d5c0000a|/dev/mapper/
> 3514f0c54d5c0000b|/dev/mapper/3514f0c54d5c0000c|/dev/mapper/
> 3514f0c54d5c0000d|/dev/mapper/3514f0c54d5c0000e|/dev/mapper/
> 3514f0c54d5c0000f|/dev/mapper/3514f0c54d5c00597|/dev/mapper/
> 3514f0c54d5c00989|\', \'r|.*|\' ] }  global {  locking_type=1 
> prioritise_write_locks=1  wait_for_locks=1 }  backup {  retain_min = 50 
> retain_days = 0 } " -f --autobackup n
> 07b2ea73-27d3-46ba-832b-7fd5620443e4/56786842-6c95-4f8c-913f-cf8944fc55b0'
> (cwd None)

This is wrong, on the vdsm side task f8f41c68-5941-4e99-97af-7eaa182d9ce2 begins at:

Thread-74::DEBUG::2013-09-01 17:35:22,879::task::579::TaskManager.Task::(_updateState) Task=`f8f41c68-5941-4e99-97af-7eaa182d9ce2`::moving from state init -> state preparing
Thread-74::INFO::2013-09-01 17:35:22,879::logUtils::44::dispatcher::(wrapper) Run and protect: deleteImage(sdUUID='07b2ea73-27d3-46ba-832b-7fd5620443e4', spUUID='7a93c0d1-1316-40e2-b946-3180c3415007', imgUUID='076f4876-473a-4894-a11e-f847934372be', postZero='true', force='false')
...

executes lvremove at:

f8f41c68-5941-4e99-97af-7eaa182d9ce2::DEBUG::2013-09-01 17:35:47,117::lvm::314::Storage.Misc.excCmd::(cmd) '/usr/bin/sudo -n /sbin/lvm lvremove --config " devices { preferred_names = [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ \'a|/dev/mapper/1elad11370784|/dev/mapper/3514f0c54d5c00002|/dev/mapper/3514f0c54d5c00003|/dev/mapper/3514f0c54d5c00004|/dev/mapper/3514f0c54d5c00005|/dev/mapper/3514f0c54d5c00006|/dev/mapper/3514f0c54d5c00007|/dev/mapper/3514f0c54d5c00008|/dev/mapper/3514f0c54d5c00009|/dev/mapper/3514f0c54d5c0000a|/dev/mapper/3514f0c54d5c0000b|/dev/mapper/3514f0c54d5c0000c|/dev/mapper/3514f0c54d5c0000d|/dev/mapper/3514f0c54d5c0000e|/dev/mapper/3514f0c54d5c0000f|/dev/mapper/3514f0c54d5c00597|/dev/mapper/3514f0c54d5c00989|\', \'r|.*|\' ] }  global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1 }  backup {  retain_min = 50  retain_days = 0 } " -f --autobackup n 07b2ea73-27d3-46ba-832b-7fd5620443e4/56786842-6c95-4f8c-913f-cf8944fc55b0' (cwd None)

and finishes at:

f8f41c68-5941-4e99-97af-7eaa182d9ce2::DEBUG::2013-09-01 17:35:47,968::task::897::TaskManager.Task::(_runJobs) Task=`f8f41c68-5941-4e99-97af-7eaa182d9ce2`::Task.run: exit - success: result
f8f41c68-5941-4e99-97af-7eaa182d9ce2::DEBUG::2013-09-01 17:35:47,968::task::974::TaskManager.Task::(_decref) Task=`f8f41c68-5941-4e99-97af-7eaa182d9ce2`::ref 0 aborting False


The problem is on the engine side as backend fails to detect that the task has ended.
Moving to the ovirt-engine-backend component.

So analyzing what happened:

1. user requested to move an image from a storage domain to another (implemented as copy + delete)
2. moveImage (with copy operation) was sent to vdsm:

2013-09-01 17:33:55,121 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.MoveImageGroupVDSCommand] (pool-5-thread-45) [24ea19a1] START, MoveImageGroupVDSCommand( storagePoolId = 7a93c0d1-1316-40e2-b946-3180c3415007, ignoreFailoverLimit = false, storageDomainId = 07b2ea73-27d3-46ba-832b-7fd5620443e4, imageGroupId = 076f4876-473a-4894-a11e-f847934372be, dstDomainId = 62fe552d-5bcb-4bee-bc94-808a0b566fca, vmId = 00000000-0000-0000-0000-000000000000, op = Copy, postZero = true, force = false), log id: 2839ca05

3. move image finishes successfully:

b6ffbde7-367a-4623-8e1a-459d792484da::INFO::2013-09-01 17:34:53,452::image::527::Storage.Image::(move) COPY task on image 076f4876-473a-4894-a11e-f847934372be was successfully finished
b6ffbde7-367a-4623-8e1a-459d792484da::DEBUG::2013-09-01 17:34:53,455::task::579::TaskManager.Task::(_updateState) Task=`b6ffbde7-367a-4623-8e1a-459d792484da`::moving from state running -> state finished

4. vdsm is restarted:

MainThread::INFO::2013-09-01 17:35:01,693::vdsm::101::vds::(run) (PID: 14922) I am the actual vdsm 4.12.0-72.git287bb7e.el6ev nott-vds2.qa.lab.tlv.redhat.com (2.6.32-358.18.1.el6.x86_64)

5. engine notices that the spm was lost:

2013-09-01 17:34:48,153 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStatusVDSCommand] (DefaultQuartzScheduler_Worker-39) Command SpmStatusVDS execution failed. Exception: VDSNetworkException: java.net.ConnectException: Connection refused
2013-09-01 17:34:58,183 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand] (DefaultQuartzScheduler_Worker-41) Command org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand return value
TaskStatusListReturnForXmlRpc [mStatus=StatusForXmlRpc [mCode=654, mMessage=Not SPM: ()]]

6. engine starts the spm:

2013-09-01 17:35:02,766 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-68) START, SpmStartVDSCommand(HostName = nott-vds2, HostId = 69721c22-bf76-4643-80e5-210b81abf930, storagePoolId = 7a93c0d1-1316-40e2-b946-3180c3415007, prevId=1, prevLVER=11, storagePoolFormatType=V3, recoveryMode=Manual, SCSIFencing=false), log id: 21c193c6

Thread-41::INFO::2013-09-01 17:35:13,581::logUtils::44::dispatcher::(wrapper) Run and protect: spmStart(spUUID='7a93c0d1-1316-40e2-b946-3180c3415007', prevID=1, prevLVER='11', recoveryMode=None, scsiFencing='false', maxHostID=250, domVersion='3', options=None)

7. the move image task is re-detected and cleaned:

b6ffbde7-367a-4623-8e1a-459d792484da::DEBUG::2013-09-01 17:35:22,482::task::1218::TaskManager.Task::(recover) Task=`b6ffbde7-367a-4623-8e1a-459d792484da`::(recover): recovering: state finished

2013-09-01 17:35:11,997 INFO  [org.ovirt.engine.core.bll.SPMAsyncTask] (DefaultQuartzScheduler_Worker-68) SPMAsyncTask::PollTask: Polling task b6ffbde7-367a-4623-8e1a-459d792484da (Parent Command MoveOrCopyDisk, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters) returned status finished, result 'success'.
2013-09-01 17:35:12,001 INFO  [org.ovirt.engine.core.bll.SPMAsyncTask] (DefaultQuartzScheduler_Worker-68) BaseAsyncTask::OnTaskEndSuccess: Task b6ffbde7-367a-4623-8e1a-459d792484da (Parent Command MoveOrCopyDisk, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters) ended successfully.
...
2013-09-01 17:35:12,013 INFO  [org.ovirt.engine.core.bll.MoveOrCopyDiskCommand] (pool-5-thread-50) Ending command successfully: org.ovirt.engine.core.bll.MoveOrCopyDiskCommand
2013-09-01 17:35:12,015 INFO  [org.ovirt.engine.core.bll.MoveImageGroupCommand] (pool-5-thread-50) Ending command successfully: org.ovirt.engine.core.bll.MoveImageGroupCommand

8. the moveImage EndAction is executed triggering the deleteImage command:

2013-09-01 17:35:12,045 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand] (pool-5-thread-50) START, DeleteImageGroupVDSCommand( storagePoolId = 7a93c0d1-1316-40e2-b946-3180c3415007, ignoreFailoverLimit = false, storageDomainId = 07b2ea73-27d3-46ba-832b-7fd5620443e4, imageGroupId = 076f4876-473a-4894-a11e-f847934372be, postZeros = true, forceDelete = false), log id: 72fd0d2e

Thread-74::INFO::2013-09-01 17:35:22,879::logUtils::44::dispatcher::(wrapper) Run and protect: deleteImage(sdUUID='07b2ea73-27d3-46ba-832b-7fd5620443e4', spUUID='7a93c0d1-1316-40e2-b946-3180c3415007',

9. the engine storagePoolUpEvent is triggered (probably from the successful spmStart) and it stops all the tasks including the newly deleteImage (race), according to:

    @Override
    public void storagePoolUpEvent(StoragePool storagePool, boolean isNewSpm) {
        if (isNewSpm) {
            AsyncTaskManager.getInstance().StopStoragePoolTasks(storagePool);
        } else {
            AsyncTaskManager.getInstance().AddStoragePoolExistingTasks(storagePool);
        }
    }

2013-09-01 17:35:12,006 INFO  [org.ovirt.engine.core.bll.AsyncTaskManager] (pool-5-thread-45) Attempting to get and stop tasks on storage pool iscsi1
...
2013-09-01 17:35:12,887 INFO  [org.ovirt.engine.core.bll.SPMAsyncTask] (pool-5-thread-45) SPMAsyncTask::StopTask: Attempting to stop task f8f41c68-5941-4e99-97af-7eaa182d9ce2 (Parent Command RemoveImage, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters).

10. it seems that StopStoragePoolTasks marked the task as Ended (stopTask) only in the database but not in memory as engine loops on:

2013-09-01 17:35:42,080 INFO  [org.ovirt.engine.core.bll.CommandMultiAsyncTasks] (DefaultQuartzScheduler_Worker-83) Task with DB Task ID 961b037e-a668-4dac-b67f-8cd72d47f659 and VDSM Task ID f8f41c68-5941-4e99-97af-7eaa182d9ce2 is in state Polling. End action for command 9994af32-7fdd-42d0-8105-a32dd1d67d6e will proceed when all the entitys tasks are completed.


To summarize the issue after the spm has been re-elected:

1. Attempting to EndAction MoveOrCopyDisk (issues deleteImage at point 4)
2. Attempting to get and stop tasks on storage pool iscsi1 (StopStoragePoolTasks)
3. Ending command successfully: org.ovirt.engine.core.bll.MoveImageGroupCommand
4. START, DeleteImageGroupVDSCommand( storagePoolId = 7a93c0d1-1316-40e2-b946-3180c3415007...
5. BaseAsyncTask::StartPollingTask: Starting to poll task f8f41c68-5941-4e99-97af-7eaa182d9ce2
6. Discovered 2 tasks on Storage Pool iscsi1, 1 added to manager. (StopStoragePoolTasks)
7. Attempting to stop task f8f41c68-5941-4e99-97af-7eaa182d9ce2 ...
8. Task with DB Task ID 961b037e-a668-4dac-b67f-8cd72d47f659 and VDSM Task ID f8f41c68-5941-4e99-97af-7eaa182d9ce2 is in state Polling (looping)

I think that part of the solution would be removing completely StopStoragePoolTasks. When a new spm is elected (isNewSpm) we don't need to stop the tasks as all the previous tasks were automatically stopped.
Removing StopStoragePoolTasks would avoid the race with the newly started tasks (e.g. deleteImage) and the inconsistency between tasks in db (Ended) and in memory (Polling).
Comment 5 Elad 2013-10-27 08:37:55 EDT
DeleteImageGroupVDSCommand task gets cleaned after it ends.

Verified on is19 RHEVM and is20 vdsm:

RHEVM:
rhevm-3.3.0-0.27.beta1.el6ev.noarch

VDSM:
vdsm-4.13.0-0.5.beta1.el6ev.x86_64
Comment 6 Itamar Heim 2014-01-21 17:26:12 EST
Closing - RHEV 3.3 Released
Comment 7 Itamar Heim 2014-01-21 17:29:41 EST
Closing - RHEV 3.3 Released

Note You need to log in before you can comment on or make changes to this bug.