Created attachment 792758 [details] logs Description of problem: deleteImage task remains running forever on vdsm, after vdsm crashed right after moveImage (op=copy) finished. Version-Release number of selected component (if applicable): vdsm-4.12.0-72.git287bb7e.el6ev.x86_64 How reproducible: 100% Steps to Reproduce: On a block pool with more than 1 SDs: - create vm - move vm's disk to another SD (offline storage migration) - restart vdsm right when moveImage ends (before deleteImage starts) Actual results: When vdsm starts, it get the deleteImage task from engine (to roll-forward the MoveOrCopyDisk action): engine: 2013-09-01 17:35:12,045 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand] (pool-5-thread-50) START, DeleteImageGroupVDSCommand( storagePoolId = 7a93c0d1-1316-40e2-b946-3180c3415007, ignoreFailoverLimit = false, storageDomainId = 07b2ea73-27d3-46ba-832b-7fd5620443e4, imageGroupId = 076f4876-473a-4894-a11e-f847934372be, postZeros = true, forceDelete = false), log id: 72fd0d2e vdsm starts deleteImage: Thread-74::INFO::2013-09-01 17:35:22,879::logUtils::44::dispatcher::(wrapper) Run and protect: deleteImage(sdUUID='07b2ea73-27d3-46ba-832b-7fd5620443e4', spUUID='7a93c0d1-1316-40e2-b946-3180c3415007', imgUUID='076 f4876-473a-4894-a11e-f847934372be', postZero='true', force='false') lvremove starts and never ends: f8f41c68-5941-4e99-97af-7eaa182d9ce2::DEBUG::2013-09-01 17:35:47,117::lvm::314::Storage.Misc.excCmd::(cmd) '/usr/bin/sudo -n /sbin/lvm lvremove --config " devices { preferred_names = [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ \'a|/dev/mapper/1elad11370784|/dev/mapper/3514f0c54d5c00002|/dev/mapper/3514f0c54d5c00003|/dev/mapper/3514f0c54d5c00004|/dev/mapper/3514f0c54d5c00005|/dev/mapper/3514f0c54d5c00006|/dev/mapper/3514f0c54d5c00007|/dev/mapper/3514f0c54d5c00008|/dev/mapper/3514f0c54d5c00009|/dev/mapper/3514f0c54d5c0000a|/dev/mapper/3514f0c54d5c0000b|/dev/mapper/3514f0c54d5c0000c|/dev/mapper/3514f0c54d5c0000d|/dev/mapper/3514f0c54d5c0000e|/dev/mapper/3514f0c54d5c0000f|/dev/mapper/3514f0c54d5c00597|/dev/mapper/3514f0c54d5c00989|\', \'r|.*|\' ] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 } backup { retain_min = 50 retain_days = 0 } " -f --autobackup n 07b2ea73-27d3-46ba-832b-7fd5620443e4/56786842-6c95-4f8c-913f-cf8944fc55b0' (cwd None) Tasks remains running forever. on engine log: 2013-09-02 03:49:30,459 INFO [org.ovirt.engine.core.bll.CommandMultiAsyncTasks] (DefaultQuartzScheduler_Worker-75) Task with DB Task ID 961b037e-a668-4dac-b67f-8cd72d47f659 and VDSM Task ID f8f41c68-5941-4e99-97af-7eaa182d9ce2 is in state Polling. End action for command 9994af32-7fdd-42d0-8105-a32dd1d67d6e will proceed when all the entitys tasks are completed. 2013-09-02 03:49:40,490 ERROR [org.ovirt.engine.core.bll.SPMAsyncTask] (DefaultQuartzScheduler_Worker-62) BaseAsyncTask::LogEndTaskFailure: Task f8f41c68-5941-4e99-97af-7eaa182d9ce2 (Parent Command RemoveImage, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters) ended with failure: -- Result: cleanSuccess -- Message: VDSGenericException: VDSTaskResultNotSuccessException: TaskState contained successful return code, but a non-success result ('cleanSuccess')., -- Exception: VDSGenericException: VDSTaskResultNotSuccessException: TaskState contained successful return code, but a non-success result ('cleanSuccess'). Expected results: Task should either fail with lvm failure or with timeout Additional info: logs
Why is this a regression?
(In reply to Ayal Baron from comment #2) > Why is this a regression? Does not happen on 3.2
(In reply to Elad from comment #0) > vdsm starts deleteImage: > > Thread-74::INFO::2013-09-01 > 17:35:22,879::logUtils::44::dispatcher::(wrapper) Run and protect: > deleteImage(sdUUID='07b2ea73-27d3-46ba-832b-7fd5620443e4', > spUUID='7a93c0d1-1316-40e2-b946-3180c3415007', imgUUID='076 > f4876-473a-4894-a11e-f847934372be', postZero='true', force='false') > > lvremove starts and never ends: > > f8f41c68-5941-4e99-97af-7eaa182d9ce2::DEBUG::2013-09-01 > 17:35:47,117::lvm::314::Storage.Misc.excCmd::(cmd) '/usr/bin/sudo -n > /sbin/lvm lvremove --config " devices { preferred_names = > [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write_cache_state=0 > disable_after_error_count=3 filter = [ > \'a|/dev/mapper/1elad11370784|/dev/mapper/3514f0c54d5c00002|/dev/mapper/ > 3514f0c54d5c00003|/dev/mapper/3514f0c54d5c00004|/dev/mapper/ > 3514f0c54d5c00005|/dev/mapper/3514f0c54d5c00006|/dev/mapper/ > 3514f0c54d5c00007|/dev/mapper/3514f0c54d5c00008|/dev/mapper/ > 3514f0c54d5c00009|/dev/mapper/3514f0c54d5c0000a|/dev/mapper/ > 3514f0c54d5c0000b|/dev/mapper/3514f0c54d5c0000c|/dev/mapper/ > 3514f0c54d5c0000d|/dev/mapper/3514f0c54d5c0000e|/dev/mapper/ > 3514f0c54d5c0000f|/dev/mapper/3514f0c54d5c00597|/dev/mapper/ > 3514f0c54d5c00989|\', \'r|.*|\' ] } global { locking_type=1 > prioritise_write_locks=1 wait_for_locks=1 } backup { retain_min = 50 > retain_days = 0 } " -f --autobackup n > 07b2ea73-27d3-46ba-832b-7fd5620443e4/56786842-6c95-4f8c-913f-cf8944fc55b0' > (cwd None) This is wrong, on the vdsm side task f8f41c68-5941-4e99-97af-7eaa182d9ce2 begins at: Thread-74::DEBUG::2013-09-01 17:35:22,879::task::579::TaskManager.Task::(_updateState) Task=`f8f41c68-5941-4e99-97af-7eaa182d9ce2`::moving from state init -> state preparing Thread-74::INFO::2013-09-01 17:35:22,879::logUtils::44::dispatcher::(wrapper) Run and protect: deleteImage(sdUUID='07b2ea73-27d3-46ba-832b-7fd5620443e4', spUUID='7a93c0d1-1316-40e2-b946-3180c3415007', imgUUID='076f4876-473a-4894-a11e-f847934372be', postZero='true', force='false') ... executes lvremove at: f8f41c68-5941-4e99-97af-7eaa182d9ce2::DEBUG::2013-09-01 17:35:47,117::lvm::314::Storage.Misc.excCmd::(cmd) '/usr/bin/sudo -n /sbin/lvm lvremove --config " devices { preferred_names = [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ \'a|/dev/mapper/1elad11370784|/dev/mapper/3514f0c54d5c00002|/dev/mapper/3514f0c54d5c00003|/dev/mapper/3514f0c54d5c00004|/dev/mapper/3514f0c54d5c00005|/dev/mapper/3514f0c54d5c00006|/dev/mapper/3514f0c54d5c00007|/dev/mapper/3514f0c54d5c00008|/dev/mapper/3514f0c54d5c00009|/dev/mapper/3514f0c54d5c0000a|/dev/mapper/3514f0c54d5c0000b|/dev/mapper/3514f0c54d5c0000c|/dev/mapper/3514f0c54d5c0000d|/dev/mapper/3514f0c54d5c0000e|/dev/mapper/3514f0c54d5c0000f|/dev/mapper/3514f0c54d5c00597|/dev/mapper/3514f0c54d5c00989|\', \'r|.*|\' ] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 } backup { retain_min = 50 retain_days = 0 } " -f --autobackup n 07b2ea73-27d3-46ba-832b-7fd5620443e4/56786842-6c95-4f8c-913f-cf8944fc55b0' (cwd None) and finishes at: f8f41c68-5941-4e99-97af-7eaa182d9ce2::DEBUG::2013-09-01 17:35:47,968::task::897::TaskManager.Task::(_runJobs) Task=`f8f41c68-5941-4e99-97af-7eaa182d9ce2`::Task.run: exit - success: result f8f41c68-5941-4e99-97af-7eaa182d9ce2::DEBUG::2013-09-01 17:35:47,968::task::974::TaskManager.Task::(_decref) Task=`f8f41c68-5941-4e99-97af-7eaa182d9ce2`::ref 0 aborting False The problem is on the engine side as backend fails to detect that the task has ended. Moving to the ovirt-engine-backend component. So analyzing what happened: 1. user requested to move an image from a storage domain to another (implemented as copy + delete) 2. moveImage (with copy operation) was sent to vdsm: 2013-09-01 17:33:55,121 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.MoveImageGroupVDSCommand] (pool-5-thread-45) [24ea19a1] START, MoveImageGroupVDSCommand( storagePoolId = 7a93c0d1-1316-40e2-b946-3180c3415007, ignoreFailoverLimit = false, storageDomainId = 07b2ea73-27d3-46ba-832b-7fd5620443e4, imageGroupId = 076f4876-473a-4894-a11e-f847934372be, dstDomainId = 62fe552d-5bcb-4bee-bc94-808a0b566fca, vmId = 00000000-0000-0000-0000-000000000000, op = Copy, postZero = true, force = false), log id: 2839ca05 3. move image finishes successfully: b6ffbde7-367a-4623-8e1a-459d792484da::INFO::2013-09-01 17:34:53,452::image::527::Storage.Image::(move) COPY task on image 076f4876-473a-4894-a11e-f847934372be was successfully finished b6ffbde7-367a-4623-8e1a-459d792484da::DEBUG::2013-09-01 17:34:53,455::task::579::TaskManager.Task::(_updateState) Task=`b6ffbde7-367a-4623-8e1a-459d792484da`::moving from state running -> state finished 4. vdsm is restarted: MainThread::INFO::2013-09-01 17:35:01,693::vdsm::101::vds::(run) (PID: 14922) I am the actual vdsm 4.12.0-72.git287bb7e.el6ev nott-vds2.qa.lab.tlv.redhat.com (2.6.32-358.18.1.el6.x86_64) 5. engine notices that the spm was lost: 2013-09-01 17:34:48,153 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStatusVDSCommand] (DefaultQuartzScheduler_Worker-39) Command SpmStatusVDS execution failed. Exception: VDSNetworkException: java.net.ConnectException: Connection refused 2013-09-01 17:34:58,183 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand] (DefaultQuartzScheduler_Worker-41) Command org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand return value TaskStatusListReturnForXmlRpc [mStatus=StatusForXmlRpc [mCode=654, mMessage=Not SPM: ()]] 6. engine starts the spm: 2013-09-01 17:35:02,766 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-68) START, SpmStartVDSCommand(HostName = nott-vds2, HostId = 69721c22-bf76-4643-80e5-210b81abf930, storagePoolId = 7a93c0d1-1316-40e2-b946-3180c3415007, prevId=1, prevLVER=11, storagePoolFormatType=V3, recoveryMode=Manual, SCSIFencing=false), log id: 21c193c6 Thread-41::INFO::2013-09-01 17:35:13,581::logUtils::44::dispatcher::(wrapper) Run and protect: spmStart(spUUID='7a93c0d1-1316-40e2-b946-3180c3415007', prevID=1, prevLVER='11', recoveryMode=None, scsiFencing='false', maxHostID=250, domVersion='3', options=None) 7. the move image task is re-detected and cleaned: b6ffbde7-367a-4623-8e1a-459d792484da::DEBUG::2013-09-01 17:35:22,482::task::1218::TaskManager.Task::(recover) Task=`b6ffbde7-367a-4623-8e1a-459d792484da`::(recover): recovering: state finished 2013-09-01 17:35:11,997 INFO [org.ovirt.engine.core.bll.SPMAsyncTask] (DefaultQuartzScheduler_Worker-68) SPMAsyncTask::PollTask: Polling task b6ffbde7-367a-4623-8e1a-459d792484da (Parent Command MoveOrCopyDisk, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters) returned status finished, result 'success'. 2013-09-01 17:35:12,001 INFO [org.ovirt.engine.core.bll.SPMAsyncTask] (DefaultQuartzScheduler_Worker-68) BaseAsyncTask::OnTaskEndSuccess: Task b6ffbde7-367a-4623-8e1a-459d792484da (Parent Command MoveOrCopyDisk, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters) ended successfully. ... 2013-09-01 17:35:12,013 INFO [org.ovirt.engine.core.bll.MoveOrCopyDiskCommand] (pool-5-thread-50) Ending command successfully: org.ovirt.engine.core.bll.MoveOrCopyDiskCommand 2013-09-01 17:35:12,015 INFO [org.ovirt.engine.core.bll.MoveImageGroupCommand] (pool-5-thread-50) Ending command successfully: org.ovirt.engine.core.bll.MoveImageGroupCommand 8. the moveImage EndAction is executed triggering the deleteImage command: 2013-09-01 17:35:12,045 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand] (pool-5-thread-50) START, DeleteImageGroupVDSCommand( storagePoolId = 7a93c0d1-1316-40e2-b946-3180c3415007, ignoreFailoverLimit = false, storageDomainId = 07b2ea73-27d3-46ba-832b-7fd5620443e4, imageGroupId = 076f4876-473a-4894-a11e-f847934372be, postZeros = true, forceDelete = false), log id: 72fd0d2e Thread-74::INFO::2013-09-01 17:35:22,879::logUtils::44::dispatcher::(wrapper) Run and protect: deleteImage(sdUUID='07b2ea73-27d3-46ba-832b-7fd5620443e4', spUUID='7a93c0d1-1316-40e2-b946-3180c3415007', 9. the engine storagePoolUpEvent is triggered (probably from the successful spmStart) and it stops all the tasks including the newly deleteImage (race), according to: @Override public void storagePoolUpEvent(StoragePool storagePool, boolean isNewSpm) { if (isNewSpm) { AsyncTaskManager.getInstance().StopStoragePoolTasks(storagePool); } else { AsyncTaskManager.getInstance().AddStoragePoolExistingTasks(storagePool); } } 2013-09-01 17:35:12,006 INFO [org.ovirt.engine.core.bll.AsyncTaskManager] (pool-5-thread-45) Attempting to get and stop tasks on storage pool iscsi1 ... 2013-09-01 17:35:12,887 INFO [org.ovirt.engine.core.bll.SPMAsyncTask] (pool-5-thread-45) SPMAsyncTask::StopTask: Attempting to stop task f8f41c68-5941-4e99-97af-7eaa182d9ce2 (Parent Command RemoveImage, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters). 10. it seems that StopStoragePoolTasks marked the task as Ended (stopTask) only in the database but not in memory as engine loops on: 2013-09-01 17:35:42,080 INFO [org.ovirt.engine.core.bll.CommandMultiAsyncTasks] (DefaultQuartzScheduler_Worker-83) Task with DB Task ID 961b037e-a668-4dac-b67f-8cd72d47f659 and VDSM Task ID f8f41c68-5941-4e99-97af-7eaa182d9ce2 is in state Polling. End action for command 9994af32-7fdd-42d0-8105-a32dd1d67d6e will proceed when all the entitys tasks are completed. To summarize the issue after the spm has been re-elected: 1. Attempting to EndAction MoveOrCopyDisk (issues deleteImage at point 4) 2. Attempting to get and stop tasks on storage pool iscsi1 (StopStoragePoolTasks) 3. Ending command successfully: org.ovirt.engine.core.bll.MoveImageGroupCommand 4. START, DeleteImageGroupVDSCommand( storagePoolId = 7a93c0d1-1316-40e2-b946-3180c3415007... 5. BaseAsyncTask::StartPollingTask: Starting to poll task f8f41c68-5941-4e99-97af-7eaa182d9ce2 6. Discovered 2 tasks on Storage Pool iscsi1, 1 added to manager. (StopStoragePoolTasks) 7. Attempting to stop task f8f41c68-5941-4e99-97af-7eaa182d9ce2 ... 8. Task with DB Task ID 961b037e-a668-4dac-b67f-8cd72d47f659 and VDSM Task ID f8f41c68-5941-4e99-97af-7eaa182d9ce2 is in state Polling (looping) I think that part of the solution would be removing completely StopStoragePoolTasks. When a new spm is elected (isNewSpm) we don't need to stop the tasks as all the previous tasks were automatically stopped. Removing StopStoragePoolTasks would avoid the race with the newly started tasks (e.g. deleteImage) and the inconsistency between tasks in db (Ended) and in memory (Polling).
DeleteImageGroupVDSCommand task gets cleaned after it ends. Verified on is19 RHEVM and is20 vdsm: RHEVM: rhevm-3.3.0-0.27.beta1.el6ev.noarch VDSM: vdsm-4.13.0-0.5.beta1.el6ev.x86_64
Closing - RHEV 3.3 Released