Created attachment 741806 [details] logs Description of problem: after live storage migration failure, engine keeps reporting that HSMGetAllTasksStatusesVDS failed with error=task id. the vdsm returns reply on the task with no errors: 2013-04-30 12:27:09,069 ERROR [org.ovirt.engine.core.bll.SPMAsyncTask] (QuartzScheduler_Worker-100) BaseAsyncTask::LogEndTaskFailure: Task 31b39f81-27c2-4782-bccd-8436305a5ae9 (Parent Command LiveMigrateDisk, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters) ended with failure: -- Result: cleanSuccess -- Message: VDSGenericException: VDSErrorException: Failed in vdscommand to HSMGetAllTasksStatusesVDS, error = 31b39f81-27c2-4782-bccd-8436305a5ae9, -- Exception: VDSGenericException: VDSErrorException: Failed in vdscommand to HSMGetAllTasksStatusesVDS, error = 31b39f81-27c2-4782-bccd-8436305a5ae9 vdsm: hread-3920::INFO::2013-04-30 12:40:31,451::logUtils::42::dispatcher::(wrapper) Run and protect: getAllTasksStatuses, Return response: {'allTasksStatus': {'e6337610-1662-4c07-97c3-84fae53e36da': {'code': 0, 'message': u'1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': 'e6337610-1662-4c07-97c3-84fae53e36da'}, '31b39f81-27c2-4782-bccd-8436305a5ae9': {'code': 411, 'message': u'31b39f81-27c2-4782-bccd-8436305a5ae9', 'taskState': 'finished', 'taskResult': 'cleanSuccess', 'taskID': '31b39f81-27c2-4782-bccd-8436305a5ae9'}, '316e828e-0fd2-454e-b2a2-67e35267d287': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success ', 'taskID': '316e828e-0fd2-454e-b2a2-67e35267d287'}, 'a7dc15a0-795f-4bc1-9d6b-96e5758e2de3': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': 'a7dc15a0-795f-4bc1-9d6b-9 6e5758e2de3'}, '6fb93406-7b3a-4975-beb1-1099fcc7fae4': {'code': 0, 'message': u'1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': '6fb93406-7b3a-4975-beb1-1099fcc7fae4'}, 'cb240b63-afe1-4edc-a06 3-008aa670ba28': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': 'cb240b63-afe1-4edc-a063-008aa670ba28'}, '0340452b-0c31-42d9-9c37-adc0443bf87a': {'code': 0, 'message': u'1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': '0340452b-0c31-42d9-9c37-adc0443bf87a'}}} Version-Release number of selected component (if applicable): sf14 Steps to Reproduce: 1. run 10 vm's from blank template on iscsi storage with two hosts - run all vm's on hsm host 2. live storage migrate all the vm's disks one at a time 3. Actual results: vdsm restarted due to timeout on storage. the task was stopped in vdsm with unexpected exception the engine keeps reporting the same error for HSMGetAllTasksStatusesVDS although vdsm reports the task as finished. Expected results: we should clear the task properly. Additional info: full logs
We got two tasks which are related to entity f606e454-8c06-4b95-9e35-70a9c8aa8534 Each task is related to a different flow of live storage migration. The first task is fada3067-7741-486a-b479-865cdd46c6ef and the last task is 31b39f81-27c2-4782-bccd-8436305a5ae9. The first task (fada3067-7741-486a-b479-865cdd46c6ef) is being polled at 2013-04-30 11:16:29 and finally being cleared and removed from the DB at 2013-04-30 11:20:03 The other task is 31b39f81-27c2-4782-bccd-8436305a5ae9, and it is being attached to the entity at 2013-04-30 11:19:43,318, we start polling task one milisecond after we finish to clear fada3067... task at 2013-04-30 11:20:03,248 When 31b39f81... is ended it can be seen in the log that we log that we end the command (BaseAsyncTask::LogEndTaskFailure), and the following operation should be to end the command and clear the task, but engine does not get there. Looking into the source code, the procedure that is called after the log is EntityAsyncTask#EndActionIfNecessary. we were supposed to get a log which indicates EntityAsyncTask::EndActionIfNecessary: All tasks of entity '{0}' has ended, but we didn't. That implies that not all task were acctually ended, but why? Apperentlly not all tasks were cleared in the previous operation in the entity f606e454-8c06-4b95-9e35-70a9c8aa8534, that is why we probably can't end the command and clear the task, and that is also why we are getting this (BaseAsyncTask::LogEndTaskFailure) in an endless loop. I added a patch which will indicate the tasks which prevent the end of the process at http://gerrit.ovirt.org/#/c/14433/
We need to figure out the following: 1) Why the task did not proceed to end the action (Probably related tasks which were not cleaned) 2) (Dependent on 1) Which tasks were not cleaned - For this http://gerrit.ovirt.org/#/c/14433/ should be helpful 3) If 2 is indeed the case, then find why was it not cleared from the _listTasks list of the entity.
I couldn't reproduce this scenario, I ran 10 VMs over HSM with 2 domains, made move to all disks one by one, And as described in the bz description, during this process the communication to storage should brake and vdsm should restart. It didn't happen on my env, so I manually restarted vdsm Seems like it all happened due to a race-> EntityMultiAsyncTasks::ShouldEndAction is called from EndActionIfNecessary, that called after HandleEndedTask-> this should set all tasks to 'ended' state, Apparently, we got to shouldEndAction with a task that is not 'ended', otherwise we would have got to endAction and clean all the tasks. 3.2 and upstream in those sections are quite the same, So it hasn't been changed between the versions I check. Maor suggested to add comment that might help us if this will happen again while testing. From VDSM log we see that all tasks result are "success" except the one with the exception that holds on 'cleanSuccess', and it stays like that in each getAllStatuses request. That's lead to log we see in engine. Nothing should change until endCommand is called.
After the following error the errorcode of the task set to 411 (TaskAborted): 31b39f81-27c2-4782-bccd-8436305a5ae9::ERROR::2013-04-30 11:20:06,451::task::850::TaskManager.Task::(_setError) Task=`31b39f81-27c2-4782-bccd-8436305a5ae9`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 857, in _run return fn(*args, **kargs) File "/usr/share/vdsm/storage/task.py", line 318, in run return self.cmd(*self.argslist, **self.argsdict) File "/usr/share/vdsm/storage/securable.py", line 68, in wrapper return f(self, *args, **kwargs) File "/usr/share/vdsm/storage/sp.py", line 1798, in syncImageData syncType) File "/usr/share/vdsm/storage/image.py", line 682, in syncData {'srcChain': srcChain, 'dstChain': dstChain}) File "/usr/share/vdsm/storage/image.py", line 570, in _interImagesCopy srcFmtStr, dstFmtStr) File "/usr/lib64/python2.6/site-packages/vdsm/qemuImg.py", line 114, in convert ioclass=misc.IOCLASS.IDLE) File "/usr/share/vdsm/storage/misc.py", line 286, in watchCmd raise se.ActionStopped() ActionStopped: Action was stopped: () Vdsm keeps reporting with errorcode 411, engine sets it as the tasks exception and we get: 2013-04-30 12:24:58,045 ERROR [org.ovirt.engine.core.bll.SPMAsyncTask] (QuartzScheduler_Worker-36) BaseAsyncTask::LogEndTaskFailure: Task 31b39f81-27c2-4782-bccd-8436305a5ae9 (Parent Command LiveMigrateDisk, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters) ended with failure: -- Result: cleanSuccess -- Message: VDSGenericException: VDSErrorException: Failed in vdscommand to HSMGetAllTasksStatusesVDS, error = 31b39f81-27c2-4782-bccd-8436305a5ae9, -- Exception: VDSGenericException: VDSErrorException: Failed in vdscommand to HSMGetAllTasksStatusesVDS, error = 31b39f81-27c2-4782-bccd-8436305a5ae9 After this report we see all other tasks statuses. The questions are: 1. Even with that exception, should we set the task to ended and endAction? Seems that yes. but still, it looks like one of the tasks is missing, otherwise engine would have called to endCommand.. 2. How many tasks should be for each LiveMigrateDisk command? Is it multiEntityAsyncTask or just EntityAsyncTask? which task related to which part of the command? 3. If one task status is unreported by vdsm, it should be set to unknown status and handled accordingly. is it happened?
Yaniv, you are confusing here two terms - EntityAsyncTask is a base class of SPMAynscTask which in the task manager design means that the task is associated with an entity. EntityMultiAsyncTasks is a container of EntityAsyncTasks which has containerId field which is the entityID and the "currentActionType" field which represents the current action that "tasks coordination" is performed for (the tasks that are being polled - i.e - their task state is no longer WaitingForPoll). EntityAsyncTask has a static map _multiTasksByEntities which is a map from containerId (entityId) to EntityMultiAsyncTasks (the container of tasks per the entity + action). IMHO, What you should understand is at each part of LiveMigrateDisks which tasks are executed + what is the entityId for them.
By following tasks: e633 cf09 cb24 , we see that: e633 starts, ends and fails during the clear (vdsm throws exception) - stays on cleanFailed cf09 starts afterwards and fails - trying to clear and clear from DB - both successfully - stays on cleared. cb24 starts also afterwards and also ends with success - also cleared. The last part checks if all entity related tasks are cleared or clearFailed. We suspect that somewhere between the treatment of e633 and cb24 the state of cf09 has changed from cleared. In this case, it might be that the task was removed from AsyncTaskManager::_tasks list but not from its EntityMultiAsyncTask::_listTasks. This can lead to that asyncTasksManager won't poll this task anymore, and each call to handleEndActionResult won't clean the containerId due to cf09 that still exists there not as cleared or clearFailed.
The patch http://gerrit.ovirt.org/#/c/14433 is merged to allow better following and to locate the missing uncleared task, as suspected. Currently I couldn't manage to reproduce the bug by following the bug description few times. If its possible, please try to reproduce the scenario again, or add more details to the description, so we'll be able to investigate the case more deeply.
According to comment 7, bug does not reproduce. If you're able to reproduce it, please reopen with updated steps and logs.