Bug 958061 - engine: failure in HSMGetAllTasksStatusesVDS with error=task id and no error in vdsm (task reported as finished)
Summary: engine: failure in HSMGetAllTasksStatusesVDS with error=task id and no error ...
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 3.2.0
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: 3.3.0
Assignee: Ayal Baron
QA Contact:
URL:
Whiteboard: storage
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-04-30 09:45 UTC by Dafna Ron
Modified: 2016-02-10 18:02 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-08-14 08:49:27 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:
abaron: Triaged+


Attachments (Terms of Use)
logs (2.25 MB, application/x-gzip)
2013-04-30 09:45 UTC, Dafna Ron
no flags Details

Description Dafna Ron 2013-04-30 09:45:47 UTC
Created attachment 741806 [details]
logs

Description of problem:

after live storage migration failure, engine keeps reporting that HSMGetAllTasksStatusesVDS failed with error=task id.
the vdsm returns reply on the task with no errors:

2013-04-30 12:27:09,069 ERROR [org.ovirt.engine.core.bll.SPMAsyncTask] (QuartzScheduler_Worker-100) BaseAsyncTask::LogEndTaskFailure: Task 31b39f81-27c2-4782-bccd-8436305a5ae9 (Parent Command LiveMigrateDisk, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters) ended with failure:
-- Result: cleanSuccess
-- Message: VDSGenericException: VDSErrorException: Failed in vdscommand to HSMGetAllTasksStatusesVDS, error = 31b39f81-27c2-4782-bccd-8436305a5ae9,
-- Exception: VDSGenericException: VDSErrorException: Failed in vdscommand to HSMGetAllTasksStatusesVDS, error = 31b39f81-27c2-4782-bccd-8436305a5ae9

vdsm:

hread-3920::INFO::2013-04-30 12:40:31,451::logUtils::42::dispatcher::(wrapper) Run and protect: getAllTasksStatuses, Return response: {'allTasksStatus': {'e6337610-1662-4c07-97c3-84fae53e36da': {'code': 0, 'message': u'1 jobs completed
 successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': 'e6337610-1662-4c07-97c3-84fae53e36da'}, '31b39f81-27c2-4782-bccd-8436305a5ae9': {'code': 411, 'message': u'31b39f81-27c2-4782-bccd-8436305a5ae9', 'taskState':
'finished', 'taskResult': 'cleanSuccess', 'taskID': '31b39f81-27c2-4782-bccd-8436305a5ae9'}, '316e828e-0fd2-454e-b2a2-67e35267d287': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success
', 'taskID': '316e828e-0fd2-454e-b2a2-67e35267d287'}, 'a7dc15a0-795f-4bc1-9d6b-96e5758e2de3': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': 'a7dc15a0-795f-4bc1-9d6b-9
6e5758e2de3'}, '6fb93406-7b3a-4975-beb1-1099fcc7fae4': {'code': 0, 'message': u'1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': '6fb93406-7b3a-4975-beb1-1099fcc7fae4'}, 'cb240b63-afe1-4edc-a06
3-008aa670ba28': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': 'cb240b63-afe1-4edc-a063-008aa670ba28'}, '0340452b-0c31-42d9-9c37-adc0443bf87a': {'code': 0, 'message':
 u'1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': '0340452b-0c31-42d9-9c37-adc0443bf87a'}}}

Version-Release number of selected component (if applicable):

sf14

Steps to Reproduce:
1. run 10 vm's from blank template on iscsi storage with two hosts - run all vm's on hsm host
2. live storage migrate all the vm's disks one at a time
3.
 
Actual results:

vdsm restarted due to timeout on storage.
the task was stopped in vdsm with unexpected exception
the engine keeps reporting the same error for HSMGetAllTasksStatusesVDS although vdsm reports the task as finished.

Expected results:

we should clear the task properly.

Additional info: full logs

Comment 2 Maor 2013-05-06 12:37:52 UTC
We got two tasks which are related to entity f606e454-8c06-4b95-9e35-70a9c8aa8534
Each task is related to a different flow of live storage migration.

The first task is fada3067-7741-486a-b479-865cdd46c6ef and the last task is 31b39f81-27c2-4782-bccd-8436305a5ae9.

The first task (fada3067-7741-486a-b479-865cdd46c6ef) is being polled at 2013-04-30 11:16:29 and finally being cleared and removed from the DB at 2013-04-30 11:20:03

The other task is 31b39f81-27c2-4782-bccd-8436305a5ae9, and it is being attached to the entity at 2013-04-30 11:19:43,318, we start polling task one milisecond after we finish to clear fada3067... task at 2013-04-30 11:20:03,248

When 31b39f81... is ended it can be seen in the log that we log that we end the command  (BaseAsyncTask::LogEndTaskFailure), and the following operation should be to end the command and clear the task, but engine does not get there.

Looking into the source code, the procedure that is called after the log is EntityAsyncTask#EndActionIfNecessary. we were supposed to get a log which indicates 
EntityAsyncTask::EndActionIfNecessary: All tasks of entity '{0}' has ended, but we didn't.
That implies that not all task were acctually ended, but why?

Apperentlly not all tasks were cleared in the previous operation in the entity f606e454-8c06-4b95-9e35-70a9c8aa8534, that is why we probably can't end the command and clear the task, and that is also why we are getting this (BaseAsyncTask::LogEndTaskFailure) in an endless loop.

I added a patch which will indicate the tasks which prevent the end of the process at http://gerrit.ovirt.org/#/c/14433/

Comment 3 Maor 2013-05-06 13:02:49 UTC
We need to figure out the following:
1) Why the task did not proceed to end the action (Probably related tasks which were not cleaned)
2) (Dependent on 1) Which tasks were not cleaned - For this http://gerrit.ovirt.org/#/c/14433/ should be helpful
3) If 2 is indeed the case, then find why was it not cleared from the _listTasks list of the entity.

Comment 4 Yaniv Bronhaim 2013-06-03 13:19:03 UTC
I couldn't reproduce this scenario,
I ran 10 VMs over HSM with 2 domains, made move to all disks one by one, 

And as described in the bz description, during this process the communication to storage should brake and vdsm should restart. It didn't happen on my env, so I manually restarted vdsm

Seems like it all happened due to a race-> EntityMultiAsyncTasks::ShouldEndAction is called from EndActionIfNecessary, that called after HandleEndedTask-> this should set all tasks to 'ended' state,

Apparently, we got to shouldEndAction with a task that is not 'ended', otherwise we would have got to endAction and clean all the tasks.

3.2 and upstream in those sections are quite the same, So it hasn't been changed between the versions I check.
Maor suggested to add comment that might help us if this will happen again while testing.

From VDSM log we see that all tasks result are "success" except the one with the exception that holds on 'cleanSuccess', and it stays like that in each getAllStatuses request. That's lead to log we see in engine. Nothing should change until endCommand is called.

Comment 5 Yaniv Bronhaim 2013-06-04 15:48:55 UTC
After the following error the errorcode of the task set to 411 (TaskAborted):

31b39f81-27c2-4782-bccd-8436305a5ae9::ERROR::2013-04-30 11:20:06,451::task::850::TaskManager.Task::(_setError) Task=`31b39f81-27c2-4782-bccd-8436305a5ae9`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 857, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/storage/task.py", line 318, in run
    return self.cmd(*self.argslist, **self.argsdict)
  File "/usr/share/vdsm/storage/securable.py", line 68, in wrapper
    return f(self, *args, **kwargs)
  File "/usr/share/vdsm/storage/sp.py", line 1798, in syncImageData
    syncType)
  File "/usr/share/vdsm/storage/image.py", line 682, in syncData
    {'srcChain': srcChain, 'dstChain': dstChain})
  File "/usr/share/vdsm/storage/image.py", line 570, in _interImagesCopy
    srcFmtStr, dstFmtStr)
  File "/usr/lib64/python2.6/site-packages/vdsm/qemuImg.py", line 114, in convert
    ioclass=misc.IOCLASS.IDLE)
  File "/usr/share/vdsm/storage/misc.py", line 286, in watchCmd
    raise se.ActionStopped()
ActionStopped: Action was stopped: ()

Vdsm keeps reporting with errorcode 411, engine sets it as the tasks exception and we get:
2013-04-30 12:24:58,045 ERROR [org.ovirt.engine.core.bll.SPMAsyncTask] (QuartzScheduler_Worker-36) BaseAsyncTask::LogEndTaskFailure: Task 31b39f81-27c2-4782-bccd-8436305a5ae9 (Parent Command LiveMigrateDisk, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters) ended with failure:
-- Result: cleanSuccess
-- Message: VDSGenericException: VDSErrorException: Failed in vdscommand to HSMGetAllTasksStatusesVDS, error = 31b39f81-27c2-4782-bccd-8436305a5ae9,
-- Exception: VDSGenericException: VDSErrorException: Failed in vdscommand to HSMGetAllTasksStatusesVDS, error = 31b39f81-27c2-4782-bccd-8436305a5ae9

After this report we see all other tasks statuses.

The questions are:
1. Even with that exception, should we set the task to ended and endAction? Seems that yes. but still, it looks like one of the tasks is missing, otherwise engine would have called to endCommand..
2. How many tasks should be for each LiveMigrateDisk command? Is it multiEntityAsyncTask or just EntityAsyncTask? which task related to which part of the command?
3. If one task status is unreported by vdsm, it should be set to unknown status and handled accordingly. is it happened?

Comment 6 Yair Zaslavsky 2013-06-04 15:55:23 UTC
Yaniv, you are confusing here two terms - EntityAsyncTask is a base class of SPMAynscTask which in the task manager design means that the task is associated with an entity.
EntityMultiAsyncTasks is a container of EntityAsyncTasks  which has containerId field which is the entityID and the "currentActionType" field which represents the current action that "tasks coordination" is performed for (the tasks that are being polled - i.e - their task state is no longer WaitingForPoll).

EntityAsyncTask has a static map _multiTasksByEntities which is a map from containerId (entityId) to EntityMultiAsyncTasks (the container of tasks per the entity + action).

IMHO,
What you should understand is at each part of LiveMigrateDisks which tasks are executed + what is the entityId for them.

Comment 7 Yaniv Bronhaim 2013-06-11 08:09:34 UTC
By following tasks: e633 cf09 cb24 , we see that: 
e633 starts, ends and fails during the clear (vdsm throws exception) - stays on cleanFailed
cf09 starts afterwards and fails - trying to clear and clear from DB - both successfully - stays on cleared.
cb24 starts also afterwards and also ends with success - also cleared.

The last part checks if all entity related tasks are cleared or clearFailed.

We suspect that somewhere between the treatment of e633 and cb24 the state of cf09 has changed from cleared. In this case, it might be that the task was removed from AsyncTaskManager::_tasks list but not from its EntityMultiAsyncTask::_listTasks. This can lead to that asyncTasksManager won't poll this task anymore, and each call to handleEndActionResult won't clean the containerId due to cf09 that still exists there not as cleared or clearFailed.

Comment 9 Yaniv Bronhaim 2013-07-21 09:08:36 UTC
The patch http://gerrit.ovirt.org/#/c/14433 is merged to allow better following and to locate the missing uncleared task, as suspected.

Currently I couldn't manage to reproduce the bug by following the bug description few times. If its possible, please try to reproduce the scenario again, or add more details to the description, so we'll be able to investigate the case more deeply.

Comment 13 Allon Mureinik 2013-08-14 08:49:27 UTC
According to comment 7, bug does not reproduce.
If you're able to reproduce it, please reopen with updated steps and logs.


Note You need to log in before you can comment on or make changes to this bug.