Bug 1751881 - Possible faulty storage task state transition on task abort
Summary: Possible faulty storage task state transition on task abort
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: vdsm
Classification: oVirt
Component: Core
Version: 4.40.0
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ovirt-4.4.0
: ---
Assignee: Amit Bawer
QA Contact: Elad
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-09-12 20:36 UTC by Amit Bawer
Modified: 2019-10-01 17:18 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-10-01 17:18:51 UTC
oVirt Team: Storage
Embargoed:
sbonazzo: ovirt-4.4?


Attachments (Terms of Use)
Good test run log (18.95 KB, text/plain)
2019-09-12 22:02 UTC, Amit Bawer
no flags Details
Bad test run log (19.46 KB, text/plain)
2019-09-12 22:03 UTC, Amit Bawer
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 103339 0 master MERGED task: Fix race when aborting tasks 2020-07-08 08:22:16 UTC

Description Amit Bawer 2019-09-12 20:36:58 UTC
Description of problem: When running a storage task under taskManager (with recoveryPolicy set to "auto"), abort sequence may lead to faulty task state transition update attempt from 'aborting' to 'raborting'. This issue is reproduced non-deterministically (i.e. 40% of the times) from CI unit testing, on either py2 / py3 storage tests.

_____________________________ test_stop_clear_task _____________________________

    def test_stop_clear_task():
        called_object = Callable(hang_around=True)
        with task_manager() as tm:
            task_object = task.Task(id="task-id")
            task_object.prepare(tm.scheduleJob,
                                "test",
                                None,
                                task_object,
                                "test-job",
                                called_object)
            called_object.wait_until_running()
            assert "task-id" in tm.getAllTasks()
    
            assert called_object.is_finished() is False
            with task_object.abort_callback(called_object.finish):
>               tm.stopTask("task-id")

storage/taskmanager_test.py:123: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../lib/vdsm/storage/taskManager.py:157: in stopTask
    t.stop(force=force)
../lib/vdsm/storage/task.py:1275: in stop
    self._decref(force)
../lib/vdsm/storage/task.py:1006: in _decref
    self._doAbort(force)
../lib/vdsm/storage/task.py:958: in _doAbort
    self._updateState(State.raborting)
../lib/vdsm/storage/task.py:609: in _updateState
    self.state.moveto(state, force)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <vdsm.storage.task.State instance at 0x7f3fb5ed8b48>, state = 'raborting'
force = False

    def moveto(self, state, force=False):
        if state not in self._moveto:
            raise ValueError("not a valid target state: %s" % state)
        if not force and self.state not in self._moveto[state]:
            raise se.TaskStateTransitionError("from %s to %s" %
>                                             (self.state, state))
E           TaskStateTransitionError: cannot move task to requested state: ('from aborting to raborting',)

../lib/vdsm/storage/task.py:225: TaskStateTransitionError
---------------------------- Captured stderr setup -----------------------------
dsm_standard-check-patch/vdsm/lib/vdsm/storage/task.py", line 612, in _updateState
    self.persist()
  File "/home/jenkins/workspace/vdsm_standard-check-patch/vdsm/lib/vdsm/storage/task.py", line 1146, in persist
    self._save(self.store)
  File "/home/jenkins/workspace/vdsm_standard-check-patch/vdsm/lib/vdsm/storage/task.py", line 786, in _save
    getProcPool().fileUtils.cleanupdir(origTaskDir + BACKUP_EXT)
  File "/home/jenkins/workspace/vdsm_standard-check-patch/vdsm/lib/vdsm/storage/outOfProcess.py", line 126, in cleanupdir
    if _IOProcessOs(self._iop).path.isdir(fullpath):
  File "/home/jenkins/workspace/vdsm_standard-check-patch/vdsm/lib/vdsm/storage/outOfProcess.py", line 253, in isdir
    res = self._iop.stat(path)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 471, in stat
    resdict = self._sendCommand("stat", {"path": path}, self.timeout)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 439, in _sendCommand
    self._pingPoller()
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 370, in _pingPoller
    raise Closed("Client %s was closed" % self.name)
Closed: Client Global was closed
2019-09-12 15:59:14,125 DEBUG (tasks/0) [storage.ResourceManager.Owner] Owner.releaseAll requests {} resources {} (resourceManager:913)
2019-09-12 15:59:14,125 DEBUG (tasks/0) [storage.ResourceManager.Owner] Owner.cancelAll requests {} (resourceManager:950)
2019-09-12 15:59:14,126 DEBUG (tasks/0) [storage.TaskManager.Task] (Task='task-id') Task.run: exit - success: result  (task:927)
2019-09-12 15:59:14,126 DEBUG (tasks/0) [storage.TaskManager.Task] (Task='task-id') ref 0 aborting False (task:1004)

Version-Release number of selected component (if applicable):


How reproducible: 40% of CI test runs with test patch https://gerrit.ovirt.org/#/c/103320/1/


Steps to Reproduce:
1. Run test_stop_clear_task test repeatedly.
2.
3.

Actual results: Wrong state transition exception:   TaskStateTransitionError: cannot move task to requested state: ('from aborting to raborting',)


Expected results: No errors on task abort flow.


Additional info:

Comment 1 Amit Bawer 2019-09-12 20:53:06 UTC
Since issue occurs by chance, we can compare "good" and "bad" executions of test_stop_clear_task.

"good" flow:

2019-09-12 20:41:42,534 DEBUG (tasks/0) [storage.ThreadPool] Number of running tasks: 1 (threadPool:60)
2019-09-12 20:41:42,534 INFO  (tasks/0) [storage.ThreadPool.WorkerThread] START task task-id (cmd=<bound method Task.commit of <vdsm.storage.task.Task object at 0x7f6e8293f048>>, args=None) (threadPool:208)
2019-09-12 20:41:42,534 DEBUG (tasks/0) [storage.TaskManager.Task] (Task='task-id') committing task: task-id (task:1212)
2019-09-12 20:41:42,534 DEBUG (tasks/0) [storage.TaskManager.Task] (Task='task-id') moving from state queued -> state running (task:604)
2019-09-12 20:41:42,534 DEBUG (tasks/0) [storage.TaskManager.Task] (Task='task-id') Task.run: running job 0: test-job: <storage.storagetestlib.Callable object at 0x7f6e81470e10> (args: () kwargs: {}) (task:916)
2019-09-12 20:41:42,534 DEBUG (tasks/0) [storage.TaskManager.Task] (Task='task-id') Job.run: running test-job: <storage.storagetestlib.Callable object at 0x7f6e81470e10> (args: () kwargs: {}) callback None (task:335)
2019-09-12 20:41:42,534 DEBUG (MainThread) [storage.TaskManager] Entry. (taskManager:132)
2019-09-12 20:41:42,534 DEBUG (MainThread) [storage.TaskManager] Return: {'task-id': {'id': 'task-id', 'verb': '', 'state': 'running', 'code': 0, 'message': 'running job 1 of 1', 'result': '', 'tag': 'test'}} (taskManager:141)
2019-09-12 20:41:42,534 DEBUG (MainThread) [storage.TaskManager] Entry. taskID: task-id (taskManager:157)
2019-09-12 20:41:42,534 DEBUG (MainThread) [storage.TaskManager.Task] (Task='task-id') stopping in state running (force False) (task:1258)
2019-09-12 20:41:42,534 DEBUG (tasks/0) [storage.TaskManager.Task] (Task='task-id') moving from state running -> state aborting instead of finished (task:607)
2019-09-12 20:41:42,535 DEBUG (MainThread) [storage.TaskManager.Task] (Task='task-id') ref 1 aborting False (task:1004)
2019-09-12 20:41:42,535 DEBUG (tasks/0) [storage.TaskManager.Task] (Task='task-id') _aborting: recover policy auto (task:559)
2019-09-12 20:41:42,535 DEBUG (MainThread) [storage.TaskManager] Return. (taskManager:160)
2019-09-12 20:41:42,535 DEBUG (tasks/0) [storage.TaskManager.Task] (Task='task-id') moving from state aborting -> state racquiring (task:604)
2019-09-12 20:41:42,535 DEBUG (tasks/0) [storage.TaskManager.Task] (Task='task-id') moving from state racquiring -> state recovering (task:604)
2019-09-12 20:41:42,535 DEBUG (tasks/0) [storage.TaskManager.Task] (Task='task-id') _recover (task:807)
2019-09-12 20:41:42,535 DEBUG (tasks/0) [storage.TaskManager.Task] (Task='task-id') running recovery None (task:814)
2019-09-12 20:41:42,535 DEBUG (tasks/0) [storage.TaskManager.Task] (Task='task-id') Recover Done: state recovering (task:795)
2019-09-12 20:41:42,535 DEBUG (tasks/0) [storage.TaskManager.Task] (Task='task-id') moving from state recovering -> state recovered (task:604)
2019-09-12 20:41:42,535 DEBUG (tasks/0) [storage.ResourceManager.Owner] Owner.releaseAll requests {} resources {} (resourceManager:913)
2019-09-12 20:41:42,535 DEBUG (tasks/0) [storage.ResourceManager.Owner] Owner.cancelAll requests {} (resourceManager:950)
2019-09-12 20:41:42,536 DEBUG (tasks/0) [storage.TaskManager.Task] (Task='task-id') Task.run: exit - success: result  (task:927)
2019-09-12 20:41:42,536 DEBUG (tasks/0) [storage.TaskManager.Task] (Task='task-id') ref 0 aborting False (task:1004)
2019-09-12 20:41:42,536 INFO  (tasks/0) [storage.ThreadPool.WorkerThread] FINISH task task-id (threadPool:210)
2019-09-12 20:41:42,536 DEBUG (tasks/0) [storage.ThreadPool] Number of running tasks: 0 (threadPool:60)



"bad" flow:

2019-09-12 20:42:02,012 DEBUG (tasks/1) [storage.ThreadPool] Number of running tasks: 1 (threadPool:60)
2019-09-12 20:42:02,012 INFO  (tasks/1) [storage.ThreadPool.WorkerThread] START task task-id (cmd=<bound method Task.commit of <vdsm.storage.task.Task object at 0x7fc0a3d4bf28>>, args=None) (threadPool:208)
2019-09-12 20:42:02,012 DEBUG (tasks/1) [storage.TaskManager.Task] (Task='task-id') committing task: task-id (task:1212)
2019-09-12 20:42:02,012 DEBUG (tasks/1) [storage.TaskManager.Task] (Task='task-id') moving from state queued -> state running (task:604)
2019-09-12 20:42:02,012 DEBUG (tasks/1) [storage.TaskManager.Task] (Task='task-id') Task.run: running job 0: test-job: <storage.storagetestlib.Callable object at 0x7fc0a2895e10> (args: () kwargs: {}) (task:916)
2019-09-12 20:42:02,012 DEBUG (tasks/1) [storage.TaskManager.Task] (Task='task-id') Job.run: running test-job: <storage.storagetestlib.Callable object at 0x7fc0a2895e10> (args: () kwargs: {}) callback None (task:335)
2019-09-12 20:42:02,012 DEBUG (MainThread) [storage.TaskManager] Entry. (taskManager:132)
2019-09-12 20:42:02,012 DEBUG (MainThread) [storage.TaskManager] Return: {'task-id': {'id': 'task-id', 'verb': '', 'state': 'running', 'code': 0, 'message': 'running job 1 of 1', 'result': '', 'tag': 'test'}} (taskManager:141)
2019-09-12 20:42:02,012 DEBUG (MainThread) [storage.TaskManager] Entry. taskID: task-id (taskManager:157)
2019-09-12 20:42:02,012 DEBUG (MainThread) [storage.TaskManager.Task] (Task='task-id') stopping in state running (force False) (task:1258)
2019-09-12 20:42:02,012 DEBUG (tasks/1) [storage.TaskManager.Task] (Task='task-id') moving from state running -> state aborting instead of finished (task:607)
2019-09-12 20:42:02,013 DEBUG (tasks/1) [storage.TaskManager.Task] (Task='task-id') Task.run: exit - success: result  (task:927)
2019-09-12 20:42:02,013 DEBUG (tasks/1) [storage.TaskManager.Task] (Task='task-id') ref 1 aborting True (task:1004)
2019-09-12 20:42:02,013 DEBUG (MainThread) [storage.TaskManager.Task] (Task='task-id') ref 0 aborting True (task:1004)
2019-09-12 20:42:02,013 INFO  (tasks/1) [storage.ThreadPool.WorkerThread] FINISH task task-id (threadPool:210)
2019-09-12 20:42:02,013 DEBUG (MainThread) [storage.TaskManager.Task] (Task='task-id') Task._doAbort: force False (task:939)
2019-09-12 20:42:02,013 DEBUG (tasks/1) [storage.ThreadPool] Number of running tasks: 0 (threadPool:60)
2019-09-12 20:42:02,013 DEBUG (MainThread) [storage.ResourceManager.Owner] Owner.cancelAll requests {} (resourceManager:950)
2019-09-12 20:42:02,013 DEBUG (MainThread) [storage.TaskManager.Task] (Task='task-id') moving from state aborting -> state raborting (task:604)
2019-09-12 20:42:02,013 WARNING (MainThread) [storage.TaskManager.Task] (Task='task-id') Task exited in non terminal state. Setting tasks as failed. (task:967)
2019-09-12 20:42:02,013 DEBUG (MainThread) [storage.TaskManager.Task] (Task='task-id') moving from state aborting -> state failed (task:604)

Comment 2 Amit Bawer 2019-09-12 21:59:01 UTC
At "good" flow, the Main thread which calls Task.stop() manages to finish the Task.stop() sequence before the task transitions into 'aborting' state
as the Task refcount is reduced to 1 and aborting flag is set back to 'false', meaning the abort callbacks already finished: 

2019-09-12 20:41:42,534 DEBUG (MainThread) [storage.TaskManager.Task] (Task='task-id') stopping in state running (force False) (task:1258)
2019-09-12 20:41:42,534 DEBUG (tasks/0) [storage.TaskManager.Task] (Task='task-id') moving from state running -> state aborting instead of finished (task:607)
2019-09-12 20:41:42,535 DEBUG (MainThread) [storage.TaskManager.Task] (Task='task-id') ref 1 aborting False (task:1004)
2019-09-12 20:41:42,535 DEBUG (tasks/0) [storage.TaskManager.Task] (Task='task-id') _aborting: recover policy auto (task:559)
2019-09-12 20:41:42,535 DEBUG (MainThread) [storage.TaskManager] Return. (taskManager:160)


At "bad" flow, the Main thread is still executing the Task.stop sequence while the Task first manages to transition into 'aborting' state:

2019-09-12 20:42:02,012 DEBUG (MainThread) [storage.TaskManager.Task] (Task='task-id') stopping in state running (force False) (task:1258)
2019-09-12 20:42:02,012 DEBUG (tasks/1) [storage.TaskManager.Task] (Task='task-id') moving from state running -> state aborting instead of finished (task:607)
2019-09-12 20:42:02,013 DEBUG (tasks/1) [storage.TaskManager.Task] (Task='task-id') Task.run: exit - success: result  (task:927)
2019-09-12 20:42:02,013 DEBUG (tasks/1) [storage.TaskManager.Task] (Task='task-id') ref 1 aborting True (task:1004)
2019-09-12 20:42:02,013 DEBUG (MainThread) [storage.TaskManager.Task] (Task='task-id') ref 0 aborting True (task:1004)
2019-09-12 20:42:02,013 INFO  (tasks/1) [storage.ThreadPool.WorkerThread] FINISH task task-id (threadPool:210)
2019-09-12 20:42:02,013 DEBUG (MainThread) [storage.TaskManager.Task] (Task='task-id') Task._doAbort: force False (task:939)
2019-09-12 20:42:02,013 DEBUG (tasks/1) [storage.ThreadPool] Number of running tasks: 0 (threadPool:60)

In this case, the Task transition handler will not start the recovery sequence like in the "good" flow, since the Task refcount is not reduced to 1, at least not until the Main thread finishes the
abort callbacks, and the Task jobs run is over (invoked by Task.commit() setting the task into running state). In this case, the state of the Task is already 'aborting' once it is time 
to set its status by the doAbort sequence taking place from the Main thread, so it would fallback into 'raborting' state (there is a logic for "canAbort()" preventing falling into 'aborting' once already 'aborting').
Thus leading to the error.

Since the recoveryPolicy is set to "auto" it is expected to fall into recovery sequence in case the task is aborted [1][2], this can only take place if Task.refcount <= 1 when handling 'aborting' state. 
if the refcount is > 1 then recovery will be skipped[3] , and the "canAbort()" logic[4] will prevent from it to be called again by the state transition handler, as it would fallback into 'raborting' state.[5][6]

[1] https://github.com/oVirt/vdsm/blob/ce7fff9d0f54114594bba8d73fe6328fe1dd5ef9/lib/vdsm/storage/task.py#L42
[2] https://github.com/oVirt/vdsm/blob/ce7fff9d0f54114594bba8d73fe6328fe1dd5ef9/lib/vdsm/storage/task.py#L560
[3] https://github.com/oVirt/vdsm/blob/ce7fff9d0f54114594bba8d73fe6328fe1dd5ef9/lib/vdsm/storage/task.py#L556
[4] https://github.com/oVirt/vdsm/blob/ce7fff9d0f54114594bba8d73fe6328fe1dd5ef9/lib/vdsm/storage/task.py#L211
[5] https://github.com/oVirt/vdsm/blob/ce7fff9d0f54114594bba8d73fe6328fe1dd5ef9/lib/vdsm/storage/task.py#L597
[6] https://github.com/oVirt/vdsm/blob/ce7fff9d0f54114594bba8d73fe6328fe1dd5ef9/lib/vdsm/storage/task.py#L955

Comment 3 Amit Bawer 2019-09-12 22:02:46 UTC
Created attachment 1614661 [details]
Good test run log

Comment 4 Amit Bawer 2019-09-12 22:03:42 UTC
Created attachment 1614662 [details]
Bad test run log

Comment 5 Nir Soffer 2019-09-13 19:20:27 UTC
Is this a real flow used by vdsm?

Comment 6 Amit Bawer 2019-09-13 19:53:36 UTC
From what I am able to tell, stopTask() is an exposed API [1], so it seems possible to attempt to stop SPM operation carried out as taskManager jobs, given access to API and task id.

[1] https://github.com/oVirt/vdsm/search?q=stopTask&unscoped_q=stopTask

Comment 7 Amit Bawer 2019-10-01 17:18:51 UTC
Fix was merged to Vdsm master (4.4) branch and verified by 100 runs of the original failing test.


Note You need to log in before you can comment on or make changes to this bug.