Bug 1751881
Summary: | Possible faulty storage task state transition on task abort | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [oVirt] vdsm | Reporter: | Amit Bawer <abawer> | ||||||
Component: | Core | Assignee: | Amit Bawer <abawer> | ||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Elad <ebenahar> | ||||||
Severity: | medium | Docs Contact: | |||||||
Priority: | unspecified | ||||||||
Version: | 4.40.0 | CC: | bugs, frolland, nsoffer | ||||||
Target Milestone: | ovirt-4.4.0 | Flags: | sbonazzo:
ovirt-4.4?
|
||||||
Target Release: | --- | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2019-10-01 17:18:51 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
Amit Bawer
2019-09-12 20:36:58 UTC
Since issue occurs by chance, we can compare "good" and "bad" executions of test_stop_clear_task. "good" flow: 2019-09-12 20:41:42,534 DEBUG (tasks/0) [storage.ThreadPool] Number of running tasks: 1 (threadPool:60) 2019-09-12 20:41:42,534 INFO (tasks/0) [storage.ThreadPool.WorkerThread] START task task-id (cmd=<bound method Task.commit of <vdsm.storage.task.Task object at 0x7f6e8293f048>>, args=None) (threadPool:208) 2019-09-12 20:41:42,534 DEBUG (tasks/0) [storage.TaskManager.Task] (Task='task-id') committing task: task-id (task:1212) 2019-09-12 20:41:42,534 DEBUG (tasks/0) [storage.TaskManager.Task] (Task='task-id') moving from state queued -> state running (task:604) 2019-09-12 20:41:42,534 DEBUG (tasks/0) [storage.TaskManager.Task] (Task='task-id') Task.run: running job 0: test-job: <storage.storagetestlib.Callable object at 0x7f6e81470e10> (args: () kwargs: {}) (task:916) 2019-09-12 20:41:42,534 DEBUG (tasks/0) [storage.TaskManager.Task] (Task='task-id') Job.run: running test-job: <storage.storagetestlib.Callable object at 0x7f6e81470e10> (args: () kwargs: {}) callback None (task:335) 2019-09-12 20:41:42,534 DEBUG (MainThread) [storage.TaskManager] Entry. (taskManager:132) 2019-09-12 20:41:42,534 DEBUG (MainThread) [storage.TaskManager] Return: {'task-id': {'id': 'task-id', 'verb': '', 'state': 'running', 'code': 0, 'message': 'running job 1 of 1', 'result': '', 'tag': 'test'}} (taskManager:141) 2019-09-12 20:41:42,534 DEBUG (MainThread) [storage.TaskManager] Entry. taskID: task-id (taskManager:157) 2019-09-12 20:41:42,534 DEBUG (MainThread) [storage.TaskManager.Task] (Task='task-id') stopping in state running (force False) (task:1258) 2019-09-12 20:41:42,534 DEBUG (tasks/0) [storage.TaskManager.Task] (Task='task-id') moving from state running -> state aborting instead of finished (task:607) 2019-09-12 20:41:42,535 DEBUG (MainThread) [storage.TaskManager.Task] (Task='task-id') ref 1 aborting False (task:1004) 2019-09-12 20:41:42,535 DEBUG (tasks/0) [storage.TaskManager.Task] (Task='task-id') _aborting: recover policy auto (task:559) 2019-09-12 20:41:42,535 DEBUG (MainThread) [storage.TaskManager] Return. (taskManager:160) 2019-09-12 20:41:42,535 DEBUG (tasks/0) [storage.TaskManager.Task] (Task='task-id') moving from state aborting -> state racquiring (task:604) 2019-09-12 20:41:42,535 DEBUG (tasks/0) [storage.TaskManager.Task] (Task='task-id') moving from state racquiring -> state recovering (task:604) 2019-09-12 20:41:42,535 DEBUG (tasks/0) [storage.TaskManager.Task] (Task='task-id') _recover (task:807) 2019-09-12 20:41:42,535 DEBUG (tasks/0) [storage.TaskManager.Task] (Task='task-id') running recovery None (task:814) 2019-09-12 20:41:42,535 DEBUG (tasks/0) [storage.TaskManager.Task] (Task='task-id') Recover Done: state recovering (task:795) 2019-09-12 20:41:42,535 DEBUG (tasks/0) [storage.TaskManager.Task] (Task='task-id') moving from state recovering -> state recovered (task:604) 2019-09-12 20:41:42,535 DEBUG (tasks/0) [storage.ResourceManager.Owner] Owner.releaseAll requests {} resources {} (resourceManager:913) 2019-09-12 20:41:42,535 DEBUG (tasks/0) [storage.ResourceManager.Owner] Owner.cancelAll requests {} (resourceManager:950) 2019-09-12 20:41:42,536 DEBUG (tasks/0) [storage.TaskManager.Task] (Task='task-id') Task.run: exit - success: result (task:927) 2019-09-12 20:41:42,536 DEBUG (tasks/0) [storage.TaskManager.Task] (Task='task-id') ref 0 aborting False (task:1004) 2019-09-12 20:41:42,536 INFO (tasks/0) [storage.ThreadPool.WorkerThread] FINISH task task-id (threadPool:210) 2019-09-12 20:41:42,536 DEBUG (tasks/0) [storage.ThreadPool] Number of running tasks: 0 (threadPool:60) "bad" flow: 2019-09-12 20:42:02,012 DEBUG (tasks/1) [storage.ThreadPool] Number of running tasks: 1 (threadPool:60) 2019-09-12 20:42:02,012 INFO (tasks/1) [storage.ThreadPool.WorkerThread] START task task-id (cmd=<bound method Task.commit of <vdsm.storage.task.Task object at 0x7fc0a3d4bf28>>, args=None) (threadPool:208) 2019-09-12 20:42:02,012 DEBUG (tasks/1) [storage.TaskManager.Task] (Task='task-id') committing task: task-id (task:1212) 2019-09-12 20:42:02,012 DEBUG (tasks/1) [storage.TaskManager.Task] (Task='task-id') moving from state queued -> state running (task:604) 2019-09-12 20:42:02,012 DEBUG (tasks/1) [storage.TaskManager.Task] (Task='task-id') Task.run: running job 0: test-job: <storage.storagetestlib.Callable object at 0x7fc0a2895e10> (args: () kwargs: {}) (task:916) 2019-09-12 20:42:02,012 DEBUG (tasks/1) [storage.TaskManager.Task] (Task='task-id') Job.run: running test-job: <storage.storagetestlib.Callable object at 0x7fc0a2895e10> (args: () kwargs: {}) callback None (task:335) 2019-09-12 20:42:02,012 DEBUG (MainThread) [storage.TaskManager] Entry. (taskManager:132) 2019-09-12 20:42:02,012 DEBUG (MainThread) [storage.TaskManager] Return: {'task-id': {'id': 'task-id', 'verb': '', 'state': 'running', 'code': 0, 'message': 'running job 1 of 1', 'result': '', 'tag': 'test'}} (taskManager:141) 2019-09-12 20:42:02,012 DEBUG (MainThread) [storage.TaskManager] Entry. taskID: task-id (taskManager:157) 2019-09-12 20:42:02,012 DEBUG (MainThread) [storage.TaskManager.Task] (Task='task-id') stopping in state running (force False) (task:1258) 2019-09-12 20:42:02,012 DEBUG (tasks/1) [storage.TaskManager.Task] (Task='task-id') moving from state running -> state aborting instead of finished (task:607) 2019-09-12 20:42:02,013 DEBUG (tasks/1) [storage.TaskManager.Task] (Task='task-id') Task.run: exit - success: result (task:927) 2019-09-12 20:42:02,013 DEBUG (tasks/1) [storage.TaskManager.Task] (Task='task-id') ref 1 aborting True (task:1004) 2019-09-12 20:42:02,013 DEBUG (MainThread) [storage.TaskManager.Task] (Task='task-id') ref 0 aborting True (task:1004) 2019-09-12 20:42:02,013 INFO (tasks/1) [storage.ThreadPool.WorkerThread] FINISH task task-id (threadPool:210) 2019-09-12 20:42:02,013 DEBUG (MainThread) [storage.TaskManager.Task] (Task='task-id') Task._doAbort: force False (task:939) 2019-09-12 20:42:02,013 DEBUG (tasks/1) [storage.ThreadPool] Number of running tasks: 0 (threadPool:60) 2019-09-12 20:42:02,013 DEBUG (MainThread) [storage.ResourceManager.Owner] Owner.cancelAll requests {} (resourceManager:950) 2019-09-12 20:42:02,013 DEBUG (MainThread) [storage.TaskManager.Task] (Task='task-id') moving from state aborting -> state raborting (task:604) 2019-09-12 20:42:02,013 WARNING (MainThread) [storage.TaskManager.Task] (Task='task-id') Task exited in non terminal state. Setting tasks as failed. (task:967) 2019-09-12 20:42:02,013 DEBUG (MainThread) [storage.TaskManager.Task] (Task='task-id') moving from state aborting -> state failed (task:604) At "good" flow, the Main thread which calls Task.stop() manages to finish the Task.stop() sequence before the task transitions into 'aborting' state as the Task refcount is reduced to 1 and aborting flag is set back to 'false', meaning the abort callbacks already finished: 2019-09-12 20:41:42,534 DEBUG (MainThread) [storage.TaskManager.Task] (Task='task-id') stopping in state running (force False) (task:1258) 2019-09-12 20:41:42,534 DEBUG (tasks/0) [storage.TaskManager.Task] (Task='task-id') moving from state running -> state aborting instead of finished (task:607) 2019-09-12 20:41:42,535 DEBUG (MainThread) [storage.TaskManager.Task] (Task='task-id') ref 1 aborting False (task:1004) 2019-09-12 20:41:42,535 DEBUG (tasks/0) [storage.TaskManager.Task] (Task='task-id') _aborting: recover policy auto (task:559) 2019-09-12 20:41:42,535 DEBUG (MainThread) [storage.TaskManager] Return. (taskManager:160) At "bad" flow, the Main thread is still executing the Task.stop sequence while the Task first manages to transition into 'aborting' state: 2019-09-12 20:42:02,012 DEBUG (MainThread) [storage.TaskManager.Task] (Task='task-id') stopping in state running (force False) (task:1258) 2019-09-12 20:42:02,012 DEBUG (tasks/1) [storage.TaskManager.Task] (Task='task-id') moving from state running -> state aborting instead of finished (task:607) 2019-09-12 20:42:02,013 DEBUG (tasks/1) [storage.TaskManager.Task] (Task='task-id') Task.run: exit - success: result (task:927) 2019-09-12 20:42:02,013 DEBUG (tasks/1) [storage.TaskManager.Task] (Task='task-id') ref 1 aborting True (task:1004) 2019-09-12 20:42:02,013 DEBUG (MainThread) [storage.TaskManager.Task] (Task='task-id') ref 0 aborting True (task:1004) 2019-09-12 20:42:02,013 INFO (tasks/1) [storage.ThreadPool.WorkerThread] FINISH task task-id (threadPool:210) 2019-09-12 20:42:02,013 DEBUG (MainThread) [storage.TaskManager.Task] (Task='task-id') Task._doAbort: force False (task:939) 2019-09-12 20:42:02,013 DEBUG (tasks/1) [storage.ThreadPool] Number of running tasks: 0 (threadPool:60) In this case, the Task transition handler will not start the recovery sequence like in the "good" flow, since the Task refcount is not reduced to 1, at least not until the Main thread finishes the abort callbacks, and the Task jobs run is over (invoked by Task.commit() setting the task into running state). In this case, the state of the Task is already 'aborting' once it is time to set its status by the doAbort sequence taking place from the Main thread, so it would fallback into 'raborting' state (there is a logic for "canAbort()" preventing falling into 'aborting' once already 'aborting'). Thus leading to the error. Since the recoveryPolicy is set to "auto" it is expected to fall into recovery sequence in case the task is aborted [1][2], this can only take place if Task.refcount <= 1 when handling 'aborting' state. if the refcount is > 1 then recovery will be skipped[3] , and the "canAbort()" logic[4] will prevent from it to be called again by the state transition handler, as it would fallback into 'raborting' state.[5][6] [1] https://github.com/oVirt/vdsm/blob/ce7fff9d0f54114594bba8d73fe6328fe1dd5ef9/lib/vdsm/storage/task.py#L42 [2] https://github.com/oVirt/vdsm/blob/ce7fff9d0f54114594bba8d73fe6328fe1dd5ef9/lib/vdsm/storage/task.py#L560 [3] https://github.com/oVirt/vdsm/blob/ce7fff9d0f54114594bba8d73fe6328fe1dd5ef9/lib/vdsm/storage/task.py#L556 [4] https://github.com/oVirt/vdsm/blob/ce7fff9d0f54114594bba8d73fe6328fe1dd5ef9/lib/vdsm/storage/task.py#L211 [5] https://github.com/oVirt/vdsm/blob/ce7fff9d0f54114594bba8d73fe6328fe1dd5ef9/lib/vdsm/storage/task.py#L597 [6] https://github.com/oVirt/vdsm/blob/ce7fff9d0f54114594bba8d73fe6328fe1dd5ef9/lib/vdsm/storage/task.py#L955 Created attachment 1614661 [details]
Good test run log
Created attachment 1614662 [details]
Bad test run log
Is this a real flow used by vdsm? From what I am able to tell, stopTask() is an exposed API [1], so it seems possible to attempt to stop SPM operation carried out as taskManager jobs, given access to API and task id. [1] https://github.com/oVirt/vdsm/search?q=stopTask&unscoped_q=stopTask Fix was merged to Vdsm master (4.4) branch and verified by 100 runs of the original failing test. |