Bug 1129858 - Task State Incorrect When Workers Shutdown or are Killed while Celerybeat is not Running
Summary: Task State Incorrect When Workers Shutdown or are Killed while Celerybeat is ...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Pulp
Classification: Retired
Component: async/tasks
Version: Master
Hardware: Unspecified
OS: Unspecified
high
unspecified
Target Milestone: ---
: 2.4.1
Assignee: Brian Bouterse
QA Contact: Preethi Thomas
URL:
Whiteboard:
Depends On:
Blocks: 1129870 1131719
TreeView+ depends on / blocked
 
Reported: 2014-08-13 19:39 UTC by Brian Bouterse
Modified: 2014-09-23 17:54 UTC (History)
3 users (show)

Fixed In Version:
Clone Of:
: 1129870 (view as bug list)
Environment:
Last Closed: 2014-09-23 17:54:15 UTC
Embargoed:


Attachments (Terms of Use)

Description Brian Bouterse 2014-08-13 19:39:23 UTC
Description of problem: It is possible for a pulp task state to get stuck in "in progress" or "waiting", although in some cases that work will never be moved to a final state even though it will never be "picked up". The reason work that is dispatched won't be "picked up" is documented here [0]. This could easily occur if all services are on one machine and that machine looses power while performing a task, but starts up again in less that 5 minutes.


Version-Release number of selected component (if applicable): Pulp 2.4


How reproducible: always


Steps to Reproduce:
1. Ensure pulp_celerybeat is running, pulp_resource_manager is running, httpd, is running, and *exactly* one worker is running (ie: worker0).
2. Dispatch long-running syncs for two different repositories so that one will get started by the worker, and the other will be dispatched to the worker but will not start.
3. Stop pulp_celerybeat while the first sync is still running
4. Forcefully restart worker0 (kill -9 then start the worker again).
5. Start pulp_celerybeat again (needs to be < 5 minutes of when you stopped celerybeat in step 3.
6. List the status of the tasks and observe that one is in progress, and the second is waiting. These are both incorrect because both tasks will never start or cancel. Pulp's task state will never get updated unless the user knows to magically cancel the task. Since it's running or waiting, canceling the task as the resolution is not obvious.


Expected results:
I expected the pulp task state of work that is "lost" in the system due to a worker restart to be marked as cancelled. This should happen regardless of whether pulp_celerybeat is running or not.


Additional info:
[0]:  https://bugzilla.redhat.com/show_bug.cgi?id=1129758

Comment 1 Brian Bouterse 2014-08-13 19:56:54 UTC
Note: workers already handle their own task state to in progress when tasks are started, and to completed when tasks are finished.

The recommended fix is multipart:

1) Add a worker startup behavior to workers that cancel all in progress or completed tasks assigned to a worker
2) Add an identical behavior for worker graceful shutdown
3) Adjust the graceful shutdown monitoring in pulp_celerybeat so that if a graceful shutdown of a worker is observed that pulp_celerybeat does NOT attempt to also cancel those tasks given that step 2 is taking care of that.
4) Ensure that pulp_celerybeat does still any and all releasing of reserved_resources for the worker that gracefully shut down. This fix only has the worker updating task status, it never requests that resources be deleted to avoid a potential race condition with new inbound work.

The cancelled tasks would then mirror the actual situation whereby the work will never be picked up if a worker hasn't completed it before it shuts down for any reason.

It is important that non-graceful worker detection behavior stay exactly as it is today (release reserved_resource AND cancel any outstanding tasks for that worker) to correctly handle the case where a worker is killed with kill -9 and doesn't update the task status itself. This cancellation would not occur until celerybeat runs at least 5 minutes after the worker last issued a heartbeat.

Comment 2 Brian Bouterse 2014-08-13 20:16:50 UTC
This BZ is very much related to [0], and probably should be handled at the same time.

[0]:  https://bugzilla.redhat.com/show_bug.cgi?id=1120270

Comment 3 Randy Barlow 2014-08-13 22:26:48 UTC
(In reply to bbouters from comment #1)
> 1) Add a worker startup behavior to workers that cancel all in progress or
> completed tasks assigned to a worker

I think it's actually important to go through everything the resource manager would do to delete a worker here. In addition to cancelling the tasks, the bookkeeping must also happen on the reserved_resources_collection so that resources aren't still mapped to the new worker.

Comment 4 Brian Bouterse 2014-08-22 18:54:59 UTC
PR available at:  https://github.com/pulp/pulp/pull/1124

Comment 5 Brian Bouterse 2014-08-22 20:20:32 UTC
Merged to 2.4.1-dev -> 2.4-dev -> master

Comment 6 Randy Barlow 2014-08-23 03:24:53 UTC
This was fixed in pulp-2.4.1-0.3.beta.

Comment 7 Preethi Thomas 2014-08-25 21:23:11 UTC
verified
[root@cloud-qe-13 ~]# rpm -qa pulp-server
pulp-server-2.4.1-0.3.beta.el6.noarch
[root@cloud-qe-13 ~]# 

with Brian's help I followed the steps from the original description


[root@cloud-qe-13 ~]# pulp-admin tasks list
+----------------------------------------------------------------------+
                                 Tasks
+----------------------------------------------------------------------+



Operations:  sync
Resources:   rhel65 (repository)
State:       Running
Start Time:  2014-08-25T20:45:30Z
Finish Time: Incomplete
Task Id:     cbcaded4-c8d2-40bc-8498-26c94d31ab99

Operations:  sync
Resources:   rhel7 (repository)
State:       Waiting
Start Time:  Unstarted
Finish Time: Incomplete
Task Id:     3a89488f-ac7a-444b-a4f1-d422c83b5645



After cancellation and restart



[root@cloud-qe-13 ~]# pulp-admin tasks list
+----------------------------------------------------------------------+
                                 Tasks
+----------------------------------------------------------------------+



Operations:  sync
Resources:   rhel65 (repository)
State:       Running
Start Time:  2014-08-25T20:45:30Z
Finish Time: Incomplete
Task Id:     cbcaded4-c8d2-40bc-8498-26c94d31ab99

Operations:  sync
Resources:   rhel7 (repository)
State:       Waiting
Start Time:  Unstarted
Finish Time: Incomplete
Task Id:     3a89488f-ac7a-444b-a4f1-d422c83b5645


[root@cloud-qe-13 ~]# pulp-admin tasks list
+----------------------------------------------------------------------+
                                 Tasks
+----------------------------------------------------------------------+

Operations:  sync
Resources:   pulp (repository)
State:       Successful
Start Time:  2014-08-25T17:16:56Z
Finish Time: 2014-08-25T17:16:57Z
Task Id:     0042f483-f872-43f4-a39c-ee7924f1570e

Operations:  publish
Resources:   pulp (repository)
State:       Successful
Start Time:  2014-08-25T17:16:57Z
Finish Time: 2014-08-25T17:16:57Z
Task Id:     c664f547-16ec-49e5-8936-d7b028a954d6

Operations:  publish
Resources:   pulp (repository)
State:       Failed
Start Time:  2014-08-25T17:18:38Z
Finish Time: 2014-08-25T17:18:38Z
Task Id:     739b6100-2fb6-4c1e-b8f8-8a8b23ef2562

Operations:  publish
Resources:   pulp (repository)
State:       Successful
Start Time:  2014-08-25T17:22:08Z
Finish Time: 2014-08-25T17:22:09Z
Task Id:     6a82a2a0-a964-425f-84c7-f299f350d4e5

Operations:  unit_update
Resources:   node1 (consumer)
State:       Successful
Start Time:  2014-08-25T17:25:04Z
Finish Time: 2014-08-25T17:25:05Z
Task Id:     bcbe8c80-6e4f-4294-b5de-2d0d9ef0fb69

Operations:  unit_update
Resources:   node1 (consumer)
State:       Successful
Start Time:  2014-08-25T17:25:21Z
Finish Time: 2014-08-25T17:25:21Z
Task Id:     2da29d01-477f-4534-9fe6-c95674237ac5

Operations:  unit_update
Resources:   node1 (consumer)
State:       Successful
Start Time:  2014-08-25T17:25:53Z
Finish Time: 2014-08-25T17:26:00Z
Task Id:     64f95904-85c0-4529-b230-081b814f96d9

Operations:  sync
Resources:   rhel65 (repository)
State:       Failed
Start Time:  2014-08-25T20:05:40Z
Finish Time: 2014-08-25T20:05:42Z
Task Id:     1d9cfdb7-50ba-48c2-9e07-fd880bb95a39

Operations:  sync
Resources:   rhel65 (repository)
State:       Cancelled
Start Time:  2014-08-25T20:08:50Z
Finish Time: 2014-08-25T20:27:57Z
Task Id:     2151b068-ca35-4370-a7bb-908a561ecd65

Operations:  sync
Resources:   rhel65-ks (repository)
State:       Cancelled
Start Time:  2014-08-25T20:15:35Z
Finish Time: 2014-08-25T20:17:26Z
Task Id:     39b4815b-6106-4aaa-add6-ecb69cb4e4ca

Operations:  sync
Resources:   rhel7 (repository)
State:       Cancelled
Start Time:  2014-08-25T20:19:40Z
Finish Time: 2014-08-25T20:27:37Z
Task Id:     e32a19c0-76d0-4896-b3b1-58a9b338be83

Operations:  sync
Resources:   rhel65 (repository)
State:       Running
Start Time:  2014-08-25T20:45:30Z
Finish Time: Incomplete
Task Id:     cbcaded4-c8d2-40bc-8498-26c94d31ab99

Operations:  sync
Resources:   rhel7 (repository)
State:       Waiting
Start Time:  Unstarted
Finish Time: Incomplete
Task Id:     3a89488f-ac7a-444b-a4f1-d422c83b5645


[root@cloud-qe-13 ~]# pulp-admin tasks list
+----------------------------------------------------------------------+
                                 Tasks
+----------------------------------------------------------------------+

Operations:  sync
Resources:   pulp (repository)
State:       Successful
Start Time:  2014-08-25T17:16:56Z
Finish Time: 2014-08-25T17:16:57Z
Task Id:     0042f483-f872-43f4-a39c-ee7924f1570e

Operations:  publish
Resources:   pulp (repository)
State:       Successful
Start Time:  2014-08-25T17:16:57Z
Finish Time: 2014-08-25T17:16:57Z
Task Id:     c664f547-16ec-49e5-8936-d7b028a954d6

Operations:  publish
Resources:   pulp (repository)
State:       Failed
Start Time:  2014-08-25T17:18:38Z
Finish Time: 2014-08-25T17:18:38Z
Task Id:     739b6100-2fb6-4c1e-b8f8-8a8b23ef2562

Operations:  publish
Resources:   pulp (repository)
State:       Successful
Start Time:  2014-08-25T17:22:08Z
Finish Time: 2014-08-25T17:22:09Z
Task Id:     6a82a2a0-a964-425f-84c7-f299f350d4e5

Operations:  unit_update
Resources:   node1 (consumer)
State:       Successful
Start Time:  2014-08-25T17:25:04Z
Finish Time: 2014-08-25T17:25:05Z
Task Id:     bcbe8c80-6e4f-4294-b5de-2d0d9ef0fb69

Operations:  unit_update
Resources:   node1 (consumer)
State:       Successful
Start Time:  2014-08-25T17:25:21Z
Finish Time: 2014-08-25T17:25:21Z
Task Id:     2da29d01-477f-4534-9fe6-c95674237ac5

Operations:  unit_update
Resources:   node1 (consumer)
State:       Successful
Start Time:  2014-08-25T17:25:53Z
Finish Time: 2014-08-25T17:26:00Z
Task Id:     64f95904-85c0-4529-b230-081b814f96d9

Operations:  sync
Resources:   rhel65 (repository)
State:       Failed
Start Time:  2014-08-25T20:05:40Z
Finish Time: 2014-08-25T20:05:42Z
Task Id:     1d9cfdb7-50ba-48c2-9e07-fd880bb95a39

Operations:  sync
Resources:   rhel65 (repository)
State:       Cancelled
Start Time:  2014-08-25T20:08:50Z
Finish Time: 2014-08-25T20:27:57Z
Task Id:     2151b068-ca35-4370-a7bb-908a561ecd65

Operations:  sync
Resources:   rhel65-ks (repository)
State:       Cancelled
Start Time:  2014-08-25T20:15:35Z
Finish Time: 2014-08-25T20:17:26Z
Task Id:     39b4815b-6106-4aaa-add6-ecb69cb4e4ca

Operations:  sync
Resources:   rhel7 (repository)
State:       Cancelled
Start Time:  2014-08-25T20:19:40Z
Finish Time: 2014-08-25T20:27:37Z
Task Id:     e32a19c0-76d0-4896-b3b1-58a9b338be83

Operations:  sync
Resources:   rhel65 (repository)
State:       Cancelled
Start Time:  2014-08-25T20:45:30Z
Finish Time: Incomplete
Task Id:     cbcaded4-c8d2-40bc-8498-26c94d31ab99

Operations:  sync
Resources:   rhel7 (repository)
State:       Cancelled
Start Time:  Unstarted
Finish Time: Incomplete
Task Id:     3a89488f-ac7a-444b-a4f1-d422c83b5645


[root@cloud-qe-13 ~]#

Comment 8 Randy Barlow 2014-09-23 17:54:15 UTC
This is fixed in Pulp-2.4.1-1.


Note You need to log in before you can comment on or make changes to this bug.