Bug 1129858
| Summary: | Task State Incorrect When Workers Shutdown or are Killed while Celerybeat is not Running | |||
|---|---|---|---|---|
| Product: | [Retired] Pulp | Reporter: | Brian Bouterse <bmbouter> | |
| Component: | async/tasks | Assignee: | Brian Bouterse <bmbouter> | |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Preethi Thomas <pthomas> | |
| Severity: | unspecified | Docs Contact: | ||
| Priority: | high | |||
| Version: | Master | CC: | ipanova, pthomas, rbarlow | |
| Target Milestone: | --- | Keywords: | Triaged | |
| Target Release: | 2.4.1 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | Bug Fix | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1129870 (view as bug list) | Environment: | ||
| Last Closed: | 2014-09-23 17:54:15 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1129870, 1131719 | |||
|
Description
Brian Bouterse
2014-08-13 19:39:23 UTC
Note: workers already handle their own task state to in progress when tasks are started, and to completed when tasks are finished. The recommended fix is multipart: 1) Add a worker startup behavior to workers that cancel all in progress or completed tasks assigned to a worker 2) Add an identical behavior for worker graceful shutdown 3) Adjust the graceful shutdown monitoring in pulp_celerybeat so that if a graceful shutdown of a worker is observed that pulp_celerybeat does NOT attempt to also cancel those tasks given that step 2 is taking care of that. 4) Ensure that pulp_celerybeat does still any and all releasing of reserved_resources for the worker that gracefully shut down. This fix only has the worker updating task status, it never requests that resources be deleted to avoid a potential race condition with new inbound work. The cancelled tasks would then mirror the actual situation whereby the work will never be picked up if a worker hasn't completed it before it shuts down for any reason. It is important that non-graceful worker detection behavior stay exactly as it is today (release reserved_resource AND cancel any outstanding tasks for that worker) to correctly handle the case where a worker is killed with kill -9 and doesn't update the task status itself. This cancellation would not occur until celerybeat runs at least 5 minutes after the worker last issued a heartbeat. This BZ is very much related to [0], and probably should be handled at the same time. [0]: https://bugzilla.redhat.com/show_bug.cgi?id=1120270 (In reply to bbouters from comment #1) > 1) Add a worker startup behavior to workers that cancel all in progress or > completed tasks assigned to a worker I think it's actually important to go through everything the resource manager would do to delete a worker here. In addition to cancelling the tasks, the bookkeeping must also happen on the reserved_resources_collection so that resources aren't still mapped to the new worker. PR available at: https://github.com/pulp/pulp/pull/1124 Merged to 2.4.1-dev -> 2.4-dev -> master This was fixed in pulp-2.4.1-0.3.beta. verified
[root@cloud-qe-13 ~]# rpm -qa pulp-server
pulp-server-2.4.1-0.3.beta.el6.noarch
[root@cloud-qe-13 ~]#
with Brian's help I followed the steps from the original description
[root@cloud-qe-13 ~]# pulp-admin tasks list
+----------------------------------------------------------------------+
Tasks
+----------------------------------------------------------------------+
Operations: sync
Resources: rhel65 (repository)
State: Running
Start Time: 2014-08-25T20:45:30Z
Finish Time: Incomplete
Task Id: cbcaded4-c8d2-40bc-8498-26c94d31ab99
Operations: sync
Resources: rhel7 (repository)
State: Waiting
Start Time: Unstarted
Finish Time: Incomplete
Task Id: 3a89488f-ac7a-444b-a4f1-d422c83b5645
After cancellation and restart
[root@cloud-qe-13 ~]# pulp-admin tasks list
+----------------------------------------------------------------------+
Tasks
+----------------------------------------------------------------------+
Operations: sync
Resources: rhel65 (repository)
State: Running
Start Time: 2014-08-25T20:45:30Z
Finish Time: Incomplete
Task Id: cbcaded4-c8d2-40bc-8498-26c94d31ab99
Operations: sync
Resources: rhel7 (repository)
State: Waiting
Start Time: Unstarted
Finish Time: Incomplete
Task Id: 3a89488f-ac7a-444b-a4f1-d422c83b5645
[root@cloud-qe-13 ~]# pulp-admin tasks list
+----------------------------------------------------------------------+
Tasks
+----------------------------------------------------------------------+
Operations: sync
Resources: pulp (repository)
State: Successful
Start Time: 2014-08-25T17:16:56Z
Finish Time: 2014-08-25T17:16:57Z
Task Id: 0042f483-f872-43f4-a39c-ee7924f1570e
Operations: publish
Resources: pulp (repository)
State: Successful
Start Time: 2014-08-25T17:16:57Z
Finish Time: 2014-08-25T17:16:57Z
Task Id: c664f547-16ec-49e5-8936-d7b028a954d6
Operations: publish
Resources: pulp (repository)
State: Failed
Start Time: 2014-08-25T17:18:38Z
Finish Time: 2014-08-25T17:18:38Z
Task Id: 739b6100-2fb6-4c1e-b8f8-8a8b23ef2562
Operations: publish
Resources: pulp (repository)
State: Successful
Start Time: 2014-08-25T17:22:08Z
Finish Time: 2014-08-25T17:22:09Z
Task Id: 6a82a2a0-a964-425f-84c7-f299f350d4e5
Operations: unit_update
Resources: node1 (consumer)
State: Successful
Start Time: 2014-08-25T17:25:04Z
Finish Time: 2014-08-25T17:25:05Z
Task Id: bcbe8c80-6e4f-4294-b5de-2d0d9ef0fb69
Operations: unit_update
Resources: node1 (consumer)
State: Successful
Start Time: 2014-08-25T17:25:21Z
Finish Time: 2014-08-25T17:25:21Z
Task Id: 2da29d01-477f-4534-9fe6-c95674237ac5
Operations: unit_update
Resources: node1 (consumer)
State: Successful
Start Time: 2014-08-25T17:25:53Z
Finish Time: 2014-08-25T17:26:00Z
Task Id: 64f95904-85c0-4529-b230-081b814f96d9
Operations: sync
Resources: rhel65 (repository)
State: Failed
Start Time: 2014-08-25T20:05:40Z
Finish Time: 2014-08-25T20:05:42Z
Task Id: 1d9cfdb7-50ba-48c2-9e07-fd880bb95a39
Operations: sync
Resources: rhel65 (repository)
State: Cancelled
Start Time: 2014-08-25T20:08:50Z
Finish Time: 2014-08-25T20:27:57Z
Task Id: 2151b068-ca35-4370-a7bb-908a561ecd65
Operations: sync
Resources: rhel65-ks (repository)
State: Cancelled
Start Time: 2014-08-25T20:15:35Z
Finish Time: 2014-08-25T20:17:26Z
Task Id: 39b4815b-6106-4aaa-add6-ecb69cb4e4ca
Operations: sync
Resources: rhel7 (repository)
State: Cancelled
Start Time: 2014-08-25T20:19:40Z
Finish Time: 2014-08-25T20:27:37Z
Task Id: e32a19c0-76d0-4896-b3b1-58a9b338be83
Operations: sync
Resources: rhel65 (repository)
State: Running
Start Time: 2014-08-25T20:45:30Z
Finish Time: Incomplete
Task Id: cbcaded4-c8d2-40bc-8498-26c94d31ab99
Operations: sync
Resources: rhel7 (repository)
State: Waiting
Start Time: Unstarted
Finish Time: Incomplete
Task Id: 3a89488f-ac7a-444b-a4f1-d422c83b5645
[root@cloud-qe-13 ~]# pulp-admin tasks list
+----------------------------------------------------------------------+
Tasks
+----------------------------------------------------------------------+
Operations: sync
Resources: pulp (repository)
State: Successful
Start Time: 2014-08-25T17:16:56Z
Finish Time: 2014-08-25T17:16:57Z
Task Id: 0042f483-f872-43f4-a39c-ee7924f1570e
Operations: publish
Resources: pulp (repository)
State: Successful
Start Time: 2014-08-25T17:16:57Z
Finish Time: 2014-08-25T17:16:57Z
Task Id: c664f547-16ec-49e5-8936-d7b028a954d6
Operations: publish
Resources: pulp (repository)
State: Failed
Start Time: 2014-08-25T17:18:38Z
Finish Time: 2014-08-25T17:18:38Z
Task Id: 739b6100-2fb6-4c1e-b8f8-8a8b23ef2562
Operations: publish
Resources: pulp (repository)
State: Successful
Start Time: 2014-08-25T17:22:08Z
Finish Time: 2014-08-25T17:22:09Z
Task Id: 6a82a2a0-a964-425f-84c7-f299f350d4e5
Operations: unit_update
Resources: node1 (consumer)
State: Successful
Start Time: 2014-08-25T17:25:04Z
Finish Time: 2014-08-25T17:25:05Z
Task Id: bcbe8c80-6e4f-4294-b5de-2d0d9ef0fb69
Operations: unit_update
Resources: node1 (consumer)
State: Successful
Start Time: 2014-08-25T17:25:21Z
Finish Time: 2014-08-25T17:25:21Z
Task Id: 2da29d01-477f-4534-9fe6-c95674237ac5
Operations: unit_update
Resources: node1 (consumer)
State: Successful
Start Time: 2014-08-25T17:25:53Z
Finish Time: 2014-08-25T17:26:00Z
Task Id: 64f95904-85c0-4529-b230-081b814f96d9
Operations: sync
Resources: rhel65 (repository)
State: Failed
Start Time: 2014-08-25T20:05:40Z
Finish Time: 2014-08-25T20:05:42Z
Task Id: 1d9cfdb7-50ba-48c2-9e07-fd880bb95a39
Operations: sync
Resources: rhel65 (repository)
State: Cancelled
Start Time: 2014-08-25T20:08:50Z
Finish Time: 2014-08-25T20:27:57Z
Task Id: 2151b068-ca35-4370-a7bb-908a561ecd65
Operations: sync
Resources: rhel65-ks (repository)
State: Cancelled
Start Time: 2014-08-25T20:15:35Z
Finish Time: 2014-08-25T20:17:26Z
Task Id: 39b4815b-6106-4aaa-add6-ecb69cb4e4ca
Operations: sync
Resources: rhel7 (repository)
State: Cancelled
Start Time: 2014-08-25T20:19:40Z
Finish Time: 2014-08-25T20:27:37Z
Task Id: e32a19c0-76d0-4896-b3b1-58a9b338be83
Operations: sync
Resources: rhel65 (repository)
State: Cancelled
Start Time: 2014-08-25T20:45:30Z
Finish Time: Incomplete
Task Id: cbcaded4-c8d2-40bc-8498-26c94d31ab99
Operations: sync
Resources: rhel7 (repository)
State: Cancelled
Start Time: Unstarted
Finish Time: Incomplete
Task Id: 3a89488f-ac7a-444b-a4f1-d422c83b5645
[root@cloud-qe-13 ~]#
This is fixed in Pulp-2.4.1-1. |