Description of problem: It is possible for a pulp task state to get stuck in "in progress" or "waiting", although in some cases that work will never be moved to a final state even though it will never be "picked up". The reason work that is dispatched won't be "picked up" is documented here [0]. This could easily occur if all services are on one machine and that machine looses power while performing a task, but starts up again in less that 5 minutes. Version-Release number of selected component (if applicable): Pulp 2.4 How reproducible: always Steps to Reproduce: 1. Ensure pulp_celerybeat is running, pulp_resource_manager is running, httpd, is running, and *exactly* one worker is running (ie: worker0). 2. Dispatch long-running syncs for two different repositories so that one will get started by the worker, and the other will be dispatched to the worker but will not start. 3. Stop pulp_celerybeat while the first sync is still running 4. Forcefully restart worker0 (kill -9 then start the worker again). 5. Start pulp_celerybeat again (needs to be < 5 minutes of when you stopped celerybeat in step 3. 6. List the status of the tasks and observe that one is in progress, and the second is waiting. These are both incorrect because both tasks will never start or cancel. Pulp's task state will never get updated unless the user knows to magically cancel the task. Since it's running or waiting, canceling the task as the resolution is not obvious. Expected results: I expected the pulp task state of work that is "lost" in the system due to a worker restart to be marked as cancelled. This should happen regardless of whether pulp_celerybeat is running or not. Additional info: [0]: https://bugzilla.redhat.com/show_bug.cgi?id=1129758
Note: workers already handle their own task state to in progress when tasks are started, and to completed when tasks are finished. The recommended fix is multipart: 1) Add a worker startup behavior to workers that cancel all in progress or completed tasks assigned to a worker 2) Add an identical behavior for worker graceful shutdown 3) Adjust the graceful shutdown monitoring in pulp_celerybeat so that if a graceful shutdown of a worker is observed that pulp_celerybeat does NOT attempt to also cancel those tasks given that step 2 is taking care of that. 4) Ensure that pulp_celerybeat does still any and all releasing of reserved_resources for the worker that gracefully shut down. This fix only has the worker updating task status, it never requests that resources be deleted to avoid a potential race condition with new inbound work. The cancelled tasks would then mirror the actual situation whereby the work will never be picked up if a worker hasn't completed it before it shuts down for any reason. It is important that non-graceful worker detection behavior stay exactly as it is today (release reserved_resource AND cancel any outstanding tasks for that worker) to correctly handle the case where a worker is killed with kill -9 and doesn't update the task status itself. This cancellation would not occur until celerybeat runs at least 5 minutes after the worker last issued a heartbeat.
This BZ is very much related to [0], and probably should be handled at the same time. [0]: https://bugzilla.redhat.com/show_bug.cgi?id=1120270
(In reply to bbouters from comment #1) > 1) Add a worker startup behavior to workers that cancel all in progress or > completed tasks assigned to a worker I think it's actually important to go through everything the resource manager would do to delete a worker here. In addition to cancelling the tasks, the bookkeeping must also happen on the reserved_resources_collection so that resources aren't still mapped to the new worker.
PR available at: https://github.com/pulp/pulp/pull/1124
Merged to 2.4.1-dev -> 2.4-dev -> master
This was fixed in pulp-2.4.1-0.3.beta.
verified [root@cloud-qe-13 ~]# rpm -qa pulp-server pulp-server-2.4.1-0.3.beta.el6.noarch [root@cloud-qe-13 ~]# with Brian's help I followed the steps from the original description [root@cloud-qe-13 ~]# pulp-admin tasks list +----------------------------------------------------------------------+ Tasks +----------------------------------------------------------------------+ Operations: sync Resources: rhel65 (repository) State: Running Start Time: 2014-08-25T20:45:30Z Finish Time: Incomplete Task Id: cbcaded4-c8d2-40bc-8498-26c94d31ab99 Operations: sync Resources: rhel7 (repository) State: Waiting Start Time: Unstarted Finish Time: Incomplete Task Id: 3a89488f-ac7a-444b-a4f1-d422c83b5645 After cancellation and restart [root@cloud-qe-13 ~]# pulp-admin tasks list +----------------------------------------------------------------------+ Tasks +----------------------------------------------------------------------+ Operations: sync Resources: rhel65 (repository) State: Running Start Time: 2014-08-25T20:45:30Z Finish Time: Incomplete Task Id: cbcaded4-c8d2-40bc-8498-26c94d31ab99 Operations: sync Resources: rhel7 (repository) State: Waiting Start Time: Unstarted Finish Time: Incomplete Task Id: 3a89488f-ac7a-444b-a4f1-d422c83b5645 [root@cloud-qe-13 ~]# pulp-admin tasks list +----------------------------------------------------------------------+ Tasks +----------------------------------------------------------------------+ Operations: sync Resources: pulp (repository) State: Successful Start Time: 2014-08-25T17:16:56Z Finish Time: 2014-08-25T17:16:57Z Task Id: 0042f483-f872-43f4-a39c-ee7924f1570e Operations: publish Resources: pulp (repository) State: Successful Start Time: 2014-08-25T17:16:57Z Finish Time: 2014-08-25T17:16:57Z Task Id: c664f547-16ec-49e5-8936-d7b028a954d6 Operations: publish Resources: pulp (repository) State: Failed Start Time: 2014-08-25T17:18:38Z Finish Time: 2014-08-25T17:18:38Z Task Id: 739b6100-2fb6-4c1e-b8f8-8a8b23ef2562 Operations: publish Resources: pulp (repository) State: Successful Start Time: 2014-08-25T17:22:08Z Finish Time: 2014-08-25T17:22:09Z Task Id: 6a82a2a0-a964-425f-84c7-f299f350d4e5 Operations: unit_update Resources: node1 (consumer) State: Successful Start Time: 2014-08-25T17:25:04Z Finish Time: 2014-08-25T17:25:05Z Task Id: bcbe8c80-6e4f-4294-b5de-2d0d9ef0fb69 Operations: unit_update Resources: node1 (consumer) State: Successful Start Time: 2014-08-25T17:25:21Z Finish Time: 2014-08-25T17:25:21Z Task Id: 2da29d01-477f-4534-9fe6-c95674237ac5 Operations: unit_update Resources: node1 (consumer) State: Successful Start Time: 2014-08-25T17:25:53Z Finish Time: 2014-08-25T17:26:00Z Task Id: 64f95904-85c0-4529-b230-081b814f96d9 Operations: sync Resources: rhel65 (repository) State: Failed Start Time: 2014-08-25T20:05:40Z Finish Time: 2014-08-25T20:05:42Z Task Id: 1d9cfdb7-50ba-48c2-9e07-fd880bb95a39 Operations: sync Resources: rhel65 (repository) State: Cancelled Start Time: 2014-08-25T20:08:50Z Finish Time: 2014-08-25T20:27:57Z Task Id: 2151b068-ca35-4370-a7bb-908a561ecd65 Operations: sync Resources: rhel65-ks (repository) State: Cancelled Start Time: 2014-08-25T20:15:35Z Finish Time: 2014-08-25T20:17:26Z Task Id: 39b4815b-6106-4aaa-add6-ecb69cb4e4ca Operations: sync Resources: rhel7 (repository) State: Cancelled Start Time: 2014-08-25T20:19:40Z Finish Time: 2014-08-25T20:27:37Z Task Id: e32a19c0-76d0-4896-b3b1-58a9b338be83 Operations: sync Resources: rhel65 (repository) State: Running Start Time: 2014-08-25T20:45:30Z Finish Time: Incomplete Task Id: cbcaded4-c8d2-40bc-8498-26c94d31ab99 Operations: sync Resources: rhel7 (repository) State: Waiting Start Time: Unstarted Finish Time: Incomplete Task Id: 3a89488f-ac7a-444b-a4f1-d422c83b5645 [root@cloud-qe-13 ~]# pulp-admin tasks list +----------------------------------------------------------------------+ Tasks +----------------------------------------------------------------------+ Operations: sync Resources: pulp (repository) State: Successful Start Time: 2014-08-25T17:16:56Z Finish Time: 2014-08-25T17:16:57Z Task Id: 0042f483-f872-43f4-a39c-ee7924f1570e Operations: publish Resources: pulp (repository) State: Successful Start Time: 2014-08-25T17:16:57Z Finish Time: 2014-08-25T17:16:57Z Task Id: c664f547-16ec-49e5-8936-d7b028a954d6 Operations: publish Resources: pulp (repository) State: Failed Start Time: 2014-08-25T17:18:38Z Finish Time: 2014-08-25T17:18:38Z Task Id: 739b6100-2fb6-4c1e-b8f8-8a8b23ef2562 Operations: publish Resources: pulp (repository) State: Successful Start Time: 2014-08-25T17:22:08Z Finish Time: 2014-08-25T17:22:09Z Task Id: 6a82a2a0-a964-425f-84c7-f299f350d4e5 Operations: unit_update Resources: node1 (consumer) State: Successful Start Time: 2014-08-25T17:25:04Z Finish Time: 2014-08-25T17:25:05Z Task Id: bcbe8c80-6e4f-4294-b5de-2d0d9ef0fb69 Operations: unit_update Resources: node1 (consumer) State: Successful Start Time: 2014-08-25T17:25:21Z Finish Time: 2014-08-25T17:25:21Z Task Id: 2da29d01-477f-4534-9fe6-c95674237ac5 Operations: unit_update Resources: node1 (consumer) State: Successful Start Time: 2014-08-25T17:25:53Z Finish Time: 2014-08-25T17:26:00Z Task Id: 64f95904-85c0-4529-b230-081b814f96d9 Operations: sync Resources: rhel65 (repository) State: Failed Start Time: 2014-08-25T20:05:40Z Finish Time: 2014-08-25T20:05:42Z Task Id: 1d9cfdb7-50ba-48c2-9e07-fd880bb95a39 Operations: sync Resources: rhel65 (repository) State: Cancelled Start Time: 2014-08-25T20:08:50Z Finish Time: 2014-08-25T20:27:57Z Task Id: 2151b068-ca35-4370-a7bb-908a561ecd65 Operations: sync Resources: rhel65-ks (repository) State: Cancelled Start Time: 2014-08-25T20:15:35Z Finish Time: 2014-08-25T20:17:26Z Task Id: 39b4815b-6106-4aaa-add6-ecb69cb4e4ca Operations: sync Resources: rhel7 (repository) State: Cancelled Start Time: 2014-08-25T20:19:40Z Finish Time: 2014-08-25T20:27:37Z Task Id: e32a19c0-76d0-4896-b3b1-58a9b338be83 Operations: sync Resources: rhel65 (repository) State: Cancelled Start Time: 2014-08-25T20:45:30Z Finish Time: Incomplete Task Id: cbcaded4-c8d2-40bc-8498-26c94d31ab99 Operations: sync Resources: rhel7 (repository) State: Cancelled Start Time: Unstarted Finish Time: Incomplete Task Id: 3a89488f-ac7a-444b-a4f1-d422c83b5645 [root@cloud-qe-13 ~]#
This is fixed in Pulp-2.4.1-1.