+++ This bug was initially created as a clone of Bug #1129858 +++
Description of problem: It is possible for a pulp task state to get stuck in "in progress" or "waiting", although in some cases that work will never be moved to a final state even though it will never be "picked up". The reason work that is dispatched won't be "picked up" is documented here . This could easily occur if all services are on one machine and that machine looses power while performing a task, but starts up again in less that 5 minutes.
Version-Release number of selected component (if applicable): Pulp 2.4
How reproducible: always
Steps to Reproduce:
1. Ensure pulp_celerybeat is running, pulp_resource_manager is running, httpd, is running, and *exactly* one worker is running (ie: worker0).
2. Dispatch long-running syncs for two different repositories so that one will get started by the worker, and the other will be dispatched to the worker but will not start.
3. Stop pulp_celerybeat while the first sync is still running
4. Forcefully restart worker0 (kill -9 then start the worker again).
5. Start pulp_celerybeat again (needs to be < 5 minutes of when you stopped celerybeat in step 3.
6. List the status of the tasks and observe that one is in progress, and the second is waiting. These are both incorrect because both tasks will never start or cancel. Pulp's task state will never get updated unless the user knows to magically cancel the task. Since it's running or waiting, canceling the task as the resolution is not obvious.
I expected the pulp task state of work that is "lost" in the system due to a worker restart to be marked as cancelled. This should happen regardless of whether pulp_celerybeat is running or not.
--- Additional comment from on 2014-08-13 15:56:54 EDT ---
Note: workers already handle their own task state to in progress when tasks are started, and to completed when tasks are finished.
The recommended fix is multipart:
1) Add a worker startup behavior to workers that cancel all in progress or completed tasks assigned to a worker
2) Add an identical behavior for worker graceful shutdown
3) Adjust the graceful shutdown monitoring in pulp_celerybeat so that if a graceful shutdown of a worker is observed that pulp_celerybeat does NOT attempt to also cancel those tasks given that step 2 is taking care of that.
4) Ensure that pulp_celerybeat does still any and all releasing of reserved_resources for the worker that gracefully shut down. This fix only has the worker updating task status, it never requests that resources be deleted to avoid a potential race condition with new inbound work.
The cancelled tasks would then mirror the actual situation whereby the work will never be picked up if a worker hasn't completed it before it shuts down for any reason.
It is important that non-graceful worker detection behavior stay exactly as it is today (release reserved_resource AND cancel any outstanding tasks for that worker) to correctly handle the case where a worker is killed with kill -9 and doesn't update the task status itself. This cancellation would not occur until celerybeat runs at least 5 minutes after the worker last issued a heartbeat.
Marking Verified against Satellite-6.0.4-RHEL-7-20140904.1
This was delivered with Satellite 6.0 which was released on 10 September 2014.