Bug 1129870 - Task State Incorrect When Workers Shutdown or are Killed while Celerybeat is not Running
Summary: Task State Incorrect When Workers Shutdown or are Killed while Celerybeat is ...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Satellite 6
Classification: Red Hat
Component: Other
Version: Unspecified
Hardware: Unspecified
OS: Unspecified
high
high vote
Target Milestone: Unspecified
Assignee: Brian Bouterse
QA Contact: Katello QA List
URL:
Whiteboard:
Depends On: 1129858
Blocks: 950743
TreeView+ depends on / blocked
 
Reported: 2014-08-13 19:59 UTC by Brian Bouterse
Modified: 2014-09-11 12:26 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of: 1129858
Environment:
Last Closed: 2014-09-11 12:26:25 UTC


Attachments (Terms of Use)

Description Brian Bouterse 2014-08-13 19:59:07 UTC
+++ This bug was initially created as a clone of Bug #1129858 +++

Description of problem: It is possible for a pulp task state to get stuck in "in progress" or "waiting", although in some cases that work will never be moved to a final state even though it will never be "picked up". The reason work that is dispatched won't be "picked up" is documented here [0]. This could easily occur if all services are on one machine and that machine looses power while performing a task, but starts up again in less that 5 minutes.


Version-Release number of selected component (if applicable): Pulp 2.4


How reproducible: always


Steps to Reproduce:
1. Ensure pulp_celerybeat is running, pulp_resource_manager is running, httpd, is running, and *exactly* one worker is running (ie: worker0).
2. Dispatch long-running syncs for two different repositories so that one will get started by the worker, and the other will be dispatched to the worker but will not start.
3. Stop pulp_celerybeat while the first sync is still running
4. Forcefully restart worker0 (kill -9 then start the worker again).
5. Start pulp_celerybeat again (needs to be < 5 minutes of when you stopped celerybeat in step 3.
6. List the status of the tasks and observe that one is in progress, and the second is waiting. These are both incorrect because both tasks will never start or cancel. Pulp's task state will never get updated unless the user knows to magically cancel the task. Since it's running or waiting, canceling the task as the resolution is not obvious.


Expected results:
I expected the pulp task state of work that is "lost" in the system due to a worker restart to be marked as cancelled. This should happen regardless of whether pulp_celerybeat is running or not.


Additional info:
[0]:  https://bugzilla.redhat.com/show_bug.cgi?id=1129758

--- Additional comment from  on 2014-08-13 15:56:54 EDT ---

Note: workers already handle their own task state to in progress when tasks are started, and to completed when tasks are finished.

The recommended fix is multipart:

1) Add a worker startup behavior to workers that cancel all in progress or completed tasks assigned to a worker
2) Add an identical behavior for worker graceful shutdown
3) Adjust the graceful shutdown monitoring in pulp_celerybeat so that if a graceful shutdown of a worker is observed that pulp_celerybeat does NOT attempt to also cancel those tasks given that step 2 is taking care of that.
4) Ensure that pulp_celerybeat does still any and all releasing of reserved_resources for the worker that gracefully shut down. This fix only has the worker updating task status, it never requests that resources be deleted to avoid a potential race condition with new inbound work.

The cancelled tasks would then mirror the actual situation whereby the work will never be picked up if a worker hasn't completed it before it shuts down for any reason.

It is important that non-graceful worker detection behavior stay exactly as it is today (release reserved_resource AND cancel any outstanding tasks for that worker) to correctly handle the case where a worker is killed with kill -9 and doesn't update the task status itself. This cancellation would not occur until celerybeat runs at least 5 minutes after the worker last issued a heartbeat.

Comment 4 Corey Welton 2014-09-05 15:53:15 UTC
Marking Verified against Satellite-6.0.4-RHEL-7-20140904.1

Comment 5 Bryan Kearney 2014-09-11 12:26:25 UTC
This was delivered with Satellite 6.0 which was released on 10 September 2014.


Note You need to log in before you can comment on or make changes to this bug.