1129870 – Task State Incorrect When Workers Shutdown or are Killed while Celerybeat is not Running

Red Hat Satellite engineering is moving the tracking of its product development work on Satellite to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "Satellite project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs will be migrated starting at the end of May. If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "Satellite project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/SAT-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1129870 - Task State Incorrect When Workers Shutdown or are Killed while Celerybeat is not Running

Summary: Task State Incorrect When Workers Shutdown or are Killed while Celerybeat is ...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Satellite
Classification:	Red Hat
Component:	Other
Sub Component:
Version:	Unspecified
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	Unspecified
Assignee:	Brian Bouterse
QA Contact:	Katello QA List
Docs Contact:
URL:
Whiteboard:
Depends On:	1129858
Blocks:	950743
TreeView+	depends on / blocked

Reported:	2014-08-13 19:59 UTC by Brian Bouterse
Modified:	2014-09-11 12:26 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:	1129858
Environment:
Last Closed:	2014-09-11 12:26:25 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Brian Bouterse 2014-08-13 19:59:07 UTC

+++ This bug was initially created as a clone of Bug #1129858 +++

Description of problem: It is possible for a pulp task state to get stuck in "in progress" or "waiting", although in some cases that work will never be moved to a final state even though it will never be "picked up". The reason work that is dispatched won't be "picked up" is documented here [0]. This could easily occur if all services are on one machine and that machine looses power while performing a task, but starts up again in less that 5 minutes.


Version-Release number of selected component (if applicable): Pulp 2.4


How reproducible: always


Steps to Reproduce:
1. Ensure pulp_celerybeat is running, pulp_resource_manager is running, httpd, is running, and *exactly* one worker is running (ie: worker0).
2. Dispatch long-running syncs for two different repositories so that one will get started by the worker, and the other will be dispatched to the worker but will not start.
3. Stop pulp_celerybeat while the first sync is still running
4. Forcefully restart worker0 (kill -9 then start the worker again).
5. Start pulp_celerybeat again (needs to be < 5 minutes of when you stopped celerybeat in step 3.
6. List the status of the tasks and observe that one is in progress, and the second is waiting. These are both incorrect because both tasks will never start or cancel. Pulp's task state will never get updated unless the user knows to magically cancel the task. Since it's running or waiting, canceling the task as the resolution is not obvious.


Expected results:
I expected the pulp task state of work that is "lost" in the system due to a worker restart to be marked as cancelled. This should happen regardless of whether pulp_celerybeat is running or not.


Additional info:
[0]:  https://bugzilla.redhat.com/show_bug.cgi?id=1129758

--- Additional comment from  on 2014-08-13 15:56:54 EDT ---

Note: workers already handle their own task state to in progress when tasks are started, and to completed when tasks are finished.

The recommended fix is multipart:

1) Add a worker startup behavior to workers that cancel all in progress or completed tasks assigned to a worker
2) Add an identical behavior for worker graceful shutdown
3) Adjust the graceful shutdown monitoring in pulp_celerybeat so that if a graceful shutdown of a worker is observed that pulp_celerybeat does NOT attempt to also cancel those tasks given that step 2 is taking care of that.
4) Ensure that pulp_celerybeat does still any and all releasing of reserved_resources for the worker that gracefully shut down. This fix only has the worker updating task status, it never requests that resources be deleted to avoid a potential race condition with new inbound work.

The cancelled tasks would then mirror the actual situation whereby the work will never be picked up if a worker hasn't completed it before it shuts down for any reason.

It is important that non-graceful worker detection behavior stay exactly as it is today (release reserved_resource AND cancel any outstanding tasks for that worker) to correctly handle the case where a worker is killed with kill -9 and doesn't update the task status itself. This cancellation would not occur until celerybeat runs at least 5 minutes after the worker last issued a heartbeat.

Comment 4 Corey Welton 2014-09-05 15:53:15 UTC

Marking Verified against Satellite-6.0.4-RHEL-7-20140904.1

Comment 5 Bryan Kearney 2014-09-11 12:26:25 UTC

This was delivered with Satellite 6.0 which was released on 10 September 2014.

Note You need to log in before you can comment on or make changes to this bug.