1415484 – Killing child celery process processing a task leaves the task orphaned as "running" forever

Red Hat Satellite engineering is moving the tracking of its product development work on Satellite to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "Satellite project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs will be migrated starting at the end of May. If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "Satellite project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/SAT-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1415484 - Killing child celery process processing a task leaves the task orphaned as "running" forever

Summary: Killing child celery process processing a task leaves the task orphaned as "r...

Keywords:
Status:	CLOSED DUPLICATE of bug 1353248
Alias:	None
Product:	Red Hat Satellite
Classification:	Red Hat
Component:	Pulp
Sub Component:
Version:	6.2.6
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	Unspecified
Assignee:	satellite6-bugs
QA Contact:	Katello QA List
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-01-22 13:41 UTC by Pavel Moravec
Modified:	2020-07-16 09:08 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-01-24 17:17:48 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Pulp Redmine	1673	0	None	None	None	2017-01-24 17:14:39 UTC

Description Pavel Moravec 2017-01-22 13:41:25 UTC

Description of problem:
Customer scenario: OOM killer abruptly killed child celery worker process when it was running a task. Even though the child process is re-spawned after a while, it does not pick up its job. And the task is "running" and relevant foreman task is "waiting on pulp to finish the task".


Version-Release number of selected component (if applicable):
Sat 6.2.6


How reproducible:
100%


Steps to Reproduce:
1. Sync some bigger repo (just to ensure you have some time to see its progressing).
2. When the (pulp) task is in progress, check what worker took it (i.e. via dynflow console, Output will have e.g. "queue: reserved_resource_worker-0.brq.redhat.com.dq" - or check qpid-stat -q output and see what worker has a message in its queue).
3. Check which process of the worker is the child - and kill -9 it.
4. Wait whatever time you want. Check the child process is re-spawned and check the task status.
5. To check task status: qpid-stat -q (grep for the same queue), or foreman task, or task_status mongo collection


Actual results:
5.:
- qpid-stat -q shows empty queue (of resource_manager so the job/task wasnt put back for re-schedule, and also the worker's queue is empty)
- foreman task is "waiting for pulp to finish the task"
- task_status mongo collection has:

{ "_id" : ObjectId("5884a6e393370a373be8c633"), "task_id" : "7b88e4ad-ea2b-41f1-a9d4-d0e5f5e94d5b", "exception" : null, "task_type" : "pulp.server.managers.repo.sync.sync", "tags" : [ "pulp:repository:RedHat-Red_Hat_Enterprise_Linux_Server-Red_Hat_Enterprise_Linux_7_Server_RPMs_x86_64_7Server", "pulp:action:sync" ], "finish_time" : null, "_ns" : "task_status", "traceback" : null, "spawned_tasks" : [ ], "progress_report" : { "yum_importer" : { "content" : { "size_total" : 0, "items_left" : 0, "items_total" : 0, "state" : "NOT_STARTED", "size_left" : 0, "details" : { "rpm_total" : 0, "rpm_done" : 0, "drpm_total" : 0, "drpm_done" : 0 }, "error_details" : [ ] }, "comps" : { "state" : "NOT_STARTED" }, "purge_duplicates" : { "state" : "NOT_STARTED" }, "distribution" : { "items_total" : 0, "state" : "NOT_STARTED", "error_details" : [ ], "items_left" : 0 }, "errata" : { "state" : "NOT_STARTED" }, "metadata" : { "state" : "IN_PROGRESS" } } }, "worker_name" : "reserved_resource_worker-0.brq.redhat.com", "result" : null, "error" : null, "group_id" : null, "id" : null, "state" : "canceled", "start_time" : "2017-01-22T12:34:43Z" }

i.e. false expectations the task is running / in progress


Expected results:
task is either cancelled/failed, or rescheduled to a different worker, or to the same worker once it is re-spawned (i.e. when a worker is starting, shall not it check its tasks in task_status collection?).


Additional info:
restarting pulp services is efficient workaround

Comment 2 Michael Hrivnak 2017-01-24 04:53:41 UTC

Brian, thoughts on why the task isn't getting auto-canceled?

Comment 3 Brian Bouterse 2017-01-24 17:08:55 UTC

This was a defect that was fixed in upstream Pulp in 2.9.1. Sat 6.2.6 is running pulp-server-2.8.7.3-1.el7sat.noarch. I associated the upstream issue also.

The OOM only kills the celery child process, so the parent process continues to heartbeat. The parent process spawns another child which is why additional tasks are processed without error. Task cancellation requires heartbeats to stop flowing for the failure to be detected. Since that never occurs, the task will never be canceled. In 2.9.1 an additional check is put in place to recover from this exact scenario. This also occurs when the child process segfaults which is how we discovered the issue upstream.

Comment 4 Brian Bouterse 2017-01-24 17:14:40 UTC

The wrong upstream issue was associated. Now it is fixed.

Comment 5 Michael Hrivnak 2017-01-24 17:17:48 UTC


*** This bug has been marked as a duplicate of bug 1353248 ***

Note You need to log in before you can comment on or make changes to this bug.