Bug 1415484 - Killing child celery process processing a task leaves the task orphaned as "running" forever
Summary: Killing child celery process processing a task leaves the task orphaned as "r...
Keywords:
Status: CLOSED DUPLICATE of bug 1353248
Alias: None
Product: Red Hat Satellite
Classification: Red Hat
Component: Pulp
Version: 6.2.6
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: Unspecified
Assignee: satellite6-bugs
QA Contact: Katello QA List
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-01-22 13:41 UTC by Pavel Moravec
Modified: 2020-07-16 09:08 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-01-24 17:17:48 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Pulp Redmine 1673 0 None None None 2017-01-24 17:14:39 UTC

Description Pavel Moravec 2017-01-22 13:41:25 UTC
Description of problem:
Customer scenario: OOM killer abruptly killed child celery worker process when it was running a task. Even though the child process is re-spawned after a while, it does not pick up its job. And the task is "running" and relevant foreman task is "waiting on pulp to finish the task".


Version-Release number of selected component (if applicable):
Sat 6.2.6


How reproducible:
100%


Steps to Reproduce:
1. Sync some bigger repo (just to ensure you have some time to see its progressing).
2. When the (pulp) task is in progress, check what worker took it (i.e. via dynflow console, Output will have e.g. "queue: reserved_resource_worker-0@pmoravec-sat62-rhel7.gsslab.brq.redhat.com.dq" - or check qpid-stat -q output and see what worker has a message in its queue).
3. Check which process of the worker is the child - and kill -9 it.
4. Wait whatever time you want. Check the child process is re-spawned and check the task status.
5. To check task status: qpid-stat -q (grep for the same queue), or foreman task, or task_status mongo collection


Actual results:
5.:
- qpid-stat -q shows empty queue (of resource_manager so the job/task wasnt put back for re-schedule, and also the worker's queue is empty)
- foreman task is "waiting for pulp to finish the task"
- task_status mongo collection has:

{ "_id" : ObjectId("5884a6e393370a373be8c633"), "task_id" : "7b88e4ad-ea2b-41f1-a9d4-d0e5f5e94d5b", "exception" : null, "task_type" : "pulp.server.managers.repo.sync.sync", "tags" : [ "pulp:repository:RedHat-Red_Hat_Enterprise_Linux_Server-Red_Hat_Enterprise_Linux_7_Server_RPMs_x86_64_7Server", "pulp:action:sync" ], "finish_time" : null, "_ns" : "task_status", "traceback" : null, "spawned_tasks" : [ ], "progress_report" : { "yum_importer" : { "content" : { "size_total" : 0, "items_left" : 0, "items_total" : 0, "state" : "NOT_STARTED", "size_left" : 0, "details" : { "rpm_total" : 0, "rpm_done" : 0, "drpm_total" : 0, "drpm_done" : 0 }, "error_details" : [ ] }, "comps" : { "state" : "NOT_STARTED" }, "purge_duplicates" : { "state" : "NOT_STARTED" }, "distribution" : { "items_total" : 0, "state" : "NOT_STARTED", "error_details" : [ ], "items_left" : 0 }, "errata" : { "state" : "NOT_STARTED" }, "metadata" : { "state" : "IN_PROGRESS" } } }, "worker_name" : "reserved_resource_worker-0@pmoravec-sat62-rhel7.gsslab.brq.redhat.com", "result" : null, "error" : null, "group_id" : null, "id" : null, "state" : "canceled", "start_time" : "2017-01-22T12:34:43Z" }

i.e. false expectations the task is running / in progress


Expected results:
task is either cancelled/failed, or rescheduled to a different worker, or to the same worker once it is re-spawned (i.e. when a worker is starting, shall not it check its tasks in task_status collection?).


Additional info:
restarting pulp services is efficient workaround

Comment 2 Michael Hrivnak 2017-01-24 04:53:41 UTC
Brian, thoughts on why the task isn't getting auto-canceled?

Comment 3 Brian Bouterse 2017-01-24 17:08:55 UTC
This was a defect that was fixed in upstream Pulp in 2.9.1. Sat 6.2.6 is running pulp-server-2.8.7.3-1.el7sat.noarch. I associated the upstream issue also.

The OOM only kills the celery child process, so the parent process continues to heartbeat. The parent process spawns another child which is why additional tasks are processed without error. Task cancellation requires heartbeats to stop flowing for the failure to be detected. Since that never occurs, the task will never be canceled. In 2.9.1 an additional check is put in place to recover from this exact scenario. This also occurs when the child process segfaults which is how we discovered the issue upstream.

Comment 4 Brian Bouterse 2017-01-24 17:14:40 UTC
The wrong upstream issue was associated. Now it is fixed.

Comment 5 Michael Hrivnak 2017-01-24 17:17:48 UTC

*** This bug has been marked as a duplicate of bug 1353248 ***


Note You need to log in before you can comment on or make changes to this bug.