Bug 1415484

Summary: Killing child celery process processing a task leaves the task orphaned as "running" forever
Product: Red Hat Satellite Reporter: Pavel Moravec <pmoravec>
Component: PulpAssignee: satellite6-bugs <satellite6-bugs>
Status: CLOSED DUPLICATE QA Contact: Katello QA List <katello-qa-list>
Severity: medium Docs Contact:
Priority: medium    
Version: 6.2.6CC: mhrivnak
Target Milestone: Unspecified   
Target Release: Unused   
Hardware: x86_64   
OS: Linux   
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-01-24 17:17:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Pavel Moravec 2017-01-22 13:41:25 UTC
Description of problem:
Customer scenario: OOM killer abruptly killed child celery worker process when it was running a task. Even though the child process is re-spawned after a while, it does not pick up its job. And the task is "running" and relevant foreman task is "waiting on pulp to finish the task".

Version-Release number of selected component (if applicable):
Sat 6.2.6

How reproducible:

Steps to Reproduce:
1. Sync some bigger repo (just to ensure you have some time to see its progressing).
2. When the (pulp) task is in progress, check what worker took it (i.e. via dynflow console, Output will have e.g. "queue: reserved_resource_worker-0@pmoravec-sat62-rhel7.gsslab.brq.redhat.com.dq" - or check qpid-stat -q output and see what worker has a message in its queue).
3. Check which process of the worker is the child - and kill -9 it.
4. Wait whatever time you want. Check the child process is re-spawned and check the task status.
5. To check task status: qpid-stat -q (grep for the same queue), or foreman task, or task_status mongo collection

Actual results:
- qpid-stat -q shows empty queue (of resource_manager so the job/task wasnt put back for re-schedule, and also the worker's queue is empty)
- foreman task is "waiting for pulp to finish the task"
- task_status mongo collection has:

{ "_id" : ObjectId("5884a6e393370a373be8c633"), "task_id" : "7b88e4ad-ea2b-41f1-a9d4-d0e5f5e94d5b", "exception" : null, "task_type" : "pulp.server.managers.repo.sync.sync", "tags" : [ "pulp:repository:RedHat-Red_Hat_Enterprise_Linux_Server-Red_Hat_Enterprise_Linux_7_Server_RPMs_x86_64_7Server", "pulp:action:sync" ], "finish_time" : null, "_ns" : "task_status", "traceback" : null, "spawned_tasks" : [ ], "progress_report" : { "yum_importer" : { "content" : { "size_total" : 0, "items_left" : 0, "items_total" : 0, "state" : "NOT_STARTED", "size_left" : 0, "details" : { "rpm_total" : 0, "rpm_done" : 0, "drpm_total" : 0, "drpm_done" : 0 }, "error_details" : [ ] }, "comps" : { "state" : "NOT_STARTED" }, "purge_duplicates" : { "state" : "NOT_STARTED" }, "distribution" : { "items_total" : 0, "state" : "NOT_STARTED", "error_details" : [ ], "items_left" : 0 }, "errata" : { "state" : "NOT_STARTED" }, "metadata" : { "state" : "IN_PROGRESS" } } }, "worker_name" : "reserved_resource_worker-0@pmoravec-sat62-rhel7.gsslab.brq.redhat.com", "result" : null, "error" : null, "group_id" : null, "id" : null, "state" : "canceled", "start_time" : "2017-01-22T12:34:43Z" }

i.e. false expectations the task is running / in progress

Expected results:
task is either cancelled/failed, or rescheduled to a different worker, or to the same worker once it is re-spawned (i.e. when a worker is starting, shall not it check its tasks in task_status collection?).

Additional info:
restarting pulp services is efficient workaround

Comment 2 Michael Hrivnak 2017-01-24 04:53:41 UTC
Brian, thoughts on why the task isn't getting auto-canceled?

Comment 3 Brian Bouterse 2017-01-24 17:08:55 UTC
This was a defect that was fixed in upstream Pulp in 2.9.1. Sat 6.2.6 is running pulp-server- I associated the upstream issue also.

The OOM only kills the celery child process, so the parent process continues to heartbeat. The parent process spawns another child which is why additional tasks are processed without error. Task cancellation requires heartbeats to stop flowing for the failure to be detected. Since that never occurs, the task will never be canceled. In 2.9.1 an additional check is put in place to recover from this exact scenario. This also occurs when the child process segfaults which is how we discovered the issue upstream.

Comment 4 Brian Bouterse 2017-01-24 17:14:40 UTC
The wrong upstream issue was associated. Now it is fixed.

Comment 5 Michael Hrivnak 2017-01-24 17:17:48 UTC

*** This bug has been marked as a duplicate of bug 1353248 ***