Bug 1607535

Summary: Pulp tasks getting stuck, tasks piling up on worker
Product: Red Hat Satellite Reporter: Brian Bouterse <bmbouter>
Component: PulpAssignee: satellite6-bugs <satellite6-bugs>
Status: CLOSED WONTFIX QA Contact: Kersom <koliveir>
Severity: medium Docs Contact:
Priority: unspecified    
Version: UnspecifiedCC: cmarinea, roywilli, ttereshc
Target Milestone: UnspecifiedKeywords: Triaged
Target Release: Unused   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-01-03 15:57:35 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1115190    

Description Brian Bouterse 2018-07-23 16:38:31 UTC
I got a report that a Satellite user had it's Pulp resource_manager stuck. Here are some details:

They were no longer able to get one of their capsules to sync, and Pulp tasks would become ready for work but never get executed. This hotfix (https://bugzilla.redhat.com/show_bug.cgi?id=1491032) was applied but it did not help. It was observed that the resource_manager queue had 19K items in it.

I believe the resource_manager was deadlocked. Note there were not logs that qpid.messaging was in an illegal state. Those patches were added for deadlock detection as part of (https://bugzilla.redhat.com/show_bug.cgi?id=1279502). This means it's deadlocking for another reason.

I believe this can happen to any worker or the resource_manager on start or upon fork during operation.

I do not know how to reproduce this issue. It would occur rarely.

The Celery devs and I have discussed that there is a threading/forking incompatibility of Celery and qpid.messaging and also with Celery and the pymongo driver both of which use threading. Mixing threading and forking is a known-bad practice and can cause "deadlocking on start". This would produce effectively the same symptoms as (https://bugzilla.redhat.com/show_bug.cgi?id=1279502).

These types of issues are also reported in other various trackers:

# Celery issues
https://github.com/celery/celery/issues/4185
https://github.com/celery/celery/issues/4316
https://github.com/celery/celery/issues/3898
https://github.com/celery/celery/issues?utf8=%E2%9C%93&q=is%3Aissue+is%3Aopen+hang

# Pymongo issues
https://projects.engineering.redhat.com/browse/DELIVERY-3041
https://pulp.plan.io/issues/2979         <----- you could still deadlock on start with this. This fix only causes Pulp to fixup it's connection so that it can still run in more cases. Also not using MAX_TASKS_PER_CHILD causes it to start fewer times which doesn't fix the issue it just makes you experience it less frequently.

Comment 2 Chris Duryee 2018-07-24 17:30:24 UTC
*** Bug 1600625 has been marked as a duplicate of this bug. ***

Comment 3 Tanya Tereshchenko 2020-01-03 15:57:35 UTC
This issue won't be fixed in Pulp 2.

The problem doesn't exist in Pulp 3 (no Celery).