1491032 – [deadlock] pulp workers appear idle even though many pulp tasks are in 'waiting' status

Red Hat Satellite engineering is moving the tracking of its product development work on Satellite to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "Satellite project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs will be migrated starting at the end of May. If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "Satellite project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/SAT-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1491032 - [deadlock] pulp workers appear idle even though many pulp tasks are in 'waiting' status

Summary: [deadlock] pulp workers appear idle even though many pulp tasks are in 'waiti...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Satellite
Classification:	Red Hat
Component:	Pulp
Sub Component:
Version:	6.2.11
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	Unspecified
Assignee:	satellite6-bugs
QA Contact:	jcallaha
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1495532 (view as bug list)
Depends On:
Blocks:	1122832 1590906
TreeView+	depends on / blocked

Reported:	2017-09-12 20:31 UTC by Chris Duryee
Modified:	2023-03-24 13:52 UTC (History)
CC List:	28 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1590906 (view as bug list)
Environment:
Last Closed:	2017-12-20 17:13:23 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Pulp Redmine	2979	Normal	CLOSED - CURRENTRELEASE	Celery workers may deadlock when PULP_MAX_TASKS_PER_CHILD is used	2017-10-27 00:01:55 UTC
Red Hat Knowledge Base (Solution)	3201502	None	None	None	2017-10-02 10:11:49 UTC
Red Hat Product Errata	RHBA-2017:3492	normal	SHIPPED_LIVE	Satellite 6.2.13 Async bug fix update - potentially long upgrade	2017-12-20 22:12:36 UTC

Description Chris Duryee 2017-09-12 20:31:25 UTC

Description of problem:

note: this bug is related to pulp tasks and not foreman tasks

In some cases, pulp will have idle workers even though tasks are in 'waiting' state. Restarting the pulp_workers service seems to clear this up.

This is likely caused by https://pulp.plan.io/issues/2979, https://pulp.plan.io/issues/2979#note-4 has further info the root cause of the issue.

Version-Release number of selected component (if applicable): 6.2.11


How reproducible: difficult to reproduce


Steps to Reproduce:
1. create lots of pulp tasks over a long period of time
2. observe the number of tasks in 'running' state

Actual results: 'running' tasks will slowly go down in count, even though there are enough pulp workers to handle all tasks. This can be observed via 'pulp-admin tasks list'


Expected results: there should be roughly the same number of tasks in 'running' state as there are workers


Additional info: restarting pulp_workers seems to help, but commenting out  MAX_TASKS_PER_WORKER is another workaround. This may cause increased memory usage however.

Comment 2 pulp-infra@redhat.com 2017-09-12 21:03:33 UTC

The Pulp upstream bug status is at ASSIGNED. Updating the external tracker on this bug.

Comment 3 pulp-infra@redhat.com 2017-09-12 21:03:35 UTC

The Pulp upstream bug priority is at Normal. Updating the external tracker on this bug.

Comment 5 pulp-infra@redhat.com 2017-09-27 14:30:57 UTC

The Pulp upstream bug status is at POST. Updating the external tracker on this bug.

Comment 6 pulp-infra@redhat.com 2017-09-30 21:03:00 UTC

The Pulp upstream bug status is at MODIFIED. Updating the external tracker on this bug.

Comment 7 pulp-infra@redhat.com 2017-09-30 21:31:57 UTC

All upstream Pulp bugs are at MODIFIED+. Moving this bug to POST.

Comment 8 Simon Reber 2017-10-02 10:00:28 UTC

*** Bug 1495532 has been marked as a duplicate of this bug. ***

Comment 10 Chris Duryee 2017-10-02 15:59:26 UTC

To confirm you are hitting this bug, ensure pulp_workers are not using any CPU while there are tasks waiting. If no tasks appear running yet CPU is being used, the system is likely doing applicability regeneration and no further action is needed.

Comment 14 pulp-infra@redhat.com 2017-10-19 14:02:08 UTC

The Pulp upstream bug status is at ON_QA. Updating the external tracker on this bug.

Comment 25 pulp-infra@redhat.com 2017-10-27 00:01:56 UTC

The Pulp upstream bug status is at CLOSED - CURRENTRELEASE. Updating the external tracker on this bug.

Comment 26 Mike McCune 2017-11-08 17:22:08 UTC

## WORKAROUND INSTRUCTIONS ##

To avoid the deadlock introduced in this bug please do the following:

1) edit /etc/default/pulp_workers

2) comment out this line:

PULP_MAX_TASKS_PER_CHILD=2

so it looks like:

# PULP_MAX_TASKS_PER_CHILD=2

3) katello-service restart

This may cause increased memory consumption on the Celery worker process but will avoid the deadlock situation incurred from this bug.

Hotfixes for this are available upon request.

Comment 27 jcallaha 2017-12-13 20:22:27 UTC

Verified in Satellite 6.2.13 Snap 1.

Followed the steps outlined in the aligned pulp issue.

1. Ensure PULP_MAX_TASKS_PER_CHILD is uncommented in /etc/default/pulp_workers
2. Modify /usr/lib64/python2.7/site-packages/pymongo/pool.py
  -bash-4.2# cp /usr/lib64/python2.7/site-packages/pymongo/pool.py 
  /usr/lib64/python2.7/site-packages/pymongo/pool.py.old

  ** edit pool.py **

  -bash-4.2# diff /usr/lib64/python2.7/site-packages/pymongo/pool.py 
  /usr/lib64/python2.7/site-packages/pymongo/pool.py.old 
  19d18
  < import time
  568d566
  <                     time.sleep(.1)
3. Remove the pool .pyc. and .pyo files
  -bash-4.2# rm /usr/lib64/python2.7/site-packages/pymongo/pool.pyc
  -bash-4.2# rm /usr/lib64/python2.7/site-packages/pymongo/pool.pyo
4. Restart katello services
5. Start the test
   a. In one terminal, monitor journalctl with journalctl -f | grep 'succeeded in'
   b. In a second terminal, run this command (change hostname)
      enqueue(){ celery --app=pulp.server.async.app call --exchange=C.dq --routing-key=reserved_resource_worker-2@<hostname> pulp.server.async.tasks._release_resource '--args=["test"]'; }; while true; do for i in $(seq 1 5); do for j in $(seq 1 20); do enqueue & done; sleep 1; done; wait; done
6. Wait for at least two hours, monitoring the journalctrl output for any stoppage.


-bash-4.2# enqueue(){ celery --app=pulp.server.async.app call --exchange=C.dq --routing-key=reserved_resource_worker-2@<hostname> pulp.server.async.tasks._release_resource '--args=["test"]'; }; while true; do for i in $(seq 1 5); do for j in $(seq 1 20); do enqueue & done; sleep 1; done; wait; done
...
Dec 13 11:21:15 ibm-x3250m4-06.lab.eng.rdu2.redhat.com pulp[27611]: celery.worker.job:INFO: Task pulp.server.async.tasks._release_resource[42a9f392-216d-44ec-9db1-bab4137fa931] succeeded in 0.134064707003s: None
Dec 13 11:21:17 ibm-x3250m4-06.lab.eng.rdu2.redhat.com pulp[27611]: celery.worker.job:INFO: Task pulp.server.async.tasks._release_resource[2f4fbf52-d7f1-4163-b0c2-fc76bcf460cd] succeeded in 0.663019994012s: None
Dec 13 11:21:18 ibm-x3250m4-06.lab.eng.rdu2.redhat.com pulp[27611]: celery.worker.job:INFO: Task pulp.server.async.tasks._release_resource[41f4d7b6-07b3-4543-b0d8-8bf680f2ca70] succeeded in 0.105703887006s: None
Dec 13 11:21:20 ibm-x3250m4-06.lab.eng.rdu2.redhat.com pulp[27611]: celery.worker.job:INFO: Task pulp.server.async.tasks._release_resource[de01038d-571c-4b9e-837f-0910b787ec13] succeeded in 0.720609048003s: None
...
4 hours later
Dec 13 15:21:14 ibm-x3250m4-06.lab.eng.rdu2.redhat.com pulp[27611]: celery.worker.job:INFO: Task pulp.server.async.tasks._release_resource[6bfb84e4-aa3d-4e26-b02a-f3803c2b8199] succeeded in 0.216811780992s: None
Dec 13 15:21:14 ibm-x3250m4-06.lab.eng.rdu2.redhat.com pulp[27611]: celery.worker.job:INFO: Task pulp.server.async.tasks._release_resource[828506ec-89da-4b62-9859-21042d06fcb4] succeeded in 0.10148732402s: None
Dec 13 15:21:16 ibm-x3250m4-06.lab.eng.rdu2.redhat.com pulp[27611]: celery.worker.job:INFO: Task pulp.server.async.tasks._release_resource[0bbe01b5-b969-4116-927a-83979c7f9e81] succeeded in 0.216806504992s: None

At no point did I encounter the deadlock.

Comment 30 errata-xmlrpc 2017-12-20 17:13:23 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:3492

Note You need to log in before you can comment on or make changes to this bug.

adprice
ajoseph
andrew.schofield
aperotti
bbuckingham
bkearney
bmbouter
brubisch
daniele
daviddavis
dkliban
ggainey
ipanova
jentrena
mhrivnak
mmccune
pcreech
pdwyer
peter.vreman
pmoravec
rchan
satellite6-bugs
sreber
tbrisker
tstrachota
ttereshc
xdmoon
zhunting