Bug 1491032

Summary:	[deadlock] pulp workers appear idle even though many pulp tasks are in 'waiting' status
Product:	Red Hat Satellite	Reporter:	Chris Duryee <cduryee>
Component:	Pulp	Assignee:	satellite6-bugs <satellite6-bugs>
Status:	CLOSED ERRATA	QA Contact:	jcallaha
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	6.2.11	CC:	adprice, ajoseph, andrew.schofield, aperotti, bbuckingham, bkearney, bmbouter, brubisch, daniele, daviddavis, dkliban, ggainey, ipanova, jentrena, mhrivnak, mmccune, pcreech, pdwyer, peter.vreman, pmoravec, rchan, satellite6-bugs, sreber, tbrisker, tstrachota, ttereshc, xdmoon, zhunting
Target Milestone:	Unspecified	Keywords:	FieldEngineering, PrioBumpField, Triaged
Target Release:	Unused
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1590906 (view as bug list)		Environment:
Last Closed:	2017-12-20 17:13:23 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1122832, 1590906

Description Chris Duryee 2017-09-12 20:31:25 UTC

Description of problem:

note: this bug is related to pulp tasks and not foreman tasks

In some cases, pulp will have idle workers even though tasks are in 'waiting' state. Restarting the pulp_workers service seems to clear this up.

This is likely caused by https://pulp.plan.io/issues/2979, https://pulp.plan.io/issues/2979#note-4 has further info the root cause of the issue.

Version-Release number of selected component (if applicable): 6.2.11


How reproducible: difficult to reproduce


Steps to Reproduce:
1. create lots of pulp tasks over a long period of time
2. observe the number of tasks in 'running' state

Actual results: 'running' tasks will slowly go down in count, even though there are enough pulp workers to handle all tasks. This can be observed via 'pulp-admin tasks list'


Expected results: there should be roughly the same number of tasks in 'running' state as there are workers


Additional info: restarting pulp_workers seems to help, but commenting out  MAX_TASKS_PER_WORKER is another workaround. This may cause increased memory usage however.

Comment 2 pulp-infra@redhat.com 2017-09-12 21:03:33 UTC

The Pulp upstream bug status is at ASSIGNED. Updating the external tracker on this bug.

Comment 3 pulp-infra@redhat.com 2017-09-12 21:03:35 UTC

The Pulp upstream bug priority is at Normal. Updating the external tracker on this bug.

Comment 5 pulp-infra@redhat.com 2017-09-27 14:30:57 UTC

The Pulp upstream bug status is at POST. Updating the external tracker on this bug.

Comment 6 pulp-infra@redhat.com 2017-09-30 21:03:00 UTC

The Pulp upstream bug status is at MODIFIED. Updating the external tracker on this bug.

Comment 7 pulp-infra@redhat.com 2017-09-30 21:31:57 UTC

All upstream Pulp bugs are at MODIFIED+. Moving this bug to POST.

Comment 8 Simon Reber 2017-10-02 10:00:28 UTC

*** Bug 1495532 has been marked as a duplicate of this bug. ***

Comment 10 Chris Duryee 2017-10-02 15:59:26 UTC

To confirm you are hitting this bug, ensure pulp_workers are not using any CPU while there are tasks waiting. If no tasks appear running yet CPU is being used, the system is likely doing applicability regeneration and no further action is needed.

Comment 14 pulp-infra@redhat.com 2017-10-19 14:02:08 UTC

The Pulp upstream bug status is at ON_QA. Updating the external tracker on this bug.

Comment 25 pulp-infra@redhat.com 2017-10-27 00:01:56 UTC

The Pulp upstream bug status is at CLOSED - CURRENTRELEASE. Updating the external tracker on this bug.

Comment 26 Mike McCune 2017-11-08 17:22:08 UTC

## WORKAROUND INSTRUCTIONS ##

To avoid the deadlock introduced in this bug please do the following:

1) edit /etc/default/pulp_workers

2) comment out this line:

PULP_MAX_TASKS_PER_CHILD=2

so it looks like:

# PULP_MAX_TASKS_PER_CHILD=2

3) katello-service restart

This may cause increased memory consumption on the Celery worker process but will avoid the deadlock situation incurred from this bug.

Hotfixes for this are available upon request.

Comment 27 jcallaha 2017-12-13 20:22:27 UTC

Verified in Satellite 6.2.13 Snap 1.

Followed the steps outlined in the aligned pulp issue.

1. Ensure PULP_MAX_TASKS_PER_CHILD is uncommented in /etc/default/pulp_workers
2. Modify /usr/lib64/python2.7/site-packages/pymongo/pool.py
  -bash-4.2# cp /usr/lib64/python2.7/site-packages/pymongo/pool.py 
  /usr/lib64/python2.7/site-packages/pymongo/pool.py.old

  ** edit pool.py **

  -bash-4.2# diff /usr/lib64/python2.7/site-packages/pymongo/pool.py 
  /usr/lib64/python2.7/site-packages/pymongo/pool.py.old 
  19d18
  < import time
  568d566
  <                     time.sleep(.1)
3. Remove the pool .pyc. and .pyo files
  -bash-4.2# rm /usr/lib64/python2.7/site-packages/pymongo/pool.pyc
  -bash-4.2# rm /usr/lib64/python2.7/site-packages/pymongo/pool.pyo
4. Restart katello services
5. Start the test
   a. In one terminal, monitor journalctl with journalctl -f | grep 'succeeded in'
   b. In a second terminal, run this command (change hostname)
      enqueue(){ celery --app=pulp.server.async.app call --exchange=C.dq --routing-key=reserved_resource_worker-2@<hostname> pulp.server.async.tasks._release_resource '--args=["test"]'; }; while true; do for i in $(seq 1 5); do for j in $(seq 1 20); do enqueue & done; sleep 1; done; wait; done
6. Wait for at least two hours, monitoring the journalctrl output for any stoppage.


-bash-4.2# enqueue(){ celery --app=pulp.server.async.app call --exchange=C.dq --routing-key=reserved_resource_worker-2@<hostname> pulp.server.async.tasks._release_resource '--args=["test"]'; }; while true; do for i in $(seq 1 5); do for j in $(seq 1 20); do enqueue & done; sleep 1; done; wait; done
...
Dec 13 11:21:15 ibm-x3250m4-06.lab.eng.rdu2.redhat.com pulp[27611]: celery.worker.job:INFO: Task pulp.server.async.tasks._release_resource[42a9f392-216d-44ec-9db1-bab4137fa931] succeeded in 0.134064707003s: None
Dec 13 11:21:17 ibm-x3250m4-06.lab.eng.rdu2.redhat.com pulp[27611]: celery.worker.job:INFO: Task pulp.server.async.tasks._release_resource[2f4fbf52-d7f1-4163-b0c2-fc76bcf460cd] succeeded in 0.663019994012s: None
Dec 13 11:21:18 ibm-x3250m4-06.lab.eng.rdu2.redhat.com pulp[27611]: celery.worker.job:INFO: Task pulp.server.async.tasks._release_resource[41f4d7b6-07b3-4543-b0d8-8bf680f2ca70] succeeded in 0.105703887006s: None
Dec 13 11:21:20 ibm-x3250m4-06.lab.eng.rdu2.redhat.com pulp[27611]: celery.worker.job:INFO: Task pulp.server.async.tasks._release_resource[de01038d-571c-4b9e-837f-0910b787ec13] succeeded in 0.720609048003s: None
...
4 hours later
Dec 13 15:21:14 ibm-x3250m4-06.lab.eng.rdu2.redhat.com pulp[27611]: celery.worker.job:INFO: Task pulp.server.async.tasks._release_resource[6bfb84e4-aa3d-4e26-b02a-f3803c2b8199] succeeded in 0.216811780992s: None
Dec 13 15:21:14 ibm-x3250m4-06.lab.eng.rdu2.redhat.com pulp[27611]: celery.worker.job:INFO: Task pulp.server.async.tasks._release_resource[828506ec-89da-4b62-9859-21042d06fcb4] succeeded in 0.10148732402s: None
Dec 13 15:21:16 ibm-x3250m4-06.lab.eng.rdu2.redhat.com pulp[27611]: celery.worker.job:INFO: Task pulp.server.async.tasks._release_resource[0bbe01b5-b969-4116-927a-83979c7f9e81] succeeded in 0.216806504992s: None

At no point did I encounter the deadlock.

Comment 30 errata-xmlrpc 2017-12-20 17:13:23 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:3492