Bug 1491032
Summary: | [deadlock] pulp workers appear idle even though many pulp tasks are in 'waiting' status | |||
---|---|---|---|---|
Product: | Red Hat Satellite | Reporter: | Chris Duryee <cduryee> | |
Component: | Pulp | Assignee: | satellite6-bugs <satellite6-bugs> | |
Status: | CLOSED ERRATA | QA Contact: | jcallaha | |
Severity: | urgent | Docs Contact: | ||
Priority: | urgent | |||
Version: | 6.2.11 | CC: | adprice, ajoseph, andrew.schofield, aperotti, bbuckingham, bkearney, bmbouter, brubisch, daniele, daviddavis, dkliban, ggainey, ipanova, jentrena, mhrivnak, mmccune, pcreech, pdwyer, peter.vreman, pmoravec, rchan, satellite6-bugs, sreber, tbrisker, tstrachota, ttereshc, xdmoon, zhunting | |
Target Milestone: | Unspecified | Keywords: | FieldEngineering, PrioBumpField, Triaged | |
Target Release: | Unused | |||
Hardware: | x86_64 | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1590906 (view as bug list) | Environment: | ||
Last Closed: | 2017-12-20 17:13:23 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1122832, 1590906 |
Description
Chris Duryee
2017-09-12 20:31:25 UTC
The Pulp upstream bug status is at ASSIGNED. Updating the external tracker on this bug. The Pulp upstream bug priority is at Normal. Updating the external tracker on this bug. The Pulp upstream bug status is at POST. Updating the external tracker on this bug. The Pulp upstream bug status is at MODIFIED. Updating the external tracker on this bug. All upstream Pulp bugs are at MODIFIED+. Moving this bug to POST. *** Bug 1495532 has been marked as a duplicate of this bug. *** To confirm you are hitting this bug, ensure pulp_workers are not using any CPU while there are tasks waiting. If no tasks appear running yet CPU is being used, the system is likely doing applicability regeneration and no further action is needed. The Pulp upstream bug status is at ON_QA. Updating the external tracker on this bug. The Pulp upstream bug status is at CLOSED - CURRENTRELEASE. Updating the external tracker on this bug. ## WORKAROUND INSTRUCTIONS ## To avoid the deadlock introduced in this bug please do the following: 1) edit /etc/default/pulp_workers 2) comment out this line: PULP_MAX_TASKS_PER_CHILD=2 so it looks like: # PULP_MAX_TASKS_PER_CHILD=2 3) katello-service restart This may cause increased memory consumption on the Celery worker process but will avoid the deadlock situation incurred from this bug. Hotfixes for this are available upon request. Verified in Satellite 6.2.13 Snap 1. Followed the steps outlined in the aligned pulp issue. 1. Ensure PULP_MAX_TASKS_PER_CHILD is uncommented in /etc/default/pulp_workers 2. Modify /usr/lib64/python2.7/site-packages/pymongo/pool.py -bash-4.2# cp /usr/lib64/python2.7/site-packages/pymongo/pool.py /usr/lib64/python2.7/site-packages/pymongo/pool.py.old ** edit pool.py ** -bash-4.2# diff /usr/lib64/python2.7/site-packages/pymongo/pool.py /usr/lib64/python2.7/site-packages/pymongo/pool.py.old 19d18 < import time 568d566 < time.sleep(.1) 3. Remove the pool .pyc. and .pyo files -bash-4.2# rm /usr/lib64/python2.7/site-packages/pymongo/pool.pyc -bash-4.2# rm /usr/lib64/python2.7/site-packages/pymongo/pool.pyo 4. Restart katello services 5. Start the test a. In one terminal, monitor journalctl with journalctl -f | grep 'succeeded in' b. In a second terminal, run this command (change hostname) enqueue(){ celery --app=pulp.server.async.app call --exchange=C.dq --routing-key=reserved_resource_worker-2@<hostname> pulp.server.async.tasks._release_resource '--args=["test"]'; }; while true; do for i in $(seq 1 5); do for j in $(seq 1 20); do enqueue & done; sleep 1; done; wait; done 6. Wait for at least two hours, monitoring the journalctrl output for any stoppage. -bash-4.2# enqueue(){ celery --app=pulp.server.async.app call --exchange=C.dq --routing-key=reserved_resource_worker-2@<hostname> pulp.server.async.tasks._release_resource '--args=["test"]'; }; while true; do for i in $(seq 1 5); do for j in $(seq 1 20); do enqueue & done; sleep 1; done; wait; done ... Dec 13 11:21:15 ibm-x3250m4-06.lab.eng.rdu2.redhat.com pulp[27611]: celery.worker.job:INFO: Task pulp.server.async.tasks._release_resource[42a9f392-216d-44ec-9db1-bab4137fa931] succeeded in 0.134064707003s: None Dec 13 11:21:17 ibm-x3250m4-06.lab.eng.rdu2.redhat.com pulp[27611]: celery.worker.job:INFO: Task pulp.server.async.tasks._release_resource[2f4fbf52-d7f1-4163-b0c2-fc76bcf460cd] succeeded in 0.663019994012s: None Dec 13 11:21:18 ibm-x3250m4-06.lab.eng.rdu2.redhat.com pulp[27611]: celery.worker.job:INFO: Task pulp.server.async.tasks._release_resource[41f4d7b6-07b3-4543-b0d8-8bf680f2ca70] succeeded in 0.105703887006s: None Dec 13 11:21:20 ibm-x3250m4-06.lab.eng.rdu2.redhat.com pulp[27611]: celery.worker.job:INFO: Task pulp.server.async.tasks._release_resource[de01038d-571c-4b9e-837f-0910b787ec13] succeeded in 0.720609048003s: None ... 4 hours later Dec 13 15:21:14 ibm-x3250m4-06.lab.eng.rdu2.redhat.com pulp[27611]: celery.worker.job:INFO: Task pulp.server.async.tasks._release_resource[6bfb84e4-aa3d-4e26-b02a-f3803c2b8199] succeeded in 0.216811780992s: None Dec 13 15:21:14 ibm-x3250m4-06.lab.eng.rdu2.redhat.com pulp[27611]: celery.worker.job:INFO: Task pulp.server.async.tasks._release_resource[828506ec-89da-4b62-9859-21042d06fcb4] succeeded in 0.10148732402s: None Dec 13 15:21:16 ibm-x3250m4-06.lab.eng.rdu2.redhat.com pulp[27611]: celery.worker.job:INFO: Task pulp.server.async.tasks._release_resource[0bbe01b5-b969-4116-927a-83979c7f9e81] succeeded in 0.216806504992s: None At no point did I encounter the deadlock. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:3492 |