Red Hat Bugzilla – Bug 1491032
[deadlock] pulp workers appear idle even though many pulp tasks are in 'waiting' status
Last modified: 2018-06-13 11:44:14 EDT
Description of problem: note: this bug is related to pulp tasks and not foreman tasks In some cases, pulp will have idle workers even though tasks are in 'waiting' state. Restarting the pulp_workers service seems to clear this up. This is likely caused by https://pulp.plan.io/issues/2979, https://pulp.plan.io/issues/2979#note-4 has further info the root cause of the issue. Version-Release number of selected component (if applicable): 6.2.11 How reproducible: difficult to reproduce Steps to Reproduce: 1. create lots of pulp tasks over a long period of time 2. observe the number of tasks in 'running' state Actual results: 'running' tasks will slowly go down in count, even though there are enough pulp workers to handle all tasks. This can be observed via 'pulp-admin tasks list' Expected results: there should be roughly the same number of tasks in 'running' state as there are workers Additional info: restarting pulp_workers seems to help, but commenting out MAX_TASKS_PER_WORKER is another workaround. This may cause increased memory usage however.
The Pulp upstream bug status is at ASSIGNED. Updating the external tracker on this bug.
The Pulp upstream bug priority is at Normal. Updating the external tracker on this bug.
The Pulp upstream bug status is at POST. Updating the external tracker on this bug.
The Pulp upstream bug status is at MODIFIED. Updating the external tracker on this bug.
All upstream Pulp bugs are at MODIFIED+. Moving this bug to POST.
*** Bug 1495532 has been marked as a duplicate of this bug. ***
To confirm you are hitting this bug, ensure pulp_workers are not using any CPU while there are tasks waiting. If no tasks appear running yet CPU is being used, the system is likely doing applicability regeneration and no further action is needed.
The Pulp upstream bug status is at ON_QA. Updating the external tracker on this bug.
The Pulp upstream bug status is at CLOSED - CURRENTRELEASE. Updating the external tracker on this bug.
## WORKAROUND INSTRUCTIONS ## To avoid the deadlock introduced in this bug please do the following: 1) edit /etc/default/pulp_workers 2) comment out this line: PULP_MAX_TASKS_PER_CHILD=2 so it looks like: # PULP_MAX_TASKS_PER_CHILD=2 3) katello-service restart This may cause increased memory consumption on the Celery worker process but will avoid the deadlock situation incurred from this bug. Hotfixes for this are available upon request.
Verified in Satellite 6.2.13 Snap 1. Followed the steps outlined in the aligned pulp issue. 1. Ensure PULP_MAX_TASKS_PER_CHILD is uncommented in /etc/default/pulp_workers 2. Modify /usr/lib64/python2.7/site-packages/pymongo/pool.py -bash-4.2# cp /usr/lib64/python2.7/site-packages/pymongo/pool.py /usr/lib64/python2.7/site-packages/pymongo/pool.py.old ** edit pool.py ** -bash-4.2# diff /usr/lib64/python2.7/site-packages/pymongo/pool.py /usr/lib64/python2.7/site-packages/pymongo/pool.py.old 19d18 < import time 568d566 < time.sleep(.1) 3. Remove the pool .pyc. and .pyo files -bash-4.2# rm /usr/lib64/python2.7/site-packages/pymongo/pool.pyc -bash-4.2# rm /usr/lib64/python2.7/site-packages/pymongo/pool.pyo 4. Restart katello services 5. Start the test a. In one terminal, monitor journalctl with journalctl -f | grep 'succeeded in' b. In a second terminal, run this command (change hostname) enqueue(){ celery --app=pulp.server.async.app call --exchange=C.dq --routing-key=reserved_resource_worker-2@<hostname> pulp.server.async.tasks._release_resource '--args=["test"]'; }; while true; do for i in $(seq 1 5); do for j in $(seq 1 20); do enqueue & done; sleep 1; done; wait; done 6. Wait for at least two hours, monitoring the journalctrl output for any stoppage. -bash-4.2# enqueue(){ celery --app=pulp.server.async.app call --exchange=C.dq --routing-key=reserved_resource_worker-2@<hostname> pulp.server.async.tasks._release_resource '--args=["test"]'; }; while true; do for i in $(seq 1 5); do for j in $(seq 1 20); do enqueue & done; sleep 1; done; wait; done ... Dec 13 11:21:15 ibm-x3250m4-06.lab.eng.rdu2.redhat.com pulp[27611]: celery.worker.job:INFO: Task pulp.server.async.tasks._release_resource[42a9f392-216d-44ec-9db1-bab4137fa931] succeeded in 0.134064707003s: None Dec 13 11:21:17 ibm-x3250m4-06.lab.eng.rdu2.redhat.com pulp[27611]: celery.worker.job:INFO: Task pulp.server.async.tasks._release_resource[2f4fbf52-d7f1-4163-b0c2-fc76bcf460cd] succeeded in 0.663019994012s: None Dec 13 11:21:18 ibm-x3250m4-06.lab.eng.rdu2.redhat.com pulp[27611]: celery.worker.job:INFO: Task pulp.server.async.tasks._release_resource[41f4d7b6-07b3-4543-b0d8-8bf680f2ca70] succeeded in 0.105703887006s: None Dec 13 11:21:20 ibm-x3250m4-06.lab.eng.rdu2.redhat.com pulp[27611]: celery.worker.job:INFO: Task pulp.server.async.tasks._release_resource[de01038d-571c-4b9e-837f-0910b787ec13] succeeded in 0.720609048003s: None ... 4 hours later Dec 13 15:21:14 ibm-x3250m4-06.lab.eng.rdu2.redhat.com pulp[27611]: celery.worker.job:INFO: Task pulp.server.async.tasks._release_resource[6bfb84e4-aa3d-4e26-b02a-f3803c2b8199] succeeded in 0.216811780992s: None Dec 13 15:21:14 ibm-x3250m4-06.lab.eng.rdu2.redhat.com pulp[27611]: celery.worker.job:INFO: Task pulp.server.async.tasks._release_resource[828506ec-89da-4b62-9859-21042d06fcb4] succeeded in 0.10148732402s: None Dec 13 15:21:16 ibm-x3250m4-06.lab.eng.rdu2.redhat.com pulp[27611]: celery.worker.job:INFO: Task pulp.server.async.tasks._release_resource[0bbe01b5-b969-4116-927a-83979c7f9e81] succeeded in 0.216806504992s: None At no point did I encounter the deadlock.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:3492