Bug 1643650
| Summary: | pulp tasking system in Sat6.4 gets stuck, CV promotion waiting on start the pulp task | ||
|---|---|---|---|
| Product: | Red Hat Satellite | Reporter: | Pavel Moravec <pmoravec> |
| Component: | Installation | Assignee: | satellite6-bugs <satellite6-bugs> |
| Status: | CLOSED DUPLICATE | QA Contact: | Perry Gagne <pgagne> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 6.4 | CC: | daviddavis, dconsoli, dgross, ehelms, hyu, ktordeur, peter.vreman, pmoravec |
| Target Milestone: | Unspecified | Keywords: | Triaged |
| Target Release: | Unused | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2018-12-11 16:12:21 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1122832 | ||
|
Description
Pavel Moravec
2018-10-26 21:23:13 UTC
We hit this once on our internal Sat6.4 (provisioning.usersys.redhat.com) - in a situation when the (powerfull but still) VM got impacted by host system running 100% CPU on all cores - maybe an idea for reproducer..? (In reply to Pavel Moravec from comment #2) > We hit this once on our internal Sat6.4 (provisioning.usersys.redhat.com) - > in a situation when the (powerfull but still) VM got impacted by host system > running 100% CPU on all cores - maybe an idea for reproducer..? Very reliable reproducer on my VM (under the same host, details will follow): # qpid-stat --ssl-certificate=/etc/pki/katello/qpid_client_striped.crt -b amqps://localhost:5671 -q | grep resource reserved_resource_worker-0.lan.celery.pidbox Y 0 2 2 0 1.30k 1.30k 1 2 reserved_resource_worker-0.lan.dq2 Y 1 56 55 1.10k 70.9k 69.8k 1 2 reserved_resource_worker-1.lan.celery.pidbox Y 0 2 2 0 1.30k 1.30k 1 2 reserved_resource_worker-1.lan.dq2 Y 4 50 46 5.06k 63.3k 58.2k 1 2 reserved_resource_worker-2.lan.celery.pidbox Y 0 2 2 0 1.30k 1.30k 1 2 reserved_resource_worker-2.lan.dq2 Y 2 60 58 2.53k 75.9k 73.4k 1 2 reserved_resource_worker-3.lan.celery.pidbox Y 0 2 2 0 1.30k 1.30k 1 2 reserved_resource_worker-3.lan.dq2 Y 7 50 43 8.70k 63.3k 54.6k 1 2 resource_manager Y 10 111 101 17.9k 198k 180k 1 2 resource_manager.lan.dq2 Y 0 0 0 0 0 0 0 2 resource_manager.lan.celery.pidbox Y 0 2 2 0 1.30k 1.30k 1 2 resource_manager.lan.dq2 Y 0 0 0 0 0 0 1 2 # qpid-stat --ssl-certificate=/etc/pki/katello/qpid_client_striped.crt -b amqps://localhost:5671 -u | grep resource 1 resource_manager qpid.[::1]:5671-[::1]:44372 celery 19825 Y CREDIT 57 6 0 resource_manager.lan.celery.pidbox qpid.[::1]:5671-[::1]:44372 celery 19825 Y CREDIT 2 6 2 resource_manager.lan.dq2 qpid.[::1]:5671-[::1]:44372 celery 19825 Y CREDIT 0 6 0 reserved_resource_worker-3.lan.celery.pidbox qpid.[::1]:5671-[::1]:44484 celery 20040 Y CREDIT 2 8 2 reserved_resource_worker-3.lan.dq2 qpid.[::1]:5671-[::1]:44484 celery 20040 Y CREDIT 29 8 0 reserved_resource_worker-0.lan.celery.pidbox qpid.[::1]:5671-[::1]:44454 celery 20031 Y CREDIT 2 2 2 reserved_resource_worker-0.lan.dq2 qpid.[::1]:5671-[::1]:44454 celery 20031 Y CREDIT 34 2 0 reserved_resource_worker-2.lan.celery.pidbox qpid.[::1]:5671-[::1]:44488 celery 20037 Y CREDIT 2 2 2 reserved_resource_worker-2.lan.dq2 qpid.[::1]:5671-[::1]:44488 celery 20037 Y CREDIT 29 2 0 reserved_resource_worker-1.lan.celery.pidbox qpid.[::1]:5671-[::1]:44470 celery 20034 Y CREDIT 2 4 2 reserved_resource_worker-1.lan.dq2 qpid.[::1]:5671-[::1]:44470 celery 20034 Y CREDIT 23 4 # celery -A pulp.server.async.app inspect active --timeout=200 -> reserved_resource_worker-1.lan: OK - empty - -> reserved_resource_worker-0.lan: OK - empty - ^C# pulp-admin -u admin -p $pulpAdminPassword tasks list +----------------------------------------------------------------------+ Tasks +----------------------------------------------------------------------+ Operations: publish Resources: c65968ce-3b4f-42a5-a208-337a2e3e03ac (repository) State: Waiting Start Time: Unstarted Finish Time: Incomplete Task Id: 6e51fc42-2dff-4a5c-9bce-b86712bcd5e4 Worker Name: reserved_resource_worker-1.lan Operations: publish Resources: a05dbb22-8756-44f1-9988-de6eec16df30 (repository) State: Waiting Start Time: Unstarted Finish Time: Incomplete Task Id: 7ecb17c1-adc4-40b1-9e8d-d9efd68420dd Worker Name: reserved_resource_worker-3.lan .. # and coredumps match above backtraces Forgot to mention the reproducer steps:
1) generate some CPU load (here by querying mongo for all collections):
cd /root/pulp_db
for i in $(cat _tables.txt ); do echo $i; mongo pulp_database --eval "load('.mongorc.js'); db.${i}.find().shellPrint()" > /dev/null; done
2) forcefully publish all (exceptsome biggest) repos:
pulpAdminPassword=$(grep ^default_password /etc/pulp/server.conf | cut -d' ' -f2)
for i in 1 2 3 4 5; do
for repo in $(ls /var/lib/pulp/published/yum/master/yum_distributor/ | grep -v -e c1db7f26-b874-4a9b-b0c9-233db901f114 -e d46be6ac-8f46-40da-890e-a983fb7ca6bb); do curl -i -H "Content-Type: application/json" -X POST -d "{\"id\":\"$repo\",\"override_config\":{\"force_full\":true}}" -u admin:$pulpAdminPassword https://$(hostname -f)/pulp/api/v2/repositories/$repo/actions/publish/ & done
sleep 2
done
3) time to time check celery status / qpid queues.
The above reproducer _is_ realiable, reproduced on a dedicated HW: - Having separate PC and running a VM with Satellite on it - generating a CPU load on the PC (i.e. run another VM and compile some program in a loop) - restart pulp services on the Satellite VM - bulk create more (forcefully re)publish tasks - check celery status Yes, I am actively working on this. It's my top priority at the moment. I'll try to provide more updates as I have them. An update: it seems that disabling PULP_MAX_TASKS_PER_CHILD fixes the problem (at least for the reproducer we're using to test this bug). It looks like process recycling is the issue. Going to see if it reconnecting to qpid helps at all. I can confirm that: - crucial command to trigger the workers stuck is "celery -A pulp.server.async.app inspect active" - both workarounds prevent the stuck: - disabling workers recycling via commenting out "PULP_MAX_TASKS_PER_CHILD=2" in /etc/default/pulp_workers - or running workers in "--pool solo" mode I think our only option is to disable PULP_MAX_TASKS_PER_CHILD or at least set it to a higher number. 2 is overly aggressive and prone to causing problems. I was thinking that we could maybe attempt to reconnect to qpid after forking but it appears that we've been down that road before: https://github.com/apache/qpid-python/commit/e859964c379a72a3d9fd6502829176ddb4f1b90b#diff-498240a0a2a6c24273a734324dfd386c Quite reliable reproducer: have pulp celery worker recycling enabled, and run few times: celery -A pulp.server.async.app inspect active That is enough.. :-/ Based on comment 17, the solution is to raise or disable PULP_MAX_TASKS_PER_CHILD. I believe this is set in the installer so I'm reassigning. Technically, the bug is within pulp (or in libraries it uses). Increasing PULP_MAX_TASKS_PER_CHILD can just alleviate the problem, decrease probability of hitting hit, but it isnt a resolution. Disabling pulp child recycling prevents this bug, but it is assumed memory accumulation of worker process will happen. Just 2c from a support guy, I am ok with whatever resolution of the bugzilla, if the fix wont have regressions elsewhere. Pavel, You're absolutely correct. There is an architectural problem in that celery uses a forking model and qpid doesn't support that. One solution might be to move to rabbitmq but I believe that isn't an option. The other solution would be to move off celery and onto something like rq (which Pulp 3 currently uses). However, I'd worry about introducing regressions. Thus, I think increasing or disabling PULP_MAX_TASKS_PER_CHILD is the safest bet for now. We can reevaluate though if we continue to see more celery problems. Pavel, David, I understand the issue and that it is hard (timeconsuming and risk) to replace child spawning in the core. Better spend the time on the future to have pulp3 out quicker. I will start with the pulp_max_tasks on 10000 and see the frequency. As long as it gets down to once per 60 days than it is acceptable for me. Peter *** This bug has been marked as a duplicate of bug 1649938 *** |