Bug 1643382
Summary: | pulp celery workers stuck in kombu polling events forever, all tasks stuck in NOT_STARTED state | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Satellite | Reporter: | Pavel Moravec <pmoravec> | ||||||
Component: | Pulp | Assignee: | satellite6-bugs <satellite6-bugs> | ||||||
Status: | CLOSED NEXTRELEASE | QA Contact: | Katello QA List <katello-qa-list> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | 6.3.4 | CC: | bmbouter, daviddavis, koliveir, ktordeur, snemeth | ||||||
Target Milestone: | Unspecified | ||||||||
Target Release: | Unused | ||||||||
Hardware: | x86_64 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2018-11-14 19:52:51 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
Pavel Moravec
2018-10-26 06:42:39 UTC
Created attachment 1497706 [details] debugging patch When problems happen, worth to try this patch that: - will log (a lot of records) about polling kombu events - might prevent this race bug as the extra logs can break the bad concurrency causing this bug To apply: - upload the file to /tmp cd /usr/lib/python2.7/site-packages cat /tmp/bz1643382-debugging.patch | patch -p1 for i in pulp_resource_manager pulp_workers pulp_celerybeat; do service $i restart; done To revert back: cd /usr/lib/python2.7/site-packages cat /tmp/bz1643382-debugging.patch | patch -p1 -R for i in pulp_resource_manager pulp_workers pulp_celerybeat; do service $i restart; done Created attachment 1497721 [details]
debugging patch for Sat6.4
(hub.py slightly differs so bit different patch needs to be applied on 6.4)
Pavel, Was there any event that precipitated this? Do you know if this happened on earlier versions of Satellite (< 6.3)? Also, any reproducer info would be helpful (e.g. does a heavy load trigger this?). (In reply to David Davis from comment #6) > Pavel, > > Was there any event that precipitated this? Do you know if this happened on > earlier versions of Satellite (< 6.3)? Also, any reproducer info would be > helpful (e.g. does a heavy load trigger this?). The trigger was Capsule sync that means an attempt to generate hundreds (or thousands) of sync tasks on Capsule. I havent seen this earlier. And contrary, comparing kombu upstream, I havent noticed any upstream fix in relevant area that is missing in downstream (if I assume kombu is the guilty one). I can try the above reproducer idea (fire many tasks in a bulk), but I am afraid the race condition might require whatever specific factor (HW, hypervisor tuning of the VM with Capsule, kernel tuning,..) that I can easily miss. FYI the 2nd case attached has different cause - split to BZ https://bugzilla.redhat.com/show_bug.cgi?id=1643650 (but sadly also without reproducer). We can work on reproducers for either BZ, but we will appreciate an idea what to try, based on what the threads are just attempting to execute. David, could you please explain me at what phase of execution the threads are? My guess is they get stuck during startup due to some wrong interleaving of threads. So potential (artificial) reproducer could be adding sleep (with very fine-tuned interval) to *some* places executed during startup such that the wrong interleaving is invoked. Does that sound reasonable? What areas of code are worth to try for that? Pavel, I've been trying now for days to reproduce this without much success. I'm going to dig into the core dumps and code to see if I see anything. I'll try to look for some places we can add sleep statements to in order to reproduce. Sporadic reproducer: follow https://bugzilla.redhat.com/show_bug.cgi?id=1643650#c5 (PLUS do pulp service restart!) but on Satellite 6.3. On my reproducer system, there is >50% probability that one (always just one of four) celery workers will get stuck - whatever amount of tasks one generates, the worker will be idle, celery inspect shows "empty", but coincidentally some tasks will remain unstarted forever (even if all others will be processed and all workers will be idle - this is quite bad to have orphaned tasks..). Coredump analysis shows this stuck worker hits above described backtraces. Clearing the needinfo on this bug. I think at this point we are testing disabling or increasing PULP_MAX_TASKS_PER_CHILD to see if it fixes the problem. Disabling PULP_MAX_TASKS_PER_CHILD in /etc/default/pulp_workers does *not* prevent this bugzilla. Having the child workers recycling disabled, I still can reproduce the same worker stuck. Now even with the polling debugging patch added (something that prevented the stuck at customer(s) as the extra logs broke bad concurrency triggering the bug behind). This "deadlocking" AIUI occurs with some unknown but low probability on startup and fork. So disabling PULP_MAX_TASKS_PER_CHILD I expect to mitigate but not fully resolve the issue. Anytime celery tasks has these 3 criteria: 1) pulp tasks are "stuck" 2) GDB shows that it's waiting on a poll even that never happens 3) the Selector thread in qpid.messaging is confirmed to have *not* died then I believe it's the "celery isn't fork safe w/ qpid.messaging" known issue. The actual root cause issue is in celery's architecture and their fork of the standard multiprocessing library. @pmoravec, are you able to reproduce this problem on Sat 6.4? If not, I wonder if there was some improvement in celery 4 to fix this. (In reply to David Davis from comment #15) > @pmoravec, are you able to reproduce this problem on Sat 6.4? If not, I > wonder if there was some improvement in celery 4 to fix this. I am not. I will try more today but on 6.4: - I can reproduce "the other worker stuck" bug - if I disable pulp worker recycling, this BZ happens on 6.3 but it does not happen on 6.4 - so I *think* upgrade of kombu or else to the version in 6.4 contains a fix of this BZ. But I will test this on 6.4 with disabled workers recycling more thoroughly, with focus on symptoms of this bug. (In reply to Pavel Moravec from comment #16) > (In reply to David Davis from comment #15) > > @pmoravec, are you able to reproduce this problem on Sat 6.4? If not, I > > wonder if there was some improvement in celery 4 to fix this. > > I am not. I will try more today but on 6.4: > > - I can reproduce "the other worker stuck" bug > - if I disable pulp worker recycling, this BZ happens on 6.3 but it does not > happen on 6.4 - so I *think* upgrade of kombu or else to the version in 6.4 > contains a fix of this BZ. > > But I will test this on 6.4 with disabled workers recycling more thoroughly, > with focus on symptoms of this bug. Indeed, I cant reproduce this bug in 6.4. There I get either the bz1643650 stuck, or when I disable worker recycling, no stuck noticed at all - in really many tests tried. So, I am closing this one as fixed in 6.4 by an unknown fix that hardly can be identified & backported. Sounds good. Thank you Pavel. |