Bug 1643382

Summary: pulp celery workers stuck in kombu polling events forever, all tasks stuck in NOT_STARTED state
Product: Red Hat Satellite Reporter: Pavel Moravec <pmoravec>
Component: PulpAssignee: satellite6-bugs <satellite6-bugs>
Status: CLOSED NEXTRELEASE QA Contact: Katello QA List <katello-qa-list>
Severity: high Docs Contact:
Priority: high    
Version: 6.3.4CC: bmbouter, daviddavis, koliveir, ktordeur, snemeth
Target Milestone: Unspecified   
Target Release: Unused   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-11-14 19:52:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
debugging patch
none
debugging patch for Sat6.4 none

Description Pavel Moravec 2018-10-26 06:42:39 UTC
Description of problem:
We got a situation when pulp tasking system got stuck with:
- some or many tasks pending/"running"
- celery workers not exhibiting any visible activity
- qpid queues for resource_manager and for workers being empty
- coredumps of all celery workers show they are polling some event from kombu (details follow)

Restarting pulp services sometimes helped, sometimes not.

Adding debugs to kombu around the event polling unblock the workers completely - thus the error must be a race condition and adding those debugs breaks the faulty interleaving of threads execution behind this bug.

This was observed during a Capsule sync, repeatedly, blocking each and every task on the Capsule, but the generic symptoms suggest it can affect any user activity in Sat relying on pulp tasks.


Version-Release number of selected component (if applicable):
Sat 6.3.4
  pulp-server-2.13.4.14-1.el7sat.noarch
  python-billiard-3.3.0.20-2.el7sat.x86_64
  python-celery-3.1.17-1.el7sat.noarch
  python-kombu-3.0.33-8.pulp.el7sat.noarch


How reproducible:
??? no reproducer known


Steps to Reproduce:
???


Actual results:
- see symptoms above
- coredumps having backtraces with the problematic thread:
    Thread 4 (Thread 0x7fca4e3f0740
    #3  PyEval_EvalFrameEx (f=f@entry=Frame 0x7fca23df0218, for file /usr/lib/python2.7/site-packages/kombu/utils/eventio.py, line 92, in _poll
    #7  PyEval_EvalFrameEx (f=f@entry=Frame 0x7fca23de99b0, for file /usr/lib/python2.7/site-packages/kombu/utils/eventio.py, line 64, in poll
    #11 PyEval_EvalFrameEx (f=f@entry=Frame 0x34b8510, for file /usr/lib/python2.7/site-packages/kombu/async/hub.py, line 285, in create_loop
    #15 PyEval_EvalFrameEx (f=f@entry=Frame 0x34ba190, for file /usr/lib/python2.7/site-packages/celery/worker/loops.py, line 70, in asynloop
    #20 PyEval_EvalFrameEx (f=f@entry=Frame 0x7fca23dd63d0, for file /usr/lib/python2.7/site-packages/celery/worker/consumer.py, line 821, in start
    #24 PyEval_EvalFrameEx (f=f@entry=Frame 0x7fca29114050, for file /usr/lib/python2.7/site-packages/celery/bootsteps.py, line 123, in start
    #28 PyEval_EvalFrameEx (f=f@entry=Frame 0x7fca29105810, for file /usr/lib/python2.7/site-packages/celery/worker/consumer.py, line 278, in start
    #32 PyEval_EvalFrameEx (f=f@entry=Frame 0x7fca2910b3c0, for file /usr/lib/python2.7/site-packages/celery/bootsteps.py, line 374, in start
    #36 PyEval_EvalFrameEx (f=f@entry=Frame 0x7fca3ad645f0, for file /usr/lib/python2.7/site-packages/celery/bootsteps.py, line 123, in start
    #40 PyEval_EvalFrameEx (f=f@entry=Frame 0x7fca3ad64410, for file /usr/lib/python2.7/site-packages/celery/worker/__init__.py, line 206, in start
    #44 PyEval_EvalFrameEx (f=f@entry=Frame 0x2e791a0, for file /usr/lib/python2.7/site-packages/celery/bin/worker.py, line 212, in run
    #49 PyEval_EvalFrameEx (f=f@entry=Frame 0x7fca3c9e1620, for file /usr/lib/python2.7/site-packages/celery/bin/base.py, line 270, in __call__
    #58 PyEval_EvalFrameEx (f=f@entry=Frame 0x7fca3d1c6638, for file /usr/lib/python2.7/site-packages/celery/bin/worker.py, line 179, in run_from_argv
    #62 PyEval_EvalFrameEx (f=f@entry=Frame 0x2e279a0, for file /usr/lib/python2.7/site-packages/celery/bin/celery.py, line 693, in execute
    #66 PyEval_EvalFrameEx (f=f@entry=Frame 0x7fca3d25aa00, for file /usr/lib/python2.7/site-packages/celery/bin/celery.py, line 761, in handle_argv
    #70 PyEval_EvalFrameEx (f=f@entry=Frame 0x7fca40f01d00, for file /usr/lib/python2.7/site-packages/celery/bin/base.py, line 307, in execute_from_commandline
    #74 PyEval_EvalFrameEx (f=f@entry=Frame 0x7fca3ff64d38, for file /usr/lib/python2.7/site-packages/celery/bin/celery.py, line 769, in execute_from_commandline
    #78 PyEval_EvalFrameEx (f=f@entry=Frame 0x7fca408119b0, for file /usr/lib/python2.7/site-packages/celery/bin/celery.py, line 81, in main
    #82 PyEval_EvalFrameEx (f=f@entry=Frame 0x7fca44e82208, for file /usr/lib/python2.7/site-packages/celery/__main__.py, line 30, in main

Here, the "kombu/async/hub.py, line 285, in create_loop" seems as the stuck point (and adding logs around this line unblocked it)

Coredumps to be provided separately.


Expected results:
- no stuck celery workers
- pulp tasks executed

Additional info:

Comment 3 Pavel Moravec 2018-10-26 11:48:36 UTC
Created attachment 1497706 [details]
debugging patch

When problems happen, worth to try this patch that:

- will log (a lot of records) about polling kombu events
- might prevent this race bug as the extra logs can break the bad concurrency causing this bug


To apply:

- upload the file to /tmp

cd /usr/lib/python2.7/site-packages

cat /tmp/bz1643382-debugging.patch | patch -p1

for i in pulp_resource_manager pulp_workers pulp_celerybeat; do service $i restart; done


To revert back:

cd /usr/lib/python2.7/site-packages

cat /tmp/bz1643382-debugging.patch | patch -p1 -R

for i in pulp_resource_manager pulp_workers pulp_celerybeat; do service $i restart; done

Comment 4 Pavel Moravec 2018-10-26 12:12:13 UTC
Created attachment 1497721 [details]
debugging patch for Sat6.4

(hub.py slightly differs so bit different patch needs to be applied on 6.4)

Comment 6 David Davis 2018-10-26 20:53:34 UTC
Pavel, 

Was there any event that precipitated this? Do you know if this happened on earlier versions of Satellite (< 6.3)? Also, any reproducer info would be helpful (e.g. does a heavy load trigger this?).

Comment 7 Pavel Moravec 2018-10-26 21:25:18 UTC
(In reply to David Davis from comment #6)
> Pavel, 
> 
> Was there any event that precipitated this? Do you know if this happened on
> earlier versions of Satellite (< 6.3)? Also, any reproducer info would be
> helpful (e.g. does a heavy load trigger this?).

The trigger was Capsule sync that means an attempt to generate hundreds (or thousands) of sync tasks on Capsule.

I havent seen this earlier. And contrary, comparing kombu upstream, I havent noticed any upstream fix in relevant area that is missing in downstream (if I assume kombu is the guilty one).

I can try the above reproducer idea (fire many tasks in a bulk), but I am afraid the race condition might require whatever specific factor (HW, hypervisor tuning of the VM with Capsule, kernel tuning,..) that I can easily miss.


FYI the 2nd case attached has different cause - split to BZ https://bugzilla.redhat.com/show_bug.cgi?id=1643650 (but sadly also without reproducer).

We can work on reproducers for either BZ, but we will appreciate an idea what to try, based on what the threads are just attempting to execute.

Comment 8 Pavel Moravec 2018-11-01 15:16:12 UTC
David,
could you please explain me at what phase of execution the threads are?

My guess is they get stuck during startup due to some wrong interleaving of threads. So potential (artificial) reproducer could be adding sleep (with very fine-tuned interval) to *some* places executed during startup such that the wrong interleaving is invoked.

Does that sound reasonable?

What areas of code are worth to try for that?

Comment 9 David Davis 2018-11-01 16:22:16 UTC
Pavel,

I've been trying now for days to reproduce this without much success. I'm going to dig into the core dumps and code to see if I see anything. I'll try to look for some places we can add sleep statements to in order to reproduce.

Comment 10 Pavel Moravec 2018-11-08 07:39:52 UTC
Sporadic reproducer: follow https://bugzilla.redhat.com/show_bug.cgi?id=1643650#c5 (PLUS do pulp service restart!) but on Satellite 6.3.

On my reproducer system, there is >50% probability that one (always just one of four) celery workers will get stuck - whatever amount of tasks one generates, the worker will be idle, celery inspect shows "empty", but coincidentally some tasks will remain unstarted forever (even if all others will be processed and all workers will be idle - this is quite bad to have orphaned tasks..). Coredump analysis shows this stuck worker hits above described backtraces.

Comment 12 David Davis 2018-11-13 18:42:03 UTC
Clearing the needinfo on this bug. I think at this point we are testing disabling or increasing PULP_MAX_TASKS_PER_CHILD to see if it fixes the problem.

Comment 13 Pavel Moravec 2018-11-13 21:24:06 UTC
Disabling PULP_MAX_TASKS_PER_CHILD in /etc/default/pulp_workers does *not* prevent this bugzilla.

Having the child workers recycling disabled, I still can reproduce the same worker stuck. Now even with the polling debugging patch added (something that prevented the stuck at customer(s) as the extra logs broke bad concurrency triggering the bug behind).

Comment 14 Brian Bouterse 2018-11-14 14:47:29 UTC
This "deadlocking" AIUI occurs with some unknown but low probability on startup and fork. So disabling PULP_MAX_TASKS_PER_CHILD I expect to mitigate but not fully resolve the issue.

Anytime celery tasks has these 3 criteria:
1) pulp tasks are "stuck"
2) GDB shows that it's waiting on a poll even that never happens
3) the Selector thread in qpid.messaging is confirmed to have *not* died

then I believe it's the "celery isn't fork safe w/ qpid.messaging" known issue. The actual root cause issue is in celery's architecture and their fork of the standard  multiprocessing library.

Comment 15 David Davis 2018-11-14 15:27:32 UTC
@pmoravec, are you able to reproduce this problem on Sat 6.4? If not, I wonder if there was some improvement in celery 4 to fix this.

Comment 16 Pavel Moravec 2018-11-14 15:49:25 UTC
(In reply to David Davis from comment #15)
> @pmoravec, are you able to reproduce this problem on Sat 6.4? If not, I
> wonder if there was some improvement in celery 4 to fix this.

I am not. I will try more today but on 6.4:

- I can reproduce "the other worker stuck" bug
- if I disable pulp worker recycling, this BZ happens on 6.3 but it does not happen on 6.4 - so I *think* upgrade of kombu or else to the version in 6.4 contains a fix of this BZ.

But I will test this on 6.4 with disabled workers recycling more thoroughly, with focus on symptoms of this bug.

Comment 17 Pavel Moravec 2018-11-14 19:52:51 UTC
(In reply to Pavel Moravec from comment #16)
> (In reply to David Davis from comment #15)
> > @pmoravec, are you able to reproduce this problem on Sat 6.4? If not, I
> > wonder if there was some improvement in celery 4 to fix this.
> 
> I am not. I will try more today but on 6.4:
> 
> - I can reproduce "the other worker stuck" bug
> - if I disable pulp worker recycling, this BZ happens on 6.3 but it does not
> happen on 6.4 - so I *think* upgrade of kombu or else to the version in 6.4
> contains a fix of this BZ.
> 
> But I will test this on 6.4 with disabled workers recycling more thoroughly,
> with focus on symptoms of this bug.

Indeed, I cant reproduce this bug in 6.4. There I get either the bz1643650 stuck, or when I disable worker recycling, no stuck noticed at all - in really many tests tried.

So, I am closing this one as fixed in 6.4 by an unknown fix that hardly can be identified & backported.

Comment 18 David Davis 2018-11-14 20:43:11 UTC
Sounds good. Thank you Pavel.