Bug 1272656 - pulp celery worker memory leak and high CPU after restarting qpidd
pulp celery worker memory leak and high CPU after restarting qpidd
Status: NEW
Product: Red Hat Satellite 6
Classification: Red Hat
Component: Pulp (Show other bugs)
6.1.1
All Linux
low Severity medium (vote)
: GA
: --
Assigned To: satellite6-bugs
Katello QA List
: Triaged
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-10-17 07:15 EDT by Pavel Moravec
Modified: 2017-08-15 15:24 EDT (History)
15 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 2137771 None None None 2016-01-25 04:00 EST
Pulp Redmine 1330 Normal NEW celery worker memory leak and high CPU after restarting qpidd many times 2017-03-21 15:20 EDT

  None (edit)
Description Pavel Moravec 2015-10-17 07:15:34 EDT
Description of problem:
There is a race condition when doing 1) qpidd restart on Satellite, 2) (Re-)Register a content host (either to a capsule or Sat), some celery worker process (usually two) start to consume more and more CPU and memory. Some other celery worker processes consume much CPU "only"


Version-Release number of selected component (if applicable):
python-celery-3.1.11-1.el7sat.noarch
python-qpid-0.30-6.el7.noarch


How reproducible:
100% in few minutes


Steps to Reproduce:
1. On Satellite, restart qpidd regularly
while true; do service qpidd restart; sleep 1.1; sleep $(($((RANDOM))%2)); done

2. On a content host, re-register it regularly:
while true; do subscription-manager register --org="Default_Organization" --environment="Library" --username=admin --password=SWcwZ226oErH6ZqC --force; sleep 5; done

3. Monitor disk usage of celery processes on Satellite:
date; ps aux | grep celer | awk '{ print $4 }' | sort -nu


Actual results:
After some time, some celery worker thread starts consuming more and more memory (and also CPU).

Since then, let stop whole reproducer and let the celery process(es) to boil your system - they will consume more and more CPU and memory.


Expected results:
No CPU or memory increasing over time.


Additional info:
Backtrace of busy thread is almost always:

/usr/lib/python2.7/site-packages/kombu/transport/qpid.py, line 473
/usr/lib/python2.7/site-packages/kombu/transport/qpid.py, line 1097
/usr/lib/python2.7/site-packages/kombu/messaging.py, line 184
/usr/lib/python2.7/site-packages/kombu/messaging.py, line 168
/usr/lib/python2.7/site-packages/celery/events/__init__.py, line 215
/usr/lib/python2.7/site-packages/celery/events/__init__.py, line 238
/usr/lib/python2.7/site-packages/celery/events/__init__.py, line 251
/usr/lib/python2.7/site-packages/celery/worker/consumer.py, line 512
/usr/lib/python2.7/site-packages/celery/bootsteps.py, line 123
/usr/lib/python2.7/site-packages/celery/worker/consumer.py, line 278
/usr/lib/python2.7/site-packages/celery/bootsteps.py, line 373
/usr/lib/python2.7/site-packages/celery/bootsteps.py, line 123
/usr/lib/python2.7/site-packages/celery/worker/__init__.py, line 206
/usr/lib/python2.7/site-packages/celery/bin/worker.py, line 212
/usr/lib/python2.7/site-packages/celery/bin/base.py, line 269
/usr/lib/python2.7/site-packages/celery/bin/worker.py, line 179
/usr/lib/python2.7/site-packages/celery/bin/celery.py, line 693
/usr/lib/python2.7/site-packages/celery/bin/celery.py, line 761
/usr/lib/python2.7/site-packages/celery/bin/base.py, line 306
/usr/lib/python2.7/site-packages/celery/bin/celery.py, line 769
/usr/lib/python2.7/site-packages/celery/bin/celery.py, line 81
/usr/lib/python2.7/site-packages/celery/__main__.py, line 30

(once it was "/usr/lib/python2.7/site-packages/kombu/messaging.py, line 150" on top), instead of "messaging.py, line 168"
Comment 2 Pavel Moravec 2015-10-17 07:29:08 EDT
Raised sat-6.1.z? since I first hit the problem at so late stage when my Satellite was swapping, with system load > 10 and totally unresponsive.
Comment 4 Michael Hrivnak 2015-10-18 22:43:10 EDT
Brian, what do you think?
Comment 5 Brian Bouterse 2015-10-19 18:28:02 EDT
I've dug into this some today and I was able to reproduce it on a vanilla Pulp 2.6.4 system. I didn't even need to register consumers; I only restart qpidd a lot using the command from comment 1. After maybe 10 minutes, my celery processes were at 80-100% cpu. That continued even after I stopped restarting qpidd so once the processes are "dirty" they stay that way. Even hours later...

I do observe the following in my logs:

  File "<string>", line 6, in close
  File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 795, in close
    self.sync(timeout=timeout)
  File "<string>", line 6, in sync
  File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 786, in sync
    if not self._ewait(lambda: not self.outgoing and not self.acked, timeout=timeout):
  File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 596, in _ewait
    result = self.connection._ewait(lambda: self.error or predicate(), timeout)
  File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 234, in _ewait
    self.check_error()
  File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 226, in check_error
    self.close()
  File "<string>", line 6, in close
  File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 345, in close
    ssn.close(timeout=timeout)
  File "<string>", line 6, in close
  File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 795, in close
    self.sync(timeout=timeout)
  File "<string>", line 6, in sync
  File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 786, in sync
    if not self._ewait(lambda: not self.outgoing and not self.acked, timeout=timeout):
  File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 596, in _ewait
    result = self.connection._ewait(lambda: self.error or predicate(), timeout)
  File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 234, in _ewait
    self.check_error()
  File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 226, in check_error
    self.close()
  File "<string>", line 6, in close
  File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 345, in close
    ssn.close(timeout=timeout)
  File "<string>", line 6, in close
  File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 795, in close
    self.sync(timeout=timeout)
  File "<string>", line 6, in sync
  File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 786, in sync
    if not self._ewait(lambda: not self.outgoing and not self.acked, timeout=timeout):
  File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 596, in _ewait
    result = self.connection._ewait(lambda: self.error or predicate(), timeout)
RuntimeError: maximum recursion depth exceeded

The above traceback is produced by all affected processes regularly until they are halted. I believe qpid.messaging has a intermittent disconnect problem where if the server side closes unexpectedly, the client qpid.messaging threads continue to be stuck in an endless ewait() loop. I'm going to talk w/ the qpid team about troubleshooting steps tomorrow. I'll post back here with what I learn.
Comment 6 Gordon Sim 2015-10-20 14:25:06 EDT
Pavel, does this look like https://issues.apache.org/jira/browse/QPID-6297 ?Seems like the same sort of error and I see you raised that issue upstream also. From https://brewweb.devel.redhat.com/buildinfo?buildID=425731 it looks like the client side of the fix did not make it into the build you were using here?
Comment 7 Brian Bouterse 2015-10-20 14:40:28 EDT
Not sure if this is helpful, but I was able to reproduce the issue described in comment 5 on fedora 22 with these packages:

python-qpid-proton-0.9-3.fc22.x86_64
qpid-cpp-server-store-0.32-4.fc22.x86_64
qpid-cpp-server-0.32-4.fc22.x86_64
python-qpid-0.32-4.fc22.x86_64
qpid-proton-c-0.9-3.fc22.x86_64
qpid-cpp-client-0.32-4.fc22.x86_64
python-qpid-qmf-0.28-29.fc22.x86_64
qpid-tools-0.32-4.fc22.x86_64
qpid-qmf-0.28-29.fc22.x86_64
python-qpid-common-0.32-4.fc22.x86_64
Comment 8 Gordon Sim 2015-10-20 14:48:59 EDT
(In reply to Brian Bouterse from comment #7)
> Not sure if this is helpful, but I was able to reproduce the issue described
> in comment 5 on fedora 22 with these packages:
[...]
> python-qpid-0.32-4.fc22.x86_64

I can't find exactly what's in that package through koji, but 0.32 was released in March (i.e. before the fix for QPID-6297) cand the latest fedora package (also based on 0.32) doesn't seem to contain the bug fix: http://koji.fedoraproject.org/koji/buildinfo?buildID=687194.
Comment 9 pulp-infra@redhat.com 2015-10-20 17:00:19 EDT
The Pulp upstream bug status is at ASSIGNED. Updating the external tracker on this bug.
Comment 10 pulp-infra@redhat.com 2015-10-20 17:00:23 EDT
The Pulp upstream bug priority is at Normal. Updating the external tracker on this bug.
Comment 11 Pavel Moravec 2015-10-27 17:25:18 EDT
(In reply to Gordon Sim from comment #6)
> Pavel, does this look like https://issues.apache.org/jira/browse/QPID-6297
> ?Seems like the same sort of error and I see you raised that issue upstream
> also. From https://brewweb.devel.redhat.com/buildinfo?buildID=425731 it
> looks like the client side of the fix did not make it into the build you
> were using here?

Backporting the https://svn.apache.org/viewvc/qpid/trunk/qpid/python/qpid/messaging/driver.py?r1=1684716&r2=1684715&pathrev=1684716 , I can still reproduce the same high CPU usage.
Comment 12 Gordon Sim 2015-10-28 05:25:07 EDT
(In reply to Pavel Moravec from comment #11)
> Backporting the
> https://svn.apache.org/viewvc/qpid/trunk/qpid/python/qpid/messaging/driver.
> py?r1=1684716&r2=1684715&pathrev=1684716 , I can still reproduce the same
> high CPU usage.

What version of qpidd was that using?
Comment 13 Brian Bouterse 2015-10-28 08:48:41 EDT
I was also able to reproduce this outside of satellite (just using Pulp+celery+qpid). My packages are listed here in comment 7.
Comment 14 Gordon Sim 2015-10-28 08:53:31 EDT
(In reply to Brian Bouterse from comment #13)
> I was also able to reproduce this outside of satellite (just using
> Pulp+celery+qpid). My packages are listed here in comment 7.

I don't think those packages include either the client side or server side part of the fix, as it was committed upstream after 0,32 was released.
Comment 15 Brian Bouterse 2015-10-28 08:55:38 EDT
I can easily reproduce this if you want me to test a specific version. Having some rpms to install for F22 would make the testing even easier, but another way would work too. Let me know how I can help.
Comment 16 Gordon Sim 2015-10-28 09:34:13 EDT
(In reply to Brian Bouterse from comment #15)
> I can easily reproduce this if you want me to test a specific version.
> Having some rpms to install for F22 would make the testing even easier, but
> another way would work too. Let me know how I can help.

Unfortunately I don't think there are rpms at present that would fix this. Until there are, it would require either patching the source for the current rpm and rebuilding, or just using the trunk upstream.
Comment 17 Brian Bouterse 2015-10-28 10:44:27 EDT
I can use the upstream trunk, no problem. Which specific python client package do you want me to test? Is it python-qpid? What about these other packages below? Can I mix a trunk-based python-qpid my existing versions from comment 7?

python-qpid-qmf
qpid-tools
qpid-qmf
python-qpid-common
Comment 18 Gordon Sim 2015-10-28 10:53:21 EDT
(In reply to Brian Bouterse from comment #17)
> I can use the upstream trunk, no problem. Which specific python client
> package do you want me to test? Is it python-qpid?

Yes, though I think you will also need an up to date broker (the fix, which I was not directly involved with, appears to have two parts, one in the client and one in the broker).

> What about these other
> packages below? Can I mix a trunk-based python-qpid my existing versions
> from comment 7?
> 
> python-qpid-qmf
> qpid-tools
> qpid-qmf

Yes, no need to update any of the above

> python-qpid-common

From koji, this appears to be essentially the old 0-9/0-10 client, so no need to worry about that either I don't think.

On the client, make sure the place you load qpid.messaging from is the place with the fix, and make sure the qpidd running also has the fix.
Comment 19 Pavel Moravec 2015-10-28 12:26:11 EDT
(In reply to Gordon Sim from comment #12)
> (In reply to Pavel Moravec from comment #11)
> > Backporting the
> > https://svn.apache.org/viewvc/qpid/trunk/qpid/python/qpid/messaging/driver.
> > py?r1=1684716&r2=1684715&pathrev=1684716 , I can still reproduce the same
> > high CPU usage.
> 
> What version of qpidd was that using?

qpid-cpp-server-0.30-9.el7.x86_64, where I expect the broker part of fix is not applied.

I can try building upstream qpid on my Satellite with the post-0.32 fix, if it will prevent the celery high CPU usage etc.

Anyway it sounds like there is some issue on the client side (or in pulp/celery?) even on trunk. Assume python-qpid client from trunk is used against qpid-cpp-server 0.30 - it would exhibit the same high CPU usage and mem.leak. I.e. the client (or pulp/celery) is not robust enough for misbehaving broker it communicates to.

Anyway, I will (try to) build upstream qpid on Satellite and test it.
Comment 20 Pavel Moravec 2015-10-28 17:13:30 EDT
(In reply to Pavel Moravec from comment #19)
> Anyway, I will (try to) build upstream qpid on Satellite and test it.

It was quite tricky due to linearstore incompatibility between 0.30 and upstream, but:

- the problem is reproducible on 0.30 when I disable linearstore.so plugin
- the problem is reproducible on qpid trunk (r1711138) with linearstore disabled and with python-qpid-0.30-6 patched by r1684716 (the client part of the patch in driver.py)

So just backporting QPID-6297 (both client&broker side) wont help.. :(
Comment 21 Gordon Sim 2015-10-29 05:25:33 EDT
Ok, thanks for checking and sorry for the wild goose chase!
Comment 22 Pavel Moravec 2015-10-29 06:00:51 EDT
Adding some debugs, the busy loop is due to cycle:

celery/events/__init__.py:

    def flush(self):
        """Flushes the outbound buffer."""
        while self._outbound_buffer:
            try:
                type, fields, _ = self._outbound_buffer.popleft()
            except IndexError:
                return
            self.send(type, **fields)


That while loop is executed forever, calling channel.basic_publish calling _put for address 'celeryev/worker.offline; {assert: always, node: {type: topic}}'.

In logs with my debugs, I see:

Oct 29 10:15:09 pmoravec-sat61-rhel7 pulp: kombu.transport.qpid:ERROR: PavelM: new sender: 'celeryev/worker.offline; {assert: always, node: {type: topic}}'
Oct 29 10:15:09 pmoravec-sat61-rhel7 pulp: kombu.messaging:ERROR: PavelM: channel.basic_publish
Oct 29 10:15:09 pmoravec-sat61-rhel7 pulp: kombu.transport.qpid:ERROR: PavelM: new sender: 'celeryev/worker.offline; {assert: always, node: {type: topic}}'
Oct 29 10:15:09 pmoravec-sat61-rhel7 systemd: Stopping An AMQP message broker daemon....
Oct 29 10:15:09 pmoravec-sat61-rhel7 pulp: kombu.messaging:ERROR: PavelM: channel.basic_publish
Oct 29 10:15:09 pmoravec-sat61-rhel7 pulp: kombu.transport.qpid:ERROR: PavelM: new sender: 'celeryev/worker.online; {assert: always, node: {type: topic}}'
Oct 29 10:15:09 pmoravec-sat61-rhel7 pulp: kombu.messaging:ERROR: PavelM: channel.basic_publish
Oct 29 10:15:09 pmoravec-sat61-rhel7 pulp: pulp.server.async.scheduler:ERROR: connection aborted
Oct 29 10:15:09 pmoravec-sat61-rhel7 pulp: kombu.transport.qpid:ERROR: PavelM: new sender: 'celeryev/worker.offline; {assert: always, node: {type: topic}}'
Oct 29 10:15:09 pmoravec-sat61-rhel7 pulp: kombu.messaging:ERROR: PavelM: channel.basic_publish
Oct 29 10:15:09 pmoravec-sat61-rhel7 pulp: kombu.transport.qpid:ERROR: PavelM: new sender: 'celeryev/worker.offline; {assert: always, node: {type: topic}}'
Oct 29 10:15:09 pmoravec-sat61-rhel7 pulp: kombu.messaging:ERROR: PavelM: channel.basic_publish
Oct 29 10:15:09 pmoravec-sat61-rhel7 pulp: kombu.transport.qpid:ERROR: PavelM: new sender: 'celeryev/worker.offline; {assert: always, node: {type: topic}}'
Oct 29 10:15:09 pmoravec-sat61-rhel7 pulp: kombu.messaging:ERROR: PavelM: channel.basic_publish
..

The self._outbound_buffer never gets empty, some other debugs show the loop isn't finished.

So this sounds like a bug in celery, not in python-qpid or qpid broker.


Bryan, could you confirm that by adding debugs like mine?

[root@pmoravec-sat61-rhel7 ~]# grep -rn PavelM /usr/lib/python2.7/site-packages/
/usr/lib/python2.7/site-packages/celery/events/__init__.py:249:        logger.error("PavelM: flush:pred cyklem: self._outbound_buffer.size()=%s", len(self._outbound_buffer))
/usr/lib/python2.7/site-packages/celery/events/__init__.py:251:#            logger.error("PavelM: flush: self._outbound_buffer.size()=%s", len(self._outbound_buffer))
/usr/lib/python2.7/site-packages/celery/events/__init__.py:257:        logger.error("PavelM: flush:za cyklem")
/usr/lib/python2.7/site-packages/kombu/messaging.py:184:        logger.error("PavelM: channel.basic_publish")
/usr/lib/python2.7/site-packages/kombu/messaging.py:206:        logger.error("PavelM: revive: is_connection(channel)=%s, isinstance(channel, ChannelPromise)=%s" %(is_connection(channel), isinstance(channel, ChannelPromise)))
/usr/lib/python2.7/site-packages/kombu/transport/qpid.py:473:        logger.error("PavelM: new sender: '%s'" % address)
[root@pmoravec-sat61-rhel7 ~]#
Comment 24 Brian Bouterse 2015-11-30 10:00:18 EST
@pmoravec, on IRC you mentioned that there is a race condition which can cause memory usage to jump significantly even with a single restart of qpidd. What is the race condition exactly?

Also, the reproducer is of a slowly growing memory leak which takes hundreds of restarts to occur. How do you know the slow memory leak is the same/related to the race condition?
Comment 25 Pavel Moravec 2015-12-05 07:02:39 EST
(In reply to Brian Bouterse from comment #24)
> @pmoravec, on IRC you mentioned that there is a race condition which can
> cause memory usage to jump significantly even with a single restart of
> qpidd. What is the race condition exactly?
That is something I dont know. I have noticed a customer did a single qpidd restart and since that time, sevear celery processes started to consume more memory and CPU. Reproducing this, I had to restart qpidd many times to hit it - since apparently there is *some* race condition that the qpidd restart must be done at some specific time when a celery process is in some particular state / executing some particular command, in order to trigger this error. "Race condition" was meant the reproducer is not 100% deterministic on each broker restart and there is just some small probability of triggering the bug.

> Also, the reproducer is of a slowly growing memory leak which takes hundreds
> of restarts to occur. How do you know the slow memory leak is the
> same/related to the race condition?
Once qpidd restart triggers "something bad" in a celery process, the celery process starts to consume memory and CPU - regardless of any further restart of qpidd or not - it is idempotent to such further restarts.

Simply, if you are "lucky" enough, a single qpidd broker restart triggers a celery process starts to consume CPU and memory more and more, without any further intervention at all.

If you are not lucky, restart qpidd again in an attempt to trigger the flaw once more. And so on.
Comment 26 Pavel Moravec 2016-01-25 04:00:23 EST
Confirmed the simplified reproducer in just restarting qpidd without anything else *can* trigger this bug on 6.1.6.
Comment 27 pulp-infra@redhat.com 2016-02-04 16:00:30 EST
The Pulp upstream bug status is at NEW. Updating the external tracker on this bug.
Comment 30 Pavel Moravec 2016-06-17 06:24:02 EDT
(In reply to Pavel Moravec from comment #26)
> Confirmed the simplified reproducer in just restarting qpidd without
> anything else *can* trigger this bug on 6.1.6.

Now I have Sat 6.1.9 where *every* qpidd restart triggers the high CPU usage :(
Comment 31 pulp-infra@redhat.com 2016-12-12 11:03:13 EST
The Pulp upstream bug status is at ASSIGNED. Updating the external tracker on this bug.
Comment 37 Brian Bouterse 2017-03-16 11:47:33 EDT
In my testing, this issue is triggered through a large number of restarts. It's important to track, but I don't think its one a customer will experience. With that in mind I think the prio and sev are wrong. I'm lowering it to low prio and med sev. If this isn't right, please adjust accordingly.

Also as an aside, upstream is starting to upgrade to Celery4 instead of 3.1 so we are going to retest against Celery4 as an approach to resolving this bug.
Comment 38 pulp-infra@redhat.com 2017-03-21 15:20:49 EDT
The Pulp upstream bug status is at NEW. Updating the external tracker on this bug.

Note You need to log in before you can comment on or make changes to this bug.