Description of problem: There is a race condition when doing 1) qpidd restart on Satellite, 2) (Re-)Register a content host (either to a capsule or Sat), some celery worker process (usually two) start to consume more and more CPU and memory. Some other celery worker processes consume much CPU "only" Version-Release number of selected component (if applicable): python-celery-3.1.11-1.el7sat.noarch python-qpid-0.30-6.el7.noarch How reproducible: 100% in few minutes Steps to Reproduce: 1. On Satellite, restart qpidd regularly while true; do service qpidd restart; sleep 1.1; sleep $(($((RANDOM))%2)); done 2. On a content host, re-register it regularly: while true; do subscription-manager register --org="Default_Organization" --environment="Library" --username=admin --password=SWcwZ226oErH6ZqC --force; sleep 5; done 3. Monitor disk usage of celery processes on Satellite: date; ps aux | grep celer | awk '{ print $4 }' | sort -nu Actual results: After some time, some celery worker thread starts consuming more and more memory (and also CPU). Since then, let stop whole reproducer and let the celery process(es) to boil your system - they will consume more and more CPU and memory. Expected results: No CPU or memory increasing over time. Additional info: Backtrace of busy thread is almost always: /usr/lib/python2.7/site-packages/kombu/transport/qpid.py, line 473 /usr/lib/python2.7/site-packages/kombu/transport/qpid.py, line 1097 /usr/lib/python2.7/site-packages/kombu/messaging.py, line 184 /usr/lib/python2.7/site-packages/kombu/messaging.py, line 168 /usr/lib/python2.7/site-packages/celery/events/__init__.py, line 215 /usr/lib/python2.7/site-packages/celery/events/__init__.py, line 238 /usr/lib/python2.7/site-packages/celery/events/__init__.py, line 251 /usr/lib/python2.7/site-packages/celery/worker/consumer.py, line 512 /usr/lib/python2.7/site-packages/celery/bootsteps.py, line 123 /usr/lib/python2.7/site-packages/celery/worker/consumer.py, line 278 /usr/lib/python2.7/site-packages/celery/bootsteps.py, line 373 /usr/lib/python2.7/site-packages/celery/bootsteps.py, line 123 /usr/lib/python2.7/site-packages/celery/worker/__init__.py, line 206 /usr/lib/python2.7/site-packages/celery/bin/worker.py, line 212 /usr/lib/python2.7/site-packages/celery/bin/base.py, line 269 /usr/lib/python2.7/site-packages/celery/bin/worker.py, line 179 /usr/lib/python2.7/site-packages/celery/bin/celery.py, line 693 /usr/lib/python2.7/site-packages/celery/bin/celery.py, line 761 /usr/lib/python2.7/site-packages/celery/bin/base.py, line 306 /usr/lib/python2.7/site-packages/celery/bin/celery.py, line 769 /usr/lib/python2.7/site-packages/celery/bin/celery.py, line 81 /usr/lib/python2.7/site-packages/celery/__main__.py, line 30 (once it was "/usr/lib/python2.7/site-packages/kombu/messaging.py, line 150" on top), instead of "messaging.py, line 168"
Raised sat-6.1.z? since I first hit the problem at so late stage when my Satellite was swapping, with system load > 10 and totally unresponsive.
Brian, what do you think?
I've dug into this some today and I was able to reproduce it on a vanilla Pulp 2.6.4 system. I didn't even need to register consumers; I only restart qpidd a lot using the command from comment 1. After maybe 10 minutes, my celery processes were at 80-100% cpu. That continued even after I stopped restarting qpidd so once the processes are "dirty" they stay that way. Even hours later... I do observe the following in my logs: File "<string>", line 6, in close File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 795, in close self.sync(timeout=timeout) File "<string>", line 6, in sync File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 786, in sync if not self._ewait(lambda: not self.outgoing and not self.acked, timeout=timeout): File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 596, in _ewait result = self.connection._ewait(lambda: self.error or predicate(), timeout) File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 234, in _ewait self.check_error() File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 226, in check_error self.close() File "<string>", line 6, in close File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 345, in close ssn.close(timeout=timeout) File "<string>", line 6, in close File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 795, in close self.sync(timeout=timeout) File "<string>", line 6, in sync File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 786, in sync if not self._ewait(lambda: not self.outgoing and not self.acked, timeout=timeout): File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 596, in _ewait result = self.connection._ewait(lambda: self.error or predicate(), timeout) File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 234, in _ewait self.check_error() File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 226, in check_error self.close() File "<string>", line 6, in close File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 345, in close ssn.close(timeout=timeout) File "<string>", line 6, in close File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 795, in close self.sync(timeout=timeout) File "<string>", line 6, in sync File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 786, in sync if not self._ewait(lambda: not self.outgoing and not self.acked, timeout=timeout): File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 596, in _ewait result = self.connection._ewait(lambda: self.error or predicate(), timeout) RuntimeError: maximum recursion depth exceeded The above traceback is produced by all affected processes regularly until they are halted. I believe qpid.messaging has a intermittent disconnect problem where if the server side closes unexpectedly, the client qpid.messaging threads continue to be stuck in an endless ewait() loop. I'm going to talk w/ the qpid team about troubleshooting steps tomorrow. I'll post back here with what I learn.
Pavel, does this look like https://issues.apache.org/jira/browse/QPID-6297 ?Seems like the same sort of error and I see you raised that issue upstream also. From https://brewweb.devel.redhat.com/buildinfo?buildID=425731 it looks like the client side of the fix did not make it into the build you were using here?
Not sure if this is helpful, but I was able to reproduce the issue described in comment 5 on fedora 22 with these packages: python-qpid-proton-0.9-3.fc22.x86_64 qpid-cpp-server-store-0.32-4.fc22.x86_64 qpid-cpp-server-0.32-4.fc22.x86_64 python-qpid-0.32-4.fc22.x86_64 qpid-proton-c-0.9-3.fc22.x86_64 qpid-cpp-client-0.32-4.fc22.x86_64 python-qpid-qmf-0.28-29.fc22.x86_64 qpid-tools-0.32-4.fc22.x86_64 qpid-qmf-0.28-29.fc22.x86_64 python-qpid-common-0.32-4.fc22.x86_64
(In reply to Brian Bouterse from comment #7) > Not sure if this is helpful, but I was able to reproduce the issue described > in comment 5 on fedora 22 with these packages: [...] > python-qpid-0.32-4.fc22.x86_64 I can't find exactly what's in that package through koji, but 0.32 was released in March (i.e. before the fix for QPID-6297) cand the latest fedora package (also based on 0.32) doesn't seem to contain the bug fix: http://koji.fedoraproject.org/koji/buildinfo?buildID=687194.
The Pulp upstream bug status is at ASSIGNED. Updating the external tracker on this bug.
The Pulp upstream bug priority is at Normal. Updating the external tracker on this bug.
(In reply to Gordon Sim from comment #6) > Pavel, does this look like https://issues.apache.org/jira/browse/QPID-6297 > ?Seems like the same sort of error and I see you raised that issue upstream > also. From https://brewweb.devel.redhat.com/buildinfo?buildID=425731 it > looks like the client side of the fix did not make it into the build you > were using here? Backporting the https://svn.apache.org/viewvc/qpid/trunk/qpid/python/qpid/messaging/driver.py?r1=1684716&r2=1684715&pathrev=1684716 , I can still reproduce the same high CPU usage.
(In reply to Pavel Moravec from comment #11) > Backporting the > https://svn.apache.org/viewvc/qpid/trunk/qpid/python/qpid/messaging/driver. > py?r1=1684716&r2=1684715&pathrev=1684716 , I can still reproduce the same > high CPU usage. What version of qpidd was that using?
I was also able to reproduce this outside of satellite (just using Pulp+celery+qpid). My packages are listed here in comment 7.
(In reply to Brian Bouterse from comment #13) > I was also able to reproduce this outside of satellite (just using > Pulp+celery+qpid). My packages are listed here in comment 7. I don't think those packages include either the client side or server side part of the fix, as it was committed upstream after 0,32 was released.
I can easily reproduce this if you want me to test a specific version. Having some rpms to install for F22 would make the testing even easier, but another way would work too. Let me know how I can help.
(In reply to Brian Bouterse from comment #15) > I can easily reproduce this if you want me to test a specific version. > Having some rpms to install for F22 would make the testing even easier, but > another way would work too. Let me know how I can help. Unfortunately I don't think there are rpms at present that would fix this. Until there are, it would require either patching the source for the current rpm and rebuilding, or just using the trunk upstream.
I can use the upstream trunk, no problem. Which specific python client package do you want me to test? Is it python-qpid? What about these other packages below? Can I mix a trunk-based python-qpid my existing versions from comment 7? python-qpid-qmf qpid-tools qpid-qmf python-qpid-common
(In reply to Brian Bouterse from comment #17) > I can use the upstream trunk, no problem. Which specific python client > package do you want me to test? Is it python-qpid? Yes, though I think you will also need an up to date broker (the fix, which I was not directly involved with, appears to have two parts, one in the client and one in the broker). > What about these other > packages below? Can I mix a trunk-based python-qpid my existing versions > from comment 7? > > python-qpid-qmf > qpid-tools > qpid-qmf Yes, no need to update any of the above > python-qpid-common From koji, this appears to be essentially the old 0-9/0-10 client, so no need to worry about that either I don't think. On the client, make sure the place you load qpid.messaging from is the place with the fix, and make sure the qpidd running also has the fix.
(In reply to Gordon Sim from comment #12) > (In reply to Pavel Moravec from comment #11) > > Backporting the > > https://svn.apache.org/viewvc/qpid/trunk/qpid/python/qpid/messaging/driver. > > py?r1=1684716&r2=1684715&pathrev=1684716 , I can still reproduce the same > > high CPU usage. > > What version of qpidd was that using? qpid-cpp-server-0.30-9.el7.x86_64, where I expect the broker part of fix is not applied. I can try building upstream qpid on my Satellite with the post-0.32 fix, if it will prevent the celery high CPU usage etc. Anyway it sounds like there is some issue on the client side (or in pulp/celery?) even on trunk. Assume python-qpid client from trunk is used against qpid-cpp-server 0.30 - it would exhibit the same high CPU usage and mem.leak. I.e. the client (or pulp/celery) is not robust enough for misbehaving broker it communicates to. Anyway, I will (try to) build upstream qpid on Satellite and test it.
(In reply to Pavel Moravec from comment #19) > Anyway, I will (try to) build upstream qpid on Satellite and test it. It was quite tricky due to linearstore incompatibility between 0.30 and upstream, but: - the problem is reproducible on 0.30 when I disable linearstore.so plugin - the problem is reproducible on qpid trunk (r1711138) with linearstore disabled and with python-qpid-0.30-6 patched by r1684716 (the client part of the patch in driver.py) So just backporting QPID-6297 (both client&broker side) wont help.. :(
Ok, thanks for checking and sorry for the wild goose chase!
Adding some debugs, the busy loop is due to cycle: celery/events/__init__.py: def flush(self): """Flushes the outbound buffer.""" while self._outbound_buffer: try: type, fields, _ = self._outbound_buffer.popleft() except IndexError: return self.send(type, **fields) That while loop is executed forever, calling channel.basic_publish calling _put for address 'celeryev/worker.offline; {assert: always, node: {type: topic}}'. In logs with my debugs, I see: Oct 29 10:15:09 pmoravec-sat61-rhel7 pulp: kombu.transport.qpid:ERROR: PavelM: new sender: 'celeryev/worker.offline; {assert: always, node: {type: topic}}' Oct 29 10:15:09 pmoravec-sat61-rhel7 pulp: kombu.messaging:ERROR: PavelM: channel.basic_publish Oct 29 10:15:09 pmoravec-sat61-rhel7 pulp: kombu.transport.qpid:ERROR: PavelM: new sender: 'celeryev/worker.offline; {assert: always, node: {type: topic}}' Oct 29 10:15:09 pmoravec-sat61-rhel7 systemd: Stopping An AMQP message broker daemon.... Oct 29 10:15:09 pmoravec-sat61-rhel7 pulp: kombu.messaging:ERROR: PavelM: channel.basic_publish Oct 29 10:15:09 pmoravec-sat61-rhel7 pulp: kombu.transport.qpid:ERROR: PavelM: new sender: 'celeryev/worker.online; {assert: always, node: {type: topic}}' Oct 29 10:15:09 pmoravec-sat61-rhel7 pulp: kombu.messaging:ERROR: PavelM: channel.basic_publish Oct 29 10:15:09 pmoravec-sat61-rhel7 pulp: pulp.server.async.scheduler:ERROR: connection aborted Oct 29 10:15:09 pmoravec-sat61-rhel7 pulp: kombu.transport.qpid:ERROR: PavelM: new sender: 'celeryev/worker.offline; {assert: always, node: {type: topic}}' Oct 29 10:15:09 pmoravec-sat61-rhel7 pulp: kombu.messaging:ERROR: PavelM: channel.basic_publish Oct 29 10:15:09 pmoravec-sat61-rhel7 pulp: kombu.transport.qpid:ERROR: PavelM: new sender: 'celeryev/worker.offline; {assert: always, node: {type: topic}}' Oct 29 10:15:09 pmoravec-sat61-rhel7 pulp: kombu.messaging:ERROR: PavelM: channel.basic_publish Oct 29 10:15:09 pmoravec-sat61-rhel7 pulp: kombu.transport.qpid:ERROR: PavelM: new sender: 'celeryev/worker.offline; {assert: always, node: {type: topic}}' Oct 29 10:15:09 pmoravec-sat61-rhel7 pulp: kombu.messaging:ERROR: PavelM: channel.basic_publish .. The self._outbound_buffer never gets empty, some other debugs show the loop isn't finished. So this sounds like a bug in celery, not in python-qpid or qpid broker. Bryan, could you confirm that by adding debugs like mine? [root@pmoravec-sat61-rhel7 ~]# grep -rn PavelM /usr/lib/python2.7/site-packages/ /usr/lib/python2.7/site-packages/celery/events/__init__.py:249: logger.error("PavelM: flush:pred cyklem: self._outbound_buffer.size()=%s", len(self._outbound_buffer)) /usr/lib/python2.7/site-packages/celery/events/__init__.py:251:# logger.error("PavelM: flush: self._outbound_buffer.size()=%s", len(self._outbound_buffer)) /usr/lib/python2.7/site-packages/celery/events/__init__.py:257: logger.error("PavelM: flush:za cyklem") /usr/lib/python2.7/site-packages/kombu/messaging.py:184: logger.error("PavelM: channel.basic_publish") /usr/lib/python2.7/site-packages/kombu/messaging.py:206: logger.error("PavelM: revive: is_connection(channel)=%s, isinstance(channel, ChannelPromise)=%s" %(is_connection(channel), isinstance(channel, ChannelPromise))) /usr/lib/python2.7/site-packages/kombu/transport/qpid.py:473: logger.error("PavelM: new sender: '%s'" % address) [root@pmoravec-sat61-rhel7 ~]#
@pmoravec, on IRC you mentioned that there is a race condition which can cause memory usage to jump significantly even with a single restart of qpidd. What is the race condition exactly? Also, the reproducer is of a slowly growing memory leak which takes hundreds of restarts to occur. How do you know the slow memory leak is the same/related to the race condition?
(In reply to Brian Bouterse from comment #24) > @pmoravec, on IRC you mentioned that there is a race condition which can > cause memory usage to jump significantly even with a single restart of > qpidd. What is the race condition exactly? That is something I dont know. I have noticed a customer did a single qpidd restart and since that time, sevear celery processes started to consume more memory and CPU. Reproducing this, I had to restart qpidd many times to hit it - since apparently there is *some* race condition that the qpidd restart must be done at some specific time when a celery process is in some particular state / executing some particular command, in order to trigger this error. "Race condition" was meant the reproducer is not 100% deterministic on each broker restart and there is just some small probability of triggering the bug. > Also, the reproducer is of a slowly growing memory leak which takes hundreds > of restarts to occur. How do you know the slow memory leak is the > same/related to the race condition? Once qpidd restart triggers "something bad" in a celery process, the celery process starts to consume memory and CPU - regardless of any further restart of qpidd or not - it is idempotent to such further restarts. Simply, if you are "lucky" enough, a single qpidd broker restart triggers a celery process starts to consume CPU and memory more and more, without any further intervention at all. If you are not lucky, restart qpidd again in an attempt to trigger the flaw once more. And so on.
Confirmed the simplified reproducer in just restarting qpidd without anything else *can* trigger this bug on 6.1.6.
The Pulp upstream bug status is at NEW. Updating the external tracker on this bug.
(In reply to Pavel Moravec from comment #26) > Confirmed the simplified reproducer in just restarting qpidd without > anything else *can* trigger this bug on 6.1.6. Now I have Sat 6.1.9 where *every* qpidd restart triggers the high CPU usage :(
In my testing, this issue is triggered through a large number of restarts. It's important to track, but I don't think its one a customer will experience. With that in mind I think the prio and sev are wrong. I'm lowering it to low prio and med sev. If this isn't right, please adjust accordingly. Also as an aside, upstream is starting to upgrade to Celery4 instead of 3.1 so we are going to retest against Celery4 as an approach to resolving this bug.
Thank you for your interest in Satellite 6. We have evaluated this request, and we do not expect this to be implemented in the product in the foreseeable future. We are therefore closing this out as WONTFIX. If you have any concerns about this, please feel free to contact Rich Jerrido or Bryan Kearney. Thank you.
The Pulp upstream bug status is at CLOSED - WONTFIX. Updating the external tracker on this bug.