Description of problem: When qdrouterd is not accessible, the goferd process has a slow memory leak that eventually leads to a crash Version-Release number of selected component (if applicable): python-pulp-common-2.6.0.20-1.el6.noarch python-pulp-rpm-common-2.6.0.20-1.el6.noarch pulp-rpm-handlers-2.6.0.20-1.el6.noarch python-isodate-0.5.0-4.pulp.el6_6sat.noarch python-pulp-agent-lib-2.6.0.20-1.el6.noarch How reproducible: shutting down qrouterd on sat server Steps to Reproduce: [root@sputnik ~]# systemctl stop qdrouterd <-- on Sat server [root@rhel6local ~]# while true; do ps aux | grep python | grep gofer; sleep 5; done root 1518 0.2 11.3 1197740 116100 ? Ssl 13:44 0:41 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116100 ? Ssl 13:44 0:41 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116100 ? Ssl 13:44 0:41 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116100 ? Ssl 13:44 0:41 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116100 ? Ssl 13:44 0:41 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116100 ? Ssl 13:44 0:41 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116100 ? Ssl 13:44 0:41 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116100 ? Ssl 13:44 0:41 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116100 ? Ssl 13:44 0:41 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116104 ? Ssl 13:44 0:41 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116104 ? Ssl 13:44 0:41 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116128 ? Ssl 13:44 0:41 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116128 ? Ssl 13:44 0:41 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116132 ? Ssl 13:44 0:41 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116132 ? Ssl 13:44 0:41 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116132 ? Ssl 13:44 0:41 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116132 ? Ssl 13:44 0:41 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116132 ? Ssl 13:44 0:41 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116132 ? Ssl 13:44 0:41 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116132 ? Ssl 13:44 0:41 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116136 ? Ssl 13:44 0:41 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116136 ? Ssl 13:44 0:41 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116136 ? Ssl 13:44 0:41 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116136 ? Ssl 13:44 0:41 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116136 ? Ssl 13:44 0:41 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116136 ? Ssl 13:44 0:41 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116136 ? Ssl 13:44 0:41 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116136 ? Ssl 13:44 0:41 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116136 ? Ssl 13:44 0:41 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116144 ? Ssl 13:44 0:41 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116144 ? Ssl 13:44 0:41 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116144 ? Ssl 13:44 0:41 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116144 ? Ssl 13:44 0:41 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116144 ? Ssl 13:44 0:41 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116144 ? Ssl 13:44 0:41 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116148 ? Ssl 13:44 0:41 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116148 ? Ssl 13:44 0:41 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116148 ? Ssl 13:44 0:41 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116148 ? Ssl 13:44 0:41 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116148 ? Ssl 13:44 0:41 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116148 ? Ssl 13:44 0:41 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116148 ? Ssl 13:44 0:41 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116152 ? Ssl 13:44 0:41 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116152 ? Ssl 13:44 0:41 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116152 ? Ssl 13:44 0:41 python /usr/bin/goferd ----- [root@sputnik ~]# systemctl start qdrouterd <-- on sat server root 1518 0.2 11.3 1197740 116152 ? Ssl 13:44 0:41 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116156 ? Ssl 13:44 0:41 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116156 ? Ssl 13:44 0:41 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116156 ? Ssl 13:44 0:41 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116156 ? Ssl 13:44 0:41 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116156 ? Ssl 13:44 0:41 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116156 ? Ssl 13:44 0:41 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116156 ? Ssl 13:44 0:41 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116156 ? Ssl 13:44 0:41 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116156 ? Ssl 13:44 0:41 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116156 ? Ssl 13:44 0:41 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116196 ? Ssl 13:44 0:42 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116196 ? Ssl 13:44 0:42 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116196 ? Ssl 13:44 0:42 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116196 ? Ssl 13:44 0:42 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116196 ? Ssl 13:44 0:42 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116196 ? Ssl 13:44 0:42 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116196 ? Ssl 13:44 0:42 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116196 ? Ssl 13:44 0:42 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116196 ? Ssl 13:44 0:42 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116196 ? Ssl 13:44 0:42 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116196 ? Ssl 13:44 0:42 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116196 ? Ssl 13:44 0:42 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116196 ? Ssl 13:44 0:42 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116196 ? Ssl 13:44 0:42 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116196 ? Ssl 13:44 0:42 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116196 ? Ssl 13:44 0:42 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116196 ? Ssl 13:44 0:42 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116196 ? Ssl 13:44 0:42 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116196 ? Ssl 13:44 0:42 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116196 ? Ssl 13:44 0:42 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116196 ? Ssl 13:44 0:42 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116196 ? Ssl 13:44 0:42 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116196 ? Ssl 13:44 0:42 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116196 ? Ssl 13:44 0:42 python /usr/bin/goferd So as you can see, the RSS column grows over time when goferd can't contact the Sat server properly 24 hours later, goferd has mostly recovered: 24 hours later, the daemon has actually given some memory back: root 1518 0.2 11.3 1197740 116188 ? Ssl Mar14 4:59 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116188 ? Ssl Mar14 4:59 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116188 ? Ssl Mar14 4:59 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116188 ? Ssl Mar14 4:59 python /usr/bin/goferd root 1518 0.2 11.3 1197740 116188 ? Ssl Mar14 4:59 python /usr/bin/goferd Actual results: RSS of goferd process grows over time until it crashes Expected results: goferd to not leak memory. Additional info:
Trivially reproducible: - have a Content Host registered to a Capsule - on Satellite, stop qdrouterd service - monitor memory usage of goferd on the Content Host Stopping qdrouterd on Capsule instead of Satellite does _not_ cause the leak. So it seems to be a bug in qpid-proton library since just link fails, not (TCP+AMQP) connection.
Trivial reproducer pointing the bug in proton: - have qdrouterd with link routing to else-where (i.e. to qpidd) service that is _stopped_ - run below script using proton library that just creates a connection to qdrouterd and tries to create a receiver to an address link-routed to stopped qpidd: #!/usr/bin/python from time import sleep from uuid import uuid4 from proton import ConnectionException from proton.utils import BlockingConnection while True: sleep(0.1) try: conn = BlockingConnection("proton+amqp://qdrouterd.example.com:5648", ssl_domain=None, heartbeat=2) print "connection" rec = conn.create_receiver("pulp.1318015", name=str(uuid4()), dynamic=False, options=None) except Exception: if (conn): conn.close() - monitor script memory usage growth - wait until it even terminates after a (longer) while :) Assigning to kgiusti as I know he fixed past proton bugs (feel free to change it).
Forgot to mention: Reproduced on: python-qpid-proton-0.9-16.el7.x86_64 qpid-proton-c-0.9-16.el7.x86_64
Testing with package versions: python-qpid-proton-0.9-17.el7.x86_64 qpid-proton-c-0.9-17.el7.x86_64 Tested both #c4 and #c5, seems the leak is not present. (technically, goferd mem.usage per #c4 was quite slowly growing over time - wasnt able to reproduce outside Sat and the growth was really tiny like 4k RSS per minute)
Please add verifications steps for this bug to help QE verify
Published EL7 X86-64 hotfix packages at http://people.redhat.com/~lzapleta/hotfix/qpid-memleak-1318015/el7/
The fix has a regression worse than original problem :( starting goferd with qpid-proton-0.9-18, the process consumes 100% CPU within a minute. It spins in busy loop with backtrace: Thread 13 f=f@entry=Frame 0x7fa3401b4080, for file /usr/lib64/python2.7/site-packages/proton/reactor.py, line 138, in _check_errors f=f@entry=Frame 0x7fa3401b18d0, for file /usr/lib64/python2.7/site-packages/proton/reactor.py, line 146, in process f=f@entry=Frame 0x7fa3401ecfc0, for file /usr/lib64/python2.7/site-packages/proton/utils.py, line 251, in run f=f@entry=Frame 0x7fa3401ff370, for file /usr/lib64/python2.7/site-packages/proton/utils.py, line 241, in close f=f@entry=Frame 0x7fa3401f2f60, for file /usr/lib/python2.7/site-packages/gofer/messaging/adapter/proton/connection.py, line 152, in close f=f@entry=Frame 0x7fa340048260, for file /usr/lib/python2.7/site-packages/gofer/common.py, line 272, in _fn f=f@entry=Frame 0x1f12b70, for file /usr/lib/python2.7/site-packages/gofer/agent/plugin.py, line 316, in attach f=f@entry=Frame 0x7fa340003c50, for file /usr/lib/python2.7/site-packages/gofer/common.py, line 228, in sfn goferd invokes connection.close(), and proton reactor infinitely executes the cycle in utils.py line 251: while self.container.process(): pass because reactor.py line 146 gets in result = pn_reactor_process(self._impl) call everytime "true", making the above loop infinite. I will try to reproduce it outside goferd via a standalone python script now.
Reproducer outside Satellite causing 0.9-18 the high CPU: try to close an amqps connection. When not using SSL, everything works well. When using SSL, any attempt to close opened BlockingConnection triggers the high CPU. Particular script: #!/usr/bin/python from proton.utils import BlockingConnection from proton import SSLDomain ca_certificate='/etc/rhsm/ca/katello-server-ca.pem' client_certificate='/etc/pki/consumer/bundle.pem' client_key=None domain = SSLDomain(SSLDomain.MODE_CLIENT) domain.set_trusted_ca_db(ca_certificate) domain.set_credentials(client_certificate, client_key or client_certificate, None) domain.set_peer_authentication(SSLDomain.VERIFY_PEER) SSL = True # set to False to check nonSSL version that works well connectionURL="amqps://pmoravec-caps62-rhel7.gsslab.brq.redhat.com:5647" if SSL else "proton+amqp://pmoravec-caps62-rhel7.gsslab.brq.redhat.com:5648" qmf_conn = BlockingConnection(connectionURL, ssl_domain=domain if SSL else None, heartbeat=None) qmf_conn.close()
I think this fix prevents the problem in upstream: https://github.com/apache/qpid-proton/commit/92b822c1077f08e68934a630ddef1c17bce24171#diff-1263a41bd7a5218b033cc33321977fb9R411 PROTON-1243: Stop pn_reactor_stop from causing an infinite loop when … …called from within callback (esp. the "reactor->stop" check is the crucial)
I regret that I am unable to reproduce using qpid-proton-0.9-18 with the source in comment 23. (Just changing certificates and host:port.) My client was on RHEL7. I have tried against a "broker" using Proton's SSL io. Also against qpidd with its (separate) SSL implementation. I have also tried against an address with no listener. In all cases the reproducer ran to completion without hanging or busy looping. The socket resources (which were previously leaking) were closed. Do I need a dispatch router in the middle? If so, what version and what relevant SSL config for it? Thanks.
(In reply to Cliff Jansen from comment #34) > I regret that I am unable to reproduce using qpid-proton-0.9-18 with the > source in comment 23. (Just changing certificates and host:port.) My > client was on RHEL7. > > I have tried against a "broker" using Proton's SSL io. Also against qpidd > with its (separate) SSL implementation. I have also tried against an > address with no listener. > > In all cases the reproducer ran to completion without hanging or busy > looping. The socket resources (which were previously leaking) were closed. > > Do I need a dispatch router in the middle? If so, what version and what > relevant SSL config for it? > > > Thanks. I managed to reproduce it against qpidd (0.34 but version shall not matter, I hope), though it takes a bit longer than against qdrouterd to see the high CPU. But the busy loop is the same.
I have some UNCONFIRMED WORKAROUND. Minimize the time when qrouterd is turned off, it is being turned of when doing offline backups, which can make things worse. Change the backup script not to turn it off, this was patched for 6.3 already: https://github.com/Katello/katello-packaging/pull/398/files Until the leak is fixed, I also recommend restarting goferd client on all nodes every day via cron or systemd.
Created attachment 1275146 [details] Patch fixing 100% CPU in 0.9-18 Work-in-progress patch that fixes the high CPU when closing BlockingConnection. Basically it requires PROTON-1243 backport plus proton-c/src/reactor/reactor.c from upstream (with some small change). Now, working with proton connections seem to work well (no mem.leak, no high CPU) in both situations when the server accepts connection, rejects it on AMQP or on TCP level. But.. goferd still has memory leak using that package / library. Because there is a common mem.leak when opening a consumer or producer while getting an error there I.e. original reproducer still valid, smaller reproducer is trivial: qmf_conn = None qmf_rec = None qmf_snd = None while True: sleep(0.1) try: qmf_conn = BlockingConnection("proton+amqp://pmoravec-sat62-rhel7.gsslab.brq.redhat.com:5648", ssl_domain=None, heartbeat=None) # qmf_rec = qmf_conn.create_receiver(None, name=str(uuid4()), dynamic=True, options=DynamicNodeProperties({'x-opt-qd.address': unicode("qmf.default.direct")})) qmf_snd = qmf_conn.create_sender("qmf.default.direct", name=str(uuid4())) except Exception, e: print e finally: try: if qmf_rec: qmf_rec.close() qmf_rec = None if qmf_snd: qmf_snd.close() qmf_snd = None if qmf_conn: qmf_conn.close() qmf_conn = None except Exception, e2: print "nested: %s" % e2 Gladly, the mem.leak is now approx. half of its magnitude (i.e. the code change from 0.9-16 to 0.9-18 did help). Just something is still leaking when using consumers or producers..
I am giving up here. Below is a patch over 0.9-18 that fixes the spinning CPU, but leaves 1/2 of the original memory leak unfixed. Reproducer for the leak is VERY trivial: just calling BlockingConnection against closed port :-/ (in a cycle, of course) Attempts to find out the remaining mem.leak resulted in: - discovery that pn_incref / pn_decref is apparently called to all expected objects properly (no evident pn_incref frequent calls without pn_decref) - suspicion the flaw is in reactor.c and not in .py part - since 0.16 does _not_ have that leak and change in .py from 0.16 to 0.18-pmoravec just cleans some memory - raising https://issues.jboss.org/browse/ENTMQCL-504 (imho upstream affected as well) Here is the patch over 0.18 that fixes the spinning CPU: diff -rup 0.18/qpid-proton-0.9/proton-c/bindings/python/proton/utils.py 0.18_new/qpid-proton-0.9/proton-c/bindings/python/proton/utils.py --- 0.18/qpid-proton-0.9/proton-c/bindings/python/proton/utils.py 2017-04-28 18:51:55.789573553 +0200 +++ 0.18_new/qpid-proton-0.9/proton-c/bindings/python/proton/utils.py 2017-05-02 14:31:25.248310418 +0200 @@ -249,6 +249,8 @@ class BlockingConnection(Handler): def run(self): """ Hand control over to the event loop (e.g. if waiting indefinitely for incoming messages) """ while self.container.process(): pass + self.container.stop() + self.container.process() def wait(self, condition, timeout=False, msg=None): """Call process until condition() is true""" diff -rup 0.18/qpid-proton-0.9/proton-c/src/reactor/reactor.c 0.18_new/qpid-proton-0.9/proton-c/src/reactor/reactor.c --- 0.18/qpid-proton-0.9/proton-c/src/reactor/reactor.c 2017-04-28 18:51:55.842573250 +0200 +++ 0.18_new/qpid-proton-0.9/proton-c/src/reactor/reactor.c 2017-05-02 14:40:50.502008406 +0200 @@ -28,6 +28,7 @@ #include <proton/session.h> #include <proton/link.h> #include <proton/delivery.h> + #include <stdio.h> #include <stdlib.h> #include <assert.h> @@ -51,6 +52,7 @@ struct pn_reactor_t { int selectables; int timeout; bool yield; + bool stop; }; pn_timestamp_t pn_reactor_mark(pn_reactor_t *reactor) { @@ -79,6 +81,7 @@ static void pn_reactor_initialize(pn_rea reactor->selectables = 0; reactor->timeout = 0; reactor->yield = false; + reactor->stop = false; pn_reactor_mark(reactor); } @@ -404,21 +407,21 @@ bool pn_reactor_process(pn_reactor_t *re previous = reactor->previous = type; pn_decref(event); pn_collector_pop(reactor->collector); + } else if (!reactor->stop && pni_reactor_more(reactor)) { + if (previous != PN_REACTOR_QUIESCED && reactor->previous != PN_REACTOR_FINAL) { + pn_collector_put(reactor->collector, PN_OBJECT, reactor, PN_REACTOR_QUIESCED); + } else { + return true; + } } else { - if (pni_reactor_more(reactor)) { - if (previous != PN_REACTOR_QUIESCED && reactor->previous != PN_REACTOR_FINAL) { - pn_collector_put(reactor->collector, PN_OBJECT, reactor, PN_REACTOR_QUIESCED); - } else { - return true; - } + if (reactor->selectable) { + pn_selectable_terminate(reactor->selectable); + pn_reactor_update(reactor, reactor->selectable); + reactor->selectable = NULL; } else { - if (reactor->selectable) { - pn_selectable_terminate(reactor->selectable); - pn_reactor_update(reactor, reactor->selectable); - reactor->selectable = NULL; - } else { - return false; - } + if (reactor->previous != PN_REACTOR_FINAL) + pn_collector_put(reactor->collector, PN_OBJECT, reactor, PN_REACTOR_FINAL); + return false; } } } @@ -478,14 +481,11 @@ void pn_reactor_start(pn_reactor_t *reac assert(reactor); pn_collector_put(reactor->collector, PN_OBJECT, reactor, PN_REACTOR_INIT); reactor->selectable = pni_timer_selectable(reactor); - } +} void pn_reactor_stop(pn_reactor_t *reactor) { assert(reactor); - pn_collector_put(reactor->collector, PN_OBJECT, reactor, PN_REACTOR_FINAL); - // XXX: should consider removing this fron stop to avoid reentrance - pn_reactor_process(reactor); - pn_collector_release(reactor->collector); + reactor->stop = true; } void pn_reactor_run(pn_reactor_t *reactor) { @@ -493,5 +493,6 @@ void pn_reactor_run(pn_reactor_t *reacto pn_reactor_set_timeout(reactor, 3141); pn_reactor_start(reactor); while (pn_reactor_process(reactor)) {} - pn_reactor_stop(reactor); + pn_reactor_process(reactor); + pn_collector_release(reactor->collector); }
*** Bug 1447834 has been marked as a duplicate of this bug. ***
Possible fix upstream (further testing pending) see: https://issues.apache.org/jira/browse/PROTON-1394
Newer patch available. Fixes segfault in goferd if a BlockingLink is incompletely created due to an exception. https://issues.apache.org/jira/browse/PROTON-1394
(In reply to Cliff Jansen from comment #54) > Newer patch available. Fixes segfault in goferd if a BlockingLink > is incompletely created due to an exception. > > https://issues.apache.org/jira/browse/PROTON-1394 Added to the backported patch_1. No more segfault, but still memory accumulation: Original behaviour on proton 0.9-16: - reproducer per #c4 (goferd bounced on AMQP level, probes delays decreased from default 10s to 0.1-1s): RSS growth by 124MB in 10 minutes - reproducer from #c5 (standalone reproducer): RSS growth by 65MB in 10 minutes Behaviour on patched proton: - reproducer per #c4 (goferd bounced on AMQP level, probes delays decreased from default 10s to 0.1-1s): RSS growth by 143MB in 10 minutes - reproducer from #c5 (standalone reproducer): steady RSS growth by 76MB in 10 minutes Eeven if I add "self.container = None" to the newest patch (that might be unintentionally removed from the code by the patch), nothing changes. If I remove the patch for PROTON-1366, nothing changes either. Reproducer: - machine like 2 posts ago - python goferd_mem_leak-reproducer-c5.py or - python /usr/bin/goferd --foreground
Sorry for confusion, applying the three patches cumulatively, no further memory leak \o/ BUT.. But the high CPU usage is still present. See #c22 and reproducer in #c23. (python code lines just differ a bit due to the patches applied). Reproducer on the same machine: python goferd_mem_leak-reproducer-c23-CPU.py (it is worth trying the patch from #c38 to fix the high CPU usage - the change is in C code and I dont have enough time to build the sources ATM. But that patch worked well for that problem)
Created attachment 1305891 [details] perspective patch Thanks to Cliff for the final patch that fixes also the CPU spinning (see #c23 or #c56). Now the cumulative patch passed all my tests: - goferd exhibit flat mem.usage when failing to connect on AMQP level - the standalone reproducer as well - the same applies when failing to connect on TCP level (qdrouterd down) - no mem.increase when installing/removing a package via goferd in a loop - no high CPU when closing connection (#c23) I attached the cumulative patch I used: cd /usr/lib64/python2.*/site-packages cat /tmp/bz1318015.patch | patch -p4 service goferd restart (do not apply at customers - otherwise you get unsupported system, though with the mem.leak resolved)
Created attachment 1343803 [details] python-qpid-proton-0.9-20.el6.x86_64.rpm
Created attachment 1343805 [details] python-qpid-proton-0.9-20.el7.x86_64.rpm
**** HOTFIX INSTRUCTIONS **** I've attached a qpid-proton build that can be applied to clients as a fix. You can download the appropriate rpm on to the client and run: yum localinstall <rpm> and restart goferd service on the client In order to install this hotfix on multiple client hosts needing the update you may need to utilize a custom Satellite repository to upload the hotfix RPM to. After uploading you will need to ensure the clients needing the hotfix has a subscription added to grant them access to this new custom repository so you can update via your standard mechanism. If you have questions about how to accomplish you can find our documentation here: https://access.redhat.com/documentation/en-us/red_hat_satellite/6.2/html/content_management_guide/importing_custom_content or feel free to contact support.
The attached RPMs don't work. I get a dependency error: # yum localinstall python-qpid-proton-0.9-20.el7.x86_64.rpm ... --> Finished Dependency Resolution Error: Package: python-qpid-proton-0.9-20.el7.x86_64 (/python-qpid-proton-0.9-20.el7.x86_64) Requires: qpid-proton-c(x86-64) = 0.9-20.el7 Installed: qpid-proton-c-0.9-16.el7.x86_64 (@rhel-7-server-satellite-tools-6.2-rpms) qpid-proton-c(x86-64) = 0.9-16.el7 ********************************************************************** yum can be configured to try to resolve such errors by temporarily enabling disabled repos and searching for missing dependencies. To enable this functionality please set 'notify_only=0' in /etc/yum/pluginconf.d/search-disabled-repos.conf ********************************************************************** Error: Package: python-qpid-proton-0.9-20.el7.x86_64 (/python-qpid-proton-0.9-20.el7.x86_64) Requires: qpid-proton-c(x86-64) = 0.9-20.el7 Installed: qpid-proton-c-0.9-16.el7.x86_64 (@rhel-7-server-satellite-tools-6.2-rpms) qpid-proton-c(x86-64) = 0.9-16.el7 You could try using --skip-broken to work around the problem Looks like we need the qpid-proton-c-0.9-20 packages as well. Same problem on both EL6 & EL7.
Created attachment 1343988 [details] qpid-proton-c-0.9-20.el6.x86_64.rpm
Created attachment 1343989 [details] qpid-proton-c-0.9-20.el7.x86_64.rpm
Matt, You are correct, the package was missing the dependency. I've attached the updated qpid-proton-c for both el6 + 7 to this BZ. Let me know if you hit any other issues installing the package. -John
Thank you for the hard work on this.... Would this patch be available for RHEL 6 clients as well? -Patrick
Patrick, Both RHEL6 + 7 packages are attached to this BZ.
OMG, I just saw that. Apologies....is there an ETA as to when it will be put in the repos?
Jeff, see previous internal comment for the reproducer machine - in general it is enough to stop qpidd on Sat to trigger the mem.leak on goferd. Please investigate there, I start to run out of idea (and time) Some facts from other experiments: - in goferd, decreasing DELAY in /usr/lib/python2.7/site-packages/gofer/messaging/adapter/proton/reliability.py to 0.1 second makes RSS usage (almost) flat - decreasing it to 1 second halves the mem.leak (halves the increase of RSS over time), compared to default 10 second - comparing default goferd with 0.9-20 proton against 0.9-16 proton does show an improvement in memory growth (70kB increase per 10s with 0.9-16 versus 55kB increase per 10s with 0.9-20) - standalone reproducers dont show a mem.leak, regardless of delay used there - trying to improve the standalone reproducers - no leak so far I start to speculate the problem is outside proton libraries, or in the way how it uses them? (havent tried the reliable(fn) trick, e.g.) (why I thought the mem.leak in goferd is gone? because I run standalone reprods and goferd with 0.1 DELAY only, to make the reproducer faster)
Looking into it.
Created attachment 1357300 [details] proposed patch for proton 0.9 The 2017-11-13 leak reproducer seems to be timing dependent. It never (or very rarely) occurred on my local environment. It occurred about 50% on the internal reproducer machine. This backport from Proton 0-16 seems to do the trick when a LinkDetached exception ('qd:no-route-to-dest') prompts an application to abruptly close a connection. Normally I am wary of transplanting code between these distal versions because the underlying reactor has changed so much. In this case, the breaking of the circular reference happens around unchanged code, so it seems pretty safe to me. Many thanks to those who crafted the reproducer and made the target machine available to me.
I derived the new memory leak from previous (internal) comment to https://bugzilla.redhat.com/show_bug.cgi?id=1518008 That one should be independent on this one, until we identify it is a regression in qpid-proton.
(In reply to Pavel Moravec from comment #97) > I derived the new memory leak from previous (internal) comment to > > https://bugzilla.redhat.com/show_bug.cgi?id=1518008 > > That one should be independent on this one, until we identify it is a > regression in qpid-proton. The same leak (verified against the tiny reproducer per [1]) is present also in 0.9-16. Therefore those two leaks are independent and the new one is not a regression introduced by codefix of this one. So from my side: - qpid-proton 0.9-21 fully fixes this memory leak - no regression is known to me - bz1518008 is fully independent mem.leak that was present there for a longer time - I would not object against sharing 0.9-21 as an interim fix for the mem.leak - while I suggest fixing bz1518008 still in next z-stream (to have one official (z-stream) release fixing both leaks) [1] https://bugzilla.redhat.com/show_bug.cgi?id=1518008#c3
Hello all, We've made a KCS article to address the needed changes for this Bugzilla. You can reach it here: https://access.redhat.com/solutions/3254571
Verified with Sat 6.2.14 (qpid-proton-c-0.9-21) both on rhel 6 and 7. There is a small increase in rss collumn on clients after stopping qdrouterd on sat that stabilizes afterwards (60kb and 24kb respectively). No CPU spinning experienced.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:0272