Bug 1318015 - goferd continues to leak memory when qdrouterd is not accessible. Was supposedly fixed as per bz 1260963
Summary: goferd continues to leak memory when qdrouterd is not accessible. Was suppose...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Satellite
Classification: Red Hat
Component: katello-agent
Version: 6.1.7
Hardware: x86_64
OS: Linux
urgent
urgent with 2 votes
Target Milestone: Unspecified
Assignee: Cliff Jansen
QA Contact: Peter Ondrejka
URL:
Whiteboard:
: 1447834 (view as bug list)
Depends On:
Blocks: 1353215 1399395 CEE_Sat6_Top_BZs, GSS_Sat6_Top_Bugs 1426380 1482208
TreeView+ depends on / blocked
 
Reported: 2016-03-15 18:40 UTC by Marc Richter
Modified: 2023-03-24 13:40 UTC (History)
54 users (show)

Fixed In Version: qpid-proton-0.9-21
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1426380 (view as bug list)
Environment:
Last Closed: 2018-02-05 14:11:55 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Patch fixing 100% CPU in 0.9-18 (6.81 KB, patch)
2017-04-29 13:50 UTC, Pavel Moravec
no flags Details | Diff
perspective patch (11.96 KB, patch)
2017-07-28 10:37 UTC, Pavel Moravec
no flags Details | Diff
python-qpid-proton-0.9-20.el6.x86_64.rpm (192.21 KB, application/x-rpm)
2017-10-26 14:43 UTC, John Mitsch
no flags Details
python-qpid-proton-0.9-20.el7.x86_64.rpm (200.21 KB, application/x-rpm)
2017-10-26 14:44 UTC, John Mitsch
no flags Details
qpid-proton-c-0.9-20.el6.x86_64.rpm (125.15 KB, application/x-rpm)
2017-10-26 21:13 UTC, John Mitsch
no flags Details
qpid-proton-c-0.9-20.el7.x86_64.rpm (134.43 KB, application/x-rpm)
2017-10-26 21:14 UTC, John Mitsch
no flags Details
proposed patch for proton 0.9 (792 bytes, patch)
2017-11-22 09:01 UTC, Cliff Jansen
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Apache JIRA PROTON-1312 0 None None None 2016-12-09 22:00:13 UTC
Red Hat Issue Tracker ENTMQCL-457 0 Major Closed [python] BlockingConnection leaks Proton-C memory 2020-09-22 21:39:40 UTC
Red Hat Knowledge Base (Solution) 3198642 0 None None None 2017-09-28 06:29:36 UTC
Red Hat Knowledge Base (Solution) 3254571 0 None None None 2017-11-30 22:03:10 UTC
Red Hat Product Errata RHBA-2018:0272 0 normal SHIPPED_LIVE Satellite Tools 6.2.14 bug fix update 2018-02-06 05:01:15 UTC

Description Marc Richter 2016-03-15 18:40:59 UTC
Description of problem: When qdrouterd is not accessible, the goferd process has a slow memory leak that eventually leads to a crash


Version-Release number of selected component (if applicable):
python-pulp-common-2.6.0.20-1.el6.noarch
python-pulp-rpm-common-2.6.0.20-1.el6.noarch
pulp-rpm-handlers-2.6.0.20-1.el6.noarch
python-isodate-0.5.0-4.pulp.el6_6sat.noarch
python-pulp-agent-lib-2.6.0.20-1.el6.noarch


How reproducible: shutting down qrouterd on sat server

Steps to Reproduce:
[root@sputnik ~]# systemctl stop qdrouterd  <-- on Sat server


[root@rhel6local ~]# while true; do ps aux | grep python | grep gofer; sleep 5; done
root      1518  0.2 11.3 1197740 116100 ?      Ssl  13:44   0:41 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116100 ?      Ssl  13:44   0:41 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116100 ?      Ssl  13:44   0:41 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116100 ?      Ssl  13:44   0:41 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116100 ?      Ssl  13:44   0:41 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116100 ?      Ssl  13:44   0:41 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116100 ?      Ssl  13:44   0:41 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116100 ?      Ssl  13:44   0:41 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116100 ?      Ssl  13:44   0:41 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116104 ?      Ssl  13:44   0:41 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116104 ?      Ssl  13:44   0:41 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116128 ?      Ssl  13:44   0:41 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116128 ?      Ssl  13:44   0:41 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116132 ?      Ssl  13:44   0:41 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116132 ?      Ssl  13:44   0:41 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116132 ?      Ssl  13:44   0:41 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116132 ?      Ssl  13:44   0:41 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116132 ?      Ssl  13:44   0:41 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116132 ?      Ssl  13:44   0:41 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116132 ?      Ssl  13:44   0:41 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116136 ?      Ssl  13:44   0:41 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116136 ?      Ssl  13:44   0:41 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116136 ?      Ssl  13:44   0:41 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116136 ?      Ssl  13:44   0:41 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116136 ?      Ssl  13:44   0:41 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116136 ?      Ssl  13:44   0:41 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116136 ?      Ssl  13:44   0:41 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116136 ?      Ssl  13:44   0:41 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116136 ?      Ssl  13:44   0:41 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116144 ?      Ssl  13:44   0:41 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116144 ?      Ssl  13:44   0:41 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116144 ?      Ssl  13:44   0:41 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116144 ?      Ssl  13:44   0:41 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116144 ?      Ssl  13:44   0:41 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116144 ?      Ssl  13:44   0:41 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116148 ?      Ssl  13:44   0:41 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116148 ?      Ssl  13:44   0:41 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116148 ?      Ssl  13:44   0:41 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116148 ?      Ssl  13:44   0:41 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116148 ?      Ssl  13:44   0:41 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116148 ?      Ssl  13:44   0:41 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116148 ?      Ssl  13:44   0:41 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116152 ?      Ssl  13:44   0:41 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116152 ?      Ssl  13:44   0:41 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116152 ?      Ssl  13:44   0:41 python /usr/bin/goferd

----- [root@sputnik ~]# systemctl start qdrouterd <-- on sat server

root      1518  0.2 11.3 1197740 116152 ?      Ssl  13:44   0:41 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116156 ?      Ssl  13:44   0:41 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116156 ?      Ssl  13:44   0:41 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116156 ?      Ssl  13:44   0:41 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116156 ?      Ssl  13:44   0:41 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116156 ?      Ssl  13:44   0:41 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116156 ?      Ssl  13:44   0:41 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116156 ?      Ssl  13:44   0:41 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116156 ?      Ssl  13:44   0:41 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116156 ?      Ssl  13:44   0:41 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116156 ?      Ssl  13:44   0:41 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116196 ?      Ssl  13:44   0:42 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116196 ?      Ssl  13:44   0:42 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116196 ?      Ssl  13:44   0:42 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116196 ?      Ssl  13:44   0:42 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116196 ?      Ssl  13:44   0:42 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116196 ?      Ssl  13:44   0:42 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116196 ?      Ssl  13:44   0:42 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116196 ?      Ssl  13:44   0:42 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116196 ?      Ssl  13:44   0:42 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116196 ?      Ssl  13:44   0:42 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116196 ?      Ssl  13:44   0:42 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116196 ?      Ssl  13:44   0:42 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116196 ?      Ssl  13:44   0:42 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116196 ?      Ssl  13:44   0:42 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116196 ?      Ssl  13:44   0:42 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116196 ?      Ssl  13:44   0:42 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116196 ?      Ssl  13:44   0:42 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116196 ?      Ssl  13:44   0:42 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116196 ?      Ssl  13:44   0:42 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116196 ?      Ssl  13:44   0:42 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116196 ?      Ssl  13:44   0:42 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116196 ?      Ssl  13:44   0:42 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116196 ?      Ssl  13:44   0:42 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116196 ?      Ssl  13:44   0:42 python /usr/bin/goferd


So as you can see, the RSS column grows over time when goferd can't contact the Sat server properly

24 hours later, goferd has mostly recovered:
24 hours later, the daemon has actually given some memory back:

root      1518  0.2 11.3 1197740 116188 ?      Ssl  Mar14   4:59 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116188 ?      Ssl  Mar14   4:59 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116188 ?      Ssl  Mar14   4:59 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116188 ?      Ssl  Mar14   4:59 python /usr/bin/goferd
root      1518  0.2 11.3 1197740 116188 ?      Ssl  Mar14   4:59 python /usr/bin/goferd

Actual results: RSS of goferd process grows over time until it crashes


Expected results: goferd to not leak memory.


Additional info:

Comment 4 Pavel Moravec 2016-06-08 13:16:44 UTC
Trivially reproducible:

- have a Content Host registered to a Capsule
- on Satellite, stop qdrouterd service
- monitor memory usage of goferd on the Content Host

Stopping qdrouterd on Capsule instead of Satellite does _not_ cause the leak. So it seems to be a bug in qpid-proton library since just link fails, not (TCP+AMQP) connection.

Comment 5 Pavel Moravec 2016-06-08 14:30:10 UTC
Trivial reproducer pointing the bug in proton:

- have qdrouterd with link routing to else-where (i.e. to qpidd) service that is _stopped_
- run below script using proton library that just creates a connection to qdrouterd and tries to create a receiver to an address link-routed to stopped qpidd:


#!/usr/bin/python

from time import sleep
from uuid import uuid4

from proton import ConnectionException

from proton.utils import BlockingConnection

while True:
  sleep(0.1)
  try:
    conn = BlockingConnection("proton+amqp://qdrouterd.example.com:5648", ssl_domain=None, heartbeat=2)
    print "connection"
    rec = conn.create_receiver("pulp.1318015", name=str(uuid4()), dynamic=False, options=None)
  except Exception:
      if (conn):
        conn.close()

- monitor script memory usage growth
- wait until it even terminates after a (longer) while :)


Assigning to kgiusti as I know he fixed past proton bugs (feel free to change it).

Comment 6 Pavel Moravec 2016-06-08 14:30:46 UTC
Forgot to mention:

Reproduced on:
python-qpid-proton-0.9-16.el7.x86_64
qpid-proton-c-0.9-16.el7.x86_64

Comment 12 Pavel Moravec 2017-01-15 14:50:51 UTC
Testing with package versions:

python-qpid-proton-0.9-17.el7.x86_64
qpid-proton-c-0.9-17.el7.x86_64

Tested both #c4 and #c5, seems the leak is not present.

(technically, goferd mem.usage per #c4 was quite slowly growing over time - wasnt able to reproduce outside Sat and the growth was really tiny like 4k RSS per minute)

Comment 18 Satellite Program 2017-02-23 21:08:21 UTC
Please add verifications steps for this bug to help QE verify

Comment 21 Lukas Zapletal 2017-02-28 14:34:31 UTC
Published EL7 X86-64 hotfix packages at http://people.redhat.com/~lzapleta/hotfix/qpid-memleak-1318015/el7/

Comment 22 Pavel Moravec 2017-03-17 19:19:00 UTC
The fix has a regression worse than original problem :(

starting goferd with qpid-proton-0.9-18, the process consumes 100% CPU within a minute. It spins in busy loop with backtrace:

Thread 13 
    f=f@entry=Frame 0x7fa3401b4080, for file /usr/lib64/python2.7/site-packages/proton/reactor.py, line 138, in _check_errors 
    f=f@entry=Frame 0x7fa3401b18d0, for file /usr/lib64/python2.7/site-packages/proton/reactor.py, line 146, in process 
    f=f@entry=Frame 0x7fa3401ecfc0, for file /usr/lib64/python2.7/site-packages/proton/utils.py, line 251, in run 
    f=f@entry=Frame 0x7fa3401ff370, for file /usr/lib64/python2.7/site-packages/proton/utils.py, line 241, in close 
    f=f@entry=Frame 0x7fa3401f2f60, for file /usr/lib/python2.7/site-packages/gofer/messaging/adapter/proton/connection.py, line 152, in close 
    f=f@entry=Frame 0x7fa340048260, for file /usr/lib/python2.7/site-packages/gofer/common.py, line 272, in _fn 
    f=f@entry=Frame 0x1f12b70, for file /usr/lib/python2.7/site-packages/gofer/agent/plugin.py, line 316, in attach 
    f=f@entry=Frame 0x7fa340003c50, for file /usr/lib/python2.7/site-packages/gofer/common.py, line 228, in sfn

goferd invokes connection.close(), and proton reactor infinitely executes the cycle in utils.py line 251:

while self.container.process(): pass

because reactor.py line 146 gets in

result = pn_reactor_process(self._impl)

call everytime "true", making the above loop infinite.

I will try to reproduce it outside goferd via a standalone python script now.

Comment 23 Pavel Moravec 2017-03-17 20:52:03 UTC
Reproducer outside Satellite causing 0.9-18 the high CPU: try to close an amqps connection. When not using SSL, everything works well. When using SSL, any attempt to close opened BlockingConnection triggers the high CPU. Particular script:

#!/usr/bin/python
from proton.utils import BlockingConnection
from proton import SSLDomain

ca_certificate='/etc/rhsm/ca/katello-server-ca.pem'
client_certificate='/etc/pki/consumer/bundle.pem'
client_key=None

domain = SSLDomain(SSLDomain.MODE_CLIENT)
domain.set_trusted_ca_db(ca_certificate)
domain.set_credentials(client_certificate, client_key or client_certificate, None)
domain.set_peer_authentication(SSLDomain.VERIFY_PEER)

SSL = True  # set to False to check nonSSL version that works well
connectionURL="amqps://pmoravec-caps62-rhel7.gsslab.brq.redhat.com:5647" if SSL else "proton+amqp://pmoravec-caps62-rhel7.gsslab.brq.redhat.com:5648"

qmf_conn = BlockingConnection(connectionURL, ssl_domain=domain if SSL else None, heartbeat=None)
qmf_conn.close()

Comment 28 Pavel Moravec 2017-03-21 11:14:16 UTC
I think this fix prevents the problem in upstream:

https://github.com/apache/qpid-proton/commit/92b822c1077f08e68934a630ddef1c17bce24171#diff-1263a41bd7a5218b033cc33321977fb9R411

 PROTON-1243: Stop pn_reactor_stop from causing an infinite loop when …

…called from within callback

(esp. the "reactor->stop" check is the crucial)

Comment 34 Cliff Jansen 2017-04-19 18:44:54 UTC
I regret that I am unable to reproduce using qpid-proton-0.9-18 with the source in comment 23.  (Just changing certificates and host:port.)  My client was on RHEL7.

I have tried against a "broker" using Proton's SSL io.  Also against qpidd with its (separate) SSL implementation.  I have also tried against an address with no listener.

In all cases the reproducer ran to completion without hanging or busy looping.  The socket resources (which were previously leaking) were closed.

Do I need a dispatch router in the middle?  If so, what version and what relevant SSL config for it?


Thanks.

Comment 35 Pavel Moravec 2017-04-20 06:56:10 UTC
(In reply to Cliff Jansen from comment #34)
> I regret that I am unable to reproduce using qpid-proton-0.9-18 with the
> source in comment 23.  (Just changing certificates and host:port.)  My
> client was on RHEL7.
> 
> I have tried against a "broker" using Proton's SSL io.  Also against qpidd
> with its (separate) SSL implementation.  I have also tried against an
> address with no listener.
> 
> In all cases the reproducer ran to completion without hanging or busy
> looping.  The socket resources (which were previously leaking) were closed.
> 
> Do I need a dispatch router in the middle?  If so, what version and what
> relevant SSL config for it?
> 
> 
> Thanks.

I managed to reproduce it against qpidd (0.34 but version shall not matter, I hope), though it takes a bit longer than against qdrouterd to see the high CPU. But the busy loop is the same.

Comment 37 Lukas Zapletal 2017-04-21 08:59:28 UTC
I have some UNCONFIRMED WORKAROUND. Minimize the time when qrouterd is turned off, it is being turned of when doing offline backups, which can make things worse. Change the backup script not to turn it off, this was patched for 6.3 already:

https://github.com/Katello/katello-packaging/pull/398/files

Until the leak is fixed, I also recommend restarting goferd client on all nodes every day via cron or systemd.

Comment 38 Pavel Moravec 2017-04-29 13:50:31 UTC
Created attachment 1275146 [details]
Patch fixing 100% CPU in 0.9-18

Work-in-progress patch that fixes the high CPU when closing BlockingConnection. Basically it requires PROTON-1243 backport plus proton-c/src/reactor/reactor.c from upstream (with some small change).

Now, working with proton connections seem to work well (no mem.leak, no high CPU) in both situations when the server accepts connection, rejects it on AMQP or on TCP level.

But..

goferd still has memory leak using that package / library. Because there is a common mem.leak when opening a consumer or producer while getting an error there I.e. original reproducer still valid, smaller reproducer is trivial:

qmf_conn = None
qmf_rec = None
qmf_snd = None

while True:
  sleep(0.1)
  try:
    qmf_conn = BlockingConnection("proton+amqp://pmoravec-sat62-rhel7.gsslab.brq.redhat.com:5648", ssl_domain=None, heartbeat=None)
#    qmf_rec = qmf_conn.create_receiver(None, name=str(uuid4()), dynamic=True, options=DynamicNodeProperties({'x-opt-qd.address': unicode("qmf.default.direct")}))
    qmf_snd = qmf_conn.create_sender("qmf.default.direct", name=str(uuid4()))
  except Exception, e:
    print e
  finally:
    try:
      if qmf_rec:
        qmf_rec.close()
        qmf_rec = None
      if qmf_snd:
        qmf_snd.close()
        qmf_snd = None
      if qmf_conn:
        qmf_conn.close()
        qmf_conn = None
    except Exception, e2:
      print "nested: %s" % e2


Gladly, the mem.leak is now approx. half of its magnitude (i.e. the code change from 0.9-16 to 0.9-18 did help). Just something is still leaking when using consumers or producers..

Comment 39 Pavel Moravec 2017-05-02 13:00:48 UTC
I am giving up here. Below is a patch over 0.9-18 that fixes the spinning CPU, but leaves 1/2 of the original memory leak unfixed.

Reproducer for the leak is VERY trivial: just calling BlockingConnection against closed port :-/ (in a cycle, of course)


Attempts to find out the remaining mem.leak resulted in:
- discovery that pn_incref / pn_decref is apparently called to all expected objects properly (no evident pn_incref frequent calls without pn_decref)

- suspicion the flaw is in reactor.c and not in .py part - since 0.16 does _not_ have that leak and change in .py from 0.16 to 0.18-pmoravec just cleans some memory

- raising https://issues.jboss.org/browse/ENTMQCL-504 (imho upstream affected as well)


Here is the patch over 0.18 that fixes the spinning CPU:

diff -rup 0.18/qpid-proton-0.9/proton-c/bindings/python/proton/utils.py 0.18_new/qpid-proton-0.9/proton-c/bindings/python/proton/utils.py
--- 0.18/qpid-proton-0.9/proton-c/bindings/python/proton/utils.py	2017-04-28 18:51:55.789573553 +0200
+++ 0.18_new/qpid-proton-0.9/proton-c/bindings/python/proton/utils.py	2017-05-02 14:31:25.248310418 +0200
@@ -249,6 +249,8 @@ class BlockingConnection(Handler):
     def run(self):
         """ Hand control over to the event loop (e.g. if waiting indefinitely for incoming messages) """
         while self.container.process(): pass
+        self.container.stop()
+        self.container.process()
 
     def wait(self, condition, timeout=False, msg=None):
         """Call process until condition() is true"""
diff -rup 0.18/qpid-proton-0.9/proton-c/src/reactor/reactor.c 0.18_new/qpid-proton-0.9/proton-c/src/reactor/reactor.c
--- 0.18/qpid-proton-0.9/proton-c/src/reactor/reactor.c	2017-04-28 18:51:55.842573250 +0200
+++ 0.18_new/qpid-proton-0.9/proton-c/src/reactor/reactor.c	2017-05-02 14:40:50.502008406 +0200
@@ -28,6 +28,7 @@
 #include <proton/session.h>
 #include <proton/link.h>
 #include <proton/delivery.h>
+
 #include <stdio.h>
 #include <stdlib.h>
 #include <assert.h>
@@ -51,6 +52,7 @@ struct pn_reactor_t {
   int selectables;
   int timeout;
   bool yield;
+  bool stop;
 };
 
 pn_timestamp_t pn_reactor_mark(pn_reactor_t *reactor) {
@@ -79,6 +81,7 @@ static void pn_reactor_initialize(pn_rea
   reactor->selectables = 0;
   reactor->timeout = 0;
   reactor->yield = false;
+  reactor->stop = false;
   pn_reactor_mark(reactor);
 }
 
@@ -404,21 +407,21 @@ bool pn_reactor_process(pn_reactor_t *re
       previous = reactor->previous = type;
       pn_decref(event);
       pn_collector_pop(reactor->collector);
+    } else if (!reactor->stop && pni_reactor_more(reactor)) {
+      if (previous != PN_REACTOR_QUIESCED && reactor->previous != PN_REACTOR_FINAL) {
+        pn_collector_put(reactor->collector, PN_OBJECT, reactor, PN_REACTOR_QUIESCED);
+      } else {
+        return true;
+      }
     } else {
-      if (pni_reactor_more(reactor)) {
-        if (previous != PN_REACTOR_QUIESCED && reactor->previous != PN_REACTOR_FINAL) {
-          pn_collector_put(reactor->collector, PN_OBJECT, reactor, PN_REACTOR_QUIESCED);
-        } else {
-          return true;
-        }
+      if (reactor->selectable) {
+        pn_selectable_terminate(reactor->selectable);
+        pn_reactor_update(reactor, reactor->selectable);
+        reactor->selectable = NULL;
       } else {
-        if (reactor->selectable) {
-          pn_selectable_terminate(reactor->selectable);
-          pn_reactor_update(reactor, reactor->selectable);
-          reactor->selectable = NULL;
-        } else {
-          return false;
-        }
+        if (reactor->previous != PN_REACTOR_FINAL)
+          pn_collector_put(reactor->collector, PN_OBJECT, reactor, PN_REACTOR_FINAL);
+        return false;
       }
     }
   }
@@ -478,14 +481,11 @@ void pn_reactor_start(pn_reactor_t *reac
   assert(reactor);
   pn_collector_put(reactor->collector, PN_OBJECT, reactor, PN_REACTOR_INIT);
   reactor->selectable = pni_timer_selectable(reactor);
- }
+}
 
 void pn_reactor_stop(pn_reactor_t *reactor) {
   assert(reactor);
-  pn_collector_put(reactor->collector, PN_OBJECT, reactor, PN_REACTOR_FINAL);
-  // XXX: should consider removing this fron stop to avoid reentrance
-  pn_reactor_process(reactor);
-  pn_collector_release(reactor->collector);
+  reactor->stop = true;
 }
 
 void pn_reactor_run(pn_reactor_t *reactor) {
@@ -493,5 +493,6 @@ void pn_reactor_run(pn_reactor_t *reacto
   pn_reactor_set_timeout(reactor, 3141);
   pn_reactor_start(reactor);
   while (pn_reactor_process(reactor)) {}
-  pn_reactor_stop(reactor);
+  pn_reactor_process(reactor);
+  pn_collector_release(reactor->collector);
 }

Comment 41 Mike McCune 2017-05-12 14:54:26 UTC
*** Bug 1447834 has been marked as a duplicate of this bug. ***

Comment 49 Cliff Jansen 2017-07-17 15:07:30 UTC
Possible fix upstream (further testing pending)

see: https://issues.apache.org/jira/browse/PROTON-1394

Comment 54 Cliff Jansen 2017-07-25 18:42:40 UTC
Newer patch available.  Fixes segfault in goferd if a BlockingLink
is incompletely created due to an exception.

   https://issues.apache.org/jira/browse/PROTON-1394

Comment 55 Pavel Moravec 2017-07-26 08:09:16 UTC
(In reply to Cliff Jansen from comment #54)
> Newer patch available.  Fixes segfault in goferd if a BlockingLink
> is incompletely created due to an exception.
> 
>    https://issues.apache.org/jira/browse/PROTON-1394

Added to the backported patch_1. No more segfault, but still memory accumulation:

Original behaviour on proton 0.9-16:

- reproducer per #c4 (goferd bounced on AMQP level, probes delays decreased from default 10s to 0.1-1s): RSS growth by 124MB in 10 minutes

- reproducer from #c5 (standalone reproducer): RSS growth by 65MB in 10 minutes


Behaviour on patched proton:

- reproducer per #c4 (goferd bounced on AMQP level, probes delays decreased from default 10s to 0.1-1s): RSS growth by 143MB in 10 minutes

- reproducer from #c5 (standalone reproducer): steady RSS growth by 76MB in 10 minutes

Eeven if I add "self.container = None" to the newest patch (that might be unintentionally removed from the code by the patch), nothing changes. If I remove the patch for PROTON-1366, nothing changes either.


Reproducer:

- machine like 2 posts ago
  - python goferd_mem_leak-reproducer-c5.py
  or
  - python /usr/bin/goferd --foreground

Comment 56 Pavel Moravec 2017-07-27 08:05:48 UTC
Sorry for confusion, applying the three patches cumulatively, no further memory leak \o/

BUT..

But the high CPU usage is still present. See #c22 and reproducer in #c23. (python code lines just differ a bit due to the patches applied).

Reproducer on the same machine:

python goferd_mem_leak-reproducer-c23-CPU.py


(it is worth trying the patch from #c38 to fix the high CPU usage - the change is in C code and I dont have enough time to build the sources ATM. But that patch worked well for that problem)

Comment 57 Pavel Moravec 2017-07-28 10:37:02 UTC
Created attachment 1305891 [details]
perspective patch

Thanks to Cliff for the final patch that fixes also the CPU spinning (see #c23 or #c56). Now the cumulative patch passed all my tests:

- goferd exhibit flat mem.usage when failing to connect on AMQP level
- the standalone reproducer as well
- the same applies when failing to connect on TCP level (qdrouterd down)
- no mem.increase when installing/removing a package via goferd in a loop
- no high CPU when closing connection (#c23)


I attached the cumulative patch I used:

cd /usr/lib64/python2.*/site-packages
cat /tmp/bz1318015.patch | patch -p4
service goferd restart

(do not apply at customers - otherwise you get unsupported system, though with the mem.leak resolved)

Comment 73 John Mitsch 2017-10-26 14:43:37 UTC
Created attachment 1343803 [details]
python-qpid-proton-0.9-20.el6.x86_64.rpm

Comment 74 John Mitsch 2017-10-26 14:44:18 UTC
Created attachment 1343805 [details]
python-qpid-proton-0.9-20.el7.x86_64.rpm

Comment 75 John Mitsch 2017-10-26 14:54:32 UTC
**** HOTFIX INSTRUCTIONS ****

I've attached a qpid-proton build that can be applied to clients as a fix.

You can download the appropriate rpm on to the client and run:

yum localinstall <rpm>

and restart goferd service on the client

In order to install this hotfix on multiple client hosts needing the update you may need to utilize a custom Satellite repository to upload the hotfix RPM to. After uploading you will need to ensure the clients needing the hotfix has a subscription added to grant them access to this new custom repository so you can update via your standard mechanism. If you have questions about how to accomplish you can find our documentation here: https://access.redhat.com/documentation/en-us/red_hat_satellite/6.2/html/content_management_guide/importing_custom_content  or feel free to contact support.

Comment 76 Matt 2017-10-26 20:45:43 UTC
The attached RPMs don't work. I get a dependency error:

# yum localinstall python-qpid-proton-0.9-20.el7.x86_64.rpm
...
--> Finished Dependency Resolution
Error: Package: python-qpid-proton-0.9-20.el7.x86_64 (/python-qpid-proton-0.9-20.el7.x86_64)
           Requires: qpid-proton-c(x86-64) = 0.9-20.el7
           Installed: qpid-proton-c-0.9-16.el7.x86_64 (@rhel-7-server-satellite-tools-6.2-rpms)
               qpid-proton-c(x86-64) = 0.9-16.el7
**********************************************************************
yum can be configured to try to resolve such errors by temporarily enabling
disabled repos and searching for missing dependencies.
To enable this functionality please set 'notify_only=0' in /etc/yum/pluginconf.d/search-disabled-repos.conf
**********************************************************************

Error: Package: python-qpid-proton-0.9-20.el7.x86_64 (/python-qpid-proton-0.9-20.el7.x86_64)
           Requires: qpid-proton-c(x86-64) = 0.9-20.el7
           Installed: qpid-proton-c-0.9-16.el7.x86_64 (@rhel-7-server-satellite-tools-6.2-rpms)
               qpid-proton-c(x86-64) = 0.9-16.el7
 You could try using --skip-broken to work around the problem


Looks like we need the qpid-proton-c-0.9-20 packages as well. Same problem on both EL6 & EL7.

Comment 77 John Mitsch 2017-10-26 21:13:14 UTC
Created attachment 1343988 [details]
qpid-proton-c-0.9-20.el6.x86_64.rpm

Comment 78 John Mitsch 2017-10-26 21:14:02 UTC
Created attachment 1343989 [details]
qpid-proton-c-0.9-20.el7.x86_64.rpm

Comment 79 John Mitsch 2017-10-26 21:17:37 UTC
Matt,

You are correct, the package was missing the dependency. I've attached the updated qpid-proton-c for both el6 + 7 to this BZ. Let me know if you hit any other issues installing the package.

-John

Comment 80 Patrick M 2017-10-31 13:29:20 UTC
Thank you for the hard work on this....

Would this patch be available for RHEL 6 clients as well?

-Patrick

Comment 81 John Mitsch 2017-10-31 13:39:18 UTC
Patrick,

Both RHEL6 + 7 packages are attached to this BZ.

Comment 82 Patrick M 2017-10-31 14:18:22 UTC
OMG, I just saw that.  Apologies....is there an ETA as to when it will be put in the repos?

Comment 86 Pavel Moravec 2017-11-10 13:41:38 UTC
Jeff,
see previous internal comment for the reproducer machine - in general it is enough to stop qpidd on Sat to trigger the mem.leak on goferd. Please investigate there, I start to run out of idea (and time)


Some facts from other experiments:

- in goferd, decreasing DELAY in /usr/lib/python2.7/site-packages/gofer/messaging/adapter/proton/reliability.py to 0.1 second makes RSS usage (almost) flat

- decreasing it to 1 second halves the mem.leak (halves the increase of RSS over time), compared to default 10 second

- comparing default goferd with 0.9-20 proton against 0.9-16 proton does show an improvement in memory growth (70kB increase per 10s with 0.9-16 versus 55kB increase per 10s with 0.9-20)

- standalone reproducers dont show a mem.leak, regardless of delay used there

- trying to improve the standalone reproducers - no leak so far

I start to speculate the problem is outside proton libraries, or in the way how it uses them? (havent tried the reliable(fn) trick, e.g.)

(why I thought the mem.leak in goferd is gone? because I run standalone reprods and goferd with 0.1 DELAY only, to make the reproducer faster)

Comment 87 Jeff Ortel 2017-11-10 15:49:34 UTC
Looking into it.

Comment 93 Cliff Jansen 2017-11-22 09:01:32 UTC
Created attachment 1357300 [details]
proposed patch for proton 0.9

The 2017-11-13 leak reproducer seems to be timing dependent.  It never (or very rarely) occurred on my local environment.  It occurred about 50% on the internal reproducer machine.

This backport from Proton 0-16 seems to do the trick when a LinkDetached exception ('qd:no-route-to-dest') prompts an application to abruptly close a connection.  Normally I am wary of transplanting code between these distal versions because the underlying reactor has changed so much.  In this case, the breaking of the circular reference happens around unchanged code, so it seems pretty safe to me.

Many thanks to those who crafted the reproducer and made the target machine available to me.

Comment 97 Pavel Moravec 2017-11-27 21:49:36 UTC
I derived the new memory leak from previous (internal) comment to

https://bugzilla.redhat.com/show_bug.cgi?id=1518008

That one should be independent on this one, until we identify it is a regression in qpid-proton.

Comment 98 Pavel Moravec 2017-11-28 09:05:26 UTC
(In reply to Pavel Moravec from comment #97)
> I derived the new memory leak from previous (internal) comment to
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=1518008
> 
> That one should be independent on this one, until we identify it is a
> regression in qpid-proton.

The same leak (verified against the tiny reproducer per [1]) is present also in 0.9-16.

Therefore those two leaks are independent and the new one is not a regression introduced by codefix of this one.

So from my side:
- qpid-proton 0.9-21 fully fixes this memory leak
- no regression is known to me
- bz1518008 is fully independent mem.leak that was present there for a longer time
- I would not object against sharing 0.9-21 as an interim fix for the mem.leak
- while I suggest fixing bz1518008 still in next z-stream (to have one official (z-stream) release fixing both leaks)


[1] https://bugzilla.redhat.com/show_bug.cgi?id=1518008#c3

Comment 99 Zach Huntington-Meath 2017-11-30 16:57:53 UTC
Hello all,

We've made a KCS article to address the needed changes for this Bugzilla. You can reach it here:

https://access.redhat.com/solutions/3254571

Comment 106 Peter Ondrejka 2018-01-15 13:56:06 UTC
Verified with Sat 6.2.14 (qpid-proton-c-0.9-21) both on rhel 6 and 7. There is a small increase in rss collumn on clients after stopping qdrouterd on sat that stabilizes afterwards (60kb and 24kb respectively). No CPU spinning experienced.

Comment 115 errata-xmlrpc 2018-02-05 14:11:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0272


Note You need to log in before you can comment on or make changes to this bug.