Bug 1754314

Summary: memory leak in qpid-proton 0.28.0-1 libraries used by goferd when conection to qdrouterd is bounced
Product: Red Hat Satellite Reporter: Pavel Moravec <pmoravec>
Component: QpidAssignee: Mike Cressman <mcressma>
Status: CLOSED ERRATA QA Contact: Radovan Drazny <rdrazny>
Severity: high Docs Contact:
Priority: high    
Version: 6.5.0CC: aeladawy, bkearney, bvassova, christian.klier, cjansen, dsynk, fhirtz, gkadam, gmurthy, gpadholi, gpayelka, greartes, hmore, jalviso, kagarwal, ktordeur, kupadhya, mawerner, mcressma, mkalyat, mmccune, momran, mschibli, mvanderw, patalber, pcreech, pdwyer, rcavalca, sadas, saydas, shisingh, skudupud, spetrosi, sraut, vmeghana, wclark, whitedm
Target Milestone: 6.7.0Keywords: Regression, Triaged
Target Release: Unused   
Hardware: x86_64   
OS: Linux   
Whiteboard: hotfix_delivered
Fixed In Version: qpid-proton-0.28.0-2.{el6,el7,el8} Doc Type: Known Issue
Doc Text:
Satellite hosts that use katello-agent might experience a memory leak caused by the qpid-proton package.
Story Points: ---
Clone Of:
: 1769895 1774268 (view as bug list) Environment:
Last Closed: 2020-04-14 13:25:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
RHEL7 Hotfix RPMs none

Description Pavel Moravec 2019-09-22 18:22:28 UTC
Description of problem:
Two scenarios show a memory leak in goferd process that is not present when downgrading qpid-proton libraries from 0.28.0-1.el7 to 0.26.0-3.el7. So the leak is expected to be in python-qpid-proton-0.28.0-1.el7 or qpid-proton-c-0.28.0-1.el7 packages as a regression.

Reproducer using Satellite (outside Satellite can be provided later, if needed): either regularly restart qdrouterd, or wrongly set up SSL certs - any connection bounce of goferd consumes 2.5M-3M extra RSS memory.


Version-Release number of selected component (if applicable):
current Sat6.5 tools, in particular:
python-gofer-2.12.5-3.el7sat.noarch
gofer-2.12.5-3.el7sat.noarch
qpid-proton-c-0.28.0-1.el7.x86_64
python-gofer-proton-2.12.5-3.el7sat.noarch
python-qpid-proton-0.28.0-1.el7.x86_64


How reproducible:
100%


Steps to Reproduce:
1. Have a katello agent running.
2. Every 11s, restart qdrouterd where the goferd is connecting to.
3. Alternatively to 2. disable SSL on the qdrouterd listening port and leave goferd trying to connect over SSL (with a failure). To do so, just comment out in /etc/qpid-dispatch/qdrouterd.conf:

listener {
    port: 5647
    sasl-mechanisms: ANONYMOUS
#    ssl-profile: server    ### comment out this line
}

To speed-up reproducers, one can increase frequency of the reconnects by updating in /usr/lib/python2.7/site-packages/gofer/messaging/adapter/connect.py :

DELAY = 2 # was 10
MAX_DELAY = 2 # was 90



Actual results:
Either 2. or 3. scenario (both are fully independent ones) show the mem.leak on each and every reconnection attempt.


Expected results:
obviously no mem.leak :)


Additional info:

Comment 3 Pavel Moravec 2019-09-23 06:46:09 UTC
Reproducer script outside Satellite:

1) Have qdrouterd with link routing everything (or at least prefix pulp.*) to qpidd.
2) qpidd having queue pulp.agent.TEST.2
3) scenarios:
- A: use SSL in both qdrouterd and the client program (it's code is below), run the client and restart qdrouterd frequently. The client will be reconnecting automatically.
- B: disable SSL in qdrouterd, leave it enabled in client, and run the client; it will be repeatedly failing to connect as qdrouterd will reject "SSL rubbish" on plain AMQP connection.
- C: disable SSL also in the client (just set "SSL = False" in the client code), run the client and restart qdrouterd frequently. The client will be reconnecting automatically.

In either A, B or C scenario:
- when using 0.26.0-3.el7 proton libraries on the client, no memory leak (a tiny mem.growth is observed, sometimes stabilised after 15mins)
- when using 0.28.0-1.el7 proton libraries on the client, evident mem.leak is observed

the script itself:


from proton import Timeout
from proton.utils import BlockingConnection
from proton import SSLDomain

from time import sleep
from uuid import uuid4

from gofer.config import Config

RHSM_CONFIG_PATH = '/etc/rhsm/rhsm.conf'
SSL = True
SSL_S = 'amqps' if SSL else 'proton+amqp'

domain = SSLDomain(SSLDomain.MODE_CLIENT)
domain.set_trusted_ca_db('/etc/rhsm/ca/katello-default-ca.pem')
domain.set_credentials('/etc/pki/consumer/bundle.pem', '/etc/pki/consumer/bundle.pem', None)
domain.set_peer_authentication(SSLDomain.ANONYMOUS_PEER)

rhsm_conf = Config(RHSM_CONFIG_PATH)
ROUTER_ADDRESS = '%s://pmoravec-sat65-on-rhev.gsslab.brq2.redhat.com:5647' % SSL_S
ADDRESS = "pulp.agent.TEST.2"
HEARTBEAT = 5
SLEEP = 5

recv = None
conn = None
while True:
    subscribed = False
    while not subscribed:
        try:
            conn = BlockingConnection(ROUTER_ADDRESS, ssl_domain=domain if SSL else None, heartbeat=HEARTBEAT)
            recv = conn.create_receiver(ADDRESS, name=str(uuid4()), dynamic=False, options=None)
            subscribed = True
        except Exception, e:
            print "received exception %s on connect/subscribe, trying again in 0.5s" % e
            sleep(0.5)

    print "connected => running"
    while subscribed:
        try:
            print recv.receive(SLEEP)
        except Timeout:
            pass
        except Exception, e:
            print e
            try:
                recv.close()
                recv = None
            except:
                pass
            try:
                conn.close()
                conn = None
            except:
                pass
            subscribed = False

Comment 4 Pavel Moravec 2019-09-23 06:53:01 UTC
(the A and C reproducer scenarios differ only in usage of SSL - that proves the proton memory leak is not in SSL part of the proton code)

Comment 11 Frank Hirtz 2019-10-14 13:38:23 UTC
Good morning!

How goes progress on a test build and/or candidate?

Thanks again,

Frank.

Comment 22 wclark 2019-11-04 16:05:36 UTC
Created attachment 1632607 [details]
RHEL7 Hotfix RPMs

Hotfix is available for RHEL7. To install:

1. Download attached file qpid-proton-HF1754314-RHEL7.tar.gz and extract it

2. Copy the two RPMs inside the archive to each affected RHEL7 gofer client

3. on each client, # yum localinstall ./python-qpid-proton-0.28.0-2.el7.x86_64.rpm ./qpid-proton-c-0.28.0-2.el7.x86_64.rpm

4. on each client, # systemctl restart goferd

Comment 27 Radovan Drazny 2019-12-12 12:59:49 UTC
Tested with python-qpid-proton-0.28.0-2.el7.x86_64 from the Sat 6.7 Snap 5 Sat Tools using the reproducer from the initial report, using the option 3, and lowered DELAY and MAX_DELAY vars. After a initial memory init, the memory usage settled up, and remained completely constant even after a few hundreds failed attempts to connect.

VERIFIED

Comment 32 Pavel Moravec 2020-03-05 16:14:14 UTC
If you report a memory leak on 0.28.0-2 version:

1) ensure what the symptoms are (qdrouterd was restarted? goferd logs like described?)

2) check if https://bugzilla.redhat.com/show_bug.cgi?id=1810549 is not hit, rather (different scenario, present on any recent qpid-proton version)

Comment 37 errata-xmlrpc 2020-04-14 13:25:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:1454