Bug 1816831

Summary: Memory leak in python-proton library when goferd connections are timeouted due to TCP packets dropped
Product: Red Hat Satellite Reporter: Amar Ganbat <aganbat>
Component: QpidAssignee: Mike Cressman <mcressma>
Status: CLOSED DUPLICATE QA Contact: vijsingh
Severity: medium Docs Contact:
Priority: medium    
Version: 6.6.0CC: pmoravec, wpinheir
Target Milestone: UnspecifiedKeywords: Triaged
Target Release: Unused   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-04-14 08:01:38 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Amar Ganbat 2020-03-24 20:09:40 UTC
Description of problem:

The content host's memory consumption is happening and affecting the content host after a while (weeks or months) according to the usage.
Even with the updated the proton packages.  

~~~
qpid-proton-c-0.28.0-2.el7.x86_64
python-qpid-proton-0.28.0-2.el7.x86_64
~~~


How reproducible:
Install the latest version of katello-agent and monitor the usage.

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 3 Brad Buckingham 2020-03-27 14:14:59 UTC
*** Bug 1816829 has been marked as a duplicate of this bug. ***

Comment 4 Brad Buckingham 2020-03-27 14:17:40 UTC
Hello Amar,

Can you provide some more information to help with investigating the bugzilla?
Do you have reproducer steps or a reproducer environment?
Are there logs from the server or client that would help with investigation?  (e.g. foreman-debug)
Is there are there system statistics that can be shared? (e.g. memory usage, where did it start, client activities, where did it end...etc?)

Comment 8 Pavel Moravec 2020-04-13 15:33:06 UTC
Here is a reproducer inside Satellite:

1) Start goferd on a client, ensure everything is connected properly (this step is must)

2) Simulate a networking issue where packets towards Sat/Caps port 5647 are dropped:

iptables -I INPUT -p tcp --dport 5647 -j DROP

3) /var/log/messages on the client system notice it:

Apr 13 17:25:53 pmoravec-rhel76 goferd: [WARNING][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.adapter.proton.reliability:54 - Connection amqps://pmoravec-sat66-on-rhev.gsslab.brq2.redhat.com:5647 disconnected: Condition('amqp:resource-limit-exceeded', 'local-idle-timeout expired')

4) after a short time (probably in the 10s), a reconnect attempt happens, but it is timeouted:
Apr 13 17:26:03 pmoravec-rhel76 goferd: [INFO][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.adapter.proton.connection:131 - closed: proton+amqps://pmoravec-sat66-on-rhev.gsslab.brq2.redhat.com:5647
Apr 13 17:26:03 pmoravec-rhel76 goferd: [INFO][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.adapter.connect:28 - connecting: proton+amqps://pmoravec-sat66-on-rhev.gsslab.brq2.redhat.com:5647
Apr 13 17:26:03 pmoravec-rhel76 goferd: [INFO][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.adapter.proton.connection:87 - open: URL: amqps://pmoravec-sat66-on-rhev.gsslab.brq2.redhat.com:5647|SSL: ca: /etc/rhsm/ca/katello-default-ca.pem|key: None|certificate: /etc/pki/consumer/bundle.pem|host-validation: None
Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.adapter.model:44 - Connection amqps://pmoravec-sat66-on-rhev.gsslab.brq2.redhat.com:5647 timed out: Opening connection
Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.adapter.model:44 - Traceback (most recent call last):
Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.adapter.model:44 -   File "/usr/lib/python2.7/site-packages/gofer/messaging/adapter/model.py", line 40, in _fn
Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.adapter.model:44 -     return fn(*args, **keywords)
Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.adapter.model:44 -   File "/usr/lib/python2.7/site-packages/gofer/messaging/adapter/model.py", line 611, in get
Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.adapter.model:44 -     return self._impl.get(timeout)
Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.adapter.model:44 -   File "/usr/lib/python2.7/site-packages/gofer/messaging/adapter/proton/reliability.py", line 44, in _fn
Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.adapter.model:44 -     repair()
Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.adapter.model:44 -   File "/usr/lib/python2.7/site-packages/gofer/messaging/adapter/proton/consumer.py", line 81, in repair
Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.adapter.model:44 -     self.connection.open()
Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.adapter.model:44 -   File "/usr/lib/python2.7/site-packages/gofer/messaging/adapter/connect.py", line 29, in inner
Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.adapter.model:44 -     impl = fn(connection)
Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.adapter.model:44 -   File "/usr/lib/python2.7/site-packages/gofer/messaging/adapter/proton/connection.py", line 91, in open
Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.adapter.model:44 -     ssl_domain=domain)
Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.adapter.model:44 -   File "/usr/lib64/python2.7/site-packages/proton/_utils.py", line 245, in __init__
Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.adapter.model:44 -     msg="Opening connection")
Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.adapter.model:44 -   File "/usr/lib64/python2.7/site-packages/proton/_utils.py", line 314, in wait
Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.adapter.model:44 -     raise Timeout(txt)
Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.adapter.model:44 - Timeout: Connection amqps://pmoravec-sat66-on-rhev.gsslab.brq2.redhat.com:5647 timed out: Opening connection
Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.consumer:112 - pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2
Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.consumer:112 - Traceback (most recent call last):
Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.consumer:112 -   File "/usr/lib/python2.7/site-packages/gofer/messaging/consumer.py", line 98, in read
Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.consumer:112 -     message, document = reader.next(wait)
Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.consumer:112 -   File "/usr/lib/python2.7/site-packages/gofer/messaging/adapter/model.py", line 40, in _fn
Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.consumer:112 -     return fn(*args, **keywords)
Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.consumer:112 -   File "/usr/lib/python2.7/site-packages/gofer/messaging/adapter/model.py", line 645, in next
Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.consumer:112 -     message = self.get(timeout)
Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.consumer:112 -   File "/usr/lib/python2.7/site-packages/gofer/messaging/adapter/model.py", line 45, in _fn
Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.consumer:112 -     raise ModelError(*e.args)
Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.consumer:112 - ModelError: Connection amqps://pmoravec-sat66-on-rhev.gsslab.brq2.redhat.com:5647 timed out: Opening connection

5) after some time, the above logs re-occur

6) meantime, measure memory usage of goferd - it grows very linearly



TODO on me: prepare a reproducer outside Satellite / goferd, purely using proton (here I assume the leak is in proton, which does not need to be the case)

Comment 9 Pavel Moravec 2020-04-13 20:13:12 UTC
And here is a reproducer outside Satellite, against qpid-proton-c-0.28.0-2.el7.x86_64 :

1) Have qpidd running with queue pulp.TEST.0
2) Have qdrouterd link-routing everything to qpidd
3) modify in below script ROUTER_ADDRESS to the qdrouterd (I *think* the script below can connect directly to qpidd and bypass qdrouterd)

---------------8<---------

import random
import threading

from proton import Timeout
from proton.utils import BlockingConnection

from time import sleep
from uuid import uuid4

ROUTER_ADDRESS = 'proton+amqp://pmoravec-sat66-on-rhev.gsslab.brq2.redhat.com:5648'
ADDRESS = "pulp.TEST"
HEARTBEAT = 5
SLEEP = 2.0
THREADS = 1
ADDRESSES = 20
 
class ReceiverThread(threading.Thread):
    def __init__(self, address=ADDRESS):
        super(ReceiverThread, self).__init__()
        self.address = address
        self.running = True

    def subscribe(self):
	self.recv = self.conn.create_receiver('%s' %(self.address), name=str(uuid4()), dynamic=False, options=None)
 
    def connect(self):
        connected = False
        while not connected:
            try:
                self.conn = BlockingConnection(ROUTER_ADDRESS, ssl_domain=None, heartbeat=HEARTBEAT, timeout=3)
                connected = True
            except Exception as e:
                print("received exception %s on connect, trying again in 0.5s" % e)
                sleep(0.5)

	self.subscribe()
 
    def reconnect(self):
        try:
            self.recv.close()
	except:
	    pass
        try:
    	    self.conn.close()
	except:
            pass
        self.connect()
 
    def run(self):
        subscribed = False
        while not subscribed:
            try:
                self.connect()
                subscribed = True
            except:
                print "received exception on connect/subscribe, trying again in 0.5s"
                sleep(0.5)

	print "connected => running"
        while self.running:
	    try:
                msg = None
                print "receive with timeout %s" % SLEEP
		msg = self.recv.receive(timeout=SLEEP)
		if msg:
	            print "(%s): accepting msg %s" % (self.address, msg)
		    self.recv.accept()
		    sleep(timeout)
                else:
                    print "(%s): no msg received" % self.address
	    except KeyboardInterrupt:
                self.running = False
		pass
            except Timeout:
		pass
	    except Exception as e:
		print "(%s): receiver thread: exception %s" % (self.address, e)
		self.reconnect()
		pass
 
    def stop(self):
        self.running = False
 
threads = []
for i in range(THREADS):
  threads.append(ReceiverThread('%s.%s' %(ADDRESS, i%ADDRESSES)))
  threads[i].start()

running = True
while running:
    try:
	sleep(1)
    except KeyboardInterrupt:
        running = False
	pass
#_in = raw_input("Press Enter to exit:")
 
for i in range(THREADS):
    threads[i].stop()
for i in range(THREADS):
    threads[i].join()

---------------8<---------


4) ensure all ports are opened on firewall, and run the python client script
5) Once it connects and logs "receive with timeout 2.0", add iptables rule:

iptables -I INPUT -p tcp --dport 5648 -j DROP

on the system with qdrouterd/qpidd, to drop incoming traffic from the client

6) monitor memory usage of the client, trying to reconnect after a timeout exception - it will grow

7) delete the iptables rule and observe that memory usage of the script is stabilised

8) goto 5) :)

Comment 10 Pavel Moravec 2020-04-14 08:01:38 UTC
.. and a memory leak in my head hid me the fact the same reproducer for same circumstances applied already to https://bugzilla.redhat.com/show_bug.cgi?id=1810549

Closing this BZ as a DUPLICATE of https://bugzilla.redhat.com/show_bug.cgi?id=1810549 .

*** This bug has been marked as a duplicate of bug 1810549 ***