Bug 1816831
Summary: | Memory leak in python-proton library when goferd connections are timeouted due to TCP packets dropped | ||
---|---|---|---|
Product: | Red Hat Satellite | Reporter: | Amar Ganbat <aganbat> |
Component: | Qpid | Assignee: | Mike Cressman <mcressma> |
Status: | CLOSED DUPLICATE | QA Contact: | vijsingh |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 6.6.0 | CC: | pmoravec, wpinheir |
Target Milestone: | Unspecified | Keywords: | Triaged |
Target Release: | Unused | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-04-14 08:01:38 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Amar Ganbat
2020-03-24 20:09:40 UTC
*** Bug 1816829 has been marked as a duplicate of this bug. *** Hello Amar, Can you provide some more information to help with investigating the bugzilla? Do you have reproducer steps or a reproducer environment? Are there logs from the server or client that would help with investigation? (e.g. foreman-debug) Is there are there system statistics that can be shared? (e.g. memory usage, where did it start, client activities, where did it end...etc?) Here is a reproducer inside Satellite: 1) Start goferd on a client, ensure everything is connected properly (this step is must) 2) Simulate a networking issue where packets towards Sat/Caps port 5647 are dropped: iptables -I INPUT -p tcp --dport 5647 -j DROP 3) /var/log/messages on the client system notice it: Apr 13 17:25:53 pmoravec-rhel76 goferd: [WARNING][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.adapter.proton.reliability:54 - Connection amqps://pmoravec-sat66-on-rhev.gsslab.brq2.redhat.com:5647 disconnected: Condition('amqp:resource-limit-exceeded', 'local-idle-timeout expired') 4) after a short time (probably in the 10s), a reconnect attempt happens, but it is timeouted: Apr 13 17:26:03 pmoravec-rhel76 goferd: [INFO][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.adapter.proton.connection:131 - closed: proton+amqps://pmoravec-sat66-on-rhev.gsslab.brq2.redhat.com:5647 Apr 13 17:26:03 pmoravec-rhel76 goferd: [INFO][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.adapter.connect:28 - connecting: proton+amqps://pmoravec-sat66-on-rhev.gsslab.brq2.redhat.com:5647 Apr 13 17:26:03 pmoravec-rhel76 goferd: [INFO][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.adapter.proton.connection:87 - open: URL: amqps://pmoravec-sat66-on-rhev.gsslab.brq2.redhat.com:5647|SSL: ca: /etc/rhsm/ca/katello-default-ca.pem|key: None|certificate: /etc/pki/consumer/bundle.pem|host-validation: None Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.adapter.model:44 - Connection amqps://pmoravec-sat66-on-rhev.gsslab.brq2.redhat.com:5647 timed out: Opening connection Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.adapter.model:44 - Traceback (most recent call last): Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.adapter.model:44 - File "/usr/lib/python2.7/site-packages/gofer/messaging/adapter/model.py", line 40, in _fn Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.adapter.model:44 - return fn(*args, **keywords) Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.adapter.model:44 - File "/usr/lib/python2.7/site-packages/gofer/messaging/adapter/model.py", line 611, in get Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.adapter.model:44 - return self._impl.get(timeout) Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.adapter.model:44 - File "/usr/lib/python2.7/site-packages/gofer/messaging/adapter/proton/reliability.py", line 44, in _fn Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.adapter.model:44 - repair() Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.adapter.model:44 - File "/usr/lib/python2.7/site-packages/gofer/messaging/adapter/proton/consumer.py", line 81, in repair Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.adapter.model:44 - self.connection.open() Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.adapter.model:44 - File "/usr/lib/python2.7/site-packages/gofer/messaging/adapter/connect.py", line 29, in inner Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.adapter.model:44 - impl = fn(connection) Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.adapter.model:44 - File "/usr/lib/python2.7/site-packages/gofer/messaging/adapter/proton/connection.py", line 91, in open Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.adapter.model:44 - ssl_domain=domain) Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.adapter.model:44 - File "/usr/lib64/python2.7/site-packages/proton/_utils.py", line 245, in __init__ Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.adapter.model:44 - msg="Opening connection") Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.adapter.model:44 - File "/usr/lib64/python2.7/site-packages/proton/_utils.py", line 314, in wait Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.adapter.model:44 - raise Timeout(txt) Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.adapter.model:44 - Timeout: Connection amqps://pmoravec-sat66-on-rhev.gsslab.brq2.redhat.com:5647 timed out: Opening connection Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.consumer:112 - pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2 Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.consumer:112 - Traceback (most recent call last): Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.consumer:112 - File "/usr/lib/python2.7/site-packages/gofer/messaging/consumer.py", line 98, in read Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.consumer:112 - message, document = reader.next(wait) Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.consumer:112 - File "/usr/lib/python2.7/site-packages/gofer/messaging/adapter/model.py", line 40, in _fn Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.consumer:112 - return fn(*args, **keywords) Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.consumer:112 - File "/usr/lib/python2.7/site-packages/gofer/messaging/adapter/model.py", line 645, in next Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.consumer:112 - message = self.get(timeout) Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.consumer:112 - File "/usr/lib/python2.7/site-packages/gofer/messaging/adapter/model.py", line 45, in _fn Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.consumer:112 - raise ModelError(*e.args) Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.consumer:112 - ModelError: Connection amqps://pmoravec-sat66-on-rhev.gsslab.brq2.redhat.com:5647 timed out: Opening connection 5) after some time, the above logs re-occur 6) meantime, measure memory usage of goferd - it grows very linearly TODO on me: prepare a reproducer outside Satellite / goferd, purely using proton (here I assume the leak is in proton, which does not need to be the case) And here is a reproducer outside Satellite, against qpid-proton-c-0.28.0-2.el7.x86_64 : 1) Have qpidd running with queue pulp.TEST.0 2) Have qdrouterd link-routing everything to qpidd 3) modify in below script ROUTER_ADDRESS to the qdrouterd (I *think* the script below can connect directly to qpidd and bypass qdrouterd) ---------------8<--------- import random import threading from proton import Timeout from proton.utils import BlockingConnection from time import sleep from uuid import uuid4 ROUTER_ADDRESS = 'proton+amqp://pmoravec-sat66-on-rhev.gsslab.brq2.redhat.com:5648' ADDRESS = "pulp.TEST" HEARTBEAT = 5 SLEEP = 2.0 THREADS = 1 ADDRESSES = 20 class ReceiverThread(threading.Thread): def __init__(self, address=ADDRESS): super(ReceiverThread, self).__init__() self.address = address self.running = True def subscribe(self): self.recv = self.conn.create_receiver('%s' %(self.address), name=str(uuid4()), dynamic=False, options=None) def connect(self): connected = False while not connected: try: self.conn = BlockingConnection(ROUTER_ADDRESS, ssl_domain=None, heartbeat=HEARTBEAT, timeout=3) connected = True except Exception as e: print("received exception %s on connect, trying again in 0.5s" % e) sleep(0.5) self.subscribe() def reconnect(self): try: self.recv.close() except: pass try: self.conn.close() except: pass self.connect() def run(self): subscribed = False while not subscribed: try: self.connect() subscribed = True except: print "received exception on connect/subscribe, trying again in 0.5s" sleep(0.5) print "connected => running" while self.running: try: msg = None print "receive with timeout %s" % SLEEP msg = self.recv.receive(timeout=SLEEP) if msg: print "(%s): accepting msg %s" % (self.address, msg) self.recv.accept() sleep(timeout) else: print "(%s): no msg received" % self.address except KeyboardInterrupt: self.running = False pass except Timeout: pass except Exception as e: print "(%s): receiver thread: exception %s" % (self.address, e) self.reconnect() pass def stop(self): self.running = False threads = [] for i in range(THREADS): threads.append(ReceiverThread('%s.%s' %(ADDRESS, i%ADDRESSES))) threads[i].start() running = True while running: try: sleep(1) except KeyboardInterrupt: running = False pass #_in = raw_input("Press Enter to exit:") for i in range(THREADS): threads[i].stop() for i in range(THREADS): threads[i].join() ---------------8<--------- 4) ensure all ports are opened on firewall, and run the python client script 5) Once it connects and logs "receive with timeout 2.0", add iptables rule: iptables -I INPUT -p tcp --dport 5648 -j DROP on the system with qdrouterd/qpidd, to drop incoming traffic from the client 6) monitor memory usage of the client, trying to reconnect after a timeout exception - it will grow 7) delete the iptables rule and observe that memory usage of the script is stabilised 8) goto 5) :) .. and a memory leak in my head hid me the fact the same reproducer for same circumstances applied already to https://bugzilla.redhat.com/show_bug.cgi?id=1810549 Closing this BZ as a DUPLICATE of https://bugzilla.redhat.com/show_bug.cgi?id=1810549 . *** This bug has been marked as a duplicate of bug 1810549 *** |