Bug 1816831
| Summary: | Memory leak in python-proton library when goferd connections are timeouted due to TCP packets dropped | ||
|---|---|---|---|
| Product: | Red Hat Satellite | Reporter: | Amar Ganbat <aganbat> |
| Component: | Qpid | Assignee: | Mike Cressman <mcressma> |
| Status: | CLOSED DUPLICATE | QA Contact: | vijsingh |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 6.6.0 | CC: | pmoravec, wpinheir |
| Target Milestone: | Unspecified | Keywords: | Triaged |
| Target Release: | Unused | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-04-14 08:01:38 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Amar Ganbat
2020-03-24 20:09:40 UTC
*** Bug 1816829 has been marked as a duplicate of this bug. *** Hello Amar, Can you provide some more information to help with investigating the bugzilla? Do you have reproducer steps or a reproducer environment? Are there logs from the server or client that would help with investigation? (e.g. foreman-debug) Is there are there system statistics that can be shared? (e.g. memory usage, where did it start, client activities, where did it end...etc?) Here is a reproducer inside Satellite:
1) Start goferd on a client, ensure everything is connected properly (this step is must)
2) Simulate a networking issue where packets towards Sat/Caps port 5647 are dropped:
iptables -I INPUT -p tcp --dport 5647 -j DROP
3) /var/log/messages on the client system notice it:
Apr 13 17:25:53 pmoravec-rhel76 goferd: [WARNING][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.adapter.proton.reliability:54 - Connection amqps://pmoravec-sat66-on-rhev.gsslab.brq2.redhat.com:5647 disconnected: Condition('amqp:resource-limit-exceeded', 'local-idle-timeout expired')
4) after a short time (probably in the 10s), a reconnect attempt happens, but it is timeouted:
Apr 13 17:26:03 pmoravec-rhel76 goferd: [INFO][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.adapter.proton.connection:131 - closed: proton+amqps://pmoravec-sat66-on-rhev.gsslab.brq2.redhat.com:5647
Apr 13 17:26:03 pmoravec-rhel76 goferd: [INFO][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.adapter.connect:28 - connecting: proton+amqps://pmoravec-sat66-on-rhev.gsslab.brq2.redhat.com:5647
Apr 13 17:26:03 pmoravec-rhel76 goferd: [INFO][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.adapter.proton.connection:87 - open: URL: amqps://pmoravec-sat66-on-rhev.gsslab.brq2.redhat.com:5647|SSL: ca: /etc/rhsm/ca/katello-default-ca.pem|key: None|certificate: /etc/pki/consumer/bundle.pem|host-validation: None
Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.adapter.model:44 - Connection amqps://pmoravec-sat66-on-rhev.gsslab.brq2.redhat.com:5647 timed out: Opening connection
Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.adapter.model:44 - Traceback (most recent call last):
Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.adapter.model:44 - File "/usr/lib/python2.7/site-packages/gofer/messaging/adapter/model.py", line 40, in _fn
Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.adapter.model:44 - return fn(*args, **keywords)
Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.adapter.model:44 - File "/usr/lib/python2.7/site-packages/gofer/messaging/adapter/model.py", line 611, in get
Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.adapter.model:44 - return self._impl.get(timeout)
Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.adapter.model:44 - File "/usr/lib/python2.7/site-packages/gofer/messaging/adapter/proton/reliability.py", line 44, in _fn
Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.adapter.model:44 - repair()
Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.adapter.model:44 - File "/usr/lib/python2.7/site-packages/gofer/messaging/adapter/proton/consumer.py", line 81, in repair
Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.adapter.model:44 - self.connection.open()
Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.adapter.model:44 - File "/usr/lib/python2.7/site-packages/gofer/messaging/adapter/connect.py", line 29, in inner
Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.adapter.model:44 - impl = fn(connection)
Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.adapter.model:44 - File "/usr/lib/python2.7/site-packages/gofer/messaging/adapter/proton/connection.py", line 91, in open
Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.adapter.model:44 - ssl_domain=domain)
Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.adapter.model:44 - File "/usr/lib64/python2.7/site-packages/proton/_utils.py", line 245, in __init__
Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.adapter.model:44 - msg="Opening connection")
Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.adapter.model:44 - File "/usr/lib64/python2.7/site-packages/proton/_utils.py", line 314, in wait
Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.adapter.model:44 - raise Timeout(txt)
Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.adapter.model:44 - Timeout: Connection amqps://pmoravec-sat66-on-rhev.gsslab.brq2.redhat.com:5647 timed out: Opening connection
Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.consumer:112 - pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2
Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.consumer:112 - Traceback (most recent call last):
Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.consumer:112 - File "/usr/lib/python2.7/site-packages/gofer/messaging/consumer.py", line 98, in read
Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.consumer:112 - message, document = reader.next(wait)
Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.consumer:112 - File "/usr/lib/python2.7/site-packages/gofer/messaging/adapter/model.py", line 40, in _fn
Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.consumer:112 - return fn(*args, **keywords)
Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.consumer:112 - File "/usr/lib/python2.7/site-packages/gofer/messaging/adapter/model.py", line 645, in next
Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.consumer:112 - message = self.get(timeout)
Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.consumer:112 - File "/usr/lib/python2.7/site-packages/gofer/messaging/adapter/model.py", line 45, in _fn
Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.consumer:112 - raise ModelError(*e.args)
Apr 13 17:26:04 pmoravec-rhel76 goferd: [ERROR][pulp.agent.32279d52-1393-4e0a-b19f-b0258f7d86c2] gofer.messaging.consumer:112 - ModelError: Connection amqps://pmoravec-sat66-on-rhev.gsslab.brq2.redhat.com:5647 timed out: Opening connection
5) after some time, the above logs re-occur
6) meantime, measure memory usage of goferd - it grows very linearly
TODO on me: prepare a reproducer outside Satellite / goferd, purely using proton (here I assume the leak is in proton, which does not need to be the case)
And here is a reproducer outside Satellite, against qpid-proton-c-0.28.0-2.el7.x86_64 :
1) Have qpidd running with queue pulp.TEST.0
2) Have qdrouterd link-routing everything to qpidd
3) modify in below script ROUTER_ADDRESS to the qdrouterd (I *think* the script below can connect directly to qpidd and bypass qdrouterd)
---------------8<---------
import random
import threading
from proton import Timeout
from proton.utils import BlockingConnection
from time import sleep
from uuid import uuid4
ROUTER_ADDRESS = 'proton+amqp://pmoravec-sat66-on-rhev.gsslab.brq2.redhat.com:5648'
ADDRESS = "pulp.TEST"
HEARTBEAT = 5
SLEEP = 2.0
THREADS = 1
ADDRESSES = 20
class ReceiverThread(threading.Thread):
def __init__(self, address=ADDRESS):
super(ReceiverThread, self).__init__()
self.address = address
self.running = True
def subscribe(self):
self.recv = self.conn.create_receiver('%s' %(self.address), name=str(uuid4()), dynamic=False, options=None)
def connect(self):
connected = False
while not connected:
try:
self.conn = BlockingConnection(ROUTER_ADDRESS, ssl_domain=None, heartbeat=HEARTBEAT, timeout=3)
connected = True
except Exception as e:
print("received exception %s on connect, trying again in 0.5s" % e)
sleep(0.5)
self.subscribe()
def reconnect(self):
try:
self.recv.close()
except:
pass
try:
self.conn.close()
except:
pass
self.connect()
def run(self):
subscribed = False
while not subscribed:
try:
self.connect()
subscribed = True
except:
print "received exception on connect/subscribe, trying again in 0.5s"
sleep(0.5)
print "connected => running"
while self.running:
try:
msg = None
print "receive with timeout %s" % SLEEP
msg = self.recv.receive(timeout=SLEEP)
if msg:
print "(%s): accepting msg %s" % (self.address, msg)
self.recv.accept()
sleep(timeout)
else:
print "(%s): no msg received" % self.address
except KeyboardInterrupt:
self.running = False
pass
except Timeout:
pass
except Exception as e:
print "(%s): receiver thread: exception %s" % (self.address, e)
self.reconnect()
pass
def stop(self):
self.running = False
threads = []
for i in range(THREADS):
threads.append(ReceiverThread('%s.%s' %(ADDRESS, i%ADDRESSES)))
threads[i].start()
running = True
while running:
try:
sleep(1)
except KeyboardInterrupt:
running = False
pass
#_in = raw_input("Press Enter to exit:")
for i in range(THREADS):
threads[i].stop()
for i in range(THREADS):
threads[i].join()
---------------8<---------
4) ensure all ports are opened on firewall, and run the python client script
5) Once it connects and logs "receive with timeout 2.0", add iptables rule:
iptables -I INPUT -p tcp --dport 5648 -j DROP
on the system with qdrouterd/qpidd, to drop incoming traffic from the client
6) monitor memory usage of the client, trying to reconnect after a timeout exception - it will grow
7) delete the iptables rule and observe that memory usage of the script is stabilised
8) goto 5) :)
.. and a memory leak in my head hid me the fact the same reproducer for same circumstances applied already to https://bugzilla.redhat.com/show_bug.cgi?id=1810549 Closing this BZ as a DUPLICATE of https://bugzilla.redhat.com/show_bug.cgi?id=1810549 . *** This bug has been marked as a duplicate of bug 1810549 *** |