Description of problem: When port 5647 is blocked on firewall or Satellite/Capsule is stopped - meaning packets to the port are dropped - then goferd consumes more and more memory over time. The problem sounds to be inside qpid-proton library when repeatedly calling self.conn = BlockingConnection(ROUTER_ADDRESS, ssl_domain=None, heartbeat=10) that fails on a timeout exception. The leak is present even on the qpid-proton-c-0.28.0-2 that fixed bz1754314. Version-Release number of selected component (if applicable): qpid-proton-c-0.28.0-2.el7.x86_64 python-qpid-proton-0.28.0-2.el7.x86_64 python-gofer-proton-2.12.5-5.el7sat.noarch gofer-2.12.5-5.el7sat.noarch katello-agent-3.5.1-2.el7sat.noarch How reproducible: 100% Steps to Reproduce: Reproducer with goferd: 0. optionally, to speedup reproducer, replace constants 10, 60 and 90 in various places by 10times lower: grep -e 90 -e 60 -e 10 $(rpm -ql $(rpm -qa | grep -e goferd -e katello-agent) | grep py$) 2. Restart goferd service and monitor its memory usage 3. on the Content Host, drop packets towards Satellite's port 5647: iptables -I OUTPUT -p tcp --dport=5647 -j DROP 4. wait a hour while monitoring goferd memory usage Actual results: 4. shows evident memory growth; the growth happens by steps whenever goferd logs: Mar 5 11:46:57 pmoravec-rhel76 goferd: [ERROR][pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e] gofer.messaging.adapter.model:44 - Traceback (most recent call last): Mar 5 11:46:57 pmoravec-rhel76 goferd: [ERROR][pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e] gofer.messaging.adapter.model:44 - File "/usr/lib/python2.7/site-packages/gofer/messaging/adapter/model.py", line 40, in _fn Mar 5 11:46:57 pmoravec-rhel76 goferd: [ERROR][pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e] gofer.messaging.adapter.model:44 - return fn(*args, **keywords) Mar 5 11:46:57 pmoravec-rhel76 goferd: [ERROR][pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e] gofer.messaging.adapter.model:44 - File "/usr/lib/python2.7/site-packages/gofer/messaging/adapter/model.py", line 584, in open Mar 5 11:46:57 pmoravec-rhel76 goferd: [ERROR][pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e] gofer.messaging.adapter.model:44 - self._impl.open() Mar 5 11:46:57 pmoravec-rhel76 goferd: [ERROR][pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e] gofer.messaging.adapter.model:44 - File "/usr/lib/python2.7/site-packages/gofer/messaging/adapter/proton/reliability.py", line 45, in _fn Mar 5 11:46:57 pmoravec-rhel76 goferd: [ERROR][pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e] gofer.messaging.adapter.model:44 - return fn(messenger, *args, **kwargs) Mar 5 11:46:57 pmoravec-rhel76 goferd: [ERROR][pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e] gofer.messaging.adapter.model:44 - File "/usr/lib/python2.7/site-packages/gofer/messaging/adapter/proton/consumer.py", line 72, in open Mar 5 11:46:57 pmoravec-rhel76 goferd: [ERROR][pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e] gofer.messaging.adapter.model:44 - self.connection.open() Mar 5 11:46:57 pmoravec-rhel76 goferd: [ERROR][pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e] gofer.messaging.adapter.model:44 - File "/usr/lib/python2.7/site-packages/gofer/messaging/adapter/connect.py", line 29, in inner Mar 5 11:46:57 pmoravec-rhel76 goferd: [ERROR][pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e] gofer.messaging.adapter.model:44 - impl = fn(connection) Mar 5 11:46:57 pmoravec-rhel76 goferd: [ERROR][pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e] gofer.messaging.adapter.model:44 - File "/usr/lib/python2.7/site-packages/gofer/messaging/adapter/proton/connection.py", line 91, in open Mar 5 11:46:57 pmoravec-rhel76 goferd: [ERROR][pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e] gofer.messaging.adapter.model:44 - ssl_domain=domain) Mar 5 11:46:57 pmoravec-rhel76 goferd: [ERROR][pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e] gofer.messaging.adapter.model:44 - File "/usr/lib64/python2.7/site-packages/proton/_utils.py", line 245, in __init__ Mar 5 11:46:57 pmoravec-rhel76 goferd: [ERROR][pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e] gofer.messaging.adapter.model:44 - msg="Opening connection") Mar 5 11:46:57 pmoravec-rhel76 goferd: [ERROR][pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e] gofer.messaging.adapter.model:44 - File "/usr/lib64/python2.7/site-packages/proton/_utils.py", line 314, in wait Mar 5 11:46:57 pmoravec-rhel76 goferd: [ERROR][pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e] gofer.messaging.adapter.model:44 - raise Timeout(txt) Mar 5 11:46:57 pmoravec-rhel76 goferd: [ERROR][pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e] gofer.messaging.adapter.model:44 - Timeout: Connection amqps://pmoravec-sat66-on-rhev.gsslab.brq2.redhat.com:5647 timed out: Opening connection Mar 5 11:46:57 pmoravec-rhel76 goferd: [ERROR][pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e] gofer.messaging.consumer:79 - pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e Mar 5 11:46:57 pmoravec-rhel76 goferd: [ERROR][pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e] gofer.messaging.consumer:79 - Traceback (most recent call last): Mar 5 11:46:57 pmoravec-rhel76 goferd: [ERROR][pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e] gofer.messaging.consumer:79 - File "/usr/lib/python2.7/site-packages/gofer/messaging/consumer.py", line 72, in open Mar 5 11:46:57 pmoravec-rhel76 goferd: [ERROR][pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e] gofer.messaging.consumer:79 - self.reader.open() Mar 5 11:46:57 pmoravec-rhel76 goferd: [ERROR][pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e] gofer.messaging.consumer:79 - File "/usr/lib/python2.7/site-packages/gofer/messaging/adapter/model.py", line 45, in _fn Mar 5 11:46:57 pmoravec-rhel76 goferd: [ERROR][pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e] gofer.messaging.consumer:79 - raise ModelError(*e.args) Mar 5 11:46:57 pmoravec-rhel76 goferd: [ERROR][pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e] gofer.messaging.consumer:79 - ModelError: Connection amqps://pmoravec-sat66-on-rhev.gsslab.brq2.redhat.com:5647 timed out: Opening connection Expected results: 4. no such memory leak Additional info: Reproducer outside Satellite to follow in next comment
Reproducer using qpid-proton: - in qpidd, create queue pulp.TEST.0 - replace in below script ROUTER_ADDRESS to point to the broker - run the script below and randomly block qpidd listening port (iptables -j DROP, that matters for the timeout) import random import threading from proton import Timeout from proton.utils import BlockingConnection from time import sleep from uuid import uuid4 ROUTER_ADDRESS = 'proton+amqp://broker.address:5671' ADDRESS = "pulp.TEST" HEARTBEAT = 5 SLEEP = 2.0 THREADS = 1 ADDRESSES = 20 class ReceiverThread(threading.Thread): def __init__(self, address=ADDRESS): super(ReceiverThread, self).__init__() self.address = address self.running = True def subscribe(self): self.recv = self.conn.create_receiver('%s' %(self.address), name=str(uuid4()), dynamic=False, options=None) def connect(self): connected = False while not connected: try: self.conn = BlockingConnection(ROUTER_ADDRESS, ssl_domain=None, heartbeat=HEARTBEAT) connected = True except: print "received exception on connect, trying again in 0.5s" sleep(0.5) self.subscribe() def reconnect(self): try: self.recv.close() except: pass try: self.conn.close() except: pass self.connect() def run(self): subscribed = False while not subscribed: try: self.connect() subscribed = True except: print "received exception on connect/subscribe, trying again in 0.5s" sleep(0.5) print "connected => running" while self.running: try: msg = None print "receive with timeout %s" % SLEEP msg = self.recv.receive(timeout=SLEEP) if msg: print "(%s): accepting msg %s" % (self.address, msg) self.recv.accept() sleep(timeout) else: print "(%s): no msg received" % self.address except KeyboardInterrupt: self.running = False pass except Timeout: pass except Exception as e: print "(%s): receiver thread: exception %s" % (self.address, e) self.reconnect() pass def stop(self): self.running = False threads = [] for i in range(THREADS): threads.append(ReceiverThread('%s.%s' %(ADDRESS, i%ADDRESSES))) threads[i].start() running = True while running: try: sleep(1) except KeyboardInterrupt: running = False pass for i in range(THREADS): threads[i].stop() for i in range(THREADS): threads[i].join()
It is reproducible on either version of qpid-proton: 0.28.0-1 0.28.0-2 0.30.0-2 Even more simplified (and faster) reproducer outside Satellite: from proton.utils import BlockingConnection from time import sleep ROUTER_ADDRESS = 'proton+amqp://qpid.broker.url:5672' while True: connected = False while not connected: try: conn = BlockingConnection(ROUTER_ADDRESS, ssl_domain=None, heartbeat=10, timeout=3) connected = True except: print "received exception on connect, trying again in 0.1s" sleep(0.1) print "connected, sleep 2s then start again" sleep(2) conn.close()
Fix upstream for reproducer in comment 3. See commit e5e1b7d from https://issues.apache.org/jira/browse/PROTON-2187
*** Bug 1816831 has been marked as a duplicate of this bug. ***
Bryan (or Brad with already needinfo raised): what version of qpid-proton do you request to backport to the upstream fix? (see #c8)
I tested the packages qpid-proton-c-0.28.0-3.el7.x86_64 python-qpid-proton-0.28.0-3.el7.x86_64 against both goferd reproducer and standalone script reproducer, with mixed outcome: 1) goferd memory consumption is stable over time 2) the script memory consumption still grows over time, e.g. using more safe version of the script: from proton.utils import BlockingConnection from time import sleep ROUTER_ADDRESS = 'proton+amqp://pmoravec-sat66-on-rhev.gsslab.brq2.redhat.com:5648' while True: connected = False while not connected: try: conn = BlockingConnection(ROUTER_ADDRESS, ssl_domain=None, heartbeat=10, timeout=10) connected = True except: print "received exception on connect, trying again in 0.1s" try: conn.close() except: pass conn = None sleep(0.1) print "connected, sleep 2s then start again" sleep(2) conn.close() and swapping packets dropping with passing them via: # run on the qpidd/qdrouterd side: while true; do echo "$(date): deleting" iptables -D INPUT -p tcp --dport 5648 -j DROP sleep 60 echo "$(date): inserting" iptables -I INPUT -p tcp --dport 5648 -j DROP sleep 60 done I *think* the more particular reproducer is when: - "conn" is an established connection, heartbeats not expired - we are dropping packets at the moment - conn.close() is executed This scenario can happen in goferd just very sporadically (can elaborate more if needed)) so imho no need to fix it specifically for goferd. BUT it sounds as a relict memory leak in qpid-proton .
FYI there is yet another scenario - quite unusual, an evident networking misconfig - where even 0.28.0-3 has a memory leak (*maybe* the root cause is the same like for #c17 reproducer): the reproducer using goferd: 1) start goferd when no networking issues happen 2) drop _established_ traffic to qdrouterd's port: on Satellite: iptables -I INPUT -p tcp --dport 5647 -m conntrack --ctstate ESTABLISHED -j DROP 3) observe goferd memory usage growing over time the reproducer "without goferd/Satellite bits": replace goferd by the script from #c17 . So far, neither of these two reproducers (script from #c17 with either playing of iptables) is much applicable to Satellite. But they can happen in upstream proton or in elsewhere usage of downstream packages. Cliff, Mike: is either of this reproducer an issue for you (elsewhere than in Satellite)?
Verified. Version Tested: @Satellite 6.8.0 Snap 3.0 Steps followed: ~~~~~~~~~~~~~~~~~~~~~~ a. to speedup reproducer, replaced constants 10, 60 and 90 in various places by 10times lower: grep -e 90 -e 60 -e 10 $(rpm -ql $(rpm -qa | grep -e goferd -e katello-agent) | grep py$) b. Restarted goferd service and monitor its memory usage c. on the Content Host, drop packets towards Satellite's port 5647: iptables -I OUTPUT -p tcp --dport=5647 -j DROP d. waited a hour while monitoring goferd memory usage ~~~~~~~~~~~~~~~~~~~~~~ Observation: goferd memory consumption is stable over time . Packages versions: ~~~~~~~~~~~~~~~~~~~~~~ katello-agent noarch 3.5.4-1.el7sat gofer noarch 2.12.5-7.el7sat katello-host-tools noarch 3.5.4-1.el7sat katello-host-tools-fact-plugin noarch 3.5.4-1.el7sat python-gofer noarch 2.12.5-7.el7sat python-gofer-proton noarch 2.12.5-7.el7sat python-qpid-proton x86_64 0.28.0-3.el7 qpid-proton-c x86_64 0.28.0-3.el7 ~~~~~~~~~~~~~~~~~~~~~~
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Satellite 6.8 release), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:4366