Red Hat Satellite engineering is moving the tracking of its product development work on Satellite to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "Satellite project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs will be migrated starting at the end of May. If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "Satellite project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/SAT-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1810549 - dropping packets to qdrouterd triggers a memory leak in qpid-proton 0.28.0-2 libraries used by goferd
Summary: dropping packets to qdrouterd triggers a memory leak in qpid-proton 0.28.0-2 ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Satellite
Classification: Red Hat
Component: Qpid
Version: 6.7.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: 6.8.0
Assignee: Cliff Jansen
QA Contact: vijsingh
URL:
Whiteboard:
: 1816831 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-03-05 13:27 UTC by Pavel Moravec
Modified: 2024-10-01 16:30 UTC (History)
33 users (show)

Fixed In Version: qpid-proton-0.28.0-3
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1832585 (view as bug list)
Environment:
Last Closed: 2020-10-27 13:00:31 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Apache JIRA PROTON-2187 0 None None None 2020-04-03 18:01:51 UTC
Red Hat Knowledge Base (Solution) 4886141 0 None None None 2020-03-07 09:32:43 UTC
Red Hat Product Errata RHSA-2020:4366 0 None None None 2020-10-27 13:00:48 UTC

Description Pavel Moravec 2020-03-05 13:27:47 UTC
Description of problem:
When port 5647 is blocked on firewall or Satellite/Capsule is stopped - meaning packets to the port are dropped - then goferd consumes more and more memory over time.

The problem sounds to be inside qpid-proton library when repeatedly calling

self.conn = BlockingConnection(ROUTER_ADDRESS, ssl_domain=None, heartbeat=10)

that fails on a timeout exception.


The leak is present even on the qpid-proton-c-0.28.0-2 that fixed bz1754314.


Version-Release number of selected component (if applicable):
qpid-proton-c-0.28.0-2.el7.x86_64
python-qpid-proton-0.28.0-2.el7.x86_64

python-gofer-proton-2.12.5-5.el7sat.noarch
gofer-2.12.5-5.el7sat.noarch
katello-agent-3.5.1-2.el7sat.noarch


How reproducible:
100%


Steps to Reproduce:
Reproducer with goferd: 
0. optionally, to speedup reproducer, replace constants 10, 60 and 90 in various places by 10times lower:

grep -e 90 -e 60 -e 10 $(rpm -ql $(rpm -qa | grep -e goferd -e katello-agent) | grep py$)

2. Restart goferd service and monitor its memory usage

3. on the Content Host, drop packets towards Satellite's port 5647:

iptables -I OUTPUT -p tcp --dport=5647 -j DROP

4. wait a hour while monitoring goferd memory usage


Actual results:
4. shows evident memory growth; the growth happens by steps whenever goferd logs:

Mar  5 11:46:57 pmoravec-rhel76 goferd: [ERROR][pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e] gofer.messaging.adapter.model:44 - Traceback (most recent call last):
Mar  5 11:46:57 pmoravec-rhel76 goferd: [ERROR][pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e] gofer.messaging.adapter.model:44 -   File "/usr/lib/python2.7/site-packages/gofer/messaging/adapter/model.py", line 40, in _fn
Mar  5 11:46:57 pmoravec-rhel76 goferd: [ERROR][pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e] gofer.messaging.adapter.model:44 -     return fn(*args, **keywords)
Mar  5 11:46:57 pmoravec-rhel76 goferd: [ERROR][pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e] gofer.messaging.adapter.model:44 -   File "/usr/lib/python2.7/site-packages/gofer/messaging/adapter/model.py", line 584, in open
Mar  5 11:46:57 pmoravec-rhel76 goferd: [ERROR][pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e] gofer.messaging.adapter.model:44 -     self._impl.open()
Mar  5 11:46:57 pmoravec-rhel76 goferd: [ERROR][pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e] gofer.messaging.adapter.model:44 -   File "/usr/lib/python2.7/site-packages/gofer/messaging/adapter/proton/reliability.py", line 45, in _fn
Mar  5 11:46:57 pmoravec-rhel76 goferd: [ERROR][pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e] gofer.messaging.adapter.model:44 -     return fn(messenger, *args, **kwargs)
Mar  5 11:46:57 pmoravec-rhel76 goferd: [ERROR][pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e] gofer.messaging.adapter.model:44 -   File "/usr/lib/python2.7/site-packages/gofer/messaging/adapter/proton/consumer.py", line 72, in open
Mar  5 11:46:57 pmoravec-rhel76 goferd: [ERROR][pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e] gofer.messaging.adapter.model:44 -     self.connection.open()
Mar  5 11:46:57 pmoravec-rhel76 goferd: [ERROR][pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e] gofer.messaging.adapter.model:44 -   File "/usr/lib/python2.7/site-packages/gofer/messaging/adapter/connect.py", line 29, in inner
Mar  5 11:46:57 pmoravec-rhel76 goferd: [ERROR][pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e] gofer.messaging.adapter.model:44 -     impl = fn(connection)
Mar  5 11:46:57 pmoravec-rhel76 goferd: [ERROR][pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e] gofer.messaging.adapter.model:44 -   File "/usr/lib/python2.7/site-packages/gofer/messaging/adapter/proton/connection.py", line 91, in open
Mar  5 11:46:57 pmoravec-rhel76 goferd: [ERROR][pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e] gofer.messaging.adapter.model:44 -     ssl_domain=domain)
Mar  5 11:46:57 pmoravec-rhel76 goferd: [ERROR][pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e] gofer.messaging.adapter.model:44 -   File "/usr/lib64/python2.7/site-packages/proton/_utils.py", line 245, in __init__
Mar  5 11:46:57 pmoravec-rhel76 goferd: [ERROR][pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e] gofer.messaging.adapter.model:44 -     msg="Opening connection")
Mar  5 11:46:57 pmoravec-rhel76 goferd: [ERROR][pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e] gofer.messaging.adapter.model:44 -   File "/usr/lib64/python2.7/site-packages/proton/_utils.py", line 314, in wait
Mar  5 11:46:57 pmoravec-rhel76 goferd: [ERROR][pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e] gofer.messaging.adapter.model:44 -     raise Timeout(txt)
Mar  5 11:46:57 pmoravec-rhel76 goferd: [ERROR][pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e] gofer.messaging.adapter.model:44 - Timeout: Connection amqps://pmoravec-sat66-on-rhev.gsslab.brq2.redhat.com:5647 timed out: Opening connection
Mar  5 11:46:57 pmoravec-rhel76 goferd: [ERROR][pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e] gofer.messaging.consumer:79 - pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e
Mar  5 11:46:57 pmoravec-rhel76 goferd: [ERROR][pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e] gofer.messaging.consumer:79 - Traceback (most recent call last):
Mar  5 11:46:57 pmoravec-rhel76 goferd: [ERROR][pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e] gofer.messaging.consumer:79 -   File "/usr/lib/python2.7/site-packages/gofer/messaging/consumer.py", line 72, in open
Mar  5 11:46:57 pmoravec-rhel76 goferd: [ERROR][pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e] gofer.messaging.consumer:79 -     self.reader.open()
Mar  5 11:46:57 pmoravec-rhel76 goferd: [ERROR][pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e] gofer.messaging.consumer:79 -   File "/usr/lib/python2.7/site-packages/gofer/messaging/adapter/model.py", line 45, in _fn
Mar  5 11:46:57 pmoravec-rhel76 goferd: [ERROR][pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e] gofer.messaging.consumer:79 -     raise ModelError(*e.args)
Mar  5 11:46:57 pmoravec-rhel76 goferd: [ERROR][pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e] gofer.messaging.consumer:79 - ModelError: Connection amqps://pmoravec-sat66-on-rhev.gsslab.brq2.redhat.com:5647 timed out: Opening connection



Expected results:
4. no such memory leak


Additional info:
Reproducer outside Satellite to follow in next comment

Comment 2 Pavel Moravec 2020-03-05 13:36:58 UTC
Reproducer using qpid-proton:

- in qpidd, create queue pulp.TEST.0
- replace in below script ROUTER_ADDRESS to point to the broker
- run the script below and randomly block qpidd listening port (iptables -j DROP, that matters for the timeout)


import random
import threading

from proton import Timeout
from proton.utils import BlockingConnection

from time import sleep
from uuid import uuid4

ROUTER_ADDRESS = 'proton+amqp://broker.address:5671'
ADDRESS = "pulp.TEST"
HEARTBEAT = 5
SLEEP = 2.0
THREADS = 1
ADDRESSES = 20
 
class ReceiverThread(threading.Thread):
    def __init__(self, address=ADDRESS):
        super(ReceiverThread, self).__init__()
        self.address = address
        self.running = True

    def subscribe(self):
	self.recv = self.conn.create_receiver('%s' %(self.address), name=str(uuid4()), dynamic=False, options=None)
 
    def connect(self):
        connected = False
        while not connected:
            try:
                self.conn = BlockingConnection(ROUTER_ADDRESS, ssl_domain=None, heartbeat=HEARTBEAT)
                connected = True
            except:
                print "received exception on connect, trying again in 0.5s"
                sleep(0.5)
	self.subscribe()
 
    def reconnect(self):
        try:
            self.recv.close()
	except:
	    pass
        try:
    	    self.conn.close()
	except:
            pass
        self.connect()
 
    def run(self):
        subscribed = False
        while not subscribed:
            try:
                self.connect()
                subscribed = True
            except:
                print "received exception on connect/subscribe, trying again in 0.5s"
                sleep(0.5)
	print "connected => running"
        while self.running:
	    try:
                msg = None
                print "receive with timeout %s" % SLEEP
		msg = self.recv.receive(timeout=SLEEP)
		if msg:
	            print "(%s): accepting msg %s" % (self.address, msg)
		    self.recv.accept()
		    sleep(timeout)
                else:
                    print "(%s): no msg received" % self.address
	    except KeyboardInterrupt:
                self.running = False
		pass
            except Timeout:
		pass
	    except Exception as e:
		print "(%s): receiver thread: exception %s" % (self.address, e)
		self.reconnect()
		pass
 
    def stop(self):
        self.running = False
 
threads = []
for i in range(THREADS):
  threads.append(ReceiverThread('%s.%s' %(ADDRESS, i%ADDRESSES)))
  threads[i].start()

running = True
while running:
    try:
	sleep(1)
    except KeyboardInterrupt:
        running = False
	pass
 
for i in range(THREADS):
    threads[i].stop()
for i in range(THREADS):
    threads[i].join()

Comment 3 Pavel Moravec 2020-03-05 14:11:27 UTC
It is reproducible on either version of qpid-proton:
0.28.0-1
0.28.0-2
0.30.0-2


Even more simplified (and faster) reproducer outside Satellite:

from proton.utils import BlockingConnection
from time import sleep

ROUTER_ADDRESS = 'proton+amqp://qpid.broker.url:5672'

while True:
    connected = False
    while not connected:
        try:
            conn = BlockingConnection(ROUTER_ADDRESS, ssl_domain=None, heartbeat=10, timeout=3)
            connected = True
        except:
            print "received exception on connect, trying again in 0.1s"
        sleep(0.1)
    print "connected, sleep 2s then start again"
    sleep(2)
    conn.close()

Comment 6 Cliff Jansen 2020-03-30 07:03:42 UTC
Fix upstream for reproducer in comment 3.

See commit e5e1b7d from https://issues.apache.org/jira/browse/PROTON-2187

Comment 10 Pavel Moravec 2020-04-14 08:01:38 UTC
*** Bug 1816831 has been marked as a duplicate of this bug. ***

Comment 11 Pavel Moravec 2020-04-14 08:07:40 UTC
Bryan (or Brad with already needinfo raised):

what version of qpid-proton do you request to backport to the upstream fix? (see #c8)

Comment 17 Pavel Moravec 2020-04-17 08:31:38 UTC
I tested the packages

qpid-proton-c-0.28.0-3.el7.x86_64
python-qpid-proton-0.28.0-3.el7.x86_64

against both goferd reproducer and standalone script reproducer, with mixed outcome:

1) goferd memory consumption is stable over time
2) the script memory consumption still grows over time, e.g. using more safe version of the script:

from proton.utils import BlockingConnection
from time import sleep

ROUTER_ADDRESS = 'proton+amqp://pmoravec-sat66-on-rhev.gsslab.brq2.redhat.com:5648'

while True:
    connected = False
    while not connected:
        try:
            conn = BlockingConnection(ROUTER_ADDRESS, ssl_domain=None, heartbeat=10, timeout=10)
            connected = True
        except:
            print "received exception on connect, trying again in 0.1s"
            try:
                conn.close()
            except:
                pass
            conn = None
        sleep(0.1)
    print "connected, sleep 2s then start again"
    sleep(2)
    conn.close()


and swapping packets dropping with passing them via:

# run on the qpidd/qdrouterd side:
while true; do
  echo "$(date): deleting"
  iptables -D INPUT -p tcp --dport 5648 -j DROP
  sleep 60
  echo "$(date): inserting"
  iptables -I INPUT -p tcp --dport 5648 -j DROP
  sleep 60
done


I *think* the more particular reproducer is when:
- "conn" is an established connection, heartbeats not expired
- we are dropping packets at the moment
- conn.close() is executed


This scenario can happen in goferd just very sporadically (can elaborate more if needed)) so imho no need to fix it specifically for goferd. BUT it sounds as a relict memory leak in qpid-proton .

Comment 18 Pavel Moravec 2020-04-30 10:53:43 UTC
FYI there is yet another scenario - quite unusual, an evident networking misconfig - where even 0.28.0-3 has a memory leak (*maybe* the root cause is the same like for #c17 reproducer):

the reproducer using goferd:
1) start goferd when no networking issues happen
2) drop _established_ traffic to qdrouterd's port: on Satellite:
iptables -I INPUT -p tcp --dport 5647 -m conntrack --ctstate ESTABLISHED -j DROP
3) observe goferd memory usage growing over time

the reproducer "without goferd/Satellite bits": replace goferd by the script from #c17 .


So far, neither of these two reproducers (script from #c17 with either playing of iptables) is much applicable to Satellite. But they can happen in upstream proton or in elsewhere usage of downstream packages.


Cliff, Mike: is either of this reproducer an issue for you (elsewhere than in Satellite)?

Comment 25 vijsingh 2020-06-09 11:46:55 UTC
Verified.

Version Tested:  @Satellite 6.8.0 Snap 3.0

Steps followed: 

~~~~~~~~~~~~~~~~~~~~~~
a. to speedup reproducer, replaced constants 10, 60 and 90 in various places by 10times lower:

 grep -e 90 -e 60 -e 10 $(rpm -ql $(rpm -qa | grep -e goferd -e katello-agent) | grep py$)

b. Restarted goferd service and monitor its memory usage

c. on the Content Host, drop packets towards Satellite's port 5647:

iptables -I OUTPUT -p tcp --dport=5647 -j DROP

d. waited a hour while monitoring goferd memory usage
~~~~~~~~~~~~~~~~~~~~~~

Observation: 

goferd memory consumption is stable over time .


Packages versions:

~~~~~~~~~~~~~~~~~~~~~~
 katello-agent                                   noarch                  3.5.4-1.el7sat                     
 gofer                                           noarch                  2.12.5-7.el7sat                     
 katello-host-tools                              noarch                  3.5.4-1.el7sat                      
 katello-host-tools-fact-plugin                  noarch                  3.5.4-1.el7sat                      
 python-gofer                                    noarch                  2.12.5-7.el7sat                     
 python-gofer-proton                             noarch                  2.12.5-7.el7sat                     
 python-qpid-proton                              x86_64                  0.28.0-3.el7                        
 qpid-proton-c                                   x86_64                  0.28.0-3.el7                        
~~~~~~~~~~~~~~~~~~~~~~

Comment 31 errata-xmlrpc 2020-10-27 13:00:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Satellite 6.8 release), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:4366


Note You need to log in before you can comment on or make changes to this bug.