Bug 1810549 - dropping packets to qdrouterd triggers a memory leak in qpid-proton 0.28.0-2 libraries used by goferd
Summary: dropping packets to qdrouterd triggers a memory leak in qpid-proton 0.28.0-2 ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Satellite
Classification: Red Hat
Component: Qpid
Version: 6.7.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: 6.8.0
Assignee: Cliff Jansen
QA Contact: vijsingh
URL:
Whiteboard:
: 1816831 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-03-05 13:27 UTC by Pavel Moravec
Modified: 2024-03-25 15:44 UTC (History)
33 users (show)

Fixed In Version: qpid-proton-0.28.0-3
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1832585 (view as bug list)
Environment:
Last Closed: 2020-10-27 13:00:31 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Apache JIRA PROTON-2187 0 None None None 2020-04-03 18:01:51 UTC
Red Hat Knowledge Base (Solution) 4886141 0 None None None 2020-03-07 09:32:43 UTC
Red Hat Product Errata RHSA-2020:4366 0 None None None 2020-10-27 13:00:48 UTC

Description Pavel Moravec 2020-03-05 13:27:47 UTC
Description of problem:
When port 5647 is blocked on firewall or Satellite/Capsule is stopped - meaning packets to the port are dropped - then goferd consumes more and more memory over time.

The problem sounds to be inside qpid-proton library when repeatedly calling

self.conn = BlockingConnection(ROUTER_ADDRESS, ssl_domain=None, heartbeat=10)

that fails on a timeout exception.


The leak is present even on the qpid-proton-c-0.28.0-2 that fixed bz1754314.


Version-Release number of selected component (if applicable):
qpid-proton-c-0.28.0-2.el7.x86_64
python-qpid-proton-0.28.0-2.el7.x86_64

python-gofer-proton-2.12.5-5.el7sat.noarch
gofer-2.12.5-5.el7sat.noarch
katello-agent-3.5.1-2.el7sat.noarch


How reproducible:
100%


Steps to Reproduce:
Reproducer with goferd: 
0. optionally, to speedup reproducer, replace constants 10, 60 and 90 in various places by 10times lower:

grep -e 90 -e 60 -e 10 $(rpm -ql $(rpm -qa | grep -e goferd -e katello-agent) | grep py$)

2. Restart goferd service and monitor its memory usage

3. on the Content Host, drop packets towards Satellite's port 5647:

iptables -I OUTPUT -p tcp --dport=5647 -j DROP

4. wait a hour while monitoring goferd memory usage


Actual results:
4. shows evident memory growth; the growth happens by steps whenever goferd logs:

Mar  5 11:46:57 pmoravec-rhel76 goferd: [ERROR][pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e] gofer.messaging.adapter.model:44 - Traceback (most recent call last):
Mar  5 11:46:57 pmoravec-rhel76 goferd: [ERROR][pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e] gofer.messaging.adapter.model:44 -   File "/usr/lib/python2.7/site-packages/gofer/messaging/adapter/model.py", line 40, in _fn
Mar  5 11:46:57 pmoravec-rhel76 goferd: [ERROR][pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e] gofer.messaging.adapter.model:44 -     return fn(*args, **keywords)
Mar  5 11:46:57 pmoravec-rhel76 goferd: [ERROR][pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e] gofer.messaging.adapter.model:44 -   File "/usr/lib/python2.7/site-packages/gofer/messaging/adapter/model.py", line 584, in open
Mar  5 11:46:57 pmoravec-rhel76 goferd: [ERROR][pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e] gofer.messaging.adapter.model:44 -     self._impl.open()
Mar  5 11:46:57 pmoravec-rhel76 goferd: [ERROR][pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e] gofer.messaging.adapter.model:44 -   File "/usr/lib/python2.7/site-packages/gofer/messaging/adapter/proton/reliability.py", line 45, in _fn
Mar  5 11:46:57 pmoravec-rhel76 goferd: [ERROR][pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e] gofer.messaging.adapter.model:44 -     return fn(messenger, *args, **kwargs)
Mar  5 11:46:57 pmoravec-rhel76 goferd: [ERROR][pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e] gofer.messaging.adapter.model:44 -   File "/usr/lib/python2.7/site-packages/gofer/messaging/adapter/proton/consumer.py", line 72, in open
Mar  5 11:46:57 pmoravec-rhel76 goferd: [ERROR][pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e] gofer.messaging.adapter.model:44 -     self.connection.open()
Mar  5 11:46:57 pmoravec-rhel76 goferd: [ERROR][pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e] gofer.messaging.adapter.model:44 -   File "/usr/lib/python2.7/site-packages/gofer/messaging/adapter/connect.py", line 29, in inner
Mar  5 11:46:57 pmoravec-rhel76 goferd: [ERROR][pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e] gofer.messaging.adapter.model:44 -     impl = fn(connection)
Mar  5 11:46:57 pmoravec-rhel76 goferd: [ERROR][pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e] gofer.messaging.adapter.model:44 -   File "/usr/lib/python2.7/site-packages/gofer/messaging/adapter/proton/connection.py", line 91, in open
Mar  5 11:46:57 pmoravec-rhel76 goferd: [ERROR][pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e] gofer.messaging.adapter.model:44 -     ssl_domain=domain)
Mar  5 11:46:57 pmoravec-rhel76 goferd: [ERROR][pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e] gofer.messaging.adapter.model:44 -   File "/usr/lib64/python2.7/site-packages/proton/_utils.py", line 245, in __init__
Mar  5 11:46:57 pmoravec-rhel76 goferd: [ERROR][pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e] gofer.messaging.adapter.model:44 -     msg="Opening connection")
Mar  5 11:46:57 pmoravec-rhel76 goferd: [ERROR][pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e] gofer.messaging.adapter.model:44 -   File "/usr/lib64/python2.7/site-packages/proton/_utils.py", line 314, in wait
Mar  5 11:46:57 pmoravec-rhel76 goferd: [ERROR][pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e] gofer.messaging.adapter.model:44 -     raise Timeout(txt)
Mar  5 11:46:57 pmoravec-rhel76 goferd: [ERROR][pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e] gofer.messaging.adapter.model:44 - Timeout: Connection amqps://pmoravec-sat66-on-rhev.gsslab.brq2.redhat.com:5647 timed out: Opening connection
Mar  5 11:46:57 pmoravec-rhel76 goferd: [ERROR][pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e] gofer.messaging.consumer:79 - pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e
Mar  5 11:46:57 pmoravec-rhel76 goferd: [ERROR][pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e] gofer.messaging.consumer:79 - Traceback (most recent call last):
Mar  5 11:46:57 pmoravec-rhel76 goferd: [ERROR][pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e] gofer.messaging.consumer:79 -   File "/usr/lib/python2.7/site-packages/gofer/messaging/consumer.py", line 72, in open
Mar  5 11:46:57 pmoravec-rhel76 goferd: [ERROR][pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e] gofer.messaging.consumer:79 -     self.reader.open()
Mar  5 11:46:57 pmoravec-rhel76 goferd: [ERROR][pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e] gofer.messaging.consumer:79 -   File "/usr/lib/python2.7/site-packages/gofer/messaging/adapter/model.py", line 45, in _fn
Mar  5 11:46:57 pmoravec-rhel76 goferd: [ERROR][pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e] gofer.messaging.consumer:79 -     raise ModelError(*e.args)
Mar  5 11:46:57 pmoravec-rhel76 goferd: [ERROR][pulp.agent.9d2e1446-d3b9-4ca3-9004-17ddc389453e] gofer.messaging.consumer:79 - ModelError: Connection amqps://pmoravec-sat66-on-rhev.gsslab.brq2.redhat.com:5647 timed out: Opening connection



Expected results:
4. no such memory leak


Additional info:
Reproducer outside Satellite to follow in next comment

Comment 2 Pavel Moravec 2020-03-05 13:36:58 UTC
Reproducer using qpid-proton:

- in qpidd, create queue pulp.TEST.0
- replace in below script ROUTER_ADDRESS to point to the broker
- run the script below and randomly block qpidd listening port (iptables -j DROP, that matters for the timeout)


import random
import threading

from proton import Timeout
from proton.utils import BlockingConnection

from time import sleep
from uuid import uuid4

ROUTER_ADDRESS = 'proton+amqp://broker.address:5671'
ADDRESS = "pulp.TEST"
HEARTBEAT = 5
SLEEP = 2.0
THREADS = 1
ADDRESSES = 20
 
class ReceiverThread(threading.Thread):
    def __init__(self, address=ADDRESS):
        super(ReceiverThread, self).__init__()
        self.address = address
        self.running = True

    def subscribe(self):
	self.recv = self.conn.create_receiver('%s' %(self.address), name=str(uuid4()), dynamic=False, options=None)
 
    def connect(self):
        connected = False
        while not connected:
            try:
                self.conn = BlockingConnection(ROUTER_ADDRESS, ssl_domain=None, heartbeat=HEARTBEAT)
                connected = True
            except:
                print "received exception on connect, trying again in 0.5s"
                sleep(0.5)
	self.subscribe()
 
    def reconnect(self):
        try:
            self.recv.close()
	except:
	    pass
        try:
    	    self.conn.close()
	except:
            pass
        self.connect()
 
    def run(self):
        subscribed = False
        while not subscribed:
            try:
                self.connect()
                subscribed = True
            except:
                print "received exception on connect/subscribe, trying again in 0.5s"
                sleep(0.5)
	print "connected => running"
        while self.running:
	    try:
                msg = None
                print "receive with timeout %s" % SLEEP
		msg = self.recv.receive(timeout=SLEEP)
		if msg:
	            print "(%s): accepting msg %s" % (self.address, msg)
		    self.recv.accept()
		    sleep(timeout)
                else:
                    print "(%s): no msg received" % self.address
	    except KeyboardInterrupt:
                self.running = False
		pass
            except Timeout:
		pass
	    except Exception as e:
		print "(%s): receiver thread: exception %s" % (self.address, e)
		self.reconnect()
		pass
 
    def stop(self):
        self.running = False
 
threads = []
for i in range(THREADS):
  threads.append(ReceiverThread('%s.%s' %(ADDRESS, i%ADDRESSES)))
  threads[i].start()

running = True
while running:
    try:
	sleep(1)
    except KeyboardInterrupt:
        running = False
	pass
 
for i in range(THREADS):
    threads[i].stop()
for i in range(THREADS):
    threads[i].join()

Comment 3 Pavel Moravec 2020-03-05 14:11:27 UTC
It is reproducible on either version of qpid-proton:
0.28.0-1
0.28.0-2
0.30.0-2


Even more simplified (and faster) reproducer outside Satellite:

from proton.utils import BlockingConnection
from time import sleep

ROUTER_ADDRESS = 'proton+amqp://qpid.broker.url:5672'

while True:
    connected = False
    while not connected:
        try:
            conn = BlockingConnection(ROUTER_ADDRESS, ssl_domain=None, heartbeat=10, timeout=3)
            connected = True
        except:
            print "received exception on connect, trying again in 0.1s"
        sleep(0.1)
    print "connected, sleep 2s then start again"
    sleep(2)
    conn.close()

Comment 6 Cliff Jansen 2020-03-30 07:03:42 UTC
Fix upstream for reproducer in comment 3.

See commit e5e1b7d from https://issues.apache.org/jira/browse/PROTON-2187

Comment 10 Pavel Moravec 2020-04-14 08:01:38 UTC
*** Bug 1816831 has been marked as a duplicate of this bug. ***

Comment 11 Pavel Moravec 2020-04-14 08:07:40 UTC
Bryan (or Brad with already needinfo raised):

what version of qpid-proton do you request to backport to the upstream fix? (see #c8)

Comment 17 Pavel Moravec 2020-04-17 08:31:38 UTC
I tested the packages

qpid-proton-c-0.28.0-3.el7.x86_64
python-qpid-proton-0.28.0-3.el7.x86_64

against both goferd reproducer and standalone script reproducer, with mixed outcome:

1) goferd memory consumption is stable over time
2) the script memory consumption still grows over time, e.g. using more safe version of the script:

from proton.utils import BlockingConnection
from time import sleep

ROUTER_ADDRESS = 'proton+amqp://pmoravec-sat66-on-rhev.gsslab.brq2.redhat.com:5648'

while True:
    connected = False
    while not connected:
        try:
            conn = BlockingConnection(ROUTER_ADDRESS, ssl_domain=None, heartbeat=10, timeout=10)
            connected = True
        except:
            print "received exception on connect, trying again in 0.1s"
            try:
                conn.close()
            except:
                pass
            conn = None
        sleep(0.1)
    print "connected, sleep 2s then start again"
    sleep(2)
    conn.close()


and swapping packets dropping with passing them via:

# run on the qpidd/qdrouterd side:
while true; do
  echo "$(date): deleting"
  iptables -D INPUT -p tcp --dport 5648 -j DROP
  sleep 60
  echo "$(date): inserting"
  iptables -I INPUT -p tcp --dport 5648 -j DROP
  sleep 60
done


I *think* the more particular reproducer is when:
- "conn" is an established connection, heartbeats not expired
- we are dropping packets at the moment
- conn.close() is executed


This scenario can happen in goferd just very sporadically (can elaborate more if needed)) so imho no need to fix it specifically for goferd. BUT it sounds as a relict memory leak in qpid-proton .

Comment 18 Pavel Moravec 2020-04-30 10:53:43 UTC
FYI there is yet another scenario - quite unusual, an evident networking misconfig - where even 0.28.0-3 has a memory leak (*maybe* the root cause is the same like for #c17 reproducer):

the reproducer using goferd:
1) start goferd when no networking issues happen
2) drop _established_ traffic to qdrouterd's port: on Satellite:
iptables -I INPUT -p tcp --dport 5647 -m conntrack --ctstate ESTABLISHED -j DROP
3) observe goferd memory usage growing over time

the reproducer "without goferd/Satellite bits": replace goferd by the script from #c17 .


So far, neither of these two reproducers (script from #c17 with either playing of iptables) is much applicable to Satellite. But they can happen in upstream proton or in elsewhere usage of downstream packages.


Cliff, Mike: is either of this reproducer an issue for you (elsewhere than in Satellite)?

Comment 25 vijsingh 2020-06-09 11:46:55 UTC
Verified.

Version Tested:  @Satellite 6.8.0 Snap 3.0

Steps followed: 

~~~~~~~~~~~~~~~~~~~~~~
a. to speedup reproducer, replaced constants 10, 60 and 90 in various places by 10times lower:

 grep -e 90 -e 60 -e 10 $(rpm -ql $(rpm -qa | grep -e goferd -e katello-agent) | grep py$)

b. Restarted goferd service and monitor its memory usage

c. on the Content Host, drop packets towards Satellite's port 5647:

iptables -I OUTPUT -p tcp --dport=5647 -j DROP

d. waited a hour while monitoring goferd memory usage
~~~~~~~~~~~~~~~~~~~~~~

Observation: 

goferd memory consumption is stable over time .


Packages versions:

~~~~~~~~~~~~~~~~~~~~~~
 katello-agent                                   noarch                  3.5.4-1.el7sat                     
 gofer                                           noarch                  2.12.5-7.el7sat                     
 katello-host-tools                              noarch                  3.5.4-1.el7sat                      
 katello-host-tools-fact-plugin                  noarch                  3.5.4-1.el7sat                      
 python-gofer                                    noarch                  2.12.5-7.el7sat                     
 python-gofer-proton                             noarch                  2.12.5-7.el7sat                     
 python-qpid-proton                              x86_64                  0.28.0-3.el7                        
 qpid-proton-c                                   x86_64                  0.28.0-3.el7                        
~~~~~~~~~~~~~~~~~~~~~~

Comment 31 errata-xmlrpc 2020-10-27 13:00:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Satellite 6.8 release), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:4366


Note You need to log in before you can comment on or make changes to this bug.