Bug 1810549
| Summary: | dropping packets to qdrouterd triggers a memory leak in qpid-proton 0.28.0-2 libraries used by goferd | |||
|---|---|---|---|---|
| Product: | Red Hat Satellite | Reporter: | Pavel Moravec <pmoravec> | |
| Component: | Qpid | Assignee: | Cliff Jansen <cjansen> | |
| Status: | CLOSED ERRATA | QA Contact: | vijsingh | |
| Severity: | high | Docs Contact: | ||
| Priority: | high | |||
| Version: | 6.7.0 | CC: | aganbat, alexandre.chanu, alsouza, bbuckingham, bkearney, cjansen, dojones, dsynk, egolov, ehelms, francesco.trentini, gpayelka, kkohli, ktordeur, kupadhya, mcressma, momran, mshimura, mvanderw, myoder, rbdiri, rrajput, sadas, saydas, smajumda, smutkule, sraut, susalvi, tlongo, vijsingh, whitedm, wpinheir, zhunting | |
| Target Milestone: | 6.8.0 | Keywords: | Triaged | |
| Target Release: | Unused | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | qpid-proton-0.28.0-3 | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1832585 (view as bug list) | Environment: | ||
| Last Closed: | 2020-10-27 13:00:31 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
|
Description
Pavel Moravec
2020-03-05 13:27:47 UTC
Reproducer using qpid-proton:
- in qpidd, create queue pulp.TEST.0
- replace in below script ROUTER_ADDRESS to point to the broker
- run the script below and randomly block qpidd listening port (iptables -j DROP, that matters for the timeout)
import random
import threading
from proton import Timeout
from proton.utils import BlockingConnection
from time import sleep
from uuid import uuid4
ROUTER_ADDRESS = 'proton+amqp://broker.address:5671'
ADDRESS = "pulp.TEST"
HEARTBEAT = 5
SLEEP = 2.0
THREADS = 1
ADDRESSES = 20
class ReceiverThread(threading.Thread):
def __init__(self, address=ADDRESS):
super(ReceiverThread, self).__init__()
self.address = address
self.running = True
def subscribe(self):
self.recv = self.conn.create_receiver('%s' %(self.address), name=str(uuid4()), dynamic=False, options=None)
def connect(self):
connected = False
while not connected:
try:
self.conn = BlockingConnection(ROUTER_ADDRESS, ssl_domain=None, heartbeat=HEARTBEAT)
connected = True
except:
print "received exception on connect, trying again in 0.5s"
sleep(0.5)
self.subscribe()
def reconnect(self):
try:
self.recv.close()
except:
pass
try:
self.conn.close()
except:
pass
self.connect()
def run(self):
subscribed = False
while not subscribed:
try:
self.connect()
subscribed = True
except:
print "received exception on connect/subscribe, trying again in 0.5s"
sleep(0.5)
print "connected => running"
while self.running:
try:
msg = None
print "receive with timeout %s" % SLEEP
msg = self.recv.receive(timeout=SLEEP)
if msg:
print "(%s): accepting msg %s" % (self.address, msg)
self.recv.accept()
sleep(timeout)
else:
print "(%s): no msg received" % self.address
except KeyboardInterrupt:
self.running = False
pass
except Timeout:
pass
except Exception as e:
print "(%s): receiver thread: exception %s" % (self.address, e)
self.reconnect()
pass
def stop(self):
self.running = False
threads = []
for i in range(THREADS):
threads.append(ReceiverThread('%s.%s' %(ADDRESS, i%ADDRESSES)))
threads[i].start()
running = True
while running:
try:
sleep(1)
except KeyboardInterrupt:
running = False
pass
for i in range(THREADS):
threads[i].stop()
for i in range(THREADS):
threads[i].join()
It is reproducible on either version of qpid-proton:
0.28.0-1
0.28.0-2
0.30.0-2
Even more simplified (and faster) reproducer outside Satellite:
from proton.utils import BlockingConnection
from time import sleep
ROUTER_ADDRESS = 'proton+amqp://qpid.broker.url:5672'
while True:
connected = False
while not connected:
try:
conn = BlockingConnection(ROUTER_ADDRESS, ssl_domain=None, heartbeat=10, timeout=3)
connected = True
except:
print "received exception on connect, trying again in 0.1s"
sleep(0.1)
print "connected, sleep 2s then start again"
sleep(2)
conn.close()
Fix upstream for reproducer in comment 3. See commit e5e1b7d from https://issues.apache.org/jira/browse/PROTON-2187 *** Bug 1816831 has been marked as a duplicate of this bug. *** Bryan (or Brad with already needinfo raised): what version of qpid-proton do you request to backport to the upstream fix? (see #c8) I tested the packages
qpid-proton-c-0.28.0-3.el7.x86_64
python-qpid-proton-0.28.0-3.el7.x86_64
against both goferd reproducer and standalone script reproducer, with mixed outcome:
1) goferd memory consumption is stable over time
2) the script memory consumption still grows over time, e.g. using more safe version of the script:
from proton.utils import BlockingConnection
from time import sleep
ROUTER_ADDRESS = 'proton+amqp://pmoravec-sat66-on-rhev.gsslab.brq2.redhat.com:5648'
while True:
connected = False
while not connected:
try:
conn = BlockingConnection(ROUTER_ADDRESS, ssl_domain=None, heartbeat=10, timeout=10)
connected = True
except:
print "received exception on connect, trying again in 0.1s"
try:
conn.close()
except:
pass
conn = None
sleep(0.1)
print "connected, sleep 2s then start again"
sleep(2)
conn.close()
and swapping packets dropping with passing them via:
# run on the qpidd/qdrouterd side:
while true; do
echo "$(date): deleting"
iptables -D INPUT -p tcp --dport 5648 -j DROP
sleep 60
echo "$(date): inserting"
iptables -I INPUT -p tcp --dport 5648 -j DROP
sleep 60
done
I *think* the more particular reproducer is when:
- "conn" is an established connection, heartbeats not expired
- we are dropping packets at the moment
- conn.close() is executed
This scenario can happen in goferd just very sporadically (can elaborate more if needed)) so imho no need to fix it specifically for goferd. BUT it sounds as a relict memory leak in qpid-proton .
FYI there is yet another scenario - quite unusual, an evident networking misconfig - where even 0.28.0-3 has a memory leak (*maybe* the root cause is the same like for #c17 reproducer): the reproducer using goferd: 1) start goferd when no networking issues happen 2) drop _established_ traffic to qdrouterd's port: on Satellite: iptables -I INPUT -p tcp --dport 5647 -m conntrack --ctstate ESTABLISHED -j DROP 3) observe goferd memory usage growing over time the reproducer "without goferd/Satellite bits": replace goferd by the script from #c17 . So far, neither of these two reproducers (script from #c17 with either playing of iptables) is much applicable to Satellite. But they can happen in upstream proton or in elsewhere usage of downstream packages. Cliff, Mike: is either of this reproducer an issue for you (elsewhere than in Satellite)? Verified. Version Tested: @Satellite 6.8.0 Snap 3.0 Steps followed: ~~~~~~~~~~~~~~~~~~~~~~ a. to speedup reproducer, replaced constants 10, 60 and 90 in various places by 10times lower: grep -e 90 -e 60 -e 10 $(rpm -ql $(rpm -qa | grep -e goferd -e katello-agent) | grep py$) b. Restarted goferd service and monitor its memory usage c. on the Content Host, drop packets towards Satellite's port 5647: iptables -I OUTPUT -p tcp --dport=5647 -j DROP d. waited a hour while monitoring goferd memory usage ~~~~~~~~~~~~~~~~~~~~~~ Observation: goferd memory consumption is stable over time . Packages versions: ~~~~~~~~~~~~~~~~~~~~~~ katello-agent noarch 3.5.4-1.el7sat gofer noarch 2.12.5-7.el7sat katello-host-tools noarch 3.5.4-1.el7sat katello-host-tools-fact-plugin noarch 3.5.4-1.el7sat python-gofer noarch 2.12.5-7.el7sat python-gofer-proton noarch 2.12.5-7.el7sat python-qpid-proton x86_64 0.28.0-3.el7 qpid-proton-c x86_64 0.28.0-3.el7 ~~~~~~~~~~~~~~~~~~~~~~ Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Satellite 6.8 release), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:4366 |