Bug 1319165
| Summary: | goferd close_wait leak and occasional segfault when qpidd restarted | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Satellite | Reporter: | Pavel Moravec <pmoravec> | ||||
| Component: | katello-agent | Assignee: | Cliff Jansen <cjansen> | ||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Peter Ondrejka <pondrejk> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | high | ||||||
| Version: | 6.1.6 | CC: | anerurka, aperotti, bkearney, brubisch, cduryee, cjansen, dmoessne, egolov, jentrena, jross, mcressma, mmccune, mmithaiw, mvanderw, mverma, oshtaier, patrick.mcquade, pdwyer, sghai, ssherkar, xdmoon | ||||
| Target Milestone: | Unspecified | Keywords: | PrioBumpGSS, Reopened, Triaged | ||||
| Target Release: | Unused | ||||||
| Hardware: | x86_64 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | qpid-proton-0.9-20 | Doc Type: | Bug Fix | ||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | |||||||
| : | 1530709 (view as bug list) | Environment: | |||||
| Last Closed: | 2018-01-19 14:31:57 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 1122832, 1353215 | ||||||
| Attachments: |
|
||||||
|
Description
Pavel Moravec
2016-03-18 15:57:05 UTC
Re occasional segfault:
letting the reproducer with commented out "idle-timeout-seconds: 0" parameter, goferd segfaulted with backtrace:
#0 pn_reactor_selectable (reactor=0x0) at /usr/src/debug/qpid-proton-0.9/proton-c/src/reactor/reactor.c:189
sel = <optimized out>
#1 0x00007f9a3a5bc2e3 in _wrap_pn_reactor_attachments (self=<optimized out>, args=<optimized out>)
at /usr/src/debug/qpid-proton-0.9/proton-c/bindings/python/cprotonPYTHON_wrap.c:23077
_swig_thread_allow = 0x7f9a58147490
arg1 = 0x0
argp1 = 0x0
res1 = <optimized out>
obj0 = None
result = 0x0
..
#27 PyEval_EvalFrameEx (
f=f@entry=Frame 0x7f9a4c004230, for file /usr/lib64/python2.7/site-packages/proton/utils.py, line 201, in __init__ (self=<BlockingConnection(disconnected=False, timeout=None) at remote 0x7f9a26303c50>, url='amqp://satellite.example.com:5647', timeout=None, container=None, ssl_domain=None, heartbeat=10), throwflag=throwflag@entry=0) at /usr/src/debug/Python-2.7.5/Python/ceval.c:2740
I expect the segfault and close_wait leak has one common cause, thus reporting both issues in one BZ.
Reproducer of both CLOSE_WAIT leak and segfault at the end:
1. (let restart qpidd more often, without the need of qdrouterd.conf change):
while true; do service qpidd restart; sleep 10; sleep $(($((RANDOM))%5)); done
2. Run this script:
#!/usr/bin/python
from time import sleep
from uuid import uuid4
from proton import ConnectionException
from proton.utils import BlockingConnection
import traceback
import random
while True:
sleep(random.uniform(0.3,3))
try:
conn = BlockingConnection("proton+amqp://pmoravec-sat61-rhel7.gsslab.brq.redhat.com:5648", ssl_domain=None, heartbeat=2)
rec = conn.create_receiver("pulp.bz1319165", name=str(uuid4()), dynamic=False, options=None)
print "sleeping.."
sleep(random.uniform(2.1,3))
try:
if (conn):
conn.close()
except Exception:
pass
except Exception:
pass
When netstat shows it has approx 340 TCP connections (almost all in CLOSE_WAIT), the python script coredumps.
(In reply to Pavel Moravec from comment #4) > When netstat shows it has approx 340 TCP connections (almost all in > CLOSE_WAIT), the python script coredumps. Sounds ulimit -n reached caused the segfault. (In reply to Pavel Moravec from comment #5) > (In reply to Pavel Moravec from comment #4) > > When netstat shows it has approx 340 TCP connections (almost all in > > CLOSE_WAIT), the python script coredumps. > > Sounds ulimit -n reached caused the segfault. .. what suggests there are 2 separate bugs: 1) close_waits 2) no check of return value (that proton opening a local socket failed?) that is further used as a reference/pointer, causing the segfault 1 leads to 2 but does not need to be the only trigger. Anyway leaving on devels if they want to split to 2 separate BZs or not. Checking what proton versions are affected: Affected: - 0.9-14.bz1317484.el7 - 0.9-12.el7sat - 0.9-11.el7 Expected to be affected: - 0.9-7.el7 (here the reproducer got frozen, but all established TCP connections went to CLOSE_WAIT) Haven't checked proton upstream - let me know how to build it and I can try. Checking further packages, both: - qpid-proton-0.10-2 - qpid-proton-0.12.0-1 are imho affected by both errors (close_wait and segfault). Some minor observations follow (can be ignored - the main message "the packages are affected the same"): 0.12.0-1 behaves like 0.9-* versions - both close_wait and segfault happens the same way 0.10-2 has a deadlock/livelock when calling connection.close() with various backtraces seen when breaking the script. After a while, connections went to CLOSE_WAIT what I assume is the 1st error hit (segfault hasnt been hit because not enough close_waits were able to generate, but I expect it's not fixed). Just for the sake of completenes, the backtraces seen: sleeping.. ^CTraceback (most recent call last): File "bz1319165-closeWait-segfault.py", line 22, in <module> conn.close() File "/usr/lib64/python2.7/site-packages/proton/utils.py", line 215, in close msg="Closing connection") File "/usr/lib64/python2.7/site-packages/proton/utils.py", line 227, in wait self.container.process() File "/usr/lib64/python2.7/site-packages/proton/reactor.py", line 155, in process result = pn_reactor_process(self._impl) KeyboardInterrupt sleeping.. ^CTraceback (most recent call last): File "bz1319165-closeWait-segfault.py", line 22, in <module> conn.close() File "/usr/lib64/python2.7/site-packages/proton/utils.py", line 215, in close msg="Closing connection") File "/usr/lib64/python2.7/site-packages/proton/utils.py", line 226, in wait while not condition(): File "/usr/lib64/python2.7/site-packages/proton/utils.py", line 214, in <lambda> self.wait(lambda: not (self.conn.state & Endpoint.REMOTE_ACTIVE), File "/usr/lib64/python2.7/site-packages/proton/__init__.py", line 2513, in state return pn_connection_state(self._impl) KeyboardInterrupt sleeping.. ^CTraceback (most recent call last): File "bz1319165-closeWait-segfault.py", line 22, in <module> conn.close() File "/usr/lib64/python2.7/site-packages/proton/utils.py", line 215, in close msg="Closing connection") File "/usr/lib64/python2.7/site-packages/proton/utils.py", line 227, in wait self.container.process() File "/usr/lib64/python2.7/site-packages/proton/reactor.py", line 156, in process self._check_errors() File "/usr/lib64/python2.7/site-packages/proton/reactor.py", line 148, in _check_errors if self.errors: File "/usr/lib64/python2.7/site-packages/proton/wrapper.py", line 62, in __getattr__ def __getattr__(self, name): KeyboardInterrupt sleeping.. ^CTraceback (most recent call last): File "bz1319165-closeWait-segfault.py", line 22, in <module> conn.close() File "/usr/lib64/python2.7/site-packages/proton/utils.py", line 215, in close msg="Closing connection") File "/usr/lib64/python2.7/site-packages/proton/utils.py", line 226, in wait while not condition(): KeyboardInterrupt sleeping.. ^CTraceback (most recent call last): File "bz1319165-closeWait-segfault.py", line 22, in <module> conn.close() File "/usr/lib64/python2.7/site-packages/proton/utils.py", line 215, in close msg="Closing connection") File "/usr/lib64/python2.7/site-packages/proton/utils.py", line 227, in wait self.container.process() File "/usr/lib64/python2.7/site-packages/proton/reactor.py", line 156, in process self._check_errors() File "/usr/lib64/python2.7/site-packages/proton/reactor.py", line 148, in _check_errors if self.errors: File "/usr/lib64/python2.7/site-packages/proton/wrapper.py", line 63, in __getattr__ attrs = self.__dict__["_attrs"] KeyboardInterrupt sleeping.. ^CTraceback (most recent call last): File "bz1319165-closeWait-segfault.py", line 22, in <module> conn.close() File "/usr/lib64/python2.7/site-packages/proton/utils.py", line 215, in close msg="Closing connection") File "/usr/lib64/python2.7/site-packages/proton/utils.py", line 227, in wait self.container.process() KeyboardInterrupt The file descriptor leak is caused by the lack of a call to BlockingConnection.close() if an exception interrupts the setup of the receiver. The use of a try..finally block, or equivalent Python mechanism to guard resource usage is the correct way to use a BlockingConnection, where the finally clause calls close() on the blocking connection. https://issues.apache.org/jira/browse/PROTON-1175 has been raised to remedy the lack of information about this in the documentation and/or the examples. https://issues.apache.org/jira/browse/PROTON-1176 has been separately raised to deal with the collateral core dump when new reactors (needed by new BlockingConnections) cannot be created but are referenced anyway. (In reply to Cliff Jansen from comment #10) > The file descriptor leak is caused by the lack of a call to > BlockingConnection.close() if an exception interrupts the setup of the > receiver. > > The use of a try..finally block, or equivalent Python mechanism to guard > resource usage is the correct way to use a BlockingConnection, where the > finally clause calls close() on the blocking connection. So the script should be corrected to: try: rec = conn.create_receiver("pulp.bz1319165", name=str(uuid4()), dynamic=False, options=None) except Exception: if (conn): conn.close() pass to prevent the leak? Doing so, I see no CLOSE_WAITs further more (but I see a small memory consumption increase, at least in python-qpid-proton-0.9-12.el7sat.x86_64). Cliff, can you confirm the "patch of script" and do you see the mem.leak as well? (In reply to Pavel Moravec from comment #11) > (In reply to Cliff Jansen from comment #10) > > The file descriptor leak is caused by the lack of a call to > > BlockingConnection.close() if an exception interrupts the setup of the > > receiver. > > > > The use of a try..finally block, or equivalent Python mechanism to guard > > resource usage is the correct way to use a BlockingConnection, where the > > finally clause calls close() on the blocking connection. > > So the script should be corrected to: > > try: > rec = conn.create_receiver("pulp.bz1319165", name=str(uuid4()), > dynamic=False, options=None) > except Exception: > if (conn): > conn.close() > pass > > to prevent the leak? > > Doing so, I see no CLOSE_WAITs further more (but I see a small memory > consumption increase, at least in python-qpid-proton-0.9-12.el7sat.x86_64). > > Cliff, can you confirm the "patch of script" and do you see the mem.leak as > well? Btw. the mem.leak (probably smaller) is present even without qpidd restart.. (qpidd is just used as the end-point where the the script subscribes to via qdrouterd linkRouting) Full reproducer of the mem.leak: #!/usr/bin/python from time import sleep from uuid import uuid4 from proton import ConnectionException from proton.utils import BlockingConnection import traceback import random while True: sleep(random.uniform(0.3,3)) try: conn = BlockingConnection("proton+amqp://pmoravec-sat61-rhel7.gsslab.brq.redhat.com:5648", ssl_domain=None, heartbeat=2) try: rec = conn.create_receiver("pulp.bz1319165", name=str(uuid4()), dynamic=False, options=None) except Exception: if (conn): conn.close() pass print "sleeping.." sleep(random.uniform(2.1,3)) try: if (conn): conn.close() except Exception: pass except Exception: pass Your patch to the script matches mine for testing the file descriptor leak. However, I believe that the more Pythonic mechanism for dealing with a resource is to use try...finally blocks (as used by the sync_client.py example). But in the end, any defensive coding that ensures matching BlockingConnection.close() operations for each one created will do. Concerning the slow memory growth that you reported, I confirm that I see it too. To the best of my knowledge, Proton-C connections and reactors used by the BlockingXXX classes are not fully cleaned up (but they obviously are partially cleaned up if the tcp sockets and pipe file descriptors are not piling up). I seem to be able to force similar leaks from the simple helloworld example too. I will enter JIRAs in Qpid Proton for this cleanup deficiency for reactive containers and corresponding blocking versions. Moving 6.2 bugs out to sat-backlog. Moving 6.2 bugs out to sat-backlog. https://issues.apache.org/jira/browse/PROTON-1312 Last pending upstream JIRA regarding slow memory growth mentioned above. I can reproduce it using goferd inter-connected to qpidd via qdrouterd. Reliably, one qpidd restart triggers one TCP CLOSE_WAIT. Need to find out reproducer outside goferd, or find a bug in goferd's usage of proton library. Here is a standalone reproducer outside Satellite, using just proton bits:
#!/usr/bin/python
from time import sleep
from uuid import uuid4
from logging import getLogger
from proton import ConnectionException, Timeout
from proton.utils import LinkDetached, BlockingConnection
ROUTER_ADDRESS = "proton+amqp://pmoravec-caps62-rhel7.gsslab.brq.redhat.com:5648"
ADDRESS = "pulp.agent"
HEARTBEAT = 10
log = getLogger(__name__)
def open_conn_and_recv(conn, recv):
conn = BlockingConnection(ROUTER_ADDRESS, ssl_domain=None, heartbeat=HEARTBEAT)
recv = conn.create_receiver(ADDRESS, name=str(uuid4()), dynamic=False, options=None)
return (conn, recv)
def log_and_repair(e, conn, recv):
log.error(str(le))
sleep(HEARTBEAT)
# close the receiver and then connection
auxrecv = recv
recv = None
try:
auxrecv.close()
except Exception:
pass
# close connection
auxconn = conn
conn = None
try:
auxconn.close()
log.warn("01768892: closed connection (the lengthy way)")
except Exception, pe:
log.debug(str(pe))
return open_conn_and_recv(conn, recv)
conn = None
recv = None
conn, recv = open_conn_and_recv(conn, recv)
while True:
print conn
print recv
try:
# wait for a message
while True:
try:
impl = recv.receive(3)
except Timeout:
pass
except LinkDetached, le:
if le.condition != 'amqp:not-found:':
conn, recv = log_and_repair(le, conn, recv)
else:
log.error("not found: %s" % str(le))
raise Exception(*le.args)
except ConnectionException, pe:
conn, recv = log_and_repair(pe, conn, recv)
What the code does:
- establish a connection to qdrouterd (that inter-routes everything to qpidd that has a durable queue pulp.agent)
- creates a consumer and fetches messages every 3 seconds
- when a link or connection exception is raised, both receiver and connection are closed and reopened, like goferd does
To reproduce:
- have qpidd with pulp.agent durable queue
- have qdrouterd with link routing pulp.* to qpidd
- update ROUTER_ADDRESS in the script to the qdrouterd URL and port
- run the script
- meantime, restart qpidd time to time (say every 15 seconds, due to the 10s delay in script)
Current behaviour:
- every qpidd restart adds a new CLOSE_WAIT connection
- in tcpdump:
- AMQP performative "detach" is sent from router
- followed by detach+close from the client after a delay
- followed by close from the router
- immediately followed by FIN+ACK TCP frame from the router
- replied by ACK on TCP frame from the client - but not FIN :(
Proper behaviour:
- no close waits
- tcpdump shows FIN+ACK from the client as well
Cliff, could you please verify if above script properly uses proton API? Hello, Is there progress on this? I'm getting burned by this on RHEL 6.8 I can reproduce the problem on either proton 0.9-16 or on the patched one per bz1318015. Standalone reproducer: 1) Setup: - qpidd with _durable_ queue pulp.bz1319165 - qdrouterd link-routing pulp.* to qpidd and listening for the client - python client (update URL in "conn = .. accordingly"): #!/usr/bin/python from time import sleep from uuid import uuid4 from proton import ConnectionException from proton.utils import BlockingConnection import traceback import random while True: sleep(random.uniform(0.3,3)) try: conn = BlockingConnection("proton+amqp://pmoravec-sat62-rhel7.gsslab.brq2.redhat.com:5648", ssl_domain=None, heartbeat=2) rec = conn.create_receiver("pulp.bz1319165", name=str(uuid4()), dynamic=False, options=None) print "receiving.." try: impl = rec.receive(3) except Timeout: print ".. timeouted (ok)" pass except Exception: print "disconnected" connected=False pass print "closing.." try: print "closing 2" if (conn): conn.close() except Exception: pass except Exception: try: print "closing 3" if (conn): conn.close() except Exception: pass 2) run the client 3) while true; do service qpidd restart; sleep 10; sleep $(($((RANDOM))%5)); done 4) monitor lsof (and netstat) - time to time, a new tripple of FDs will appear time to time, like: > python 22780 root 3r FIFO 0,8 0t0 604583 pipe > python 22780 root 4w FIFO 0,8 0t0 604583 pipe > python 22780 root 5u IPv4 604585 0t0 TCP pmoravec-rhel72.gsslab.brq2.redhat.com:35806->pmoravec-sat62-rhel7.gsslab.brq2.redhat.com:5648 (CLOSE_WAIT) 2 pipes and one CLOSE_WAIT, every time. The key point is: do the qpidd restart during "rec.receive(3)". At that time, the FD leak appears. Commenting out that line, no FD leak. .. and the FD leak is also present in proton 0.16.0-3 Created attachment 1311425 [details] proposed patch for proton 0.9 Assumes all patches from https://bugzilla.redhat.com/show_bug.cgi?id=1318015 are applied before this one. Verified on Sat 6.2.14 both on rhel 6 and 7, no increase in close wait connections experienced when restarting qpidd. This was fixed in 6.2.14. |