Bug 1319165

Summary:

goferd close_wait leak and occasional segfault when qpidd restarted

Product:

Red Hat Satellite

Reporter:

Pavel Moravec <pmoravec>

Component:

katello-agent

Assignee:

Cliff Jansen <cjansen>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Peter Ondrejka <pondrejk>

Severity:

high

Docs Contact:

Priority:

high

Version:

6.1.6

CC:

anerurka, aperotti, bkearney, brubisch, cduryee, cjansen, dmoessne, egolov, jentrena, jross, mcressma, mmccune, mmithaiw, mvanderw, mverma, oshtaier, patrick.mcquade, pdwyer, sghai, ssherkar, xdmoon

Target Milestone:

Unspecified

Keywords:

PrioBumpGSS, Reopened, Triaged

Target Release:

Unused

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

qpid-proton-0.9-20

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Clones:

1530709 (view as bug list)

Environment:

Last Closed:

2018-01-19 14:31:57 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1122832, 1353215

Attachments:

Description	Flags
proposed patch for proton 0.9	none

Description Pavel Moravec 2016-03-18 15:57:05 UTC

Description of problem:
Restarting qpidd on Satellite, goferd / proton reactor does not close the FD for the socket of the TCP connection that timeouted.

In tcpdump:
- AMQP connection and link established
- some empty frames sent from qdrouterd
- after qpidd restart, qdrouterd sends detach frame - this triggers the leak, imho
- after some time, goferd detects connection is timeouted (see [1] for why) and re-connects (but old connection stays in close_wait)

[1] https://access.redhat.com/solutions/1614693


Version-Release number of selected component (if applicable):
qpid-proton-0.9-14.bz1317484.el7.src.rpm


How reproducible:
100%


Steps to Reproduce:
1. Have a content host with goferd running
2. time to time restart qpidd on Satellite
3. On the content host, monitor TCP connections towards Sat port 5647
(optionally to make the reproducer semi-automatic, comment out "idle-timeout-seconds: 0" in the connector to qpid broker to let connections to qpidd to timeout and drop the links there - mimics qpidd restarts regularly)


Actual results:
# of close wait connections grow over the time


Expected results:
no stable close_wait


Additional info:

Comment 2 Pavel Moravec 2016-03-18 18:39:27 UTC

Re occasional segfault:
letting the reproducer with commented out "idle-timeout-seconds: 0" parameter, goferd segfaulted with backtrace:

#0  pn_reactor_selectable (reactor=0x0) at /usr/src/debug/qpid-proton-0.9/proton-c/src/reactor/reactor.c:189
        sel = <optimized out>
#1  0x00007f9a3a5bc2e3 in _wrap_pn_reactor_attachments (self=<optimized out>, args=<optimized out>)
    at /usr/src/debug/qpid-proton-0.9/proton-c/bindings/python/cprotonPYTHON_wrap.c:23077
        _swig_thread_allow = 0x7f9a58147490
        arg1 = 0x0
        argp1 = 0x0
        res1 = <optimized out>
        obj0 = None
        result = 0x0
..
#27 PyEval_EvalFrameEx (
    f=f@entry=Frame 0x7f9a4c004230, for file /usr/lib64/python2.7/site-packages/proton/utils.py, line 201, in __init__ (self=<BlockingConnection(disconnected=False, timeout=None) at remote 0x7f9a26303c50>, url='amqp://satellite.example.com:5647', timeout=None, container=None, ssl_domain=None, heartbeat=10), throwflag=throwflag@entry=0) at /usr/src/debug/Python-2.7.5/Python/ceval.c:2740


I expect the segfault and close_wait leak has one common cause, thus reporting both issues in one BZ.

Comment 4 Pavel Moravec 2016-03-21 20:42:49 UTC

Reproducer of both CLOSE_WAIT leak and segfault at the end:

1. (let restart qpidd more often, without the need of qdrouterd.conf change):

while true; do service qpidd restart; sleep 10; sleep $(($((RANDOM))%5)); done

2. Run this script:

#!/usr/bin/python

from time import sleep
from uuid import uuid4

from proton import ConnectionException

from proton.utils import BlockingConnection

import traceback
import random

while True:
  sleep(random.uniform(0.3,3))
  try:
    conn = BlockingConnection("proton+amqp://pmoravec-sat61-rhel7.gsslab.brq.redhat.com:5648", ssl_domain=None, heartbeat=2)
    rec = conn.create_receiver("pulp.bz1319165", name=str(uuid4()), dynamic=False, options=None)
    print "sleeping.."
    sleep(random.uniform(2.1,3))
    try:
      if (conn):
        conn.close()
    except Exception:
      pass
  except Exception:
    pass


When netstat shows it has approx 340 TCP connections (almost all in CLOSE_WAIT), the python script coredumps.

Comment 5 Pavel Moravec 2016-03-21 21:25:48 UTC

(In reply to Pavel Moravec from comment #4)
> When netstat shows it has approx 340 TCP connections (almost all in
> CLOSE_WAIT), the python script coredumps.

Sounds ulimit -n reached caused the segfault.

Comment 6 Pavel Moravec 2016-03-22 12:47:04 UTC

(In reply to Pavel Moravec from comment #5)
> (In reply to Pavel Moravec from comment #4)
> > When netstat shows it has approx 340 TCP connections (almost all in
> > CLOSE_WAIT), the python script coredumps.
> 
> Sounds ulimit -n reached caused the segfault.

.. what suggests there are 2 separate bugs:

1) close_waits
2) no check of return value (that proton opening a local socket failed?) that is further used as a reference/pointer, causing the segfault

1 leads to 2 but does not need to be the only trigger. Anyway leaving on devels if they want to split to 2 separate BZs or not.

Comment 7 Pavel Moravec 2016-03-22 12:50:00 UTC

Checking what proton versions are affected:

Affected:
- 0.9-14.bz1317484.el7
- 0.9-12.el7sat
- 0.9-11.el7

Expected to be affected:
- 0.9-7.el7 (here the reproducer got frozen, but all established TCP connections went to CLOSE_WAIT)

Haven't checked proton upstream - let me know how to build it and I can try.

Comment 8 Pavel Moravec 2016-03-22 17:48:45 UTC

Checking further packages, both:

- qpid-proton-0.10-2
- qpid-proton-0.12.0-1

are imho affected by both errors (close_wait and segfault).

Some minor observations follow (can be ignored - the main message "the packages are affected the same"):

0.12.0-1 behaves like 0.9-* versions - both close_wait and segfault happens the same way

0.10-2 has a deadlock/livelock when calling connection.close() with various backtraces seen when breaking the script. After a while, connections went to CLOSE_WAIT what I assume is the 1st error hit (segfault hasnt been hit because not enough close_waits were able to generate, but I expect it's not fixed). Just for the sake of completenes, the backtraces seen:

sleeping..
^CTraceback (most recent call last):
  File "bz1319165-closeWait-segfault.py", line 22, in <module>
    conn.close()
  File "/usr/lib64/python2.7/site-packages/proton/utils.py", line 215, in close
    msg="Closing connection")
  File "/usr/lib64/python2.7/site-packages/proton/utils.py", line 227, in wait
    self.container.process()
  File "/usr/lib64/python2.7/site-packages/proton/reactor.py", line 155, in process
    result = pn_reactor_process(self._impl)
KeyboardInterrupt

sleeping..
^CTraceback (most recent call last):
  File "bz1319165-closeWait-segfault.py", line 22, in <module>
    conn.close()
  File "/usr/lib64/python2.7/site-packages/proton/utils.py", line 215, in close
    msg="Closing connection")
  File "/usr/lib64/python2.7/site-packages/proton/utils.py", line 226, in wait
    while not condition():
  File "/usr/lib64/python2.7/site-packages/proton/utils.py", line 214, in <lambda>
    self.wait(lambda: not (self.conn.state & Endpoint.REMOTE_ACTIVE),
  File "/usr/lib64/python2.7/site-packages/proton/__init__.py", line 2513, in state
    return pn_connection_state(self._impl)
KeyboardInterrupt

sleeping..
^CTraceback (most recent call last):
  File "bz1319165-closeWait-segfault.py", line 22, in <module>
    conn.close()
  File "/usr/lib64/python2.7/site-packages/proton/utils.py", line 215, in close
    msg="Closing connection")
  File "/usr/lib64/python2.7/site-packages/proton/utils.py", line 227, in wait
    self.container.process()
  File "/usr/lib64/python2.7/site-packages/proton/reactor.py", line 156, in process
    self._check_errors()
  File "/usr/lib64/python2.7/site-packages/proton/reactor.py", line 148, in _check_errors
    if self.errors:
  File "/usr/lib64/python2.7/site-packages/proton/wrapper.py", line 62, in __getattr__
    def __getattr__(self, name):
KeyboardInterrupt

sleeping..
^CTraceback (most recent call last):
  File "bz1319165-closeWait-segfault.py", line 22, in <module>
    conn.close()
  File "/usr/lib64/python2.7/site-packages/proton/utils.py", line 215, in close
    msg="Closing connection")
  File "/usr/lib64/python2.7/site-packages/proton/utils.py", line 226, in wait
    while not condition():
KeyboardInterrupt

sleeping..
^CTraceback (most recent call last):
  File "bz1319165-closeWait-segfault.py", line 22, in <module>
    conn.close()
  File "/usr/lib64/python2.7/site-packages/proton/utils.py", line 215, in close
    msg="Closing connection")
  File "/usr/lib64/python2.7/site-packages/proton/utils.py", line 227, in wait
    self.container.process()
  File "/usr/lib64/python2.7/site-packages/proton/reactor.py", line 156, in process
    self._check_errors()
  File "/usr/lib64/python2.7/site-packages/proton/reactor.py", line 148, in _check_errors
    if self.errors:
  File "/usr/lib64/python2.7/site-packages/proton/wrapper.py", line 63, in __getattr__
    attrs = self.__dict__["_attrs"]
KeyboardInterrupt

sleeping..
^CTraceback (most recent call last):
  File "bz1319165-closeWait-segfault.py", line 22, in <module>
    conn.close()
  File "/usr/lib64/python2.7/site-packages/proton/utils.py", line 215, in close
    msg="Closing connection")
  File "/usr/lib64/python2.7/site-packages/proton/utils.py", line 227, in wait
    self.container.process()
KeyboardInterrupt

Comment 10 Cliff Jansen 2016-04-12 23:32:39 UTC

The file descriptor leak is caused by the lack of a call to BlockingConnection.close() if an exception interrupts the setup of the receiver.

The use of a try..finally block, or equivalent Python mechanism to guard resource usage is the correct way to use a BlockingConnection, where the finally clause calls close() on the blocking connection.

https://issues.apache.org/jira/browse/PROTON-1175 has been raised to remedy the lack of information about this in the documentation and/or the examples.

https://issues.apache.org/jira/browse/PROTON-1176 has been separately raised to deal with the collateral core dump when new reactors (needed by new BlockingConnections) cannot be created but are referenced anyway.

Comment 11 Pavel Moravec 2016-04-14 09:01:16 UTC

(In reply to Cliff Jansen from comment #10)
> The file descriptor leak is caused by the lack of a call to
> BlockingConnection.close() if an exception interrupts the setup of the
> receiver.
> 
> The use of a try..finally block, or equivalent Python mechanism to guard
> resource usage is the correct way to use a BlockingConnection, where the
> finally clause calls close() on the blocking connection.

So the script should be corrected to:

    try:
      rec = conn.create_receiver("pulp.bz1319165", name=str(uuid4()), dynamic=False, options=None)
    except Exception:
      if (conn):
        conn.close()
      pass

to prevent the leak?

Doing so, I see no CLOSE_WAITs further more (but I see a small memory consumption increase, at least in python-qpid-proton-0.9-12.el7sat.x86_64).

Cliff, can you confirm the "patch of script" and do you see the mem.leak as well?

Comment 12 Pavel Moravec 2016-04-14 12:32:34 UTC

(In reply to Pavel Moravec from comment #11)
> (In reply to Cliff Jansen from comment #10)
> > The file descriptor leak is caused by the lack of a call to
> > BlockingConnection.close() if an exception interrupts the setup of the
> > receiver.
> > 
> > The use of a try..finally block, or equivalent Python mechanism to guard
> > resource usage is the correct way to use a BlockingConnection, where the
> > finally clause calls close() on the blocking connection.
> 
> So the script should be corrected to:
> 
>     try:
>       rec = conn.create_receiver("pulp.bz1319165", name=str(uuid4()),
> dynamic=False, options=None)
>     except Exception:
>       if (conn):
>         conn.close()
>       pass
> 
> to prevent the leak?
> 
> Doing so, I see no CLOSE_WAITs further more (but I see a small memory
> consumption increase, at least in python-qpid-proton-0.9-12.el7sat.x86_64).
> 
> Cliff, can you confirm the "patch of script" and do you see the mem.leak as
> well?

Btw. the mem.leak (probably smaller) is present even without qpidd restart.. (qpidd is just used as the end-point where the the script subscribes to via qdrouterd linkRouting)


Full reproducer of the mem.leak:


#!/usr/bin/python

from time import sleep
from uuid import uuid4

from proton import ConnectionException

from proton.utils import BlockingConnection

import traceback
import random

while True:
  sleep(random.uniform(0.3,3))
  try:
    conn = BlockingConnection("proton+amqp://pmoravec-sat61-rhel7.gsslab.brq.redhat.com:5648", ssl_domain=None, heartbeat=2)
    try:
      rec = conn.create_receiver("pulp.bz1319165", name=str(uuid4()), dynamic=False, options=None)
    except Exception:
      if (conn):
        conn.close()
      pass
    print "sleeping.."
    sleep(random.uniform(2.1,3))
    try:
      if (conn):
        conn.close()
    except Exception:
      pass
  except Exception:
    pass

Comment 13 Cliff Jansen 2016-04-15 06:24:58 UTC

Your patch to the script matches mine for testing the file descriptor leak.  However, I believe that the more Pythonic mechanism for dealing with a resource is to use try...finally blocks (as used by the sync_client.py example).  But in the end, any defensive coding that ensures matching BlockingConnection.close() operations for each one created will do.

Concerning the slow memory growth that you reported, I confirm that I see it too.  To the best of my knowledge, Proton-C connections and reactors used by the BlockingXXX classes are not fully cleaned up (but they obviously are partially cleaned up if the tcp sockets and pipe file descriptors are not piling up).

I seem to be able to force similar leaks from the simple helloworld example too.

I will enter JIRAs in Qpid Proton for this cleanup deficiency for reactive containers and corresponding blocking versions.

Comment 14 Bryan Kearney 2016-07-26 15:25:24 UTC

Moving 6.2 bugs out to sat-backlog.

Comment 15 Bryan Kearney 2016-07-26 15:32:17 UTC

Moving 6.2 bugs out to sat-backlog.

Comment 20 Cliff Jansen 2016-10-03 18:34:26 UTC

  https://issues.apache.org/jira/browse/PROTON-1312

Last pending upstream JIRA regarding slow memory growth mentioned above.

Comment 22 Pavel Moravec 2017-02-06 13:43:37 UTC

I can reproduce it using goferd inter-connected to qpidd via qdrouterd. Reliably, one qpidd restart triggers one TCP CLOSE_WAIT.

Need to find out reproducer outside goferd, or find a bug in goferd's usage of proton library.

Comment 23 Pavel Moravec 2017-02-07 11:24:36 UTC

Here is a standalone reproducer outside Satellite, using just proton bits:

#!/usr/bin/python
from time import sleep
from uuid import uuid4
from logging import getLogger

from proton import ConnectionException, Timeout
from proton.utils import LinkDetached, BlockingConnection

ROUTER_ADDRESS = "proton+amqp://pmoravec-caps62-rhel7.gsslab.brq.redhat.com:5648"
ADDRESS = "pulp.agent"
HEARTBEAT = 10

log = getLogger(__name__)

def open_conn_and_recv(conn, recv):
    conn = BlockingConnection(ROUTER_ADDRESS, ssl_domain=None, heartbeat=HEARTBEAT)
    recv = conn.create_receiver(ADDRESS, name=str(uuid4()), dynamic=False, options=None)
    return (conn, recv)

def log_and_repair(e, conn, recv):
    log.error(str(le))
    sleep(HEARTBEAT)
    # close the receiver and then connection
    auxrecv = recv
    recv = None
    try:
        auxrecv.close()
    except Exception:
        pass
    # close connection
    auxconn = conn
    conn = None
    try:
        auxconn.close()
        log.warn("01768892: closed connection (the lengthy way)")
    except Exception, pe:
        log.debug(str(pe))
    return open_conn_and_recv(conn, recv)


conn = None
recv = None
conn, recv = open_conn_and_recv(conn, recv)
while True:
    print conn
    print recv
    try:
        # wait for a message
        while True:
            try:
                impl = recv.receive(3)
            except Timeout:
                pass
    except LinkDetached, le:
        if le.condition != 'amqp:not-found:':
            conn, recv = log_and_repair(le, conn, recv)
        else:
            log.error("not found: %s" % str(le))
            raise Exception(*le.args)
    except ConnectionException, pe:
        conn, recv = log_and_repair(pe, conn, recv)


What the code does:
- establish a connection to qdrouterd (that inter-routes everything to qpidd that has a durable queue pulp.agent)
- creates a consumer and fetches messages every 3 seconds
- when a link or connection exception is raised, both receiver and connection are closed and reopened, like goferd does

To reproduce:
- have qpidd with pulp.agent durable queue
- have qdrouterd with link routing pulp.* to qpidd
- update ROUTER_ADDRESS in the script to the qdrouterd URL and port
- run the script
- meantime, restart qpidd time to time (say every 15 seconds, due to the 10s delay in script)

Current behaviour:
- every qpidd restart adds a new CLOSE_WAIT connection
- in tcpdump:
  - AMQP performative "detach" is sent from router
  - followed by detach+close from the client after a delay
  - followed by close from the router
  - immediately followed by FIN+ACK TCP frame from the router
  - replied by ACK on TCP frame from the client - but not FIN :(

Proper behaviour:
- no close waits
- tcpdump shows FIN+ACK from the client as well

Comment 24 Pavel Moravec 2017-02-07 11:27:08 UTC

Cliff,
could you please verify if above script properly uses proton API?

Comment 31 Patrick M 2017-06-29 17:29:30 UTC

Hello, 

Is there progress on this?  I'm getting burned by this on RHEL 6.8

Comment 33 Pavel Moravec 2017-08-02 08:02:15 UTC

I can reproduce the problem on either proton 0.9-16 or on the patched one per bz1318015.

Standalone reproducer:

1) Setup:
  - qpidd with _durable_ queue pulp.bz1319165
  - qdrouterd link-routing pulp.* to qpidd and listening for the client
  - python client (update URL in "conn = .. accordingly"):


#!/usr/bin/python

from time import sleep
from uuid import uuid4

from proton import ConnectionException

from proton.utils import BlockingConnection

import traceback
import random

while True:
  sleep(random.uniform(0.3,3))
  try:
    conn = BlockingConnection("proton+amqp://pmoravec-sat62-rhel7.gsslab.brq2.redhat.com:5648", ssl_domain=None, heartbeat=2)
    rec = conn.create_receiver("pulp.bz1319165", name=str(uuid4()), dynamic=False, options=None)
    print "receiving.."
    try:
      impl = rec.receive(3)
    except Timeout:
      print ".. timeouted (ok)"
      pass
    except Exception:
      print "disconnected"
      connected=False
      pass
    print "closing.."

    try:
      print "closing 2"
      if (conn):
        conn.close()
    except Exception:
      pass
  except Exception:
    try:
      print "closing 3"
      if (conn):
        conn.close()
    except Exception:
      pass

2) run the client

3) while true; do service qpidd restart; sleep 10; sleep $(($((RANDOM))%5)); done

4) monitor lsof (and netstat) - time to time, a new tripple of FDs will appear time to time, like:

> python  22780 root    3r  FIFO    0,8       0t0   604583 pipe
> python  22780 root    4w  FIFO    0,8       0t0   604583 pipe
> python  22780 root    5u  IPv4 604585       0t0      TCP pmoravec-rhel72.gsslab.brq2.redhat.com:35806->pmoravec-sat62-rhel7.gsslab.brq2.redhat.com:5648 (CLOSE_WAIT)

2 pipes and one CLOSE_WAIT, every time.



The key point is: do the qpidd restart during "rec.receive(3)". At that time, the FD leak appears. Commenting out that line, no FD leak.

Comment 34 Pavel Moravec 2017-08-02 08:15:27 UTC

.. and the FD leak is also present in proton 0.16.0-3

Comment 37 Cliff Jansen 2017-08-09 23:45:51 UTC

Created attachment 1311425 [details]
proposed patch for proton 0.9

Assumes all patches from https://bugzilla.redhat.com/show_bug.cgi?id=1318015
are applied before this one.

Comment 46 Peter Ondrejka 2018-01-15 16:09:25 UTC

Verified on Sat 6.2.14 both on rhel 6 and 7, no increase in close wait connections experienced when restarting qpidd.

Comment 47 Bryan Kearney 2018-01-19 14:31:57 UTC

This was fixed in 6.2.14.