Bug 1484028

Summary: qdrouterd segfaults on frequent goferd reconnect
Product: Red Hat Satellite Reporter: Pavel Moravec <pmoravec>
Component: QpidAssignee: Mike Cressman <mcressma>
Status: CLOSED DUPLICATE QA Contact: Katello QA List <katello-qa-list>
Severity: high Docs Contact:
Priority: high    
Version: 6.2.11CC: bbuckingham, pmoravec
Target Milestone: Unspecified   
Target Release: Unused   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-09-15 20:19:16 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Pavel Moravec 2017-08-22 13:26:43 UTC
Description of problem:
Under unknown scenario within Satellite, qdrouterd on Sat segfaults with backtrace matching https://issues.jboss.org/browse/ENTMQ-1162 / https://access.redhat.com/solutions/2042333 .

When running that scenario (outside any Satellite bits) reproducing that bug, I get the same crash on qdrouterd 0.4-22 / proton 0.9-16.


Version-Release number of selected component (if applicable):
qpid-proton-c-0.9-16.el7.x86_64
qpid-dispatch-router-0.4-22.el7sat.x86_64


How reproducible:
100% within few minutes


Steps to Reproduce:
(reproducer outside Satellite; for Sat based reproducer, in a loop, concurrently restart goferd frequently on few hundreds of Content Hosts)

1. Add many queues to qpidd:
for j in $(seq 0 10); do for i in $(seq 0 50); do qpid-config add queue pulp.pmoravec.${j}.${i}; done; echo $j; done

2. Start qdrouterd that link routes pulp.* prefix to qpidd

3. Run below script (use proper ROUTER_ADDRESS) - the script just concurrently creates receivers to addresses pulp.pmoravec.[0-10].[0-50] :


import random
import threading
import traceback

from proton import ConnectionException
from proton.utils import BlockingConnection
from proton import SSLDomain

from time import sleep
from uuid import uuid4
 
ROUTER_ADDRESS = "proton+amqp://dispatch.router.fqdn:5648"
ADDRESS = "pulp.pmoravec"
HEARTBEAT = 2
SLEEP_MIN = 1.9
SLEEP_MAX = 2.1
THREADS = 10
 
class ReceiverThread(threading.Thread):
    def __init__(self, _id, address=ADDRESS, domain=None):
        super(ReceiverThread, self).__init__()
        self._id = _id
        self.address = address
        self.domain = domain
        self.running = True
        self.nr = 1
 
    def connect(self):
        self.conn = BlockingConnection(ROUTER_ADDRESS, ssl_domain=self.domain, heartbeat=HEARTBEAT)
        self.conn.create_receiver('%s.%s' %(self.address, 0), name=str(uuid4()), dynamic=False, options=None)
 
    def reconnect(self):
        print "(%s): something got broken, reconnecting.." % self._id
        try:
            self.conn.close()
        except:
            print "(%s): receiver thread: failed to close connection" % self._id
            pass
        self.connect()
 
    def run(self):
        self.connect()
        while self.running:
            sleep(random.uniform(SLEEP_MIN,SLEEP_MAX))
            self.nr += 1
            try:
                self.recv = self.conn.create_receiver('%s.%s' %(self.address, self.nr), name=str(uuid4()), dynamic=False, options=None)
            except Exception as e:
                print "(%s): receiver failed, retrying.." % self._id
                self.reconnect()
 
    def stop(self):
        self.running = False
 
threads = []
for i in range(THREADS):
  threads.append(ReceiverThread(i, '%s.%s' %(ADDRESS, i)))
  threads[i].start()
 
_in = raw_input("Press Enter to exit:")
 
for i in range(THREADS):
    threads[i].stop()
for i in range(THREADS):
    threads[i].join()


4. Wait few minutes to qdrouterd segfault.


Actual results:
qdrouterd segfaults


Expected results:
no qdrouterd segfault


Additional info:
Backtrace:

#0  pni_record_find (record=<optimized out>, record=<optimized out>, key=key@entry=0) at /usr/src/debug/qpid-proton-0.9/proton-c/src/object/record.c:71
#1  pn_record_get (record=<optimized out>, key=key@entry=0) at /usr/src/debug/qpid-proton-0.9/proton-c/src/object/record.c:120
#2  0x00007f9fb52e1593 in pn_connection_get_context (conn=<optimized out>) at /usr/src/debug/qpid-proton-0.9/proton-c/src/engine/engine.c:184
#3  0x00007f9fb5525e61 in qd_link_connection (link=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:937
#4  0x00007f9fb5533c95 in router_link_attach_handler (context=0x8de8d0, link=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/router_node.c:1686
#5  0x00007f9fb552511c in handle_link_open (container=<optimized out>, pn_link=0x7f9f7895ecb0) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:217
#6  process_handler (unused=<optimized out>, qd_conn=0x7f9f88025fe0, container=0x836cf0) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:481
#7  handler (handler_context=0x836cf0, conn_context=<optimized out>, event=event@entry=QD_CONN_EVENT_PROCESS, qd_conn=0x7f9f88025fe0) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:643
#8  0x00007f9fb553795c in process_connector (cxtr=0x7f9f880188e0, qd_server=0x8f06f0) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:398
#9  thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:626
#10 0x00007f9fb55389c0 in qd_server_run (qd=0x604030) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:971
#11 0x0000000000401cd8 in main_process (config_path=config_path@entry=0x7ffc5a7c2f2d "/etc/qpid-dispatch/qdrouterd.conf", 
    python_pkgdir=python_pkgdir@entry=0x402401 "/usr/lib/qpid-dispatch/python", fd=fd@entry=2) at /usr/src/debug/qpid-dispatch-0.4/router/src/main.c:135
#12 0x0000000000401950 in main (argc=3, argv=0x7ffc5a7c1688) at /usr/src/debug/qpid-dispatch-0.4/router/src/main.c:335

Comment 4 Pavel Moravec 2017-08-23 10:14:59 UTC
Sorry for confusion, in fact there must be two different segfaults / two BZs. Second BZ will follow once I reproduce it (backtrace shall match the KCS  /ENTMQ JIRA but reproducer will differ).

This one follows reproducer described above but generates different backtrace / coredump:

..
Program terminated with signal 11, Segmentation fault.
#0  qd_link_close (link=0x0) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:994
994	    if (link->pn_link)
(gdb) bt full
#0  qd_link_close (link=0x0) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:994
No locals.
#1  0x00007f8d0f62ea7b in qd_router_detach_routed_link (context=0x7f8cfc0a1ed0, discard=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/router_node.c:1253
        link = 0x0
        pn_link = <optimized out>
        ld = 0x7f8cfc0a1ed0
#2  0x00007f8d0f6320b6 in invoke_deferred_calls (conn=conn@entry=0x7f8cfc00cb30, discard=discard@entry=false) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:304
        calls = {head = 0x0, tail = <optimized out>, scratch = 0x1b3fde0, size = <optimized out>}
        dc = 0x1b3fde0
#3  0x00007f8d0f63293d in process_connector (cxtr=0x7f8cfc010290, qd_server=0x1a9cff0) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:397
        ctx = 0x7f8cfc00cb30
        events = <optimized out>
        passes = <optimized out>
#4  thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:626
        work_done = 1
        timer = <optimized out>
        thread = <optimized out>
        work = <optimized out>
        cxtr = 0x7f8cfc010290
        conn = <optimized out>
        ctx = <optimized out>
        error = <optimized out>
        poll_result = <optimized out>
        qd_server = 0x1a9cff0
#5  0x00007f8d0f1a3e25 in start_thread (arg=0x7f8cf9dfa700) at pthread_create.c:308
        __res = <optimized out>
        pd = 0x7f8cf9dfa700
        now = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140243464333056, 6390606335961769466, 0, 140243464333760, 140243464333056, 0, -6361882968312732166, -6361931856609892870}, 
              mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
        pagesize_m1 = <optimized out>
        sp = <optimized out>
        freesize = <optimized out>
#6  0x00007f8d0e6f934d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
No locals.
(gdb) 


See the qd_link_close called for link=0x0.

(the reproducer triggers the segfault _after_ iterating via existing queues, once qpidd starts to complain "Error on attach: Node not found: pulp.pmoravec.2.83" or so - even after these errors start to appear, you can hit the segfault)

The bug is reproducible with qpid-proton-0.9-20.el7 as well.

I dont have a system with qpid-proton-0.16.0-6.el7 / qpid-dispatch-0.8.0-9 or -10 now, will try later on.

Comment 5 Pavel Moravec 2017-09-15 20:19:16 UTC
Unable to reproduce within Satellite, and the Satellite's relevant segfault is tracked under https://bugzilla.redhat.com/show_bug.cgi?id=1491160 . So closing this as WONTFIX (no fix required for Sat).

Comment 6 Pavel Moravec 2018-03-05 12:07:55 UTC
(technically, as it was just realized, the segfault is fixed via 0.4-29 build / 1535891)

*** This bug has been marked as a duplicate of bug 1535891 ***