1484028 – qdrouterd segfaults on frequent goferd reconnect

Red Hat Satellite engineering is moving the tracking of its product development work on Satellite to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "Satellite project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs will be migrated starting at the end of May. If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "Satellite project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/SAT-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1484028 - qdrouterd segfaults on frequent goferd reconnect

Summary: qdrouterd segfaults on frequent goferd reconnect

Keywords:
Status:	CLOSED DUPLICATE of bug 1535891
Alias:	None
Product:	Red Hat Satellite
Classification:	Red Hat
Component:	Qpid
Sub Component:
Version:	6.2.11
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	Unspecified
Assignee:	Mike Cressman
QA Contact:	Katello QA List
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-08-22 13:26 UTC by Pavel Moravec
Modified:	2019-08-12 14:36 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-09-15 20:19:16 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1491160	0	high	CLOSED	qdrouterd segfault when processing bursts of goferd requests	2024-10-01 16:05:47 UTC

Internal Links: 1491160

Description Pavel Moravec 2017-08-22 13:26:43 UTC

Description of problem:
Under unknown scenario within Satellite, qdrouterd on Sat segfaults with backtrace matching https://issues.jboss.org/browse/ENTMQ-1162 / https://access.redhat.com/solutions/2042333 .

When running that scenario (outside any Satellite bits) reproducing that bug, I get the same crash on qdrouterd 0.4-22 / proton 0.9-16.


Version-Release number of selected component (if applicable):
qpid-proton-c-0.9-16.el7.x86_64
qpid-dispatch-router-0.4-22.el7sat.x86_64


How reproducible:
100% within few minutes


Steps to Reproduce:
(reproducer outside Satellite; for Sat based reproducer, in a loop, concurrently restart goferd frequently on few hundreds of Content Hosts)

1. Add many queues to qpidd:
for j in $(seq 0 10); do for i in $(seq 0 50); do qpid-config add queue pulp.pmoravec.${j}.${i}; done; echo $j; done

2. Start qdrouterd that link routes pulp.* prefix to qpidd

3. Run below script (use proper ROUTER_ADDRESS) - the script just concurrently creates receivers to addresses pulp.pmoravec.[0-10].[0-50] :


import random
import threading
import traceback

from proton import ConnectionException
from proton.utils import BlockingConnection
from proton import SSLDomain

from time import sleep
from uuid import uuid4
 
ROUTER_ADDRESS = "proton+amqp://dispatch.router.fqdn:5648"
ADDRESS = "pulp.pmoravec"
HEARTBEAT = 2
SLEEP_MIN = 1.9
SLEEP_MAX = 2.1
THREADS = 10
 
class ReceiverThread(threading.Thread):
    def __init__(self, _id, address=ADDRESS, domain=None):
        super(ReceiverThread, self).__init__()
        self._id = _id
        self.address = address
        self.domain = domain
        self.running = True
        self.nr = 1
 
    def connect(self):
        self.conn = BlockingConnection(ROUTER_ADDRESS, ssl_domain=self.domain, heartbeat=HEARTBEAT)
        self.conn.create_receiver('%s.%s' %(self.address, 0), name=str(uuid4()), dynamic=False, options=None)
 
    def reconnect(self):
        print "(%s): something got broken, reconnecting.." % self._id
        try:
            self.conn.close()
        except:
            print "(%s): receiver thread: failed to close connection" % self._id
            pass
        self.connect()
 
    def run(self):
        self.connect()
        while self.running:
            sleep(random.uniform(SLEEP_MIN,SLEEP_MAX))
            self.nr += 1
            try:
                self.recv = self.conn.create_receiver('%s.%s' %(self.address, self.nr), name=str(uuid4()), dynamic=False, options=None)
            except Exception as e:
                print "(%s): receiver failed, retrying.." % self._id
                self.reconnect()
 
    def stop(self):
        self.running = False
 
threads = []
for i in range(THREADS):
  threads.append(ReceiverThread(i, '%s.%s' %(ADDRESS, i)))
  threads[i].start()
 
_in = raw_input("Press Enter to exit:")
 
for i in range(THREADS):
    threads[i].stop()
for i in range(THREADS):
    threads[i].join()


4. Wait few minutes to qdrouterd segfault.


Actual results:
qdrouterd segfaults


Expected results:
no qdrouterd segfault


Additional info:
Backtrace:

#0  pni_record_find (record=<optimized out>, record=<optimized out>, key=key@entry=0) at /usr/src/debug/qpid-proton-0.9/proton-c/src/object/record.c:71
#1  pn_record_get (record=<optimized out>, key=key@entry=0) at /usr/src/debug/qpid-proton-0.9/proton-c/src/object/record.c:120
#2  0x00007f9fb52e1593 in pn_connection_get_context (conn=<optimized out>) at /usr/src/debug/qpid-proton-0.9/proton-c/src/engine/engine.c:184
#3  0x00007f9fb5525e61 in qd_link_connection (link=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:937
#4  0x00007f9fb5533c95 in router_link_attach_handler (context=0x8de8d0, link=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/router_node.c:1686
#5  0x00007f9fb552511c in handle_link_open (container=<optimized out>, pn_link=0x7f9f7895ecb0) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:217
#6  process_handler (unused=<optimized out>, qd_conn=0x7f9f88025fe0, container=0x836cf0) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:481
#7  handler (handler_context=0x836cf0, conn_context=<optimized out>, event=event@entry=QD_CONN_EVENT_PROCESS, qd_conn=0x7f9f88025fe0) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:643
#8  0x00007f9fb553795c in process_connector (cxtr=0x7f9f880188e0, qd_server=0x8f06f0) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:398
#9  thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:626
#10 0x00007f9fb55389c0 in qd_server_run (qd=0x604030) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:971
#11 0x0000000000401cd8 in main_process (config_path=config_path@entry=0x7ffc5a7c2f2d "/etc/qpid-dispatch/qdrouterd.conf", 
    python_pkgdir=python_pkgdir@entry=0x402401 "/usr/lib/qpid-dispatch/python", fd=fd@entry=2) at /usr/src/debug/qpid-dispatch-0.4/router/src/main.c:135
#12 0x0000000000401950 in main (argc=3, argv=0x7ffc5a7c1688) at /usr/src/debug/qpid-dispatch-0.4/router/src/main.c:335

Comment 4 Pavel Moravec 2017-08-23 10:14:59 UTC

Sorry for confusion, in fact there must be two different segfaults / two BZs. Second BZ will follow once I reproduce it (backtrace shall match the KCS  /ENTMQ JIRA but reproducer will differ).

This one follows reproducer described above but generates different backtrace / coredump:

..
Program terminated with signal 11, Segmentation fault.
#0  qd_link_close (link=0x0) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:994
994	    if (link->pn_link)
(gdb) bt full
#0  qd_link_close (link=0x0) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:994
No locals.
#1  0x00007f8d0f62ea7b in qd_router_detach_routed_link (context=0x7f8cfc0a1ed0, discard=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/router_node.c:1253
        link = 0x0
        pn_link = <optimized out>
        ld = 0x7f8cfc0a1ed0
#2  0x00007f8d0f6320b6 in invoke_deferred_calls (conn=conn@entry=0x7f8cfc00cb30, discard=discard@entry=false) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:304
        calls = {head = 0x0, tail = <optimized out>, scratch = 0x1b3fde0, size = <optimized out>}
        dc = 0x1b3fde0
#3  0x00007f8d0f63293d in process_connector (cxtr=0x7f8cfc010290, qd_server=0x1a9cff0) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:397
        ctx = 0x7f8cfc00cb30
        events = <optimized out>
        passes = <optimized out>
#4  thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:626
        work_done = 1
        timer = <optimized out>
        thread = <optimized out>
        work = <optimized out>
        cxtr = 0x7f8cfc010290
        conn = <optimized out>
        ctx = <optimized out>
        error = <optimized out>
        poll_result = <optimized out>
        qd_server = 0x1a9cff0
#5  0x00007f8d0f1a3e25 in start_thread (arg=0x7f8cf9dfa700) at pthread_create.c:308
        __res = <optimized out>
        pd = 0x7f8cf9dfa700
        now = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140243464333056, 6390606335961769466, 0, 140243464333760, 140243464333056, 0, -6361882968312732166, -6361931856609892870}, 
              mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
        pagesize_m1 = <optimized out>
        sp = <optimized out>
        freesize = <optimized out>
#6  0x00007f8d0e6f934d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
No locals.
(gdb) 


See the qd_link_close called for link=0x0.

(the reproducer triggers the segfault _after_ iterating via existing queues, once qpidd starts to complain "Error on attach: Node not found: pulp.pmoravec.2.83" or so - even after these errors start to appear, you can hit the segfault)

The bug is reproducible with qpid-proton-0.9-20.el7 as well.

I dont have a system with qpid-proton-0.16.0-6.el7 / qpid-dispatch-0.8.0-9 or -10 now, will try later on.

Comment 5 Pavel Moravec 2017-09-15 20:19:16 UTC

Unable to reproduce within Satellite, and the Satellite's relevant segfault is tracked under https://bugzilla.redhat.com/show_bug.cgi?id=1491160 . So closing this as WONTFIX (no fix required for Sat).

Comment 6 Pavel Moravec 2018-03-05 12:07:55 UTC

(technically, as it was just realized, the segfault is fixed via 0.4-29 build / 1535891)

*** This bug has been marked as a duplicate of bug 1535891 ***

Note You need to log in before you can comment on or make changes to this bug.