1553189 – qdrouterd segfaults under load of clients (dis)connects

Red Hat Satellite engineering is moving the tracking of its product development work on Satellite to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "Satellite project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs will be migrated starting at the end of May. If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "Satellite project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/SAT-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1553189 - qdrouterd segfaults under load of clients (dis)connects

Summary: qdrouterd segfaults under load of clients (dis)connects

Keywords:
Status:	CLOSED DUPLICATE of bug 1561876
Alias:	None
Product:	Red Hat Satellite
Classification:	Red Hat
Component:	Qpid
Sub Component:
Version:	6.3.0
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	high
Target Milestone:	Unspecified
Assignee:	Mike Cressman
QA Contact:	Katello QA List
Docs Contact:
URL:
Whiteboard:
Depends On:	1553116
Blocks:
TreeView+	depends on / blocked

Reported:	2018-03-08 13:25 UTC by Pavel Moravec
Modified:	2022-03-13 14:45 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1553116
Environment:
Last Closed:	2018-11-09 19:02:27 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Pavel Moravec 2018-03-08 13:25:33 UTC

+++ This bug was initially created as a clone of Bug #1553116 +++
Same reproducer like in original 1553116 causes segfault in Sat 6.3 / qpid-dispatch-debuginfo-0.8.0-16 + qpid-proton-debuginfo-0.16.0-7, while the backtrace is completely different:

- once I got (on Satellite):

(gdb) bt
#0  qdr_link_second_attach (link=link@entry=0x0, source=0x7f14c90a7f10, target=target@entry=0x7f14c90a7fd0) at /usr/src/debug/qpid-dispatch-0.8.0/src/router_core/connections.c:427
#1  0x00007f150de879b3 in AMQP_link_attach_handler (context=<optimized out>, link=0x7f145c228090) at /usr/src/debug/qpid-dispatch-0.8.0/src/router_node.c:527
#2  0x00007f150de68b1c in handle_link_open (container=0x13a3020, pn_link=0x7f14702feff0) at /usr/src/debug/qpid-dispatch-0.8.0/src/container.c:165
#3  pn_event_handler (handler_context=0x13a3020, conn_context=0x13a1590, event=<optimized out>, qd_conn=0x7f14f4000b10) at /usr/src/debug/qpid-dispatch-0.8.0/src/container.c:514
#4  0x00007f150de8ceb8 in process_connector (cxtr=<optimized out>, qd_server=0x15271f0) at /usr/src/debug/qpid-dispatch-0.8.0/src/server.c:782
#5  thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.8.0/src/server.c:1012
#6  0x00007f150d9e3e25 in start_thread (arg=0x7f14faffd700) at pthread_create.c:308
#7  0x00007f150cd1734d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
(gdb) 

- another time (on Capsules):

(gdb) bt
#0  pn_session_set_incoming_capacity (ssn=0x0, capacity=35184372072448) at /usr/src/debug/qpid-proton-0.16.0/proton-c/src/core/engine.c:1059
#1  0x00007f718e97562c in qd_link (node=0xe6b850, conn=conn@entry=0x7f7174000b10, dir=QD_INCOMING, name=name@entry=0x7f717cfce990 "8f2ec3cc-9803-4d78-8f8b-5fa2b39e6ab3")
    at /usr/src/debug/qpid-dispatch-0.8.0/src/container.c:831
#2  0x00007f718e994269 in CORE_link_first_attach (context=0x11437f0, conn=<optimized out>, link=0x7f717cfccc90, source=0x7f715e873550, target=0x7f715e873450)
    at /usr/src/debug/qpid-dispatch-0.8.0/src/router_node.c:849
#3  0x00007f718e988b32 in qdr_connection_process (conn=0x7f7174027510) at /usr/src/debug/qpid-dispatch-0.8.0/src/router_core/connections.c:214
#4  0x00007f718e974298 in writable_handler (container=0xf8e2a0, container=0xf8e2a0, conn=<optimized out>, qd_conn=0x7f7174000b10) at /usr/src/debug/qpid-dispatch-0.8.0/src/container.c:307
#5  handler (handler_context=0xf8e2a0, conn_context=0xf8c590, event=<optimized out>, qd_conn=0x7f7174000b10) at /usr/src/debug/qpid-dispatch-0.8.0/src/container.c:614
#6  0x00007f718e998deb in process_connector (cxtr=<optimized out>, qd_server=0x11121f0) at /usr/src/debug/qpid-dispatch-0.8.0/src/server.c:792
#7  thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.8.0/src/server.c:1012
#8  0x00007f718e4efe25 in start_thread (arg=0x7f7180f08700) at pthread_create.c:308
#9  0x00007f718d82334d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
(gdb) 


Description and reproducer is the same, impact as well, just the technical cause is different (original segfault doesnt happen here).


Description of problem:
Having more Capsules (10 in reproducer) and a Satellite, and *many* goferd clients frequently re-connecting, qdrouterd segfaults within 10 minutes. Sometimes on a Capsule, sometimes on the Satellite.

Particular detailed reproducer outside Satellite will follow, basically if one mimics goferd activity by frequent creating receiver link-routed to qpidd, and one does it massivelly in parallel, the segfault will happen.


Version-Release number of selected component (if applicable):
qpid-proton-c-0.9-21.el7.x86_64
qpid-dispatch-router-0.4-29.el7sat.x86_64


How reproducible:
100% in 10 minutes


Steps to Reproduce:
1. See above, detailed reproducer in next comment (with script, configs of routers etc)


Actual results:
either qdrouterd segfaults with few possible backtraces of 2 basic variants:

1)
(gdb) bt
#0  qd_alloc (desc=desc@entry=0x7fc8ad51a760 <__desc_qd_router_link_t>, tpool=0x7fc89fa906a0) at /usr/src/debug/qpid-dispatch-0.4/src/alloc.c:162
#1  0x00007fc8ad3043c7 in new_qd_router_link_t () at /usr/src/debug/qpid-dispatch-0.4/src/router_node.c:60
#2  0x00007fc8ad30542f in qd_router_attach_routed_link (context=0x7fc894ab96b0, discard=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/router_node.c:1182
#3  0x00007fc8ad3091d9 in invoke_deferred_calls (conn=conn@entry=0x7fc89000cb30, discard=discard@entry=false) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:305
#4  0x00007fc8ad309a9d in process_connector (cxtr=0x7fc8900102d0, qd_server=0x1553bf0) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:398
#5  thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:627
#6  0x00007fc8ace7ae25 in start_thread (arg=0x7fc89fa90700) at pthread_create.c:308
#7  0x00007fc8ac3d034d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
(gdb) list
157	        desc->stats->batches_rebalanced_to_threads++;
158	        desc->stats->held_by_threads += desc->config->transfer_batch_size;
159	        for (idx = 0; idx < desc->config->transfer_batch_size; idx++) {
160	            item = DEQ_HEAD(desc->global_pool->free_list);
161	            DEQ_REMOVE_HEAD(desc->global_pool->free_list);
162	            DEQ_INSERT_TAIL(pool->free_list, item);
163	        }
164	    } else {
165	        //
166	        // Allocate a full batch from the heap and put it on the thread list.
(gdb) p item
$1 = (qd_alloc_item_t *) 0x0
(gdb) p desc->global_pool->free_list
$2 = {head = 0x0, tail = 0x0, scratch = 0x7fc8929d6be0, size = 13557}
(gdb) 

2)
(gdb) bt
#0  qd_dealloc (desc=desc@entry=0x7f0fc187e760 <__desc_qd_router_link_t>, tpool=<optimized out>, p=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/alloc.c:244
#1  0x00007f0fc16683f7 in free_qd_router_link_t (p=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/router_node.c:60
#2  0x00007f0fc16692ee in qd_router_link_free_LH (rlink=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/router_node.c:433
#3  0x00007f0fc1669bb0 in qd_router_detach_routed_link (context=0x7f0fa440ed70, discard=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/router_node.c:1258
#4  0x00007f0fc166d206 in invoke_deferred_calls (conn=conn@entry=0x27a1cb0, discard=discard@entry=false) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:305
#5  0x00007f0fc166da9d in process_connector (cxtr=0x7f0fac0052b0, qd_server=0x270e1f0) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:398
#6  thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:627
#7  0x00007f0fc11dee25 in start_thread () from /lib64/libpthread.so.0
#8  0x00007f0fc073434d in clone () from /lib64/libc.so.6
(gdb) list
239	    desc->stats->batches_rebalanced_to_global++;
240	    desc->stats->held_by_threads -= desc->config->transfer_batch_size;
241	    for (idx = 0; idx < desc->config->transfer_batch_size; idx++) {
242	        item = DEQ_HEAD(pool->free_list);
243	        DEQ_REMOVE_HEAD(pool->free_list);
244	        DEQ_INSERT_TAIL(desc->global_pool->free_list, item);
245	    }
246	
247	    //
248	    // If there's a global_free_list size limit, remove items until the limit is
(gdb) p item
$1 = (qd_alloc_item_t *) 0x0
(gdb) p pool->free_list
$2 = {head = 0x0, tail = 0x0, scratch = 0x7f0fac067500, size = 99}
(gdb) 



Expected results:
no such segfault


Additional info:
Few interesting observations:
- technical cause for segfault: we try to DEQ_INSERT_TAIL a null object, fetched from a free_list dequeue; that happened because that dequeue was empty (head = tail = 0x0) BUT of positive size (how this ever could happen???) so there was an assumption we can dequeue something
- the above problem affects both desc->global_pool->free_list and pool->free_list, but always when manipulating with links
- when running reproducer, and stopping *all* clients, qpidd still gets some lnik-routed subscriptions to the relevant queues, from some Capsule's qdrouterd (qpid-stat -q shows positive number of subs)
  - when restarting qdrouterd on Capsules, these subscriptions are gone
- that leads me to a deduction there is a leak in "links routed further" (i.e. links from Caps router to Sat router, maintained on Caps router; or links from Sat router to qpidd maintained by Sat router) - this is evident in qpid-stat and imho triggers the weird behaviour in memory management of link objects (the dequeues)

--- Additional comment from RHEL Product and Program Management on 2018-03-08 05:55:45 EST ---

Since this issue was entered in Red Hat Bugzilla, the pm_ack has been
set to + automatically for the next planned release

--- Additional comment from Pavel Moravec on 2018-03-08 06:32 EST ---

To reproduce:

- unpack and cd to the dir

- ulimit -c unlimited   # to get coredumps, if interested)

- run start_reproducer.sh
  - very first run will customize configs per your hostname
  - it will start and provision qpidd + "hub" router + 10 Capsule routers

- then run on 1 or rather 2 systems the clients (per the script output)

- I recommend using a different machine(s) for the clients, as it generates load >10 :-/

- optionally, check "qpid-stat -q" - either queue should have (close to) 100 consumers for each system running the clients (i.e. if I use 2 systems running the clients, each queue will have 190-200 consumers). If # of consumers is higher, you hit the symptom with link leak already

- wait up to 10 minutes, a segfault will come (of either router)

Comment 4 Pavel Moravec 2018-03-22 15:07:07 UTC

Testing with:

qpid-dispatch-0.8.0-18.el7
qpid-proton-0.16.0-9.el7

no segfault or mem.leak is present \o/ .

Comment 5 Pavel Moravec 2018-03-24 11:28:10 UTC

(In reply to Pavel Moravec from comment #4)
> Testing with:
> 
> qpid-dispatch-0.8.0-18.el7
> qpid-proton-0.16.0-9.el7
> 
> no segfault or mem.leak is present \o/ .

.. but using reproducer from https://bugzilla.redhat.com/show_bug.cgi?id=1553116#c4 , I get a coredump in approx 10 minutes with BT:

#0  0x00007f7f92f861f7 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x00007f7f92f878e8 in __GI_abort () at abort.c:90
#2  0x00007f7f92f7f266 in __assert_fail_base (fmt=0x7f7f930d1e68 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=assertion@entry=0x7f7f941c40f3 "result == 0", 
    file=file@entry=0x7f7f941c40a8 "/builddir/build/BUILD/qpid-dispatch-0.8.0/src/posix/threading.c", line=line@entry=57, 
    function=function@entry=0x7f7f941c41b1 <__PRETTY_FUNCTION__.3971> "sys_mutex_lock") at assert.c:92
#3  0x00007f7f92f7f312 in __GI___assert_fail (assertion=assertion@entry=0x7f7f941c40f3 "result == 0", 
    file=file@entry=0x7f7f941c40a8 "/builddir/build/BUILD/qpid-dispatch-0.8.0/src/posix/threading.c", line=line@entry=57, 
    function=function@entry=0x7f7f941c41b1 <__PRETTY_FUNCTION__.3971> "sys_mutex_lock") at assert.c:101
#4  0x00007f7f941a7411 in sys_mutex_lock (mutex=<optimized out>) at /usr/src/debug/qpid-dispatch-0.8.0/src/posix/threading.c:57
#5  0x00007f7f941a1988 in qd_message_check (in_msg=in_msg@entry=0x13e4710, depth=depth@entry=QD_DEPTH_MESSAGE_ANNOTATIONS) at /usr/src/debug/qpid-dispatch-0.8.0/src/message.c:1097
#6  0x00007f7f941a2a68 in qd_message_send (in_msg=0x13e4710, link=link@entry=0x21438d0, strip_annotations=strip_annotations@entry=false)
    at /usr/src/debug/qpid-dispatch-0.8.0/src/message.c:900
#7  0x00007f7f941ba061 in CORE_link_deliver (context=0xde2a10, link=<optimized out>, dlv=0x7f7f7c0a3750, settled=<optimized out>)
    at /usr/src/debug/qpid-dispatch-0.8.0/src/router_node.c:1038
#8  0x00007f7f941b8a96 in qdr_link_process_deliveries (core=0x10c1e50, link=0x7f7f800a3e10, credit=0) at /usr/src/debug/qpid-dispatch-0.8.0/src/router_core/transfer.c:143
#9  0x00007f7f941ba162 in CORE_link_push (context=0xde2a10, link=0x7f7f800a3e10, limit=<optimized out>) at /usr/src/debug/qpid-dispatch-0.8.0/src/router_node.c:1000
#10 0x00007f7f941aef64 in qdr_connection_process (conn=0x7f7f70028bd0) at /usr/src/debug/qpid-dispatch-0.8.0/src/router_core/connections.c:251
#11 0x00007f7f9419a388 in writable_handler (container=0x10b6050, container=0x10b6050, conn=<optimized out>, qd_conn=0x7f7f78000e90) at /usr/src/debug/qpid-dispatch-0.8.0/src/container.c:307
#12 handler (handler_context=0x10b6050, conn_context=0x1205290, event=<optimized out>, qd_conn=0x7f7f78000e90) at /usr/src/debug/qpid-dispatch-0.8.0/src/container.c:616
#13 0x00007f7f941bef53 in process_connector (cxtr=<optimized out>, qd_server=0x10c5870) at /usr/src/debug/qpid-dispatch-0.8.0/src/server.c:805
#14 thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.8.0/src/server.c:1025
#15 0x00007f7f93d15e25 in start_thread (arg=0x7f7f85eec700) at pthread_create.c:308
#16 0x00007f7f9304934d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113


(why pthread_mutex_lock could fail? either b'cos the lock isnt initialized, or b'cos it is already acquired by the same thread? changing "ulimit -l" has no impact, this shouldnt be a limit..)

Packages used:
qpid-dispatch-router-0.8.0-18.el7.x86_64
qpid-proton-c-0.16.0-9.el7.x86_64

Comment 7 Pavel Moravec 2018-03-24 12:22:10 UTC

FYI the time of reproducer execution until segfault happens is very related to the number of clients running in parallel - when I used half of clients in the stress test, the segfault happened after twice longer time - some internal resource got exhaust, I guess?

Comment 10 Pavel Moravec 2018-11-09 19:02:27 UTC

bz1561876 is a duplicate of this one (filed as first), but the real progress has been made in bz1561876. So closing this BZ instead.

*** This bug has been marked as a duplicate of bug 1561876 ***

Note You need to log in before you can comment on or make changes to this bug.