Bug 1553189 - qdrouterd segfaults under load of clients (dis)connects
Summary: qdrouterd segfaults under load of clients (dis)connects
Keywords:
Status: CLOSED DUPLICATE of bug 1561876
Alias: None
Product: Red Hat Satellite
Classification: Red Hat
Component: Qpid
Version: 6.3.0
Hardware: x86_64
OS: Linux
urgent
high
Target Milestone: Unspecified
Assignee: Mike Cressman
QA Contact: Katello QA List
URL:
Whiteboard:
Depends On: 1553116
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-03-08 13:25 UTC by Pavel Moravec
Modified: 2022-03-13 14:45 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1553116
Environment:
Last Closed: 2018-11-09 19:02:27 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Pavel Moravec 2018-03-08 13:25:33 UTC
+++ This bug was initially created as a clone of Bug #1553116 +++
Same reproducer like in original 1553116 causes segfault in Sat 6.3 / qpid-dispatch-debuginfo-0.8.0-16 + qpid-proton-debuginfo-0.16.0-7, while the backtrace is completely different:

- once I got (on Satellite):

(gdb) bt
#0  qdr_link_second_attach (link=link@entry=0x0, source=0x7f14c90a7f10, target=target@entry=0x7f14c90a7fd0) at /usr/src/debug/qpid-dispatch-0.8.0/src/router_core/connections.c:427
#1  0x00007f150de879b3 in AMQP_link_attach_handler (context=<optimized out>, link=0x7f145c228090) at /usr/src/debug/qpid-dispatch-0.8.0/src/router_node.c:527
#2  0x00007f150de68b1c in handle_link_open (container=0x13a3020, pn_link=0x7f14702feff0) at /usr/src/debug/qpid-dispatch-0.8.0/src/container.c:165
#3  pn_event_handler (handler_context=0x13a3020, conn_context=0x13a1590, event=<optimized out>, qd_conn=0x7f14f4000b10) at /usr/src/debug/qpid-dispatch-0.8.0/src/container.c:514
#4  0x00007f150de8ceb8 in process_connector (cxtr=<optimized out>, qd_server=0x15271f0) at /usr/src/debug/qpid-dispatch-0.8.0/src/server.c:782
#5  thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.8.0/src/server.c:1012
#6  0x00007f150d9e3e25 in start_thread (arg=0x7f14faffd700) at pthread_create.c:308
#7  0x00007f150cd1734d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
(gdb) 

- another time (on Capsules):

(gdb) bt
#0  pn_session_set_incoming_capacity (ssn=0x0, capacity=35184372072448) at /usr/src/debug/qpid-proton-0.16.0/proton-c/src/core/engine.c:1059
#1  0x00007f718e97562c in qd_link (node=0xe6b850, conn=conn@entry=0x7f7174000b10, dir=QD_INCOMING, name=name@entry=0x7f717cfce990 "8f2ec3cc-9803-4d78-8f8b-5fa2b39e6ab3")
    at /usr/src/debug/qpid-dispatch-0.8.0/src/container.c:831
#2  0x00007f718e994269 in CORE_link_first_attach (context=0x11437f0, conn=<optimized out>, link=0x7f717cfccc90, source=0x7f715e873550, target=0x7f715e873450)
    at /usr/src/debug/qpid-dispatch-0.8.0/src/router_node.c:849
#3  0x00007f718e988b32 in qdr_connection_process (conn=0x7f7174027510) at /usr/src/debug/qpid-dispatch-0.8.0/src/router_core/connections.c:214
#4  0x00007f718e974298 in writable_handler (container=0xf8e2a0, container=0xf8e2a0, conn=<optimized out>, qd_conn=0x7f7174000b10) at /usr/src/debug/qpid-dispatch-0.8.0/src/container.c:307
#5  handler (handler_context=0xf8e2a0, conn_context=0xf8c590, event=<optimized out>, qd_conn=0x7f7174000b10) at /usr/src/debug/qpid-dispatch-0.8.0/src/container.c:614
#6  0x00007f718e998deb in process_connector (cxtr=<optimized out>, qd_server=0x11121f0) at /usr/src/debug/qpid-dispatch-0.8.0/src/server.c:792
#7  thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.8.0/src/server.c:1012
#8  0x00007f718e4efe25 in start_thread (arg=0x7f7180f08700) at pthread_create.c:308
#9  0x00007f718d82334d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
(gdb) 


Description and reproducer is the same, impact as well, just the technical cause is different (original segfault doesnt happen here).


Description of problem:
Having more Capsules (10 in reproducer) and a Satellite, and *many* goferd clients frequently re-connecting, qdrouterd segfaults within 10 minutes. Sometimes on a Capsule, sometimes on the Satellite.

Particular detailed reproducer outside Satellite will follow, basically if one mimics goferd activity by frequent creating receiver link-routed to qpidd, and one does it massivelly in parallel, the segfault will happen.


Version-Release number of selected component (if applicable):
qpid-proton-c-0.9-21.el7.x86_64
qpid-dispatch-router-0.4-29.el7sat.x86_64


How reproducible:
100% in 10 minutes


Steps to Reproduce:
1. See above, detailed reproducer in next comment (with script, configs of routers etc)


Actual results:
either qdrouterd segfaults with few possible backtraces of 2 basic variants:

1)
(gdb) bt
#0  qd_alloc (desc=desc@entry=0x7fc8ad51a760 <__desc_qd_router_link_t>, tpool=0x7fc89fa906a0) at /usr/src/debug/qpid-dispatch-0.4/src/alloc.c:162
#1  0x00007fc8ad3043c7 in new_qd_router_link_t () at /usr/src/debug/qpid-dispatch-0.4/src/router_node.c:60
#2  0x00007fc8ad30542f in qd_router_attach_routed_link (context=0x7fc894ab96b0, discard=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/router_node.c:1182
#3  0x00007fc8ad3091d9 in invoke_deferred_calls (conn=conn@entry=0x7fc89000cb30, discard=discard@entry=false) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:305
#4  0x00007fc8ad309a9d in process_connector (cxtr=0x7fc8900102d0, qd_server=0x1553bf0) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:398
#5  thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:627
#6  0x00007fc8ace7ae25 in start_thread (arg=0x7fc89fa90700) at pthread_create.c:308
#7  0x00007fc8ac3d034d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
(gdb) list
157	        desc->stats->batches_rebalanced_to_threads++;
158	        desc->stats->held_by_threads += desc->config->transfer_batch_size;
159	        for (idx = 0; idx < desc->config->transfer_batch_size; idx++) {
160	            item = DEQ_HEAD(desc->global_pool->free_list);
161	            DEQ_REMOVE_HEAD(desc->global_pool->free_list);
162	            DEQ_INSERT_TAIL(pool->free_list, item);
163	        }
164	    } else {
165	        //
166	        // Allocate a full batch from the heap and put it on the thread list.
(gdb) p item
$1 = (qd_alloc_item_t *) 0x0
(gdb) p desc->global_pool->free_list
$2 = {head = 0x0, tail = 0x0, scratch = 0x7fc8929d6be0, size = 13557}
(gdb) 

2)
(gdb) bt
#0  qd_dealloc (desc=desc@entry=0x7f0fc187e760 <__desc_qd_router_link_t>, tpool=<optimized out>, p=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/alloc.c:244
#1  0x00007f0fc16683f7 in free_qd_router_link_t (p=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/router_node.c:60
#2  0x00007f0fc16692ee in qd_router_link_free_LH (rlink=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/router_node.c:433
#3  0x00007f0fc1669bb0 in qd_router_detach_routed_link (context=0x7f0fa440ed70, discard=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/router_node.c:1258
#4  0x00007f0fc166d206 in invoke_deferred_calls (conn=conn@entry=0x27a1cb0, discard=discard@entry=false) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:305
#5  0x00007f0fc166da9d in process_connector (cxtr=0x7f0fac0052b0, qd_server=0x270e1f0) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:398
#6  thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:627
#7  0x00007f0fc11dee25 in start_thread () from /lib64/libpthread.so.0
#8  0x00007f0fc073434d in clone () from /lib64/libc.so.6
(gdb) list
239	    desc->stats->batches_rebalanced_to_global++;
240	    desc->stats->held_by_threads -= desc->config->transfer_batch_size;
241	    for (idx = 0; idx < desc->config->transfer_batch_size; idx++) {
242	        item = DEQ_HEAD(pool->free_list);
243	        DEQ_REMOVE_HEAD(pool->free_list);
244	        DEQ_INSERT_TAIL(desc->global_pool->free_list, item);
245	    }
246	
247	    //
248	    // If there's a global_free_list size limit, remove items until the limit is
(gdb) p item
$1 = (qd_alloc_item_t *) 0x0
(gdb) p pool->free_list
$2 = {head = 0x0, tail = 0x0, scratch = 0x7f0fac067500, size = 99}
(gdb) 



Expected results:
no such segfault


Additional info:
Few interesting observations:
- technical cause for segfault: we try to DEQ_INSERT_TAIL a null object, fetched from a free_list dequeue; that happened because that dequeue was empty (head = tail = 0x0) BUT of positive size (how this ever could happen???) so there was an assumption we can dequeue something
- the above problem affects both desc->global_pool->free_list and pool->free_list, but always when manipulating with links
- when running reproducer, and stopping *all* clients, qpidd still gets some lnik-routed subscriptions to the relevant queues, from some Capsule's qdrouterd (qpid-stat -q shows positive number of subs)
  - when restarting qdrouterd on Capsules, these subscriptions are gone
- that leads me to a deduction there is a leak in "links routed further" (i.e. links from Caps router to Sat router, maintained on Caps router; or links from Sat router to qpidd maintained by Sat router) - this is evident in qpid-stat and imho triggers the weird behaviour in memory management of link objects (the dequeues)

--- Additional comment from RHEL Product and Program Management on 2018-03-08 05:55:45 EST ---

Since this issue was entered in Red Hat Bugzilla, the pm_ack has been
set to + automatically for the next planned release

--- Additional comment from Pavel Moravec on 2018-03-08 06:32 EST ---

To reproduce:

- unpack and cd to the dir

- ulimit -c unlimited   # to get coredumps, if interested)

- run start_reproducer.sh
  - very first run will customize configs per your hostname
  - it will start and provision qpidd + "hub" router + 10 Capsule routers

- then run on 1 or rather 2 systems the clients (per the script output)

- I recommend using a different machine(s) for the clients, as it generates load >10 :-/

- optionally, check "qpid-stat -q" - either queue should have (close to) 100 consumers for each system running the clients (i.e. if I use 2 systems running the clients, each queue will have 190-200 consumers). If # of consumers is higher, you hit the symptom with link leak already

- wait up to 10 minutes, a segfault will come (of either router)

Comment 4 Pavel Moravec 2018-03-22 15:07:07 UTC
Testing with:

qpid-dispatch-0.8.0-18.el7
qpid-proton-0.16.0-9.el7

no segfault or mem.leak is present \o/ .

Comment 5 Pavel Moravec 2018-03-24 11:28:10 UTC
(In reply to Pavel Moravec from comment #4)
> Testing with:
> 
> qpid-dispatch-0.8.0-18.el7
> qpid-proton-0.16.0-9.el7
> 
> no segfault or mem.leak is present \o/ .

.. but using reproducer from https://bugzilla.redhat.com/show_bug.cgi?id=1553116#c4 , I get a coredump in approx 10 minutes with BT:

#0  0x00007f7f92f861f7 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x00007f7f92f878e8 in __GI_abort () at abort.c:90
#2  0x00007f7f92f7f266 in __assert_fail_base (fmt=0x7f7f930d1e68 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=assertion@entry=0x7f7f941c40f3 "result == 0", 
    file=file@entry=0x7f7f941c40a8 "/builddir/build/BUILD/qpid-dispatch-0.8.0/src/posix/threading.c", line=line@entry=57, 
    function=function@entry=0x7f7f941c41b1 <__PRETTY_FUNCTION__.3971> "sys_mutex_lock") at assert.c:92
#3  0x00007f7f92f7f312 in __GI___assert_fail (assertion=assertion@entry=0x7f7f941c40f3 "result == 0", 
    file=file@entry=0x7f7f941c40a8 "/builddir/build/BUILD/qpid-dispatch-0.8.0/src/posix/threading.c", line=line@entry=57, 
    function=function@entry=0x7f7f941c41b1 <__PRETTY_FUNCTION__.3971> "sys_mutex_lock") at assert.c:101
#4  0x00007f7f941a7411 in sys_mutex_lock (mutex=<optimized out>) at /usr/src/debug/qpid-dispatch-0.8.0/src/posix/threading.c:57
#5  0x00007f7f941a1988 in qd_message_check (in_msg=in_msg@entry=0x13e4710, depth=depth@entry=QD_DEPTH_MESSAGE_ANNOTATIONS) at /usr/src/debug/qpid-dispatch-0.8.0/src/message.c:1097
#6  0x00007f7f941a2a68 in qd_message_send (in_msg=0x13e4710, link=link@entry=0x21438d0, strip_annotations=strip_annotations@entry=false)
    at /usr/src/debug/qpid-dispatch-0.8.0/src/message.c:900
#7  0x00007f7f941ba061 in CORE_link_deliver (context=0xde2a10, link=<optimized out>, dlv=0x7f7f7c0a3750, settled=<optimized out>)
    at /usr/src/debug/qpid-dispatch-0.8.0/src/router_node.c:1038
#8  0x00007f7f941b8a96 in qdr_link_process_deliveries (core=0x10c1e50, link=0x7f7f800a3e10, credit=0) at /usr/src/debug/qpid-dispatch-0.8.0/src/router_core/transfer.c:143
#9  0x00007f7f941ba162 in CORE_link_push (context=0xde2a10, link=0x7f7f800a3e10, limit=<optimized out>) at /usr/src/debug/qpid-dispatch-0.8.0/src/router_node.c:1000
#10 0x00007f7f941aef64 in qdr_connection_process (conn=0x7f7f70028bd0) at /usr/src/debug/qpid-dispatch-0.8.0/src/router_core/connections.c:251
#11 0x00007f7f9419a388 in writable_handler (container=0x10b6050, container=0x10b6050, conn=<optimized out>, qd_conn=0x7f7f78000e90) at /usr/src/debug/qpid-dispatch-0.8.0/src/container.c:307
#12 handler (handler_context=0x10b6050, conn_context=0x1205290, event=<optimized out>, qd_conn=0x7f7f78000e90) at /usr/src/debug/qpid-dispatch-0.8.0/src/container.c:616
#13 0x00007f7f941bef53 in process_connector (cxtr=<optimized out>, qd_server=0x10c5870) at /usr/src/debug/qpid-dispatch-0.8.0/src/server.c:805
#14 thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.8.0/src/server.c:1025
#15 0x00007f7f93d15e25 in start_thread (arg=0x7f7f85eec700) at pthread_create.c:308
#16 0x00007f7f9304934d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113


(why pthread_mutex_lock could fail? either b'cos the lock isnt initialized, or b'cos it is already acquired by the same thread? changing "ulimit -l" has no impact, this shouldnt be a limit..)

Packages used:
qpid-dispatch-router-0.8.0-18.el7.x86_64
qpid-proton-c-0.16.0-9.el7.x86_64

Comment 7 Pavel Moravec 2018-03-24 12:22:10 UTC
FYI the time of reproducer execution until segfault happens is very related to the number of clients running in parallel - when I used half of clients in the stress test, the segfault happened after twice longer time - some internal resource got exhaust, I guess?

Comment 10 Pavel Moravec 2018-11-09 19:02:27 UTC
bz1561876 is a duplicate of this one (filed as first), but the real progress has been made in bz1561876. So closing this BZ instead.

*** This bug has been marked as a duplicate of bug 1561876 ***


Note You need to log in before you can comment on or make changes to this bug.