+++ This bug was initially created as a clone of Bug #1553116 +++ Same reproducer like in original 1553116 causes segfault in Sat 6.3 / qpid-dispatch-debuginfo-0.8.0-16 + qpid-proton-debuginfo-0.16.0-7, while the backtrace is completely different: - once I got (on Satellite): (gdb) bt #0 qdr_link_second_attach (link=link@entry=0x0, source=0x7f14c90a7f10, target=target@entry=0x7f14c90a7fd0) at /usr/src/debug/qpid-dispatch-0.8.0/src/router_core/connections.c:427 #1 0x00007f150de879b3 in AMQP_link_attach_handler (context=<optimized out>, link=0x7f145c228090) at /usr/src/debug/qpid-dispatch-0.8.0/src/router_node.c:527 #2 0x00007f150de68b1c in handle_link_open (container=0x13a3020, pn_link=0x7f14702feff0) at /usr/src/debug/qpid-dispatch-0.8.0/src/container.c:165 #3 pn_event_handler (handler_context=0x13a3020, conn_context=0x13a1590, event=<optimized out>, qd_conn=0x7f14f4000b10) at /usr/src/debug/qpid-dispatch-0.8.0/src/container.c:514 #4 0x00007f150de8ceb8 in process_connector (cxtr=<optimized out>, qd_server=0x15271f0) at /usr/src/debug/qpid-dispatch-0.8.0/src/server.c:782 #5 thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.8.0/src/server.c:1012 #6 0x00007f150d9e3e25 in start_thread (arg=0x7f14faffd700) at pthread_create.c:308 #7 0x00007f150cd1734d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 (gdb) - another time (on Capsules): (gdb) bt #0 pn_session_set_incoming_capacity (ssn=0x0, capacity=35184372072448) at /usr/src/debug/qpid-proton-0.16.0/proton-c/src/core/engine.c:1059 #1 0x00007f718e97562c in qd_link (node=0xe6b850, conn=conn@entry=0x7f7174000b10, dir=QD_INCOMING, name=name@entry=0x7f717cfce990 "8f2ec3cc-9803-4d78-8f8b-5fa2b39e6ab3") at /usr/src/debug/qpid-dispatch-0.8.0/src/container.c:831 #2 0x00007f718e994269 in CORE_link_first_attach (context=0x11437f0, conn=<optimized out>, link=0x7f717cfccc90, source=0x7f715e873550, target=0x7f715e873450) at /usr/src/debug/qpid-dispatch-0.8.0/src/router_node.c:849 #3 0x00007f718e988b32 in qdr_connection_process (conn=0x7f7174027510) at /usr/src/debug/qpid-dispatch-0.8.0/src/router_core/connections.c:214 #4 0x00007f718e974298 in writable_handler (container=0xf8e2a0, container=0xf8e2a0, conn=<optimized out>, qd_conn=0x7f7174000b10) at /usr/src/debug/qpid-dispatch-0.8.0/src/container.c:307 #5 handler (handler_context=0xf8e2a0, conn_context=0xf8c590, event=<optimized out>, qd_conn=0x7f7174000b10) at /usr/src/debug/qpid-dispatch-0.8.0/src/container.c:614 #6 0x00007f718e998deb in process_connector (cxtr=<optimized out>, qd_server=0x11121f0) at /usr/src/debug/qpid-dispatch-0.8.0/src/server.c:792 #7 thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.8.0/src/server.c:1012 #8 0x00007f718e4efe25 in start_thread (arg=0x7f7180f08700) at pthread_create.c:308 #9 0x00007f718d82334d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 (gdb) Description and reproducer is the same, impact as well, just the technical cause is different (original segfault doesnt happen here). Description of problem: Having more Capsules (10 in reproducer) and a Satellite, and *many* goferd clients frequently re-connecting, qdrouterd segfaults within 10 minutes. Sometimes on a Capsule, sometimes on the Satellite. Particular detailed reproducer outside Satellite will follow, basically if one mimics goferd activity by frequent creating receiver link-routed to qpidd, and one does it massivelly in parallel, the segfault will happen. Version-Release number of selected component (if applicable): qpid-proton-c-0.9-21.el7.x86_64 qpid-dispatch-router-0.4-29.el7sat.x86_64 How reproducible: 100% in 10 minutes Steps to Reproduce: 1. See above, detailed reproducer in next comment (with script, configs of routers etc) Actual results: either qdrouterd segfaults with few possible backtraces of 2 basic variants: 1) (gdb) bt #0 qd_alloc (desc=desc@entry=0x7fc8ad51a760 <__desc_qd_router_link_t>, tpool=0x7fc89fa906a0) at /usr/src/debug/qpid-dispatch-0.4/src/alloc.c:162 #1 0x00007fc8ad3043c7 in new_qd_router_link_t () at /usr/src/debug/qpid-dispatch-0.4/src/router_node.c:60 #2 0x00007fc8ad30542f in qd_router_attach_routed_link (context=0x7fc894ab96b0, discard=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/router_node.c:1182 #3 0x00007fc8ad3091d9 in invoke_deferred_calls (conn=conn@entry=0x7fc89000cb30, discard=discard@entry=false) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:305 #4 0x00007fc8ad309a9d in process_connector (cxtr=0x7fc8900102d0, qd_server=0x1553bf0) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:398 #5 thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:627 #6 0x00007fc8ace7ae25 in start_thread (arg=0x7fc89fa90700) at pthread_create.c:308 #7 0x00007fc8ac3d034d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 (gdb) list 157 desc->stats->batches_rebalanced_to_threads++; 158 desc->stats->held_by_threads += desc->config->transfer_batch_size; 159 for (idx = 0; idx < desc->config->transfer_batch_size; idx++) { 160 item = DEQ_HEAD(desc->global_pool->free_list); 161 DEQ_REMOVE_HEAD(desc->global_pool->free_list); 162 DEQ_INSERT_TAIL(pool->free_list, item); 163 } 164 } else { 165 // 166 // Allocate a full batch from the heap and put it on the thread list. (gdb) p item $1 = (qd_alloc_item_t *) 0x0 (gdb) p desc->global_pool->free_list $2 = {head = 0x0, tail = 0x0, scratch = 0x7fc8929d6be0, size = 13557} (gdb) 2) (gdb) bt #0 qd_dealloc (desc=desc@entry=0x7f0fc187e760 <__desc_qd_router_link_t>, tpool=<optimized out>, p=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/alloc.c:244 #1 0x00007f0fc16683f7 in free_qd_router_link_t (p=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/router_node.c:60 #2 0x00007f0fc16692ee in qd_router_link_free_LH (rlink=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/router_node.c:433 #3 0x00007f0fc1669bb0 in qd_router_detach_routed_link (context=0x7f0fa440ed70, discard=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/router_node.c:1258 #4 0x00007f0fc166d206 in invoke_deferred_calls (conn=conn@entry=0x27a1cb0, discard=discard@entry=false) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:305 #5 0x00007f0fc166da9d in process_connector (cxtr=0x7f0fac0052b0, qd_server=0x270e1f0) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:398 #6 thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:627 #7 0x00007f0fc11dee25 in start_thread () from /lib64/libpthread.so.0 #8 0x00007f0fc073434d in clone () from /lib64/libc.so.6 (gdb) list 239 desc->stats->batches_rebalanced_to_global++; 240 desc->stats->held_by_threads -= desc->config->transfer_batch_size; 241 for (idx = 0; idx < desc->config->transfer_batch_size; idx++) { 242 item = DEQ_HEAD(pool->free_list); 243 DEQ_REMOVE_HEAD(pool->free_list); 244 DEQ_INSERT_TAIL(desc->global_pool->free_list, item); 245 } 246 247 // 248 // If there's a global_free_list size limit, remove items until the limit is (gdb) p item $1 = (qd_alloc_item_t *) 0x0 (gdb) p pool->free_list $2 = {head = 0x0, tail = 0x0, scratch = 0x7f0fac067500, size = 99} (gdb) Expected results: no such segfault Additional info: Few interesting observations: - technical cause for segfault: we try to DEQ_INSERT_TAIL a null object, fetched from a free_list dequeue; that happened because that dequeue was empty (head = tail = 0x0) BUT of positive size (how this ever could happen???) so there was an assumption we can dequeue something - the above problem affects both desc->global_pool->free_list and pool->free_list, but always when manipulating with links - when running reproducer, and stopping *all* clients, qpidd still gets some lnik-routed subscriptions to the relevant queues, from some Capsule's qdrouterd (qpid-stat -q shows positive number of subs) - when restarting qdrouterd on Capsules, these subscriptions are gone - that leads me to a deduction there is a leak in "links routed further" (i.e. links from Caps router to Sat router, maintained on Caps router; or links from Sat router to qpidd maintained by Sat router) - this is evident in qpid-stat and imho triggers the weird behaviour in memory management of link objects (the dequeues) --- Additional comment from RHEL Product and Program Management on 2018-03-08 05:55:45 EST --- Since this issue was entered in Red Hat Bugzilla, the pm_ack has been set to + automatically for the next planned release --- Additional comment from Pavel Moravec on 2018-03-08 06:32 EST --- To reproduce: - unpack and cd to the dir - ulimit -c unlimited # to get coredumps, if interested) - run start_reproducer.sh - very first run will customize configs per your hostname - it will start and provision qpidd + "hub" router + 10 Capsule routers - then run on 1 or rather 2 systems the clients (per the script output) - I recommend using a different machine(s) for the clients, as it generates load >10 :-/ - optionally, check "qpid-stat -q" - either queue should have (close to) 100 consumers for each system running the clients (i.e. if I use 2 systems running the clients, each queue will have 190-200 consumers). If # of consumers is higher, you hit the symptom with link leak already - wait up to 10 minutes, a segfault will come (of either router)
Testing with: qpid-dispatch-0.8.0-18.el7 qpid-proton-0.16.0-9.el7 no segfault or mem.leak is present \o/ .
(In reply to Pavel Moravec from comment #4) > Testing with: > > qpid-dispatch-0.8.0-18.el7 > qpid-proton-0.16.0-9.el7 > > no segfault or mem.leak is present \o/ . .. but using reproducer from https://bugzilla.redhat.com/show_bug.cgi?id=1553116#c4 , I get a coredump in approx 10 minutes with BT: #0 0x00007f7f92f861f7 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56 #1 0x00007f7f92f878e8 in __GI_abort () at abort.c:90 #2 0x00007f7f92f7f266 in __assert_fail_base (fmt=0x7f7f930d1e68 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=assertion@entry=0x7f7f941c40f3 "result == 0", file=file@entry=0x7f7f941c40a8 "/builddir/build/BUILD/qpid-dispatch-0.8.0/src/posix/threading.c", line=line@entry=57, function=function@entry=0x7f7f941c41b1 <__PRETTY_FUNCTION__.3971> "sys_mutex_lock") at assert.c:92 #3 0x00007f7f92f7f312 in __GI___assert_fail (assertion=assertion@entry=0x7f7f941c40f3 "result == 0", file=file@entry=0x7f7f941c40a8 "/builddir/build/BUILD/qpid-dispatch-0.8.0/src/posix/threading.c", line=line@entry=57, function=function@entry=0x7f7f941c41b1 <__PRETTY_FUNCTION__.3971> "sys_mutex_lock") at assert.c:101 #4 0x00007f7f941a7411 in sys_mutex_lock (mutex=<optimized out>) at /usr/src/debug/qpid-dispatch-0.8.0/src/posix/threading.c:57 #5 0x00007f7f941a1988 in qd_message_check (in_msg=in_msg@entry=0x13e4710, depth=depth@entry=QD_DEPTH_MESSAGE_ANNOTATIONS) at /usr/src/debug/qpid-dispatch-0.8.0/src/message.c:1097 #6 0x00007f7f941a2a68 in qd_message_send (in_msg=0x13e4710, link=link@entry=0x21438d0, strip_annotations=strip_annotations@entry=false) at /usr/src/debug/qpid-dispatch-0.8.0/src/message.c:900 #7 0x00007f7f941ba061 in CORE_link_deliver (context=0xde2a10, link=<optimized out>, dlv=0x7f7f7c0a3750, settled=<optimized out>) at /usr/src/debug/qpid-dispatch-0.8.0/src/router_node.c:1038 #8 0x00007f7f941b8a96 in qdr_link_process_deliveries (core=0x10c1e50, link=0x7f7f800a3e10, credit=0) at /usr/src/debug/qpid-dispatch-0.8.0/src/router_core/transfer.c:143 #9 0x00007f7f941ba162 in CORE_link_push (context=0xde2a10, link=0x7f7f800a3e10, limit=<optimized out>) at /usr/src/debug/qpid-dispatch-0.8.0/src/router_node.c:1000 #10 0x00007f7f941aef64 in qdr_connection_process (conn=0x7f7f70028bd0) at /usr/src/debug/qpid-dispatch-0.8.0/src/router_core/connections.c:251 #11 0x00007f7f9419a388 in writable_handler (container=0x10b6050, container=0x10b6050, conn=<optimized out>, qd_conn=0x7f7f78000e90) at /usr/src/debug/qpid-dispatch-0.8.0/src/container.c:307 #12 handler (handler_context=0x10b6050, conn_context=0x1205290, event=<optimized out>, qd_conn=0x7f7f78000e90) at /usr/src/debug/qpid-dispatch-0.8.0/src/container.c:616 #13 0x00007f7f941bef53 in process_connector (cxtr=<optimized out>, qd_server=0x10c5870) at /usr/src/debug/qpid-dispatch-0.8.0/src/server.c:805 #14 thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.8.0/src/server.c:1025 #15 0x00007f7f93d15e25 in start_thread (arg=0x7f7f85eec700) at pthread_create.c:308 #16 0x00007f7f9304934d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 (why pthread_mutex_lock could fail? either b'cos the lock isnt initialized, or b'cos it is already acquired by the same thread? changing "ulimit -l" has no impact, this shouldnt be a limit..) Packages used: qpid-dispatch-router-0.8.0-18.el7.x86_64 qpid-proton-c-0.16.0-9.el7.x86_64
FYI the time of reproducer execution until segfault happens is very related to the number of clients running in parallel - when I used half of clients in the stress test, the segfault happened after twice longer time - some internal resource got exhaust, I guess?
bz1561876 is a duplicate of this one (filed as first), but the real progress has been made in bz1561876. So closing this BZ instead. *** This bug has been marked as a duplicate of bug 1561876 ***