Description of problem: Having more Capsules (10 in reproducer) and a Satellite, and *many* goferd clients frequently re-connecting, qdrouterd segfaults within 10 minutes. Sometimes on a Capsule, sometimes on the Satellite. Particular detailed reproducer outside Satellite will follow, basically if one mimics goferd activity by frequent creating receiver link-routed to qpidd, and one does it massivelly in parallel, the segfault will happen. Version-Release number of selected component (if applicable): qpid-proton-c-0.9-21.el7.x86_64 qpid-dispatch-router-0.4-29.el7sat.x86_64 How reproducible: 100% in 10 minutes Steps to Reproduce: 1. See above, detailed reproducer in next comment (with script, configs of routers etc) Actual results: either qdrouterd segfaults with few possible backtraces of 2 basic variants: 1) (gdb) bt #0 qd_alloc (desc=desc@entry=0x7fc8ad51a760 <__desc_qd_router_link_t>, tpool=0x7fc89fa906a0) at /usr/src/debug/qpid-dispatch-0.4/src/alloc.c:162 #1 0x00007fc8ad3043c7 in new_qd_router_link_t () at /usr/src/debug/qpid-dispatch-0.4/src/router_node.c:60 #2 0x00007fc8ad30542f in qd_router_attach_routed_link (context=0x7fc894ab96b0, discard=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/router_node.c:1182 #3 0x00007fc8ad3091d9 in invoke_deferred_calls (conn=conn@entry=0x7fc89000cb30, discard=discard@entry=false) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:305 #4 0x00007fc8ad309a9d in process_connector (cxtr=0x7fc8900102d0, qd_server=0x1553bf0) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:398 #5 thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:627 #6 0x00007fc8ace7ae25 in start_thread (arg=0x7fc89fa90700) at pthread_create.c:308 #7 0x00007fc8ac3d034d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 (gdb) list 157 desc->stats->batches_rebalanced_to_threads++; 158 desc->stats->held_by_threads += desc->config->transfer_batch_size; 159 for (idx = 0; idx < desc->config->transfer_batch_size; idx++) { 160 item = DEQ_HEAD(desc->global_pool->free_list); 161 DEQ_REMOVE_HEAD(desc->global_pool->free_list); 162 DEQ_INSERT_TAIL(pool->free_list, item); 163 } 164 } else { 165 // 166 // Allocate a full batch from the heap and put it on the thread list. (gdb) p item $1 = (qd_alloc_item_t *) 0x0 (gdb) p desc->global_pool->free_list $2 = {head = 0x0, tail = 0x0, scratch = 0x7fc8929d6be0, size = 13557} (gdb) 2) (gdb) bt #0 qd_dealloc (desc=desc@entry=0x7f0fc187e760 <__desc_qd_router_link_t>, tpool=<optimized out>, p=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/alloc.c:244 #1 0x00007f0fc16683f7 in free_qd_router_link_t (p=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/router_node.c:60 #2 0x00007f0fc16692ee in qd_router_link_free_LH (rlink=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/router_node.c:433 #3 0x00007f0fc1669bb0 in qd_router_detach_routed_link (context=0x7f0fa440ed70, discard=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/router_node.c:1258 #4 0x00007f0fc166d206 in invoke_deferred_calls (conn=conn@entry=0x27a1cb0, discard=discard@entry=false) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:305 #5 0x00007f0fc166da9d in process_connector (cxtr=0x7f0fac0052b0, qd_server=0x270e1f0) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:398 #6 thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:627 #7 0x00007f0fc11dee25 in start_thread () from /lib64/libpthread.so.0 #8 0x00007f0fc073434d in clone () from /lib64/libc.so.6 (gdb) list 239 desc->stats->batches_rebalanced_to_global++; 240 desc->stats->held_by_threads -= desc->config->transfer_batch_size; 241 for (idx = 0; idx < desc->config->transfer_batch_size; idx++) { 242 item = DEQ_HEAD(pool->free_list); 243 DEQ_REMOVE_HEAD(pool->free_list); 244 DEQ_INSERT_TAIL(desc->global_pool->free_list, item); 245 } 246 247 // 248 // If there's a global_free_list size limit, remove items until the limit is (gdb) p item $1 = (qd_alloc_item_t *) 0x0 (gdb) p pool->free_list $2 = {head = 0x0, tail = 0x0, scratch = 0x7f0fac067500, size = 99} (gdb) Expected results: no such segfault Additional info: Few interesting observations: - technical cause for segfault: we try to DEQ_INSERT_TAIL a null object, fetched from a free_list dequeue; that happened because that dequeue was empty (head = tail = 0x0) BUT of positive size (how this ever could happen???) so there was an assumption we can dequeue something - the above problem affects both desc->global_pool->free_list and pool->free_list, but always when manipulating with links - when running reproducer, and stopping *all* clients, qpidd still gets some lnik-routed subscriptions to the relevant queues, from some Capsule's qdrouterd (qpid-stat -q shows positive number of subs) - when restarting qdrouterd on Capsules, these subscriptions are gone - that leads me to a deduction there is a leak in "links routed further" (i.e. links from Caps router to Sat router, maintained on Caps router; or links from Sat router to qpidd maintained by Sat router) - this is evident in qpid-stat and imho triggers the weird behaviour in memory management of link objects (the dequeues)
Created attachment 1405823 [details] reproducer configs and script To reproduce: - unpack and cd to the dir - ulimit -c unlimited # to get coredumps, if interested) - run start_reproducer.sh - very first run will customize configs per your hostname - it will start and provision qpidd + "hub" router + 10 Capsule routers - then run on 1 or rather 2 systems the clients (per the script output) - I recommend using a different machine(s) for the clients, as it generates load >10 :-/ - optionally, check "qpid-stat -q" - either queue should have (close to) 100 consumers for each system running the clients (i.e. if I use 2 systems running the clients, each queue will have 190-200 consumers). If # of consumers is higher, you hit the symptom with link leak already - wait up to 10 minutes, a segfault will come (of either router)
Improved reproducer that leads to segfault even on 0.4-30 (where the original reproducer does not crash the router): 1) run same setup like before, just add some messages to the tested queues at the end: for i in $(seq 0 19); do qpid-config bind amq.fanout pulp.bz1553116.${i} done qpid-send -a amq.fanout -m10 --content-string="hello" 2) as a client script mimicking goferd, use this one: import random import threading import traceback import sys from proton import Timeout from proton import ConnectionException from proton.utils import BlockingConnection from proton import SSLDomain from time import sleep from uuid import uuid4 ROUTER_ADDRESS = "proton+amqp://pmoravec-sat62-rhel7.gsslab.brq2.redhat.com:%s" ADDRESS = "pulp.bz1553116" HEARTBEAT = 5 SLEEP_MIN = 1.0 SLEEP_MAX = 3.0 THREADS = 20 ADDRESSES = 20 class ReceiverThread(threading.Thread): def __init__(self, port, address=ADDRESS, domain=None): super(ReceiverThread, self).__init__() self.port = port self.address = address self.domain = domain self.running = True def subscribe(self): self.recv = self.conn.create_receiver('%s' %(self.address), name=str(uuid4()), dynamic=False, options=None) def connect(self): self.conn = BlockingConnection(ROUTER_ADDRESS % (self.port), ssl_domain=self.domain, heartbeat=HEARTBEAT) self.subscribe() def reconnect(self): try: self.recv.close() except: pass try: self.conn.close() except: pass self.connect() def run(self): self.connect() while self.running: try: msg = None timeout = random.uniform(SLEEP_MIN,SLEEP_MAX) msg = self.recv.receive(timeout) if msg: print "(%s): accepting msg %s" % (self.port, msg) self.recv.accept() sleep(timeout) except KeyboardInterrupt: self.running = False pass except Timeout: pass try: print "(%s): bouncing receiver" % self.port self.recv.close() self.subscribe() except: print "(%s): receiver thread: failed to close receiver" % self.port self.reconnect() pass def stop(self): self.running = False threads = [] for i in range(THREADS): threads.append(ReceiverThread(sys.argv[1], '%s.%s' %(ADDRESS, i%ADDRESSES))) threads[i].start() running = True while running: try: sleep(10) except KeyboardInterrupt: running = False pass #_in = raw_input("Press Enter to exit:") for i in range(THREADS): threads[i].stop() for i in range(THREADS): threads[i].join() 3) in very few minutes, segfaults will happen with either backtrace: #0 0x0000000000000000 in ?? () #1 0x00007f37ff244788 in pn_class_incref (clazz=<optimized out>, object=object@entry=0x7f37e81ce0a0) at /usr/src/debug/qpid-proton-0.9/proton-c/src/object/object.c:75 #2 0x00007f37ff2449f2 in pn_incref (object=object@entry=0x7f37e81ce0a0) at /usr/src/debug/qpid-proton-0.9/proton-c/src/object/object.c:247 #3 0x00007f37ff251a96 in pn_delivery_settle (delivery=0x7f37e81ce0a0) at /usr/src/debug/qpid-proton-0.9/proton-c/src/engine/engine.c:1778 #4 0x00007f37ff494256 in qd_delivery_free_LH (delivery=0x7f37ec0153d0, final_disposition=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:1094 #5 0x00007f37ff4a03da in router_writable_link_handler (context=0x273c9b0, link=0x28d13d0) at /usr/src/debug/qpid-dispatch-0.4/src/router_node.c:571 #6 0x00007f37ff493b9e in do_writable (pn_link=0x299fea0) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:234 #7 process_handler (unused=<optimized out>, qd_conn=<optimized out>, container=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:625 #8 handler (handler_context=<optimized out>, conn_context=<optimized out>, event=event@entry=QD_CONN_EVENT_PROCESS, qd_conn=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:675 #9 0x00007f37ff4a5c7c in process_connector (cxtr=0x7f37ec004180, qd_server=0x27a5f80) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:399 #10 thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:627 #11 0x00007f37ff4a6cf0 in qd_server_run (qd=0x24ee030) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:973 #12 0x0000000000401b38 in main_process (config_path=config_path@entry=0x7ffc50f4a695 "/root/bz1553116_reproducer/qdrouterd.caps03.conf", python_pkgdir=python_pkgdir@entry=0x402261 "/usr/lib/qpid-dispatch/python", fd=fd@entry=2) at /usr/src/debug/qpid-dispatch-0.4/router/src/main.c:135 #13 0x00000000004017b0 in main (argc=3, argv=0x7ffc50f49098) at /usr/src/debug/qpid-dispatch-0.4/router/src/main.c:335 or: #0 pni_record_find (record=0x7fe7846172d0, record=0x7fe7846172d0, key=0) at /usr/src/debug/qpid-proton-0.9/proton-c/src/object/record.c:71 #1 pn_record_set (record=0x7fe7846172d0, key=key@entry=0, value=value@entry=0x0) at /usr/src/debug/qpid-proton-0.9/proton-c/src/object/record.c:132 #2 0x00007fe797fca351 in pn_delivery_set_context (delivery=<optimized out>, context=context@entry=0x0) at /usr/src/debug/qpid-proton-0.9/proton-c/src/engine/engine.c:1562 #3 0x00007fe79820d24d in qd_delivery_free_LH (delivery=0x7fe78402b8e0, final_disposition=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:1093 #4 0x00007fe7982193da in router_writable_link_handler (context=0x1e8d9b0, link=0x7fe77c341220) at /usr/src/debug/qpid-dispatch-0.4/src/router_node.c:571 #5 0x00007fe79820cb9e in do_writable (pn_link=0x2381120) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:234 #6 process_handler (unused=<optimized out>, qd_conn=<optimized out>, container=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:625 #7 handler (handler_context=<optimized out>, conn_context=<optimized out>, event=event@entry=QD_CONN_EVENT_PROCESS, qd_conn=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:675 #8 0x00007fe79821ec7c in process_connector (cxtr=0x7fe784004180, qd_server=0x1ef6f80) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:399 #9 thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:627 #10 0x00007fe797d8fe25 in start_thread (arg=0x7fe78b1a6700) at pthread_create.c:308 #11 0x00007fe7972e534d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 or also (quite different - maybe b'cos I started multiple routers with the same config?): #0 0x00007f89fdaad4d9 in router_disposition_handler (context=0x21c99b0, link=<optimized out>, delivery=0x7f89e0241560) at /usr/src/debug/qpid-dispatch-0.4/src/router_node.c:1110 #1 0x00007f89fdaa15ce in do_updated (pnd=0x7f89d8464040) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:286 #2 process_handler (unused=<optimized out>, qd_conn=0x2585430, container=0x21941d0) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:582 #3 handler (handler_context=0x21941d0, conn_context=<optimized out>, event=event@entry=QD_CONN_EVENT_PROCESS, qd_conn=0x2585430) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:675 #4 0x00007f89fdab3c7c in process_connector (cxtr=0x7f89e08056f0, qd_server=0x2232f80) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:399 #5 thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:627 #6 0x00007f89fd624e25 in start_thread (arg=0x7f89f023a700) at pthread_create.c:308 #7 0x00007f89fcb7a34d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 or (at customer only): #0 pn_add_tpwork (delivery=0x7fdf44a2fc90) at /usr/src/debug/qpid-proton-0.9/proton-c/src/engine/engine.c:677 #1 0x00007fdf70029d3e in pn_delivery_update (delivery=<optimized out>, state=<optimized out>) at /usr/src/debug/qpid-proton-0.9/proton-c/src/engine/engine.c:1928 #2 0x00007fdf702783b1 in router_writable_link_handler (context=0x25263f0, link=0x7fdf2444e8d0) at /usr/src/debug/qpid-dispatch-0.4/src/router_node.c:562 #3 0x00007fdf7026bb9e in do_writable (pn_link=0x26a2360) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:234 #4 process_handler (unused=<optimized out>, qd_conn=<optimized out>, container=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:625 #5 handler (handler_context=<optimized out>, conn_context=<optimized out>, event=event@entry=QD_CONN_EVENT_PROCESS, qd_conn=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:675 #6 0x00007fdf7027dc7c in process_connector (cxtr=0x7fdf2c000dd0, qd_server=0x22e13a0) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:399 #7 thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:627 #8 0x00007fdf6fdeee25 in start_thread (arg=0x7fdf58ff9700) at pthread_create.c:308 #9 0x00007fdf6f34434d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 (technicaly the message accept doesnt work, despite I call the method, _no_ message is really discarded from the broker - I guess this causes the difference between mine and customer's BT)
qdrouterd 0.4-31 fixes the above reproducer from #c4 .
** Satellite 6.2 Hotfix Instructions ** 1) Download qpid-dispatch-0.4-31.tar attachment from this bugzilla to your Satellite 6.2 server 2) Untar archive 3) rpm -Uvh *.rpm 4) katello-service restart 5) Resume operations
Created attachment 1470793 [details] qpid-dispatch-0.4-31.tar
Mike, thank you very much for providing this! Best regards, Michael
Removing 6.2.z flag since 6.2.z is now in maintenance phase 2 [1]. The bug will remain flagged for 6.3.z. [1] https://access.redhat.com/support/policy/updates/satellite
This was addressed in 6.6 by https://bugzilla.redhat.com/show_bug.cgi?id=1561876. Since 6.3 is EOL this will not be backported.