Description of problem: Under unknown circumstances (some events pointed below), qdrouterd segfaulted when connecting many clients to it. Version-Release number of selected component (if applicable): libqpid-dispatch-0.4-13.el7sat.x86_64 qpid-dispatch-router-0.4-13.el7sat.x86_64 qpid-proton-c-0.9-16.el7.x86_64 How reproducible: ??? Steps to Reproduce: ??? Actual results: segfault with backtrace: (gdb) bt #0 0x00007f9a412f95f7 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56 #1 0x00007f9a412face8 in __GI_abort () at abort.c:90 #2 0x00007f9a41339327 in __libc_message (do_abort=do_abort@entry=2, fmt=fmt@entry=0x7f9a41443488 "*** Error in `%s': %s: 0x%s ***\n") at ../sysdeps/unix/sysv/linux/libc_fatal.c:196 #3 0x00007f9a41341053 in malloc_printerr (ar_ptr=0x7f99b4000020, ptr=<optimized out>, str=0x7f9a41443588 "double free or corruption (!prev)", action=3) at malloc.c:5022 #4 _int_free (av=0x7f99b4000020, p=<optimized out>, have_lock=0) at malloc.c:3842 #5 0x00007f9a4208d806 in pn_class_decref (clazz=0x7f9a422c12e0 <clazz.4933>, object=0x7f99b402af60) at /usr/src/debug/qpid-proton-0.9/proton-c/src/object/object.c:103 #6 0x00007f9a4209b580 in pn_event_finalize (event=0x7f99d40847f0) at /usr/src/debug/qpid-proton-0.9/proton-c/src/events/event.c:190 #7 pn_event_finalize_cast (object=0x7f99d40847f0) at /usr/src/debug/qpid-proton-0.9/proton-c/src/events/event.c:235 #8 0x00007f9a4208d7e8 in pn_class_decref (clazz=0x7f9a422c1460 <clazz.2272>, object=0x7f99d40847f0) at /usr/src/debug/qpid-proton-0.9/proton-c/src/object/object.c:97 #9 0x00007f9a4208da12 in pn_decref (object=<optimized out>) at /usr/src/debug/qpid-proton-0.9/proton-c/src/object/object.c:252 #10 0x00007f9a4209b722 in pn_collector_pop (collector=collector@entry=0x20dad80) at /usr/src/debug/qpid-proton-0.9/proton-c/src/events/event.c:167 #11 0x00007f9a422daf00 in process_handler (unused=<optimized out>, qd_conn=0x7f9a2800cb30, container=0x1fd3e20) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:422 #12 handler (handler_context=0x1fd3e20, conn_context=<optimized out>, event=event@entry=QD_CONN_EVENT_PROCESS, qd_conn=0x7f9a2800cb30) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:486 #13 0x00007f9a422edb9c in process_connector (cxtr=0x7f9a28010270, qd_server=0x1fe37d0) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:398 #14 thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:626 #15 0x00007f9a41e5fdc5 in start_thread (arg=0x7f9a227f4700) at pthread_create.c:308 #16 0x00007f9a413baced in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 Expected results: no segfault Additional info: /var/log/messages from relevant time: Aug 10 08:29:10 ip-10-1-1-2 qdrouterd: Wed Aug 10 08:29:10 2016 ROUTER_LS (info) Router Link Lost - link_id=3 Aug 10 08:29:10 ip-10-1-1-2 qpidd: 2016-08-10 08:29:10 [Protocol] error Error on attach: Node not found: pulp.agent.bb79963e-92e2-4020-a0db-d34d082b0eb7 (the error on attach repeated multiple times, until..) Aug 10 08:29:11 ip-10-1-1-2 qdrouterd: *** Error in `/usr/sbin/qdrouterd': double free or corruption (!prev): 0x00007f99b402af50 ***
Standalone reproducer: 1) Link routing to qpidd to route pulp.* 2) Run below script 10 times in parallel - it tries to create a receiver to qdrouterd/qpidd but the broker does not have such a queue (i.e. "Node not found" error printed by qpidd): #!/usr/bin/python from time import sleep from uuid import uuid4 from proton.utils import BlockingConnection, LinkDetached routerURL = "proton+amqp://0.0.0.0:5648" conn = BlockingConnection(routerURL, ssl_domain=None, heartbeat=2) while True: sleep(0.05) try: rcv = conn.create_receiver("pulp."+str(uuid4()), name=str(uuid4())) rcv.close() except LinkDetached, e: print e if conn: conn.close() conn = BlockingConnection(routerURL, ssl_domain=None, heartbeat=2) <end-of-the-script> This segfault is usually not expected to happen in Sat6 environment. Since it relies on _missing_ pulp.agent.* queue that goferd tries to subscribe to. Usually, goferd should create its queue during startup..
*** Bug 1366231 has been marked as a duplicate of this bug. ***
May need to keep this assigned to tross. The mitigation possible by goferd is to re-create the queue when getting LinkDetached with condition = amqp:not-found. This means goferd could still try to create a receiver (Link) when the queue does not exist and crash the router. Note: This can only happen in cases where the queue existed (or was created by goferd on startup) and then disappeared.
(In reply to Jeff Ortel from comment #8) > May need to keep this assigned to tross. The mitigation possible by goferd > is to re-create the queue when getting LinkDetached with condition = > amqp:not-found. This means goferd could still try to create a receiver > (Link) when the queue does not exist and crash the router. > > Note: This can only happen in cases where the queue existed (or was created > by goferd on startup) and then disappeared. +1. The primary problem is qdrouterd segfaulting in some scenario. goferd can be improved like Jeff suggests since the repeated link failures from the same agent increased probability of the failure/segfault.
Created attachment 1193891 [details] (gdb) thread apply all bt
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:2699
*** Bug 1385890 has been marked as a duplicate of this bug. ***