Bug 1553116 - any qdrouterd 0.4 segfaults under load of clients (dis)connects
Summary: any qdrouterd 0.4 segfaults under load of clients (dis)connects
Keywords:
Status: CLOSED EOL
Alias: None
Product: Red Hat Satellite
Classification: Red Hat
Component: Qpid
Version: 6.2.14
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: Unspecified
Assignee: Mike Cressman
QA Contact: Radovan Drazny
URL:
Whiteboard:
Depends On:
Blocks: 1553189
TreeView+ depends on / blocked
 
Reported: 2018-03-08 10:51 UTC by Pavel Moravec
Modified: 2022-03-13 14:45 UTC (History)
17 users (show)

Fixed In Version: qpid-dispatch-0.4-31
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1553189 (view as bug list)
Environment:
Last Closed: 2019-11-05 19:19:36 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
reproducer configs and script (2.83 KB, application/x-gzip)
2018-03-08 11:32 UTC, Pavel Moravec
no flags Details
qpid-dispatch-0.4-31.tar (300.00 KB, application/x-tar)
2018-07-26 14:26 UTC, Mike McCune
no flags Details

Description Pavel Moravec 2018-03-08 10:51:06 UTC
Description of problem:
Having more Capsules (10 in reproducer) and a Satellite, and *many* goferd clients frequently re-connecting, qdrouterd segfaults within 10 minutes. Sometimes on a Capsule, sometimes on the Satellite.

Particular detailed reproducer outside Satellite will follow, basically if one mimics goferd activity by frequent creating receiver link-routed to qpidd, and one does it massivelly in parallel, the segfault will happen.


Version-Release number of selected component (if applicable):
qpid-proton-c-0.9-21.el7.x86_64
qpid-dispatch-router-0.4-29.el7sat.x86_64


How reproducible:
100% in 10 minutes


Steps to Reproduce:
1. See above, detailed reproducer in next comment (with script, configs of routers etc)


Actual results:
either qdrouterd segfaults with few possible backtraces of 2 basic variants:

1)
(gdb) bt
#0  qd_alloc (desc=desc@entry=0x7fc8ad51a760 <__desc_qd_router_link_t>, tpool=0x7fc89fa906a0) at /usr/src/debug/qpid-dispatch-0.4/src/alloc.c:162
#1  0x00007fc8ad3043c7 in new_qd_router_link_t () at /usr/src/debug/qpid-dispatch-0.4/src/router_node.c:60
#2  0x00007fc8ad30542f in qd_router_attach_routed_link (context=0x7fc894ab96b0, discard=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/router_node.c:1182
#3  0x00007fc8ad3091d9 in invoke_deferred_calls (conn=conn@entry=0x7fc89000cb30, discard=discard@entry=false) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:305
#4  0x00007fc8ad309a9d in process_connector (cxtr=0x7fc8900102d0, qd_server=0x1553bf0) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:398
#5  thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:627
#6  0x00007fc8ace7ae25 in start_thread (arg=0x7fc89fa90700) at pthread_create.c:308
#7  0x00007fc8ac3d034d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
(gdb) list
157	        desc->stats->batches_rebalanced_to_threads++;
158	        desc->stats->held_by_threads += desc->config->transfer_batch_size;
159	        for (idx = 0; idx < desc->config->transfer_batch_size; idx++) {
160	            item = DEQ_HEAD(desc->global_pool->free_list);
161	            DEQ_REMOVE_HEAD(desc->global_pool->free_list);
162	            DEQ_INSERT_TAIL(pool->free_list, item);
163	        }
164	    } else {
165	        //
166	        // Allocate a full batch from the heap and put it on the thread list.
(gdb) p item
$1 = (qd_alloc_item_t *) 0x0
(gdb) p desc->global_pool->free_list
$2 = {head = 0x0, tail = 0x0, scratch = 0x7fc8929d6be0, size = 13557}
(gdb) 

2)
(gdb) bt
#0  qd_dealloc (desc=desc@entry=0x7f0fc187e760 <__desc_qd_router_link_t>, tpool=<optimized out>, p=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/alloc.c:244
#1  0x00007f0fc16683f7 in free_qd_router_link_t (p=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/router_node.c:60
#2  0x00007f0fc16692ee in qd_router_link_free_LH (rlink=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/router_node.c:433
#3  0x00007f0fc1669bb0 in qd_router_detach_routed_link (context=0x7f0fa440ed70, discard=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/router_node.c:1258
#4  0x00007f0fc166d206 in invoke_deferred_calls (conn=conn@entry=0x27a1cb0, discard=discard@entry=false) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:305
#5  0x00007f0fc166da9d in process_connector (cxtr=0x7f0fac0052b0, qd_server=0x270e1f0) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:398
#6  thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:627
#7  0x00007f0fc11dee25 in start_thread () from /lib64/libpthread.so.0
#8  0x00007f0fc073434d in clone () from /lib64/libc.so.6
(gdb) list
239	    desc->stats->batches_rebalanced_to_global++;
240	    desc->stats->held_by_threads -= desc->config->transfer_batch_size;
241	    for (idx = 0; idx < desc->config->transfer_batch_size; idx++) {
242	        item = DEQ_HEAD(pool->free_list);
243	        DEQ_REMOVE_HEAD(pool->free_list);
244	        DEQ_INSERT_TAIL(desc->global_pool->free_list, item);
245	    }
246	
247	    //
248	    // If there's a global_free_list size limit, remove items until the limit is
(gdb) p item
$1 = (qd_alloc_item_t *) 0x0
(gdb) p pool->free_list
$2 = {head = 0x0, tail = 0x0, scratch = 0x7f0fac067500, size = 99}
(gdb) 



Expected results:
no such segfault


Additional info:
Few interesting observations:
- technical cause for segfault: we try to DEQ_INSERT_TAIL a null object, fetched from a free_list dequeue; that happened because that dequeue was empty (head = tail = 0x0) BUT of positive size (how this ever could happen???) so there was an assumption we can dequeue something
- the above problem affects both desc->global_pool->free_list and pool->free_list, but always when manipulating with links
- when running reproducer, and stopping *all* clients, qpidd still gets some lnik-routed subscriptions to the relevant queues, from some Capsule's qdrouterd (qpid-stat -q shows positive number of subs)
  - when restarting qdrouterd on Capsules, these subscriptions are gone
- that leads me to a deduction there is a leak in "links routed further" (i.e. links from Caps router to Sat router, maintained on Caps router; or links from Sat router to qpidd maintained by Sat router) - this is evident in qpid-stat and imho triggers the weird behaviour in memory management of link objects (the dequeues)

Comment 2 Pavel Moravec 2018-03-08 11:32:46 UTC
Created attachment 1405823 [details]
reproducer configs and script

To reproduce:

- unpack and cd to the dir

- ulimit -c unlimited   # to get coredumps, if interested)

- run start_reproducer.sh
  - very first run will customize configs per your hostname
  - it will start and provision qpidd + "hub" router + 10 Capsule routers

- then run on 1 or rather 2 systems the clients (per the script output)

- I recommend using a different machine(s) for the clients, as it generates load >10 :-/

- optionally, check "qpid-stat -q" - either queue should have (close to) 100 consumers for each system running the clients (i.e. if I use 2 systems running the clients, each queue will have 190-200 consumers). If # of consumers is higher, you hit the symptom with link leak already

- wait up to 10 minutes, a segfault will come (of either router)

Comment 4 Pavel Moravec 2018-03-24 09:51:11 UTC
Improved reproducer that leads to segfault even on 0.4-30 (where the original reproducer does not crash the router):


1) run same setup like before, just add some messages to the tested queues at the end:

for i in $(seq 0 19); do 
    qpid-config bind amq.fanout pulp.bz1553116.${i}
done

qpid-send -a amq.fanout -m10 --content-string="hello"


2) as a client script mimicking goferd, use this one:

import random
import threading
import traceback
import sys

from proton import Timeout
from proton import ConnectionException
from proton.utils import BlockingConnection
from proton import SSLDomain

from time import sleep
from uuid import uuid4
 
ROUTER_ADDRESS = "proton+amqp://pmoravec-sat62-rhel7.gsslab.brq2.redhat.com:%s"
ADDRESS = "pulp.bz1553116"
HEARTBEAT = 5
SLEEP_MIN = 1.0
SLEEP_MAX = 3.0
THREADS = 20
ADDRESSES = 20
 
class ReceiverThread(threading.Thread):
    def __init__(self, port, address=ADDRESS, domain=None):
        super(ReceiverThread, self).__init__()
        self.port = port
        self.address = address
        self.domain = domain
        self.running = True

    def subscribe(self):
	self.recv = self.conn.create_receiver('%s' %(self.address), name=str(uuid4()), dynamic=False, options=None)
 
    def connect(self):
        self.conn = BlockingConnection(ROUTER_ADDRESS % (self.port), ssl_domain=self.domain, heartbeat=HEARTBEAT)
	self.subscribe()
 
    def reconnect(self):
        try:
            self.recv.close()
	except:
	    pass
        try:
    	    self.conn.close()
	except:
            pass
        self.connect()
 
    def run(self):
        self.connect()
        while self.running:
	    try:
                msg = None
                timeout = random.uniform(SLEEP_MIN,SLEEP_MAX)
		msg = self.recv.receive(timeout)
		if msg:
	            print "(%s): accepting msg %s" % (self.port, msg)
		    self.recv.accept()
		    sleep(timeout)
	    except KeyboardInterrupt:
                self.running = False
		pass
            except Timeout:
		pass
            try:
		print "(%s): bouncing receiver" % self.port
                self.recv.close()
		self.subscribe()
	    except:
		print "(%s): receiver thread: failed to close receiver" % self.port
		self.reconnect()
		pass
 
    def stop(self):
        self.running = False
 
threads = []
for i in range(THREADS):
  threads.append(ReceiverThread(sys.argv[1], '%s.%s' %(ADDRESS, i%ADDRESSES)))
  threads[i].start()

running = True
while running:
    try:
	sleep(10)
    except KeyboardInterrupt:
        running = False
	pass
#_in = raw_input("Press Enter to exit:")
 
for i in range(THREADS):
    threads[i].stop()
for i in range(THREADS):
    threads[i].join()




3) in very few minutes, segfaults will happen with either backtrace:

#0  0x0000000000000000 in ?? ()
#1  0x00007f37ff244788 in pn_class_incref (clazz=<optimized out>, object=object@entry=0x7f37e81ce0a0) at /usr/src/debug/qpid-proton-0.9/proton-c/src/object/object.c:75
#2  0x00007f37ff2449f2 in pn_incref (object=object@entry=0x7f37e81ce0a0) at /usr/src/debug/qpid-proton-0.9/proton-c/src/object/object.c:247
#3  0x00007f37ff251a96 in pn_delivery_settle (delivery=0x7f37e81ce0a0) at /usr/src/debug/qpid-proton-0.9/proton-c/src/engine/engine.c:1778
#4  0x00007f37ff494256 in qd_delivery_free_LH (delivery=0x7f37ec0153d0, final_disposition=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:1094
#5  0x00007f37ff4a03da in router_writable_link_handler (context=0x273c9b0, link=0x28d13d0) at /usr/src/debug/qpid-dispatch-0.4/src/router_node.c:571
#6  0x00007f37ff493b9e in do_writable (pn_link=0x299fea0) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:234
#7  process_handler (unused=<optimized out>, qd_conn=<optimized out>, container=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:625
#8  handler (handler_context=<optimized out>, conn_context=<optimized out>, event=event@entry=QD_CONN_EVENT_PROCESS, qd_conn=<optimized out>)
    at /usr/src/debug/qpid-dispatch-0.4/src/container.c:675
#9  0x00007f37ff4a5c7c in process_connector (cxtr=0x7f37ec004180, qd_server=0x27a5f80) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:399
#10 thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:627
#11 0x00007f37ff4a6cf0 in qd_server_run (qd=0x24ee030) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:973
#12 0x0000000000401b38 in main_process (config_path=config_path@entry=0x7ffc50f4a695 "/root/bz1553116_reproducer/qdrouterd.caps03.conf", 
    python_pkgdir=python_pkgdir@entry=0x402261 "/usr/lib/qpid-dispatch/python", fd=fd@entry=2) at /usr/src/debug/qpid-dispatch-0.4/router/src/main.c:135
#13 0x00000000004017b0 in main (argc=3, argv=0x7ffc50f49098) at /usr/src/debug/qpid-dispatch-0.4/router/src/main.c:335

or:

#0  pni_record_find (record=0x7fe7846172d0, record=0x7fe7846172d0, key=0) at /usr/src/debug/qpid-proton-0.9/proton-c/src/object/record.c:71
#1  pn_record_set (record=0x7fe7846172d0, key=key@entry=0, value=value@entry=0x0) at /usr/src/debug/qpid-proton-0.9/proton-c/src/object/record.c:132
#2  0x00007fe797fca351 in pn_delivery_set_context (delivery=<optimized out>, context=context@entry=0x0) at /usr/src/debug/qpid-proton-0.9/proton-c/src/engine/engine.c:1562
#3  0x00007fe79820d24d in qd_delivery_free_LH (delivery=0x7fe78402b8e0, final_disposition=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:1093
#4  0x00007fe7982193da in router_writable_link_handler (context=0x1e8d9b0, link=0x7fe77c341220) at /usr/src/debug/qpid-dispatch-0.4/src/router_node.c:571
#5  0x00007fe79820cb9e in do_writable (pn_link=0x2381120) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:234
#6  process_handler (unused=<optimized out>, qd_conn=<optimized out>, container=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:625
#7  handler (handler_context=<optimized out>, conn_context=<optimized out>, event=event@entry=QD_CONN_EVENT_PROCESS, qd_conn=<optimized out>)
    at /usr/src/debug/qpid-dispatch-0.4/src/container.c:675
#8  0x00007fe79821ec7c in process_connector (cxtr=0x7fe784004180, qd_server=0x1ef6f80) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:399
#9  thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:627
#10 0x00007fe797d8fe25 in start_thread (arg=0x7fe78b1a6700) at pthread_create.c:308
#11 0x00007fe7972e534d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

or also (quite different - maybe b'cos I started multiple routers with the same config?):

#0  0x00007f89fdaad4d9 in router_disposition_handler (context=0x21c99b0, link=<optimized out>, delivery=0x7f89e0241560) at /usr/src/debug/qpid-dispatch-0.4/src/router_node.c:1110
#1  0x00007f89fdaa15ce in do_updated (pnd=0x7f89d8464040) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:286
#2  process_handler (unused=<optimized out>, qd_conn=0x2585430, container=0x21941d0) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:582
#3  handler (handler_context=0x21941d0, conn_context=<optimized out>, event=event@entry=QD_CONN_EVENT_PROCESS, qd_conn=0x2585430) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:675
#4  0x00007f89fdab3c7c in process_connector (cxtr=0x7f89e08056f0, qd_server=0x2232f80) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:399
#5  thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:627
#6  0x00007f89fd624e25 in start_thread (arg=0x7f89f023a700) at pthread_create.c:308
#7  0x00007f89fcb7a34d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

or (at customer only):

#0  pn_add_tpwork (delivery=0x7fdf44a2fc90) at /usr/src/debug/qpid-proton-0.9/proton-c/src/engine/engine.c:677
#1  0x00007fdf70029d3e in pn_delivery_update (delivery=<optimized out>, state=<optimized out>) at /usr/src/debug/qpid-proton-0.9/proton-c/src/engine/engine.c:1928
#2  0x00007fdf702783b1 in router_writable_link_handler (context=0x25263f0, link=0x7fdf2444e8d0) at /usr/src/debug/qpid-dispatch-0.4/src/router_node.c:562
#3  0x00007fdf7026bb9e in do_writable (pn_link=0x26a2360) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:234
#4  process_handler (unused=<optimized out>, qd_conn=<optimized out>, container=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:625
#5  handler (handler_context=<optimized out>, conn_context=<optimized out>, event=event@entry=QD_CONN_EVENT_PROCESS, qd_conn=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:675
#6  0x00007fdf7027dc7c in process_connector (cxtr=0x7fdf2c000dd0, qd_server=0x22e13a0) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:399
#7  thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:627
#8  0x00007fdf6fdeee25 in start_thread (arg=0x7fdf58ff9700) at pthread_create.c:308
#9  0x00007fdf6f34434d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113


(technicaly the message accept doesnt work, despite I call the method, _no_ message is really discarded from the broker - I guess this causes the difference between mine and customer's BT)

Comment 5 Pavel Moravec 2018-03-25 21:14:35 UTC
qdrouterd 0.4-31 fixes the above reproducer from #c4 .

Comment 17 Mike McCune 2018-07-26 14:25:55 UTC
** Satellite 6.2 Hotfix Instructions ** 

1) Download qpid-dispatch-0.4-31.tar attachment from this bugzilla to your Satellite 6.2 server

2) Untar archive 

3) rpm -Uvh *.rpm

4) katello-service restart

5) Resume operations

Comment 18 Mike McCune 2018-07-26 14:26:24 UTC
Created attachment 1470793 [details]
qpid-dispatch-0.4-31.tar

Comment 19 Michael Schwabe 2018-07-26 14:45:33 UTC
Mike,
thank you very much for providing this!

Best regards,
Michael

Comment 24 Bryan Kearney 2018-11-26 19:33:09 UTC
Removing 6.2.z flag since 6.2.z is now in maintenance phase 2 [1]. The bug will remain flagged for 6.3.z.


[1] https://access.redhat.com/support/policy/updates/satellite

Comment 25 Bryan Kearney 2019-11-05 19:19:36 UTC
This was addressed in 6.6 by https://bugzilla.redhat.com/show_bug.cgi?id=1561876. Since 6.3 is EOL this will not be backported.


Note You need to log in before you can comment on or make changes to this bug.