Bug 1561876

Summary: qdrouterd crashes when burst of requests arise from katello-agent clients
Product: Red Hat Satellite Reporter: sbadhwar
Component: QpidAssignee: Ganesh Murthy <gmurthy>
Status: CLOSED ERRATA QA Contact: Radovan Drazny <rdrazny>
Severity: high Docs Contact:
Priority: high    
Version: 6.3.0CC: aconway, alsouza, ehelms, fbroussy, gmurthy, jhutar, joboyer, mcressma, mmccune, pdwyer, pmoravec, psuriset, rchauhan, rdrazny, sthirugn, vvasilev, wpinheir, zhunting
Target Milestone: 6.6.0Keywords: Triaged
Target Release: Unused   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-10-22 12:46:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
qpid-dispatch-router-0.8.0-21.tar.gz none

Comment 1 Alan Conway 2018-03-29 15:11:39 UTC
max_channels can't be increased by much, it is defined by the AMQP protocol to be a 16-bit integer and I believe dispatch reserves the the high bit for internal purposes. However you can create a lot more senders/receivers per session than you can create sessions per connection (I think the limit is pretty much available memory) so I'd suggest sharing one or a few sessions for many senders/receivers.

The question is - is it satellite/katello code that is creating such a large number of sessions? If so I would fix it there, if not we need to figure out why the router is creating so many sessions.

I've raised these two issues  upstream for the C++ and router bugs:

https://issues.apache.org/jira/browse/PROTON-1819
https://issues.apache.org/jira/browse/DISPATCH-952

You might want to try this with the latest router release (1.0.1) but I didn't find any sign that such a problem was fixed recently. It would help if you can get a stack trace from the router core dump:

    gdb -batch -quiet -ex "thread apply all bt full" -ex "quit" "$@" <executable> <corefile>

Comment 2 sbadhwar 2018-03-30 06:50:14 UTC
Doing some more analysis of the coredump generated by the crash, this is what I could interpret:

Seems like, pn_session_set_incoming_capacity() in qpid-proton may receive a null reference to a session. The function then tries to assign the capacity to this session and due to the absence of null pointer check, the code segfaults.

Here is the raw analysis of the coredump:
[root@c02-h10-r620-vm1 ccpp-2018-03-29-16:18:59-6747]# gdb /usr/sbin/qdrouterd coredump 
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-100.el7_4.1
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/sbin/qdrouterd...Reading symbols from /usr/lib/debug/usr/sbin/qdrouterd.debug...done.
done.
[New LWP 6751]
[New LWP 6748]
[New LWP 6752]
[New LWP 6753]
[New LWP 6747]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/sbin/qdrouterd -c /etc/qpid-dispatch/qdrouterd.conf'.
Program terminated with signal 11, Segmentation fault.
#0  pn_session_set_incoming_capacity (ssn=0x0, capacity=35184372072448) at /usr/src/debug/qpid-proton-0.16.0/proton-c/src/core/engine.c:1059
1059	  ssn->incoming_capacity = capacity;
Missing separate debuginfos, use: debuginfo-install cyrus-sasl-lib-2.1.26-21.el7.x86_64 glibc-2.17-196.el7.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-8.el7.x86_64 libcom_err-1.42.9-10.el7.x86_64 libdb-5.3.21-21.el7_4.x86_64 libffi-3.0.13-18.el7.x86_64 libselinux-2.5-11.el7.x86_64 libwebsockets-2.1.0-3.el7.x86_64 nss-softokn-freebl-3.28.3-6.el7.x86_64 openssl-libs-1.0.2k-8.el7.x86_64 pcre-8.32-17.el7.x86_64 python-libs-2.7.5-58.el7.x86_64 zlib-1.2.7-17.el7.x86_64
(gdb) bt
#0  pn_session_set_incoming_capacity (ssn=0x0, capacity=35184372072448) at /usr/src/debug/qpid-proton-0.16.0/proton-c/src/core/engine.c:1059
#1  0x00007f52ee0de62c in qd_link (node=0x20bc390, conn=conn@entry=0x7f52d7a709d0, dir=QD_INCOMING, name=name@entry=0x7f52dc193b30 "38802e68-2a81-4698-80d5-9d0c16a2b109")
    at /usr/src/debug/qpid-dispatch-0.8.0/src/container.c:831
#2  0x00007f52ee0fd269 in CORE_link_first_attach (context=0x2357490, conn=<optimized out>, link=0x7f52b86f36d0, source=0x7f52c40189d0, target=0x7f52c4018ad0)
    at /usr/src/debug/qpid-dispatch-0.8.0/src/router_node.c:849
#3  0x00007f52ee0f1b32 in qdr_connection_process (conn=0x29409d0) at /usr/src/debug/qpid-dispatch-0.8.0/src/router_core/connections.c:214
#4  0x00007f52ee0dd298 in writable_handler (container=0x20137b0, container=0x20137b0, conn=<optimized out>, qd_conn=0x7f52d7a709d0) at /usr/src/debug/qpid-dispatch-0.8.0/src/container.c:307
#5  handler (handler_context=0x20137b0, conn_context=0x21eb0b0, event=<optimized out>, qd_conn=0x7f52d7a709d0) at /usr/src/debug/qpid-dispatch-0.8.0/src/container.c:614
#6  0x00007f52ee101deb in process_connector (cxtr=<optimized out>, qd_server=0x2367ab0) at /usr/src/debug/qpid-dispatch-0.8.0/src/server.c:792
#7  thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.8.0/src/server.c:1012
#8  0x00007f52edc58e25 in start_thread () from /lib64/libpthread.so.0
#9  0x00007f52ecf8c34d in clone () from /lib64/libc.so.6
(gdb) frame 0
#0  pn_session_set_incoming_capacity (ssn=0x0, capacity=35184372072448) at /usr/src/debug/qpid-proton-0.16.0/proton-c/src/core/engine.c:1059
1059	  ssn->incoming_capacity = capacity;
(gdb) list
1054	
1055	void pn_session_set_incoming_capacity(pn_session_t *ssn, size_t capacity)
1056	{
1057	  assert(ssn);
1058	  // XXX: should this trigger a flow?
1059	  ssn->incoming_capacity = capacity;
1060	}
1061	
1062	size_t pn_session_get_outgoing_window(pn_session_t *ssn)
1063	{
(gdb) info locals
No locals.
(gdb) info ssn
Undefined info command: "ssn".  Try "help info".
(gdb) print ssn
$1 = (pn_session_t *) 0x0
(gdb) print capacity
$2 = 35184372072448
(gdb) 


And here is some more data on thread level basis:
[root@c02-h10-r620-vm1 ccpp-2018-03-29-16:18:59-6747]# gdb -batch -quiet -ex "thread apply all bt full" -ex "quit" "$@" /usr/sbin/qdrouterd coredump 
[New LWP 6751]
[New LWP 6748]
[New LWP 6752]
[New LWP 6753]
[New LWP 6747]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/sbin/qdrouterd -c /etc/qpid-dispatch/qdrouterd.conf'.
Program terminated with signal 11, Segmentation fault.
#0  pn_session_set_incoming_capacity (ssn=0x0, capacity=35184372072448) at /usr/src/debug/qpid-proton-0.16.0/proton-c/src/core/engine.c:1059
1059	  ssn->incoming_capacity = capacity;

Thread 5 (Thread 0x7f52ee52a1c0 (LWP 6747)):
#0  0x00007f52ecf81a3d in poll () from /lib64/libc.so.6
No symbol table info available.
#1  0x00007f52ee0e9d6a in poll (__timeout=<optimized out>, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/bits/poll2.h:46
No locals.
#2  qdpn_driver_wait_2 (d=0x20b7190, timeout=<optimized out>, timeout@entry=246) at /usr/src/debug/qpid-dispatch-0.8.0/src/posix/driver.c:914
        result = <optimized out>
#3  0x00007f52ee101629 in thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.8.0/src/server.c:920
        duration = 246
        tv = {tv_sec = 1522354739, tv_nsec = 6739792}
        milliseconds = <optimized out>
        timer = <optimized out>
        thread = <optimized out>
        work = <optimized out>
        cxtr = <optimized out>
        ctx = <optimized out>
        error = 0
        poll_result = <optimized out>
        qd_server = 0x2367ab0
#4  0x00007f52ee102606 in qd_server_run (qd=0x1d72030) at /usr/src/debug/qpid-dispatch-0.8.0/src/server.c:1421
        qd_server = 0x2367ab0
        i = <optimized out>
#5  0x0000000000401bd7 in main_process (config_path=config_path@entry=0x7ffd0235ff2d "/etc/qpid-dispatch/qdrouterd.conf", python_pkgdir=python_pkgdir@entry=0x402321 "/usr/lib/qpid-dispatch/python", fd=fd@entry=2) at /usr/src/debug/qpid-dispatch-0.8.0/router/src/main.c:147
        st = {st_dev = 64768, st_ino = 2642725, st_nlink = 3, st_mode = 16877, st_uid = 0, st_gid = 0, __pad0 = 0, st_rdev = 0, st_size = 36, st_blksize = 4096, st_blocks = 0, st_atim = {tv_sec = 1522352459, tv_nsec = 652543987}, st_mtim = {tv_sec = 1522047647, tv_nsec = 727721790}, st_ctim = {tv_sec = 1522047647, tv_nsec = 727721790}, __unused = {0, 0, 0}}
        d = <optimized out>
#6  0x0000000000401844 in main (argc=3, argv=0x7ffd0235e318) at /usr/src/debug/qpid-dispatch-0.8.0/router/src/main.c:353
        config_path = 0x7ffd0235ff2d "/etc/qpid-dispatch/qdrouterd.conf"
        python_pkgdir = 0x402321 "/usr/lib/qpid-dispatch/python"
        pidfile = 0x0
        user = 0x0
        daemon_mode = false
        long_options = {{name = 0x40239c "config", has_arg = 1, flag = 0x0, val = 99}, {name = 0x4023a3 "include", has_arg = 1, flag = 0x0, val = 73}, {name = 0x4023ab "daemon", has_arg = 0, flag = 0x0, val = 100}, {name = 0x4023b2 "pidfile", has_arg = 1, flag = 0x0, val = 80}, {name = 0x4023ba "user", has_arg = 1, flag = 0x0, val = 85}, {name = 0x4023bf "help", has_arg = 0, flag = 0x0, val = 104}, {name = 0x4023c4 "version", has_arg = 0, flag = 0x0, val = 118}, {name = 0x0, has_arg = 0, flag = 0x0, val = 0}}

Thread 4 (Thread 0x7f52daffd700 (LWP 6753)):
#0  0x00007f52edc5c945 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x00007f52ee0ea3d9 in sys_cond_wait (cond=<optimized out>, held_mutex=<optimized out>) at /usr/src/debug/qpid-dispatch-0.8.0/src/posix/threading.c:91
        result = <optimized out>
        __PRETTY_FUNCTION__ = "sys_cond_wait"
#2  0x00007f52ee101a4c in thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.8.0/src/server.c:894
        timer = <optimized out>
        thread = <optimized out>
        work = 0x0
        cxtr = <optimized out>
        ctx = <optimized out>
        error = <optimized out>
        poll_result = <optimized out>
        qd_server = 0x2367ab0
#3  0x00007f52edc58e25 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#4  0x00007f52ecf8c34d in clone () from /lib64/libc.so.6
No symbol table info available.

Thread 3 (Thread 0x7f52db7fe700 (LWP 6752)):
#0  0x00007f52ec898a8c in bn_sqr8x_internal () from /lib64/libcrypto.so.10
No symbol table info available.
#1  0x00007f52ec89828f in bn_power5 () from /lib64/libcrypto.so.10
No symbol table info available.
#2  0x00007f529ce70980 in ?? ()
No symbol table info available.
#3  0x00007f52db7fc8b0 in ?? ()
No symbol table info available.
#4  0x00007f529ce70700 in ?? ()
No symbol table info available.
#5  0x00007f52db7fcda0 in ?? ()
No symbol table info available.
#6  0xf2419f003f10e605 in ?? ()
No symbol table info available.
#7  0x00007f52db7fca88 in ?? ()
No symbol table info available.
#8  0x6fc5cec5a2a3a62b in ?? ()
No symbol table info available.
#9  0xe0db79f593bc9d05 in ?? ()
No symbol table info available.
#10 0x4a472f35cc80462c in ?? ()
No symbol table info available.
#11 0xa2645e0a98f250c5 in ?? ()
No symbol table info available.
#12 0x0050eea563b319e5 in ?? ()
No symbol table info available.
#13 0x04a7f32401ece035 in ?? ()
No symbol table info available.
#14 0x557ddf34f4d690af in ?? ()
No symbol table info available.
#15 0xf7538e4e4db8b215 in ?? ()
No symbol table info available.
#16 0x3ca72573e5e20225 in ?? ()
No symbol table info available.
#17 0xd702b6535e0b400c in ?? ()
No symbol table info available.
#18 0x66753d1eb1acde85 in ?? ()
No symbol table info available.
#19 0xfab103830325f2aa in ?? ()
No symbol table info available.
#20 0x39a9029ef5ea3b0d in ?? ()
No symbol table info available.
#21 0x0a39fc4fd93d9735 in ?? ()
No symbol table info available.
#22 0x2c3d6242bebd8766 in ?? ()
No symbol table info available.
#23 0xa5273d7fa9652d71 in ?? ()
No symbol table info available.
#24 0x7eb2d4c2923500f9 in ?? ()
No symbol table info available.
#25 0x629add77bbf2407e in ?? ()
No symbol table info available.
#26 0x703d566108428fdd in ?? ()
No symbol table info available.
#27 0xbfd899f7fc8c09ec in ?? ()
No symbol table info available.
#28 0xa32b0f913d1a8dd5 in ?? ()
No symbol table info available.
#29 0xfa7aabe9284f48ab in ?? ()
No symbol table info available.
#30 0xd0a3e51802caabfa in ?? ()
No symbol table info available.
#31 0xab7b075fcaf49461 in ?? ()
No symbol table info available.
#32 0x1cc659614865ad53 in ?? ()
No symbol table info available.
#33 0xee0f166994cdec1b in ?? ()
No symbol table info available.
#34 0x9fd984db1b5ac2c3 in ?? ()
No symbol table info available.
#35 0xad0e2770cab61d28 in ?? ()
No symbol table info available.
#36 0xd7c3140b230282c4 in ?? ()
No symbol table info available.
#37 0x4655396352f8ec67 in ?? ()
No symbol table info available.
#38 0x7de4114bbf347a9c in ?? ()
No symbol table info available.
#39 0x05210b1b7908ec70 in ?? ()
No symbol table info available.
#40 0x2f0ffad65ed51413 in ?? ()
No symbol table info available.
#41 0xb04cd15d69a9af54 in ?? ()
No symbol table info available.
#42 0x335e704bcadb8797 in ?? ()
No symbol table info available.
#43 0x5a38950577aa8db5 in ?? ()
No symbol table info available.
#44 0x7d66bf7d3ef0525b in ?? ()
No symbol table info available.
#45 0xc253cce773e649ab in ?? ()
No symbol table info available.
#46 0x24828f86b5c56a80 in ?? ()
No symbol table info available.
#47 0x07fa9348e4ac1c9c in ?? ()
No symbol table info available.
#48 0x9e956e5eaba53230 in ?? ()
No symbol table info available.
#49 0xadbb30e98eba0887 in ?? ()
No symbol table info available.
#50 0xd5f34c5abd1e2218 in ?? ()
No symbol table info available.
#51 0x2038f5296a8014d7 in ?? ()
No symbol table info available.
#52 0x37233fd7bd3fce06 in ?? ()
No symbol table info available.
#53 0x543b62ec5ef012e9 in ?? ()
No symbol table info available.
#54 0xdad8950f10d65fa9 in ?? ()
No symbol table info available.
#55 0x30ae8ae17755f6a3 in ?? ()
No symbol table info available.
#56 0x013660b35920f02d in ?? ()
No symbol table info available.
#57 0x93f269f76d476924 in ?? ()
No symbol table info available.
#58 0x6de63d8bf4f61c1d in ?? ()
No symbol table info available.
#59 0x7e30614eb02ec692 in ?? ()
No symbol table info available.
#60 0x506028e67b1425a8 in ?? ()
No symbol table info available.
#61 0xeb544e07d4d17c93 in ?? ()
No symbol table info available.
#62 0xc574e6ae1edd49cb in ?? ()
No symbol table info available.
#63 0xe68d652fbadd229d in ?? ()
No symbol table info available.
#64 0xe2ae4298bef74adb in ?? ()
No symbol table info available.
#65 0xcb1502d4b92b38a1 in ?? ()
No symbol table info available.
#66 0xc92466dc54520451 in ?? ()
No symbol table info available.
#67 0x8a0ad4ff8fa60354 in ?? ()
No symbol table info available.
#68 0xed0048be1be5b222 in ?? ()
No symbol table info available.
#69 0x2a61705b5eadaedd in ?? ()
No symbol table info available.
#70 0xe779eaa9fa0c6490 in ?? ()
No symbol table info available.
#71 0x7adb527616bf2450 in ?? ()
No symbol table info available.
#72 0x0000000000000000 in ?? ()
No symbol table info available.

Thread 2 (Thread 0x7f52e0f7f700 (LWP 6748)):
#0  0x00007f52edc5c945 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x00007f52ee0ea3d9 in sys_cond_wait (cond=<optimized out>, held_mutex=<optimized out>) at /usr/src/debug/qpid-dispatch-0.8.0/src/posix/threading.c:91
        result = <optimized out>
        __PRETTY_FUNCTION__ = "sys_cond_wait"
#2  0x00007f52ee0f83a5 in router_core_thread (arg=0x2352920) at /usr/src/debug/qpid-dispatch-0.8.0/src/router_core/router_core_thread.c:66
        core = 0x2352920
        action_list = <optimized out>
        action = <optimized out>
#3  0x00007f52edc58e25 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#4  0x00007f52ecf8c34d in clone () from /lib64/libc.so.6
No symbol table info available.

Thread 1 (Thread 0x7f52dbfff700 (LWP 6751)):
#0  pn_session_set_incoming_capacity (ssn=0x0, capacity=35184372072448) at /usr/src/debug/qpid-proton-0.16.0/proton-c/src/core/engine.c:1059
No locals.
#1  0x00007f52ee0de62c in qd_link (node=0x20bc390, conn=conn@entry=0x7f52d7a709d0, dir=QD_INCOMING, name=name@entry=0x7f52dc193b30 "38802e68-2a81-4698-80d5-9d0c16a2b109") at /usr/src/debug/qpid-dispatch-0.8.0/src/container.c:831
        link = 0x7f529570e990
        cf = 0x21eb0d0
        open_new_session = true
#2  0x00007f52ee0fd269 in CORE_link_first_attach (context=0x2357490, conn=<optimized out>, link=0x7f52b86f36d0, source=0x7f52c40189d0, target=0x7f52c4018ad0) at /usr/src/debug/qpid-dispatch-0.8.0/src/router_node.c:849
        router = 0x2357490
        qconn = 0x7f52d7a709d0
        qlink = <optimized out>
#3  0x00007f52ee0f1b32 in qdr_connection_process (conn=0x29409d0) at /usr/src/debug/qpid-dispatch-0.8.0/src/router_core/connections.c:214
        work_list = {head = 0x0, tail = <optimized out>, scratch = 0x7f52dc188cd0, size = <optimized out>}
        core = 0x2352920
        ref = <optimized out>
        link = <optimized out>
        event_count = 1
        work = 0x7f52dc188cd0
#4  0x00007f52ee0dd298 in writable_handler (container=0x20137b0, container=0x20137b0, conn=<optimized out>, qd_conn=0x7f52d7a709d0) at /usr/src/debug/qpid-dispatch-0.8.0/src/container.c:307
        nt = <optimized out>
        event_count = 0
        nt_item = 0x235bca0
#5  handler (handler_context=0x20137b0, conn_context=0x21eb0b0, event=<optimized out>, qd_conn=0x7f52d7a709d0) at /usr/src/debug/qpid-dispatch-0.8.0/src/container.c:614
        conn_context = 0x21eb0b0
        event = QD_CONN_EVENT_WRITABLE
        qd_conn = 0x7f52d7a709d0
        handler_context = 0x20137b0
        container = 0x20137b0
        conn = 0x2367ab0
#6  0x00007f52ee101deb in process_connector (cxtr=<optimized out>, qd_server=0x2367ab0) at /usr/src/debug/qpid-dispatch-0.8.0/src/server.c:792
        qd_conn = <optimized out>
        collector = 0x7f52cdd0e7f0
        event = <optimized out>
        ctx = 0x7f52d7a709d0
        events = <optimized out>
        passes = <optimized out>
#7  thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.8.0/src/server.c:1012
        work_done = 1
        timer = <optimized out>
        thread = <optimized out>
        work = <optimized out>
        cxtr = <optimized out>
        ctx = <optimized out>
        error = <optimized out>
        poll_result = <optimized out>
        qd_server = 0x2367ab0
#8  0x00007f52edc58e25 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#9  0x00007f52ecf8c34d in clone () from /lib64/libc.so.6
No symbol table info available.

Comment 12 Pavel Moravec 2018-10-19 10:43:29 UTC
Technically, the root cause of the bug is evident:

/usr/src/debug/qpid-dispatch-0.8.0/src/container.c :

qd_link_t *qd_link(qd_node_t *node, qd_connection_t *conn, qd_direction_t dir, const char* name)
{
    qd_link_t *link = new_qd_link_t();
    const qd_server_config_t * cf = qd_connection_config(conn);

    bool open_new_session = true;

    //
    // A route container connection is the one that is made to the broker.
    // We don't want to create one session for every link we create when communicating with a broker because the
    // you can only have a maximum of 32,768 sessions per connection. There is a danger of us running out of sessions
    // if the number of sessions exceed 32,768 (one link per session)
    // To avoid this problem, if this is a route container connection, we will store the session in the route-container connection object
    // and every new link will be created in that session.
    //
    //
    if (conn->is_route_container) {
        // dont close the session with link because all links in the route-container case are sharing a single session.
        link->close_sess_with_link = false;
        // Use the pn_session if it is already available in the connection and set is_conn_route_container to false since we don't want to open a new session.
        if (conn->pn_sess) {
            link->pn_sess = conn->pn_sess;
            open_new_session = false;
        }
        else {
            // pn_session is not available in the connection. Create a new one.
            link->pn_sess = pn_session(qd_connection_pn(conn));
            conn->pn_sess = link->pn_sess;
        }
    }
    else {
        // We will still continue having one session per link in the non-route container cases.
        link->close_sess_with_link = true;
        link->pn_sess = pn_session(qd_connection_pn(conn));
    }

    // TODO - Do I have to worry about the incoming capacity in the case of route containers now that there is only one session?
    pn_session_set_incoming_capacity(link->pn_sess, cf->incoming_capacity);


what happened:
1) call "link->pn_sess = pn_session(qd_connection_pn(conn));" raised error about >32k sessions - it is weird qdrouterd on the Caps had 32k clients if it managed just 2-3k Content Hosts
2) so link->pn_sess was set to null
3) call "pn_session_set_incoming_capacity(link->pn_sess, cf->incoming_capacity);" didnt expect this - there should be some "if link->pn_sess" test before.


So there is some bug in qdrouterd, but for normal usage of Sat/Caps, it should not pop up.

I tried creating & dropping consumers in end-less loops, just to ensure the sessions are recycled, and really the session (channel IDs in AMQP) returned by proton _are_ recycled.

So the only way how this could be hit is imho having >32k goferd clients connected to the Capsule (or some scenario causing session leak).


Does this really happen?

Comment 13 Pavel Moravec 2018-10-19 20:12:41 UTC
.. and the reproducer for this particular segfault is really "create >32k link routing via qdrouterd to qpidd" - mimicked by so many consumers / queue receivers connecting to one qdrouterd that link-routes the address prefixes to some connector. Now I have:

Oct 19 16:08:48 hp-dl180g6-1 qdrouterd: [0x154e650]:unable to find an open available channel within limit of 32767
Oct 19 16:08:48 hp-dl180g6-1 qdrouterd: [0x154e650]:process error -2
Oct 19 16:08:48 hp-dl180g6-1 qdrouterd: [0x154e650]:pn_session: too many sessions: 32768  channel_max is 32767
Oct 19 16:08:48 hp-dl180g6-1 kernel: qdrouterd[12009]: segfault at 88 ip 00007f3879f538d0 sp 00007f38677fd258 error 6 in libqpid-proton.so.10.0.0[7f3879f35000+4b000]
Oct 19 16:08:48 hp-dl180g6-1 systemd: qdrouterd.service: main process exited, code=killed, status=11/SEGV
Oct 19 16:08:48 hp-dl180g6-1 systemd: Unit qdrouterd.service entered failed state.
Oct 19 16:08:48 hp-dl180g6-1 systemd: qdrouterd.service failed.


Hitting this bug assumes the above qdrouterd errors appear just before segfault.

Comment 14 sbadhwar 2018-10-23 05:10:34 UTC
(In reply to Pavel Moravec from comment #12)
> Technically, the root cause of the bug is evident:
> 
> /usr/src/debug/qpid-dispatch-0.8.0/src/container.c :
> 
> qd_link_t *qd_link(qd_node_t *node, qd_connection_t *conn, qd_direction_t
> dir, const char* name)
> {
>     qd_link_t *link = new_qd_link_t();
>     const qd_server_config_t * cf = qd_connection_config(conn);
> 
>     bool open_new_session = true;
> 
>     //
>     // A route container connection is the one that is made to the broker.
>     // We don't want to create one session for every link we create when
> communicating with a broker because the
>     // you can only have a maximum of 32,768 sessions per connection. There
> is a danger of us running out of sessions
>     // if the number of sessions exceed 32,768 (one link per session)
>     // To avoid this problem, if this is a route container connection, we
> will store the session in the route-container connection object
>     // and every new link will be created in that session.
>     //
>     //
>     if (conn->is_route_container) {
>         // dont close the session with link because all links in the
> route-container case are sharing a single session.
>         link->close_sess_with_link = false;
>         // Use the pn_session if it is already available in the connection
> and set is_conn_route_container to false since we don't want to open a new
> session.
>         if (conn->pn_sess) {
>             link->pn_sess = conn->pn_sess;
>             open_new_session = false;
>         }
>         else {
>             // pn_session is not available in the connection. Create a new
> one.
>             link->pn_sess = pn_session(qd_connection_pn(conn));
>             conn->pn_sess = link->pn_sess;
>         }
>     }
>     else {
>         // We will still continue having one session per link in the
> non-route container cases.
>         link->close_sess_with_link = true;
>         link->pn_sess = pn_session(qd_connection_pn(conn));
>     }
> 
>     // TODO - Do I have to worry about the incoming capacity in the case of
> route containers now that there is only one session?
>     pn_session_set_incoming_capacity(link->pn_sess, cf->incoming_capacity);
> 
> 
> what happened:
> 1) call "link->pn_sess = pn_session(qd_connection_pn(conn));" raised error
> about >32k sessions - it is weird qdrouterd on the Caps had 32k clients if
> it managed just 2-3k Content Hosts
> 2) so link->pn_sess was set to null
> 3) call "pn_session_set_incoming_capacity(link->pn_sess,
> cf->incoming_capacity);" didnt expect this - there should be some "if
> link->pn_sess" test before.
> 
> 
> So there is some bug in qdrouterd, but for normal usage of Sat/Caps, it
> should not pop up.
> 
> I tried creating & dropping consumers in end-less loops, just to ensure the
> sessions are recycled, and really the session (channel IDs in AMQP) returned
> by proton _are_ recycled.
> 
> So the only way how this could be hit is imho having >32k goferd clients
> connected to the Capsule (or some scenario causing session leak).
> 
> 
> Does this really happen?

Hey Pavel,

I can affirm that the bug was really happening even with individual capsule not having around 32K clients. This could be possibly a session leak

Comment 15 Pavel Moravec 2018-10-24 07:34:53 UTC
Smells like dup of https://bugzilla.redhat.com/show_bug.cgi?id=1553189 ?

I will try to reproduce that way.

Comment 16 Pavel Moravec 2018-10-25 20:29:16 UTC
preliminary results:

bz1553189 _is_ reproducible on qdrouterd 0.8.0-16 & proton-0.16.0-7
bz1553189 is _not_ reproducible on qdrouterd 0.8.0-19 & proton 0.16.0-11

It seems the lower versions do have some proton session leak (either router or proton bug).


Preliminary conclusion: both bugs are DUPs and upgrade to qdrouterd 0.8.0-19 & proton 0.16.0-11 fixes them.

Comment 17 Pavel Moravec 2018-10-29 16:34:45 UTC
Preliminary results were .. too preliminary.

I *can* reproduce the segfault on Caps with same backtrace, usually but not always preceded by the 32k session limit error. Sometimes segfault on Sat hit as well.

Working on tuning the parameters to have more reliable reproducer than "once upon a hour". But basic idea seems to be:

- too many link (dis)connections towards qdrouterd causes very high load on the multiplexed TCP connections (qpidd<->qdrouterd, optionally also on inter-qdrouterd one).
- netstat shows huge backlog in receive or send queue for given TCP connection
- that time, qdrouterd on Caps thinks it still has valid session towards qdrouterd/qpidd on Sat, while client session/link for the same is gone
- that consumes many of the 32k sessions of the inter-qdrouterd connection
- in some extreme case, all such sessions are occupied, while most of them are just waiting to be cleared once frames from incoming TCP socket are read
- now, new client connecting to qdrouterd triggers a new session request for the link routing towards qdrouterd on Sat, and it fails with the "pn_session: too many sessions" error
- qdrouterd does not react on that error, assuming a session was successfully created
- trying to work on NULL session raises segfault


Anyway, the above scenario should *not* work when using just 5k clients where each client would have to have several "semi-uncleared sessions"..

Comment 18 Pavel Moravec 2018-10-31 22:08:14 UTC
Tuned reproducer that hits the segfault within 5-10 minutes on average - basically follow bz1553189 one with some tuning.

reproducer machines (beaker default password for root):
fujitsu-cx250s2-3.gsslab.rdu2.redhat.com (qpidd + all qdrouterd, 50 clients)
dell-per620-4.gsslab.rdu2.redhat.com (100 clients)

On fujitsu-cx250s2-3 to start qpidd+qdrouterds:
cd ~/bz1553116_reproducer
./start_reproducer.sh

then on both machines, start clients:
cd ~/bz1553116_reproducer
nohup ./start_clients.sh > /dev/null 2>&1 &




scripts copied here for reference only (or QE purposes once the beaker machines are gone):

mimic_goferd_consumer.py : from bz1553189 with tuning:
ROUTER_ADDRESS = "proton+amqp://fujitsu-cx250s2-3.gsslab.rdu2.redhat.com:%s"
ADDRESS = "pulp.bz1553116"
HEARTBEAT = 5
SLEEP_MIN = 1.0
SLEEP_MAX = 3.0
THREADS = 50
ADDRESSES = 10 



Main script to kick reproducer: start_reproducer.sh :
capsules=10
queues=10
ulimit -c unlimited
ulimit -n 100000

pkill -9 qpidd
pkill -9 qdrouterd
kill $(lsof | grep /root/bz1553116_reproducer/monitor.log | awk '{ print $2 }')
kill -9 $(ps aux | grep mimic | grep python | awk '{ print $2 }')
sleep 2

echo "DELETING ALL COREDUMPS (to save space)"
./reset_coredumps.sh

echo "starting and provisioning qpidd.."
# start qpidd and add queues
qpidd --config ${PWD}/qpidd.conf &
sleep 2
for i in $(seq 0 $((queues-1))); do
	qpid-config add queue pulp.bz1553116.${i}
done

echo "starting qdrouterd processes.."
# start Satellite's router
qdrouterd -c ${PWD}/qdrouterd-sat.conf &
for i in $(seq 0 $((capsules-1))); do
	qdrouterd -c ${PWD}/qdrouterd.caps0${i}.conf &
done

sed -i "s/^ADDRESSES = [[:digit:]]\+ /ADDRESSES = ${queues} /1" mimic_goferd_consumer.py
scp mimic_goferd_consumer.py dell-per620-4.gsslab.rdu2.redhat.com:${PWD}

./monitor.sh > monitor.log 2>&1 &

echo
echo "now run this command (ideally from 2 other machines):"
echo
echo "while true; do for j in \$(seq 1 100); do python mimic_goferd_consumer.py 5600 & sleep 0.09; done; sleep 5; kill \$(ps aux | grep python | grep mimic | awk '{ print \$2 }'); done"
echo ".. or .."
echo "execute run_clients.sh here and on dell-per620-4"

Comment 19 Pavel Moravec 2018-10-31 22:10:14 UTC
Ganesh / Mike, could you have a look there, please?

Above are 2 beaker systems prepared with very reliable reproducer, free to use.

Comment 20 Mike Cressman 2018-11-08 21:26:01 UTC
We have a build with a potential fix for this:
- qpid-dispatch-0.8.0-21.el7, available in brew

If a client tries to connect to the capsule router, and the number of internal sessions between the capsule and the satellite routers exceeds the maximum (32k), we now detach from the client, and it will have to try again.  The connection will succeed once a session gets freed up.  So there is still a session limit, but the router will not crash and the client can retry.

In newer versions of the router (1.2+, potentially shipped in Sat 6.5) this session limit is removed.

Pavel has tested with his reproducer and it seems to behave ok, but it still
needs to be confirmed with real katello agent / goferd (although not sure how feasible that is).

So, this could potentially be used as a hotfix or queued up for a dot release
in 6.3 and/or 6.4 for cases where sessions run out, causing a qdrouterd crash.

Comment 21 Pavel Moravec 2018-11-09 19:02:27 UTC
*** Bug 1553189 has been marked as a duplicate of this bug. ***

Comment 26 Mike McCune 2019-07-17 21:53:57 UTC
This fix is currently available in 6.5 and will be officially tested in 6.6. We are making a build available that should resolve this issue in 6.4.4 with the publishing of qpid-dispatch-router-0.8.0-21.el7.x86_64.rpm

*** Satellite 6.4.4 Hotfix Available ***

1) Download qpid-dispatch-router-0.8.0-21.tar.gz from this Bugzilla to your Satellite and unpack

2) stop services:

foreman-maintain service stop

3) upgrade rpms:

rpm -Uvh qpid-dispatch*.rpm

4) resume operations

foreman-maintain service start

Comment 27 Mike McCune 2019-07-17 21:54:38 UTC
Created attachment 1591574 [details]
qpid-dispatch-router-0.8.0-21.tar.gz

Comment 30 errata-xmlrpc 2019-10-22 12:46:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2019:3172