Description of problem: Under not-100%-clear-circumstances (quite probably after frequent goferd restarts or their link reconnects, having hundreds of established clients meantime), qdrouterd 0.4-20 segfaults. Backtrace sugests some problem with orphaned link lacking its context or connection: Program terminated with signal 11, Segmentation fault. #0 process_handler (unused=<optimized out>, qd_conn=0x7fe3a8602fc0, container=0x1dd1800) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:518 518 ssn = qd_link->close_sess_with_link ? qd_link->pn_sess : 0; Missing separate debuginfos, use: debuginfo-install krb5-libs-1.14.1-26.el7.x86_64 libcom_err-1.42.9-9.el7.x86_64 libffi-3.0.13-18.el7.x86_64 libselinux-2.5-6.el7.x86_64 libuuid-2.23.2-33.el7.x86_64 openssl-libs-1.0.1e-60.el7.x86_64 zlib-1.2.7-17.el7.x86_64 (gdb) bt #0 process_handler (unused=<optimized out>, qd_conn=0x7fe3a8602fc0, container=0x1dd1800) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:518 #1 handler (handler_context=0x1dd1800, conn_context=<optimized out>, event=event@entry=QD_CONN_EVENT_PROCESS, qd_conn=0x7fe3a8602fc0) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:638 #2 0x00007fe3c120496c in process_connector (cxtr=0x7fe3ab4f5850, qd_server=0x1bfb5a0) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:398 #3 thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:626 #4 0x00007fe3c12059f0 in qd_server_run (qd=0x1a8a030) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:971 #5 0x0000000000401cd8 in main_process (config_path=config_path@entry=0x7fffa97bff2d "/etc/qpid-dispatch/qdrouterd.conf", python_pkgdir=python_pkgdir@entry=0x402401 "/usr/lib/qpid-dispatch/python", fd=fd@entry=2) at /usr/src/debug/qpid-dispatch-0.4/router/src/main.c:135 #6 0x0000000000401950 in main (argc=3, argv=0x7fffa97be378) at /usr/src/debug/qpid-dispatch-0.4/router/src/main.c:335 (gdb) list 513 case PN_LINK_LOCAL_DETACH: 514 case PN_LINK_LOCAL_CLOSE: 515 pn_link = pn_event_link(event); 516 if (pn_link_state(pn_link) == (PN_LOCAL_CLOSED | PN_REMOTE_CLOSED)) { 517 qd_link = (qd_link_t*) pn_link_get_context(pn_link); 518 ssn = qd_link->close_sess_with_link ? qd_link->pn_sess : 0; 519 if (ssn) 520 pn_session_close(ssn); 521 add_link_to_free_list(&free_link_session_list, pn_link); 522 pn_link_set_context(pn_link, 0); (gdb) p qd_link $1 = <optimized out> (gdb) p pn_link->context $2 = (pn_record_t *) 0x7fe3ad4fcd50 (gdb) p pn_link->context->size $3 = 1 (gdb) p pn_link->context->capacity $4 = 1 (gdb) p *(pn_link->context->fields) $5 = {key = 0, clazz = 0x7fe3c11d6ce0 <PNI_VOID>, value = 0x0} (gdb) I *guess* qd_link assignment on line 517 sets something fishy (see source of pn_link_get_context and above gdb statements - I *think* pn_link context should have more than a void field). Version-Release number of selected component (if applicable): qpid-dispatch-router-0.4-20.el7sat.x86_64 qpid-proton-c-0.9-16.el7.x86_64 How reproducible: ??? (managed twice, very undeterministic scenario) Steps to Reproduce: 1. follow bz1398377 but run the script with a) lower sleeps and little bit lower consumers, b) make the reconnects in bigger bursts (i.e. remove the randomness) Actual results: segfault with above backtrace Expected results: no segfault Additional info:
.. and here is reproducer: In Satellite (havent tested): have >1000 content hosts with goferd. Restart goferd frequently (every 10-30 seconds, say). Outside Satellite: 1) Add some queues mimicking pulp.agent.<UUID> ones: for i in $(seq 0 200); do qpid-config --ssl-certificate=/etc/pki/katello/qpid_client_striped.crt -b amqps://localhost:5671 add queue pulp.agent.$i & done 2) Run below script (modification of the one from bz1398377): for i in $(seq 1 20); do python just_consume_to_segfault.py 10.0 10 & sleep 0.95; done for i in $(seq 1 20); do python just_consume_to_segfault.py 10.0 10 & sleep 0.95; done (it is worth running it once on some system, and once on another - it adds some load to the system..) 3) Wait up to 30 minutes (I got it in 10 mniutes but it is non-deterministic like a dice) 4) segfault! Script itself: #!/usr/bin/python from time import sleep from uuid import uuid4 from proton import ConnectionException, Timeout from proton import SSLDomain, SSLException #from proton import Message from proton.utils import BlockingConnection import threading import traceback import os, sys import random SSL = True ROUTER_ADDRESS_NOSSL = "proton+amqp://pmoravec-sat62-rhel7.gsslab.brq.redhat.com:5648" ROUTER_ADDRESS_SSL = "amqps://pmoravec-sat62-rhel7.gsslab.brq.redhat.com:5647" ROUTER_ADDRESS = ROUTER_ADDRESS_SSL if SSL else ROUTER_ADDRESS_NOSSL #ROUTER_ADDRESS = "proton+amqp://toledo-capsule.gsslab.brq.redhat.com:5648" ADDRESS = "pulp.agent" #ADDRESS = "queue" HEARTBEAT = 10 SLEEP = float(sys.argv[1]) THREADS = int(sys.argv[2]) START_PERIOD = 10.0 class ReceiverThread(threading.Thread): def __init__(self, _id, address=ADDRESS, domain=None): super(ReceiverThread, self).__init__() self._id = _id self.address = address print self.address self.domain = domain self.running = True self.conn = None def connect(self): try: self.conn = BlockingConnection(ROUTER_ADDRESS, ssl_domain=self.domain, heartbeat=HEARTBEAT) self.conn.create_receiver(self.address, name=str(uuid4()), dynamic=False, options=None) except Exception: self.conn = None def run(self): while self.running: while self.conn == None: self.connect() sleep(SLEEP) try: print "%s: reconnecting.." % self.address self.conn.close() except Exception, e: print e pass self.conn = None def stop(self): self.running = False ca_certificate='/etc/rhsm/ca/katello-server-ca.pem' client_certificate='/etc/pki/consumer/bundle.pem' client_key=None domain = SSLDomain(SSLDomain.MODE_CLIENT) domain.set_trusted_ca_db(ca_certificate) domain.set_credentials(client_certificate, client_key or client_certificate, None) domain.set_peer_authentication(SSLDomain.VERIFY_PEER) random.seed() threads = [] for i in range(THREADS): threads.append(ReceiverThread(i, "%s.%s" % (ADDRESS, i), domain if SSL else None)) threads[i].start() sleep(START_PERIOD/THREADS) while True: sleep(10) _in = raw_input("Press Enter to exit:") for i in range(THREADS): threads[i].stop() for i in range(THREADS): threads[i].join()
btw I am not further able to reproduce the segfault on reinstalled Satellite. So verification steps shall be rather smoke test based - run the reproducer and check new version of qdrouterd doesnt fail.
/me failed to reliably reproduce on 0.4-20. Running the original reproducer against 0.4-21, and/or running scaled reproducer (more client connections, more often reconnections), no segfault was hit.
** HOTFIX AVAILABLE ** This set of hotfix packages are currently undergoing testing by Red Hat Quality Assurance but are available for users who want to get early access. This hotfix includes fixes for bugs 1398377 and 1398536 Instructions for hotfixing: 1) Download to your Satellite : http://people.redhat.com/~mmccune/hotfix/HOTFIX-1398377-1398536.tar.gz 2) Expand the archive, tar xvf HOTFIX-1398377-1398536.tar.gz 3) cd to directory depending on version of base OS you are using 4) Upgrade RPMS: rpm -Uvh qpid-dispatch-router-0.4-21.el7sat.x86_64.rpm libqpid-dispatch-0.4-21.el7sat.x86_64.rpm (optionally include qpid-dispatch-tools if you get dependency errors and have the older version of this package already installed) 5) katello-service restart proceed with operations.
Verified in Satellite 6.2.4 async based on #4 and #6 as well as the no-break automation test results. # rpm -qa | grep qpid-dispatch libqpid-dispatch-0.4-21.el6sat.x86_64 qpid-dispatch-debuginfo-0.4-21.el6sat.x86_64 qpid-dispatch-router-0.4-21.el6sat.x86_64 qpid-dispatch-tools-0.4-21.el6sat.x86_64 # rpm -qa | grep qpid-dispatch qpid-dispatch-debuginfo-0.4-21.el7sat.x86_64 qpid-dispatch-tools-0.4-21.el7sat.x86_64 libqpid-dispatch-0.4-21.el7sat.x86_64 qpid-dispatch-router-0.4-21.el7sat.x86_64
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:2855