Bug 1398536 - qdrouterd 0.4-20 segfault after frequent goferd restarts
Summary: qdrouterd 0.4-20 segfault after frequent goferd restarts
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Satellite
Classification: Red Hat
Component: katello-agent
Version: 6.2.5
Hardware: x86_64
OS: Linux
high
high
Target Milestone: Unspecified
Assignee: satellite6-bugs
QA Contact: jcallaha
URL:
Whiteboard:
Depends On:
Blocks: 1400662
TreeView+ depends on / blocked
 
Reported: 2016-11-25 07:51 UTC by Pavel Moravec
Modified: 2020-04-15 14:54 UTC (History)
7 users (show)

Fixed In Version: qpid-dispatch-0.4-21
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1400662 (view as bug list)
Environment:
Last Closed: 2016-12-05 20:54:21 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2016:2855 0 normal SHIPPED_LIVE Satellite 6.2 Dispatch Router Async Errata 2016-12-06 01:53:43 UTC

Description Pavel Moravec 2016-11-25 07:51:43 UTC
Description of problem:
Under not-100%-clear-circumstances (quite probably after frequent goferd restarts or their link reconnects, having hundreds of established clients meantime), qdrouterd 0.4-20 segfaults. Backtrace sugests some problem with orphaned link lacking its context or connection:

Program terminated with signal 11, Segmentation fault.
#0  process_handler (unused=<optimized out>, qd_conn=0x7fe3a8602fc0, container=0x1dd1800) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:518
518	                ssn = qd_link->close_sess_with_link ? qd_link->pn_sess : 0;
Missing separate debuginfos, use: debuginfo-install krb5-libs-1.14.1-26.el7.x86_64 libcom_err-1.42.9-9.el7.x86_64 libffi-3.0.13-18.el7.x86_64 libselinux-2.5-6.el7.x86_64 libuuid-2.23.2-33.el7.x86_64 openssl-libs-1.0.1e-60.el7.x86_64 zlib-1.2.7-17.el7.x86_64
(gdb) bt
#0  process_handler (unused=<optimized out>, qd_conn=0x7fe3a8602fc0, container=0x1dd1800) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:518
#1  handler (handler_context=0x1dd1800, conn_context=<optimized out>, event=event@entry=QD_CONN_EVENT_PROCESS, qd_conn=0x7fe3a8602fc0)
    at /usr/src/debug/qpid-dispatch-0.4/src/container.c:638
#2  0x00007fe3c120496c in process_connector (cxtr=0x7fe3ab4f5850, qd_server=0x1bfb5a0) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:398
#3  thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:626
#4  0x00007fe3c12059f0 in qd_server_run (qd=0x1a8a030) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:971
#5  0x0000000000401cd8 in main_process (config_path=config_path@entry=0x7fffa97bff2d "/etc/qpid-dispatch/qdrouterd.conf", 
    python_pkgdir=python_pkgdir@entry=0x402401 "/usr/lib/qpid-dispatch/python", fd=fd@entry=2) at /usr/src/debug/qpid-dispatch-0.4/router/src/main.c:135
#6  0x0000000000401950 in main (argc=3, argv=0x7fffa97be378) at /usr/src/debug/qpid-dispatch-0.4/router/src/main.c:335
(gdb) list
513	        case PN_LINK_LOCAL_DETACH:
514	        case PN_LINK_LOCAL_CLOSE:
515	            pn_link = pn_event_link(event);
516	            if (pn_link_state(pn_link) == (PN_LOCAL_CLOSED | PN_REMOTE_CLOSED)) {
517	                qd_link = (qd_link_t*) pn_link_get_context(pn_link);
518	                ssn = qd_link->close_sess_with_link ? qd_link->pn_sess : 0;
519	                if (ssn)
520	                    pn_session_close(ssn);
521	                add_link_to_free_list(&free_link_session_list, pn_link);
522	                pn_link_set_context(pn_link, 0);
(gdb) p qd_link
$1 = <optimized out>
(gdb) p pn_link->context 
$2 = (pn_record_t *) 0x7fe3ad4fcd50
(gdb) p pn_link->context->size
$3 = 1
(gdb) p pn_link->context->capacity 
$4 = 1
(gdb) p *(pn_link->context->fields)
$5 = {key = 0, clazz = 0x7fe3c11d6ce0 <PNI_VOID>, value = 0x0}
(gdb) 

I *guess* qd_link assignment on line 517 sets something fishy (see source of pn_link_get_context and above gdb statements - I *think* pn_link context should have more than a void field).


Version-Release number of selected component (if applicable):
qpid-dispatch-router-0.4-20.el7sat.x86_64
qpid-proton-c-0.9-16.el7.x86_64


How reproducible:
??? (managed twice, very undeterministic scenario)


Steps to Reproduce:
1. follow bz1398377 but run the script with a) lower sleeps and little bit lower consumers, b) make the reconnects in bigger bursts (i.e. remove the randomness)


Actual results:
segfault with above backtrace


Expected results:
no segfault


Additional info:

Comment 3 Pavel Moravec 2016-11-25 09:28:38 UTC
.. and here is reproducer:

In Satellite (havent tested): have >1000 content hosts with goferd. Restart goferd frequently (every 10-30 seconds, say).

Outside Satellite:

1) Add some queues mimicking pulp.agent.<UUID> ones:

for i in $(seq 0 200); do qpid-config --ssl-certificate=/etc/pki/katello/qpid_client_striped.crt -b amqps://localhost:5671 add queue pulp.agent.$i & done


2) Run below script (modification of the one from bz1398377):

for i in $(seq 1 20); do python just_consume_to_segfault.py 10.0 10 & sleep 0.95; done

for i in $(seq 1 20); do python just_consume_to_segfault.py 10.0 10 & sleep 0.95; done

(it is worth running it once on some system, and once on another - it adds some load to the system..)


3) Wait up to 30 minutes (I got it in 10 mniutes but it is non-deterministic like a dice)


4) segfault!

Script itself:

#!/usr/bin/python

from time import sleep
from uuid import uuid4

from proton import ConnectionException, Timeout
from proton import SSLDomain, SSLException
#from proton import Message

from proton.utils import BlockingConnection

import threading
import traceback
import os, sys
import random

SSL = True
ROUTER_ADDRESS_NOSSL = "proton+amqp://pmoravec-sat62-rhel7.gsslab.brq.redhat.com:5648"
ROUTER_ADDRESS_SSL = "amqps://pmoravec-sat62-rhel7.gsslab.brq.redhat.com:5647"
ROUTER_ADDRESS = ROUTER_ADDRESS_SSL if SSL else ROUTER_ADDRESS_NOSSL
#ROUTER_ADDRESS = "proton+amqp://toledo-capsule.gsslab.brq.redhat.com:5648"
ADDRESS = "pulp.agent"
#ADDRESS = "queue"
HEARTBEAT = 10
SLEEP = float(sys.argv[1])
THREADS = int(sys.argv[2])
START_PERIOD = 10.0


class ReceiverThread(threading.Thread):
    def __init__(self, _id, address=ADDRESS, domain=None):
        super(ReceiverThread, self).__init__()
        self._id = _id
        self.address = address
        print self.address
        self.domain = domain
        self.running = True
        self.conn = None

    def connect(self):
        try:
            self.conn = BlockingConnection(ROUTER_ADDRESS, ssl_domain=self.domain, heartbeat=HEARTBEAT)
            self.conn.create_receiver(self.address, name=str(uuid4()), dynamic=False, options=None)
        except Exception:
            self.conn = None

    def run(self):
        while self.running:
            while self.conn == None:
                self.connect()
            sleep(SLEEP)
            try:
                print "%s: reconnecting.." % self.address
                self.conn.close()
            except Exception, e:
                print e
                pass
            self.conn = None

    def stop(self):
        self.running = False

ca_certificate='/etc/rhsm/ca/katello-server-ca.pem'
client_certificate='/etc/pki/consumer/bundle.pem'
client_key=None

domain = SSLDomain(SSLDomain.MODE_CLIENT)
domain.set_trusted_ca_db(ca_certificate)
domain.set_credentials(client_certificate, client_key or client_certificate, None)
domain.set_peer_authentication(SSLDomain.VERIFY_PEER)

random.seed()
threads = []
for i in range(THREADS):
  threads.append(ReceiverThread(i, "%s.%s" % (ADDRESS, i), domain if SSL else None))
  threads[i].start()
  sleep(START_PERIOD/THREADS)

while True:
  sleep(10)
_in = raw_input("Press Enter to exit:")

for i in range(THREADS):
    threads[i].stop()
for i in range(THREADS):
    threads[i].join()

Comment 4 Pavel Moravec 2016-11-29 21:01:38 UTC
btw I am not further able to reproduce the segfault on reinstalled Satellite. So verification steps shall be rather smoke test based - run the reproducer and check new version of qdrouterd doesnt fail.

Comment 6 Pavel Moravec 2016-11-30 19:19:14 UTC
/me failed to reliably reproduce on 0.4-20. Running the original reproducer against 0.4-21, and/or running scaled reproducer (more client connections, more often reconnections), no segfault was hit.

Comment 8 Mike McCune 2016-12-01 18:34:31 UTC
** HOTFIX AVAILABLE **

This set of hotfix packages are currently undergoing testing by Red Hat Quality Assurance but are available for users who want to get early access. This hotfix includes fixes for bugs 1398377 and 1398536

Instructions for hotfixing:

1) Download to your Satellite : http://people.redhat.com/~mmccune/hotfix/HOTFIX-1398377-1398536.tar.gz

2) Expand the archive, tar xvf HOTFIX-1398377-1398536.tar.gz

3) cd to directory depending on version of base OS you are using

4) Upgrade RPMS:

rpm -Uvh qpid-dispatch-router-0.4-21.el7sat.x86_64.rpm libqpid-dispatch-0.4-21.el7sat.x86_64.rpm

(optionally include qpid-dispatch-tools if you get dependency errors and have the older version of this package already installed)

5) katello-service restart

proceed with operations.

Comment 9 jcallaha 2016-12-05 13:41:28 UTC
Verified in Satellite 6.2.4 async based on #4 and #6 as well as the no-break automation test results.

# rpm -qa | grep qpid-dispatch
libqpid-dispatch-0.4-21.el6sat.x86_64
qpid-dispatch-debuginfo-0.4-21.el6sat.x86_64
qpid-dispatch-router-0.4-21.el6sat.x86_64
qpid-dispatch-tools-0.4-21.el6sat.x86_64

# rpm -qa | grep qpid-dispatch
qpid-dispatch-debuginfo-0.4-21.el7sat.x86_64
qpid-dispatch-tools-0.4-21.el7sat.x86_64
libqpid-dispatch-0.4-21.el7sat.x86_64
qpid-dispatch-router-0.4-21.el7sat.x86_64

Comment 11 errata-xmlrpc 2016-12-05 20:54:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:2855


Note You need to log in before you can comment on or make changes to this bug.