Red Hat Bugzilla – Bug 1491160
qdrouterd segfault when processing bursts of goferd requests
Last modified: 2018-06-26 06:29:19 EDT
Description of problem: When too many client connections (from goferd) are attempting to be link-routed via qdrouterd on Satellite towards qpidd, qdrouterd on Sat can segfault. Restarting the process can lead to another segfault soon, since the pending goferd clients send another burst of link attach attempts to the just-started qdrouterd, once possible. Anyway I was able to reproduce these sefaults sporadically only. Version-Release number of selected component (if applicable): qpid-proton-c-0.9-16.el7.x86_64 qpid-dispatch-router-0.4-22.el7sat.x86_64 libqpid-dispatch-0.4-22.el7sat.x86_64 How reproducible: nondeterministic, tome to time Steps to Reproduce: (Satellite-free reproducer) 1. Have qpidd, one qdrouterd (named A) link-routing everything to it, and another qdrouterd (named B) link-routing everything via A 2. Have a python client I will post in next update (just align CONNECTION_URL there). The script basically creates (via QMF) a queue in qpidd and tries to fetch messages from it via a consumer - and it does so concurrently in 30 threads. 3. Run: while true; do if [ $((RANDOM%2)) -eq 0 ]; then service qdrouterd restart # to restart the qdrouterd B fi sleep 1 kill $(ps aux | grep simul | grep python | awk '{ print $2 }') for i in $(seq 1 10); do python simulate_first_goferd_run-threaded.py & sleep 0.1 done date sleep 45 done Actual results: qdrouterd A segfaults time to time, with logs like: Wed Sep 13 09:54:36 2017 ROUTER_LS (info) Router Link Lost - link_id=0 Wed Sep 13 09:55:00 2017 ROUTER_LS (info) Router Link Lost - link_id=0 Wed Sep 13 09:55:06 2017 ROUTER_LS (info) Router Link Lost - link_id=0 Wed Sep 13 09:55:08 2017 ROUTER_LS (info) Router Link Lost - link_id=0 Wed Sep 13 09:55:34 2017 ROUTER_LS (info) Router Link Lost - link_id=0 2017-09-13T09:57:45.632091+02:00 pmoravec-sat62-rhel7 kernel: traps: qdrouterd[14234] general protection ip:7f1265e9a585 sp:7f1259878318 error:0 in libqpid-proton.so.2.0.0[7f1265e76000+4d000] Segmentation fault (core dumped) (i.e. some random time after the Link lost event - this event is supposed to be caused by too many link attach requests exchanged between A and B, causing HELLO/RA messages timeout) Expected results: No segfaults. Additional info: Backtrace: (gdb) bt #0 0x00007fcc3d7fa585 in pn_connection_get_context (conn=0x6f6d70275b3d6e65) at /usr/src/debug/qpid-proton-0.9/proton-c/src/engine/engine.c:184 #1 0x00007fcc3da3ee61 in qd_link_connection (link=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:937 #2 0x00007fcc3da4cc95 in router_link_attach_handler (context=0xedba20, link=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/router_node.c:1686 #3 0x00007fcc3da3e11c in handle_link_open (container=<optimized out>, pn_link=0x7fcc282e4610) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:217 #4 process_handler (unused=<optimized out>, qd_conn=0x7fcc2800cb30, container=0xe80840) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:481 #5 handler (handler_context=0xe80840, conn_context=<optimized out>, event=event@entry=QD_CONN_EVENT_PROCESS, qd_conn=0x7fcc2800cb30) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:643 #6 0x00007fcc3da5095c in process_connector (cxtr=0x7fcc28010290, qd_server=0xede3d0) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:398 #7 thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:626 #8 0x00007fcc3d5c1e25 in start_thread (arg=0x7fcc311d9700) at pthread_create.c:308 #9 0x00007fcc3cb1734d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 (gdb) (very few variants, once frame #1 was already qd_link_connection method) Always the connection object there is gone while qdrouterd tries to access it: (gdb) frame 2 #2 0x00007fcc3da4cc95 in router_link_attach_handler (context=0xedba20, link=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/router_node.c:1686 1686 qd_connection_t *out_conn = qd_link_connection(peer_rlink->link); (gdb) p peer_rlink->link->pn_link->session->connection $1 = (pn_connection_t *) 0x6f6d70275b3d6e65 (gdb) p *(peer_rlink->link->pn_link->session->connection) Cannot access memory at address 0x6f6d70275b3d6e65 (gdb) The thread processes some event related to connection to/from qpidd, since: (gdb) frame 6 #6 0x00007fcc3da5095c in process_connector (cxtr=0x7fcc28010290, qd_server=0xede3d0) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:398 398 events = qd_server->conn_handler(qd_server->conn_handler_context, ctx->context, (gdb) p cxtr->name $5 = "pmoravec-sat62-rhel7.gsslab.brq2.redhat.com:5671", '\000' <repeats 136 times>, "\300\241\335<\314\177\000\000\241\f\001", '\000' <repeats 60 times> (gdb) The "$(hostname):5671" name is connector to qpidd here (qpidd listens on 5671). (note that these backtraces (with valid connection object) appears regularly whenever there is an established link routing. Idea behind reproducer and backtrace: - too many link attach requests cause inter-router link to be congested, causing HELLO/RA messages timeout and link drop - meantime, processing some request like above, it still tries to use just-dropped link/connection to routerd B
script simulate_first_goferd_run-threaded.py : (routers must link-route pulp.* to qpidd) #!/usr/bin/python from time import sleep from uuid import uuid4 import threading from proton import ConnectionException, Message, Timeout from proton.utils import LinkDetached, ConnectionClosed, BlockingConnection from proton.reactor import DynamicNodeProperties SLEEP = 0.2 THREADS = 30 CONNECTION_URL = "proton+amqp://pmoravec-caps62-rhel7.gsslab.brq2.redhat.com:5647" class CreateQueueAndSubscribe(threading.Thread): def __init__(self): super(CreateQueueAndSubscribe, self).__init__() self.qName = 'pulp.agent.FAKE.%s' % uuid4() def run(self): qmf_conn = BlockingConnection(CONNECTION_URL, ssl_domain=None, heartbeat=None) qmf_rec = qmf_conn.create_receiver(None, name=str(uuid4()), dynamic=True, options=DynamicNodeProperties({'x-opt-qd.address': unicode("qmf.default.direct")})) qmf_snd = qmf_conn.create_sender("qmf.default.direct", name=str(uuid4())) print self.qName content = { "_object_id": {"_object_name": "org.apache.qpid.broker:broker:amqp-broker"}, "_method_name": "create", "_arguments": {"type":"queue", "name":self.qName, "properties":{"exclusive":False, "auto-delete":False, "durable":True}} } properties = { 'qmf.opcode': '_method_request', 'x-amqp-0-10.app-id': 'qmf2', 'method': 'request' } request = Message( body=content, reply_to=qmf_rec.remote_source.address, properties=properties, correlation_id=unicode(uuid4()).encode('utf-8'), subject='broker') qmf_snd.send(request) qmf_rec.receive() # ignore what we have received as QMF response qmf_snd.close() qmf_rec.close() qmf_conn.close() while True: connected=False while not(connected): try: conn = BlockingConnection(CONNECTION_URL, ssl_domain=None, heartbeat=10) rec = conn.create_receiver(self.qName, name=str(uuid4())) connected=True except: pass while connected: try: impl = rec.receive(3) except Timeout: pass except LinkDetached, ConnectionClosed: connected=False pass threads = [] for i in range(THREADS): threads.append(CreateQueueAndSubscribe()) sleep(SLEEP) threads[i].start() while True: sleep(10)
Much much more reliable reproducer: 1) In 1st terminal, kill and respawn python scripts mimicking goferd : ssh root@dell-per430-14.gsslab.pek2.redhat.com (beaker password) while true; do kill $(ps aux | grep simulate | grep python | awk '{ print $2 }') for i in $(seq 1 30); do python simulate_first_goferd_run-threaded.py & done sleep $((RANDOM%10+25)) done 2) In 2nd terminal, occasionally restart qdrouterd B (the one on Capsule / between clients and "hub router"): ssh root@dell-per430-14.gsslab.pek2.redhat.com (beaker password) while true; do kill $(ps aux | grep qdr | grep -v grep | awk '{ print $2 }') sleep 1 /usr/sbin/qdrouterd -c /etc/qpid-dispatch/qdrouterd.conf & sleep $((30+RANDOM%30)) done 3) In 3rd terminal, run the "hub router", the segfaulting: ssh root@dell-per430-32.gsslab.pek2.redhat.com (beaker password) ulimit -c unlimited /usr/sbin/qdrouterd -c /etc/qpid-dispatch/qdrouterd.conf 4) In 4th terminal, clean orphaned qpidd queues (as each goferd mimicked python client creates its own queues and doesnt clean them): ssh root@dell-per430-32.gsslab.pek2.redhat.com (beaker password) while true; do date ./delete_orphaned_queues.sh | wc # to know number of deleted queues qpid-stat -q | wc # to know number of remaining queues sleep 90 done 5) Wait up to 10 minutes (usually approx. 2-3 minutes) for qdrouterd in 3rd terminal to segfault. Some observation: I *think* the cleanup is somehow necessary - or *some* activity in qpidd that makes the broker busy. I.e. the trigger for the segfault is busy qpidd broker that responds with some delays (?). Just a speculation: despite connector to qpidd broker has idle-timeout-seconds: 0, maybe the connection is still somehow timeouted in either way? That would explain why busy qpidd triggers it and also the missing connection object (timeouted)? Side observation: time to time, qdrouterd on Capsule (2nd terminal) didnt finish / survived "kill <PID>". That *can* prevent segfaults of "hub router", be aware. This "not kill-able router" is in fact in a deadlock - will describe in separate BZ.
Reproducer script from #c3 (since the machine will be returned once a day): #!/usr/bin/python from time import sleep from uuid import uuid4 import threading from proton import ConnectionException, Message, Timeout from proton.utils import LinkDetached, ConnectionClosed, BlockingConnection from proton.reactor import DynamicNodeProperties SLEEP = 0.1 THREADS = 20 # was 30 CONNECTION_URL = "proton+amqp://localhost:5647" class CreateQueueAndSubscribe(threading.Thread): def __init__(self): super(CreateQueueAndSubscribe, self).__init__() self.qName = 'pulp.agent.FAKE.%s' % uuid4() def run(self): created=False while not created: try: qmf_conn = BlockingConnection(CONNECTION_URL, ssl_domain=None, heartbeat=None) qmf_rec = qmf_conn.create_receiver(None, name=str(uuid4()), dynamic=True, options=DynamicNodeProperties({'x-opt-qd.address': unicode("qmf.default.direct")})) qmf_snd = qmf_conn.create_sender("qmf.default.direct", name=str(uuid4())) print self.qName content = { "_object_id": {"_object_name": "org.apache.qpid.broker:broker:amqp-broker"}, "_method_name": "create", "_arguments": {"type":"queue", "name":self.qName, "properties":{"exclusive":False, "auto-delete":False, "durable":True}} } properties = { 'qmf.opcode': '_method_request', 'x-amqp-0-10.app-id': 'qmf2', 'method': 'request' } request = Message( body=content, reply_to=qmf_rec.remote_source.address, properties=properties, correlation_id=unicode(uuid4()).encode('utf-8'), subject='broker') qmf_snd.send(request) qmf_rec.receive() # ignore what we have received as QMF response qmf_snd.close() qmf_rec.close() qmf_conn.close() created = True except: created=False # end of while not created while True: connected=False while not(connected): try: conn = BlockingConnection(CONNECTION_URL, ssl_domain=None, heartbeat=10) rec = conn.create_receiver(self.qName, name=str(uuid4())) connected=True except: try: sleep(0.3) con.close() except: pass pass while connected: try: impl = rec.receive(3) except LinkDetached as e: connected=False print type(e) try: con.close() except: pass except Timeout as e: pass except (ConnectionClosed, ConnectionException) as e: connected=False print type(e) pass threads = [] for i in range(THREADS): threads.append(CreateQueueAndSubscribe()) sleep(SLEEP) threads[i].start() while True: sleep(10) (just for completeness, delete_orphaned_queues.sh is: for i in $(qpid-stat -q | grep FAKE | grep "0 1$" | awk '{ print $1 }'); do echo $i qpid-receive -a "$i; {delete:always}" done )
Improved & simplified reproducer from #3 (mainly to ensure deadlocked router on -14 server does not block the reproducer, and that ulimits are properly set): 1) 1st terminal on dell-per430-32.gsslab.pek2.redhat.com : ./satellite-delete-old-queues-regularly.sh 2) 2nd terminal on dell-per430-32.gsslab.pek2.redhat.com (here watch for segfaults): ./satellite-run-qdrouterd-to-segfault.sh 3) 1st terminal on dell-per430-14.gsslab.pek2.redhat.com : ./capsule-reproducer-qdrouter.sh 4) 2nd terminal on dell-per430-14.gsslab.pek2.redhat.com : ./capsule-reproducer-clients.sh (here watch that it prints lines like: pulp.agent.FAKE.5d257d6f-fab5-442d-ab96-065b72dc5d1c <class 'proton.utils.LinkDetached'> or few other exceptions, time to time - if not, something is broken like unresponding qdrouterd on "capsule" / -14 server)
Brad, If you are including Bug 1492355 in 6.2.14, you should target this bug as well, since they are both addressed in the same qpid-dispatch-0.4-27 build.
Mike, thanks for the feedback! I will update this one as well. Be aware, that I am setting the TM to 6.2.14; however, prior to each z-stream there is a triage meeting where the content for the z gets selected.
VERIFIED on satellite-6.2.14-1.0.el7sat.noarch I also tried the following: - create a docker image with rhel with installed katello-consumer-ca and installed katello-agent (yet not registered to satellite) - as a startup script, run subscription-manager registration and append a conditional loop that will start up the gofer daemon on some sort of trigger (i mounted an external dir and made a conditional to check for a presence of some file). - start up many containers (tried with 10,30,50). - after all the containers are up and their registration is finished (verify by listing the content hosts in satellite and that there are no more requests arriving to /rhsm endpoint), pull the trigger (in my case, create the file) to break the waiting loop, that would run goferd on all containers simultaneously. - observe the number pulp.agent* queues bumps by the number of the running containers in a moment - watch the logs for any errors - no erorrs detected
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:0273