Bug 1491160
Summary: | qdrouterd segfault when processing bursts of goferd requests | |||
---|---|---|---|---|
Product: | Red Hat Satellite | Reporter: | Pavel Moravec <pmoravec> | |
Component: | Qpid | Assignee: | Mike Cressman <mcressma> | |
Status: | CLOSED ERRATA | QA Contact: | Roman Plevka <rplevka> | |
Severity: | urgent | Docs Contact: | ||
Priority: | high | |||
Version: | 6.2.11 | CC: | andrew.schofield, bbuckingham, janarula, ktordeur, mcressma, omankame, pmoravec, rplevka, xdmoon | |
Target Milestone: | Unspecified | Keywords: | Triaged | |
Target Release: | Unused | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | qpid-dispatch-0.4-27 | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1530692 (view as bug list) | Environment: | ||
Last Closed: | 2018-02-05 13:54:34 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: |
Description
Pavel Moravec
2017-09-13 08:24:18 UTC
script simulate_first_goferd_run-threaded.py : (routers must link-route pulp.* to qpidd) #!/usr/bin/python from time import sleep from uuid import uuid4 import threading from proton import ConnectionException, Message, Timeout from proton.utils import LinkDetached, ConnectionClosed, BlockingConnection from proton.reactor import DynamicNodeProperties SLEEP = 0.2 THREADS = 30 CONNECTION_URL = "proton+amqp://pmoravec-caps62-rhel7.gsslab.brq2.redhat.com:5647" class CreateQueueAndSubscribe(threading.Thread): def __init__(self): super(CreateQueueAndSubscribe, self).__init__() self.qName = 'pulp.agent.FAKE.%s' % uuid4() def run(self): qmf_conn = BlockingConnection(CONNECTION_URL, ssl_domain=None, heartbeat=None) qmf_rec = qmf_conn.create_receiver(None, name=str(uuid4()), dynamic=True, options=DynamicNodeProperties({'x-opt-qd.address': unicode("qmf.default.direct")})) qmf_snd = qmf_conn.create_sender("qmf.default.direct", name=str(uuid4())) print self.qName content = { "_object_id": {"_object_name": "org.apache.qpid.broker:broker:amqp-broker"}, "_method_name": "create", "_arguments": {"type":"queue", "name":self.qName, "properties":{"exclusive":False, "auto-delete":False, "durable":True}} } properties = { 'qmf.opcode': '_method_request', 'x-amqp-0-10.app-id': 'qmf2', 'method': 'request' } request = Message( body=content, reply_to=qmf_rec.remote_source.address, properties=properties, correlation_id=unicode(uuid4()).encode('utf-8'), subject='broker') qmf_snd.send(request) qmf_rec.receive() # ignore what we have received as QMF response qmf_snd.close() qmf_rec.close() qmf_conn.close() while True: connected=False while not(connected): try: conn = BlockingConnection(CONNECTION_URL, ssl_domain=None, heartbeat=10) rec = conn.create_receiver(self.qName, name=str(uuid4())) connected=True except: pass while connected: try: impl = rec.receive(3) except Timeout: pass except LinkDetached, ConnectionClosed: connected=False pass threads = [] for i in range(THREADS): threads.append(CreateQueueAndSubscribe()) sleep(SLEEP) threads[i].start() while True: sleep(10) Much much more reliable reproducer: 1) In 1st terminal, kill and respawn python scripts mimicking goferd : ssh root.pek2.redhat.com (beaker password) while true; do kill $(ps aux | grep simulate | grep python | awk '{ print $2 }') for i in $(seq 1 30); do python simulate_first_goferd_run-threaded.py & done sleep $((RANDOM%10+25)) done 2) In 2nd terminal, occasionally restart qdrouterd B (the one on Capsule / between clients and "hub router"): ssh root.pek2.redhat.com (beaker password) while true; do kill $(ps aux | grep qdr | grep -v grep | awk '{ print $2 }') sleep 1 /usr/sbin/qdrouterd -c /etc/qpid-dispatch/qdrouterd.conf & sleep $((30+RANDOM%30)) done 3) In 3rd terminal, run the "hub router", the segfaulting: ssh root.pek2.redhat.com (beaker password) ulimit -c unlimited /usr/sbin/qdrouterd -c /etc/qpid-dispatch/qdrouterd.conf 4) In 4th terminal, clean orphaned qpidd queues (as each goferd mimicked python client creates its own queues and doesnt clean them): ssh root.pek2.redhat.com (beaker password) while true; do date ./delete_orphaned_queues.sh | wc # to know number of deleted queues qpid-stat -q | wc # to know number of remaining queues sleep 90 done 5) Wait up to 10 minutes (usually approx. 2-3 minutes) for qdrouterd in 3rd terminal to segfault. Some observation: I *think* the cleanup is somehow necessary - or *some* activity in qpidd that makes the broker busy. I.e. the trigger for the segfault is busy qpidd broker that responds with some delays (?). Just a speculation: despite connector to qpidd broker has idle-timeout-seconds: 0, maybe the connection is still somehow timeouted in either way? That would explain why busy qpidd triggers it and also the missing connection object (timeouted)? Side observation: time to time, qdrouterd on Capsule (2nd terminal) didnt finish / survived "kill <PID>". That *can* prevent segfaults of "hub router", be aware. This "not kill-able router" is in fact in a deadlock - will describe in separate BZ. Reproducer script from #c3 (since the machine will be returned once a day): #!/usr/bin/python from time import sleep from uuid import uuid4 import threading from proton import ConnectionException, Message, Timeout from proton.utils import LinkDetached, ConnectionClosed, BlockingConnection from proton.reactor import DynamicNodeProperties SLEEP = 0.1 THREADS = 20 # was 30 CONNECTION_URL = "proton+amqp://localhost:5647" class CreateQueueAndSubscribe(threading.Thread): def __init__(self): super(CreateQueueAndSubscribe, self).__init__() self.qName = 'pulp.agent.FAKE.%s' % uuid4() def run(self): created=False while not created: try: qmf_conn = BlockingConnection(CONNECTION_URL, ssl_domain=None, heartbeat=None) qmf_rec = qmf_conn.create_receiver(None, name=str(uuid4()), dynamic=True, options=DynamicNodeProperties({'x-opt-qd.address': unicode("qmf.default.direct")})) qmf_snd = qmf_conn.create_sender("qmf.default.direct", name=str(uuid4())) print self.qName content = { "_object_id": {"_object_name": "org.apache.qpid.broker:broker:amqp-broker"}, "_method_name": "create", "_arguments": {"type":"queue", "name":self.qName, "properties":{"exclusive":False, "auto-delete":False, "durable":True}} } properties = { 'qmf.opcode': '_method_request', 'x-amqp-0-10.app-id': 'qmf2', 'method': 'request' } request = Message( body=content, reply_to=qmf_rec.remote_source.address, properties=properties, correlation_id=unicode(uuid4()).encode('utf-8'), subject='broker') qmf_snd.send(request) qmf_rec.receive() # ignore what we have received as QMF response qmf_snd.close() qmf_rec.close() qmf_conn.close() created = True except: created=False # end of while not created while True: connected=False while not(connected): try: conn = BlockingConnection(CONNECTION_URL, ssl_domain=None, heartbeat=10) rec = conn.create_receiver(self.qName, name=str(uuid4())) connected=True except: try: sleep(0.3) con.close() except: pass pass while connected: try: impl = rec.receive(3) except LinkDetached as e: connected=False print type(e) try: con.close() except: pass except Timeout as e: pass except (ConnectionClosed, ConnectionException) as e: connected=False print type(e) pass threads = [] for i in range(THREADS): threads.append(CreateQueueAndSubscribe()) sleep(SLEEP) threads[i].start() while True: sleep(10) (just for completeness, delete_orphaned_queues.sh is: for i in $(qpid-stat -q | grep FAKE | grep "0 1$" | awk '{ print $1 }'); do echo $i qpid-receive -a "$i; {delete:always}" done ) Improved & simplified reproducer from #3 (mainly to ensure deadlocked router on -14 server does not block the reproducer, and that ulimits are properly set): 1) 1st terminal on dell-per430-32.gsslab.pek2.redhat.com : ./satellite-delete-old-queues-regularly.sh 2) 2nd terminal on dell-per430-32.gsslab.pek2.redhat.com (here watch for segfaults): ./satellite-run-qdrouterd-to-segfault.sh 3) 1st terminal on dell-per430-14.gsslab.pek2.redhat.com : ./capsule-reproducer-qdrouter.sh 4) 2nd terminal on dell-per430-14.gsslab.pek2.redhat.com : ./capsule-reproducer-clients.sh (here watch that it prints lines like: pulp.agent.FAKE.5d257d6f-fab5-442d-ab96-065b72dc5d1c <class 'proton.utils.LinkDetached'> or few other exceptions, time to time - if not, something is broken like unresponding qdrouterd on "capsule" / -14 server) Brad, If you are including Bug 1492355 in 6.2.14, you should target this bug as well, since they are both addressed in the same qpid-dispatch-0.4-27 build. Mike, thanks for the feedback! I will update this one as well. Be aware, that I am setting the TM to 6.2.14; however, prior to each z-stream there is a triage meeting where the content for the z gets selected. VERIFIED on satellite-6.2.14-1.0.el7sat.noarch I also tried the following: - create a docker image with rhel with installed katello-consumer-ca and installed katello-agent (yet not registered to satellite) - as a startup script, run subscription-manager registration and append a conditional loop that will start up the gofer daemon on some sort of trigger (i mounted an external dir and made a conditional to check for a presence of some file). - start up many containers (tried with 10,30,50). - after all the containers are up and their registration is finished (verify by listing the content hosts in satellite and that there are no more requests arriving to /rhsm endpoint), pull the trigger (in my case, create the file) to break the waiting loop, that would run goferd on all containers simultaneously. - observe the number pulp.agent* queues bumps by the number of the running containers in a moment - watch the logs for any errors - no erorrs detected Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:0273 |