Bug 1281947
| Summary: | qdrouterd not responding in Satellite6 | ||
|---|---|---|---|
| Product: | Red Hat Satellite | Reporter: | Pavel Moravec <pmoravec> |
| Component: | katello-agent | Assignee: | Ted Ross <tross> |
| Status: | CLOSED ERRATA | QA Contact: | Tazim Kolhar <tkolhar> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 6.1.3 | CC: | bbuckingham, bkearney, chrobert, cwelton, daobrien, j.bittner, juwu, mcressma, mmccune, pmoravec, sthirugn, tkolhar, tross |
| Target Milestone: | Unspecified | Keywords: | Triaged |
| Target Release: | Unused | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | qpid-dispatch-0.4-11 | Doc Type: | Bug Fix |
| Doc Text: |
A defect in qdrouterd caused a thread deadlock under some circumstances where connections were opening and closing simultaneously. This caused the qdrouterd process to stop responding to management queries and stop forwarding messages. This fix removes the defect such that the qpid-proton library is used properly.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2016-01-21 07:42:42 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
I can confirm qpid-dispatch 0.4-11 fixes the deadlock.
To the reproducer: the above one has lower probability to hit the bug. An improved version is:
1) Have just_many_consumers.2.py script per #c0
2) Run the script 10 times in parallel and re-start if stopped (update "-lt 10" accordingly to other python processes running on the host):
while true; do while [ $(pgrep python | wc -l) -lt 10 ]; do python just_many_consumers.2.py & sleep 1; done; sleep 1; date; done
3) Sometimes, kill all the scripts - this greatly improves reliability of the reproducer, to invoke faster link dropping in parallel (together with prompt link establishment later on):
while true; do kill $(ps aux | grep just | grep python | awk '{ print $2 }'); sleep 10; sleep $(($((RANDOM))%10)); date; done
Hi,
can you please provide the explanation for step 1 and step 2 :
I did not get them . is there any command for same ?
1. Start qpid broker with queues "queue.0", "queue.1" to "queue.9"
2. Start qdrouterd with link routing queue. to the qpid router
thanks and regards,
Tazim
(In reply to Tazim Kolhar from comment #3) > Hi, > > can you please provide the explanation for step 1 and step 2 : > I did not get them . is there any command for same ? > > 1. Start qpid broker with queues "queue.0", "queue.1" to "queue.9" > 2. Start qdrouterd with link routing queue. to the qpid router > > thanks and regards, > Tazim Sure. > 1. Start qpid broker with queues "queue.0", "queue.1" to "queue.9" Just add the queued to the running broker on Satellite: for i in $(seq 0 9); do qpid-config add queue queue.${i} --ssl-certificate=/etc/pki/katello/qpid_client_striped.crt -b amqps://localhost:5671 done (ensure the queues are present by: qpid-stat --ssl-certificate=/etc/pki/katello/qpid_client_striped.crt -b amqps://localhost:5671 -q | grep queue ) > 2. Start qdrouterd with link routing queue. to the qpid router Add to /etc/qpid-dispatch/qdrouterd.conf this section: linkRoutePattern { prefix: queue. connector: broker } and restart qdrouterd service to apply it. The reproducer is little bit artificial (not following steps a Satellite user would do to get qdrouterd stuck), but I suppose it is sufficient (it just mimicks such activity done on Content Hosts via the reproducer script). VERIFIED: # rpm -qa | grep foreman hp-dl180g6-01.rhts.eng.bos.redhat.com-foreman-proxy-1.0-1.noarch ruby193-rubygem-foreman_hooks-0.3.7-2.el7sat.noarch foreman-vmware-1.7.2.50-1.el7sat.noarch rubygem-hammer_cli_foreman_tasks-0.0.3.5-1.el7sat.noarch foreman-selinux-1.7.2.17-1.el7sat.noarch ruby193-rubygem-foreman_bootdisk-4.0.2.14-1.el7sat.noarch foreman-ovirt-1.7.2.50-1.el7sat.noarch foreman-1.7.2.50-1.el7sat.noarch ruby193-rubygem-foreman_docker-1.2.0.24-1.el7sat.noarch ruby193-rubygem-foreman-tasks-0.6.15.7-1.el7sat.noarch rubygem-hammer_cli_foreman_bootdisk-0.1.2.7-1.el7sat.noarch rubygem-hammer_cli_foreman_docker-0.0.3.10-1.el7sat.noarch foreman-debug-1.7.2.50-1.el7sat.noarch foreman-proxy-1.7.2.8-1.el7sat.noarch hp-dl180g6-01.rhts.eng.bos.redhat.com-foreman-client-1.0-1.noarch hp-dl180g6-01.rhts.eng.bos.redhat.com-foreman-proxy-client-1.0-1.noarch foreman-discovery-image-3.0.5-3.el7sat.noarch ruby193-rubygem-foreman_gutterball-0.0.1.9-1.el7sat.noarch foreman-libvirt-1.7.2.50-1.el7sat.noarch foreman-gce-1.7.2.50-1.el7sat.noarch rubygem-hammer_cli_foreman-0.1.4.15-1.el7sat.noarch ruby193-rubygem-foreman_discovery-2.0.0.23-1.el7sat.noarch foreman-postgresql-1.7.2.50-1.el7sat.noarch foreman-compute-1.7.2.50-1.el7sat.noarch ruby193-rubygem-foreman-redhat_access-0.2.4-1.el7sat.noarch rubygem-hammer_cli_foreman_discovery-0.0.1.10-1.el7sat.noarch steps: > 1. Start qpid broker with queues "queue.0", "queue.1" to "queue.9" Just add the queued to the running broker on Satellite: for i in $(seq 0 9); do qpid-config add queue queue.${i} --ssl-certificate=/etc/pki/katello/qpid_client_striped.crt -b amqps://localhost:5671 done (ensure the queues are present by: qpid-stat --ssl-certificate=/etc/pki/katello/qpid_client_striped.crt -b amqps://localhost:5671 -q | grep queue ) > 2. Start qdrouterd with link routing queue. to the qpid router Add to /etc/qpid-dispatch/qdrouterd.conf this section: linkRoutePattern { prefix: queue. connector: broker } and restart qdrouterd service to apply it. Script itself: #!/usr/bin/python from time import sleep from uuid import uuid4 from proton import ConnectionException, Timeout from proton.utils import BlockingConnection import threading import traceback import os ROUTER_ADDRESS = "proton+amqp://10.34.84.156:5672" ADDRESS = "queue" HEARTBEAT = 2 SLEEP = 2.0 THREADS = 10 class ReceiverThread(threading.Thread): def __init__(self, _id, address=ADDRESS): super(ReceiverThread, self).__init__() self._id = _id self.address = address print self.address self.running = True self.conn = None def connect(self): try: self.conn = BlockingConnection(ROUTER_ADDRESS, ssl_domain=None, heartbeat=HEARTBEAT) self.conn.create_receiver(self.address, name=str(uuid4()), dynamic=False, options=None) except Exception: self.conn = None def run(self): while self.running: while self.conn == None: self.connect() sleep(SLEEP) try: print "%s: reconnecting " % self.address self.conn.close() except Exception, e: print e pass self.conn = None threads = [] for i in range(THREADS): threads.append(ReceiverThread(i, '%s.%s' %(ADDRESS, i))) sleep(SLEEP/THREADS) threads[i].start() while True: sleep(10) 4. Run that script 7 times in parallel (there is a chance it will coredump - if so re-run it 7 times again) # qdstat -b 0.0.0.0:5647 --ssl-certificate /etc/pki/katello/qpid_router_client.crt --ssl-key /etc/pki/katello/qpid_router_client.key --ssl-trustfile /etc/pki/katello/certs/katello-default-ca.crt -g Router Statistics attr value ====================================================== Mode interior Area 0 Router Id hp-dl180g6-01.rhts.eng.bos.redhat.com Address Count 12 Link Count 2 Node Count 0 No deadlock / missing response. Confirmed with the developers too. got reply that getting response is more important moving it to Verified Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:0052 |
Description of problem: qdrouterd in Satellite6 deployments have been seen not responding / likewise deadlocked with a specific backtrace. Although a reproducer scenario utilizing Satellite bits (i.e. goferd, applying errata etc) is unknown at the moment, there is a reproducer scenario outside Satellite6 relying on a python script, qdrouterd and qpidd that leads to the same state. See reproducer below. Version-Release number of selected component (if applicable): qpid-dispatch-router-0.4-10.el7.x86_64 How reproducible: 100% after some time Steps to Reproduce: 1. Start qpid broker with queues "queue.0", "queue.1" to "queue.9" 2. Start qdrouterd with link routing queue. to the qpid router 3. Run 7 times in parallel below script. In nutshell, it fires 10 threads where each thread creates a consumer to one of the "queue.X" queues, disconnects and repeats so in a loop. Running the script in parallel causes many link establishments and dropping are to be routed via qdrouterd in parallel. Script itself: #!/usr/bin/python from time import sleep from uuid import uuid4 from proton import ConnectionException, Timeout from proton.utils import BlockingConnection import threading import traceback import os ROUTER_ADDRESS = "proton+amqp://10.34.84.156:5672" ADDRESS = "queue" HEARTBEAT = 2 SLEEP = 2.0 THREADS = 10 class ReceiverThread(threading.Thread): def __init__(self, _id, address=ADDRESS): super(ReceiverThread, self).__init__() self._id = _id self.address = address print self.address self.running = True self.conn = None def connect(self): try: self.conn = BlockingConnection(ROUTER_ADDRESS, ssl_domain=None, heartbeat=HEARTBEAT) self.conn.create_receiver(self.address, name=str(uuid4()), dynamic=False, options=None) except Exception: self.conn = None def run(self): while self.running: while self.conn == None: self.connect() sleep(SLEEP) try: print "%s: reconnecting " % self.address self.conn.close() except Exception, e: print e pass self.conn = None threads = [] for i in range(THREADS): threads.append(ReceiverThread(i, '%s.%s' %(ADDRESS, i))) sleep(SLEEP/THREADS) threads[i].start() while True: sleep(10) 4. Run that script 7 times in parallel (there is a chance it will coredump - if so re-run it 7 times again) 5. Run qdstat -g as liveness check of qdrouterd - time to time Actual results: Within 10 minutes, qdstat -g starts to timeout, qdrouterd gets idle (0%CPU) but consuming a lot of memory. No client connection accepted. Expected results: No deadlock / missing response. Additional info: Backtraces: Thread 4 (Thread 0x7f0faa937700 (LWP 31721)): #0 0x00007f0fb6f65f7d in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x00007f0fb6f61d32 in _L_lock_791 () from /lib64/libpthread.so.0 #2 0x00007f0fb6f61c38 in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x00007f0fb73e4725 in sys_mutex_lock (mutex=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/posix/threading.c:62 #4 0x00007f0fb73eea3e in qd_connection_invoke_deferred (conn=conn@entry=0x7f0fa4012790, call=call@entry=0x7f0fb73e9a30 <qd_router_open_routed_link>, context=0x7f0f9cd7d410) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:1135 #5 0x00007f0fb73e9973 in router_link_attach_handler (context=0x1a9a120, link=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/router_node.c:1633 #6 0x00007f0fb73db325 in handle_link_open (container=<optimized out>, pn_link=0x7f0fa29bc790) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:209 #7 process_handler (unused=<optimized out>, qd_conn=0x1b32c50, container=0x1a82b10) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:390 #8 handler (handler_context=0x1a82b10, conn_context=<optimized out>, event=event@entry=QD_CONN_EVENT_PROCESS, qd_conn=0x1b32c50) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:494 #9 0x00007f0fb73ed9dc in process_connector (cxtr=0x1bf89c0, qd_server=0x18b8b40) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:398 #10 thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:626 #11 0x00007f0fb6f5fdf5 in start_thread () from /lib64/libpthread.so.0 #12 0x00007f0fb64bb1ad in clone () from /lib64/libc.so.6 Thread 3 (Thread 0x7f0faa136700 (LWP 31722)): #0 0x00007f0fb6f65f7d in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x00007f0fb6f61d32 in _L_lock_791 () from /lib64/libpthread.so.0 #2 0x00007f0fb6f61c38 in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x00007f0fb73e4725 in sys_mutex_lock (mutex=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/posix/threading.c:62 #4 0x00007f0fb73eb1f2 in qd_router_send (qd=qd@entry=0x179b030, address=address@entry=0x7f0fa005ed00, msg=msg@entry=0x7f0fa0339870) at /usr/src/debug/qpid-dispatch-0.4/src/router_node.c:2141 #5 0x00007f0fb73eb468 in qd_router_send2 (qd=0x179b030, address=<optimized out>, msg=msg@entry=0x7f0fa0339870) at /usr/src/debug/qpid-dispatch-0.4/src/router_node.c:2209 #6 0x00007f0fb73e5435 in qd_python_send (self=0x1af3e68, args=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/python_embedded.c:608 #7 0x00007f0fb6866b74 in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0 #8 0x00007f0fb6866930 in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0 #9 0x00007f0fb6866930 in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0 #10 0x00007f0fb686818d in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0 #11 0x00007f0fb67f5088 in function_call () from /lib64/libpython2.7.so.1.0 #12 0x00007f0fb67d0073 in PyObject_Call () from /lib64/libpython2.7.so.1.0 #13 0x00007f0fb67df075 in instancemethod_call () from /lib64/libpython2.7.so.1.0 #14 0x00007f0fb67d0073 in PyObject_Call () from /lib64/libpython2.7.so.1.0 #15 0x00007f0fb6861fd7 in PyEval_CallObjectWithKeywords () from /lib64/libpython2.7.so.1.0 #16 0x00007f0fb73ec631 in qd_pyrouter_tick (router=router@entry=0x1a9a120) at /usr/src/debug/qpid-dispatch-0.4/src/router_pynode.c:706 #17 0x00007f0fb73e6ee9 in qd_router_timer_handler (context=0x1a9a120) at /usr/src/debug/qpid-dispatch-0.4/src/router_node.c:1914 #18 0x00007f0fb73ed367 in thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:490 #19 0x00007f0fb6f5fdf5 in start_thread () from /lib64/libpthread.so.0 #20 0x00007f0fb64bb1ad in clone () from /lib64/libc.so.6