Bug 1281947 - qdrouterd not responding in Satellite6
qdrouterd not responding in Satellite6
Status: CLOSED ERRATA
Product: Red Hat Satellite 6
Classification: Red Hat
Component: katello-agent (Show other bugs)
6.1.3
x86_64 Linux
high Severity high (vote)
: 6.1.6
: --
Assigned To: Ted Ross
Tazim Kolhar
: Triaged
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-11-13 16:42 EST by Pavel Moravec
Modified: 2017-02-23 14:40 EST (History)
13 users (show)

See Also:
Fixed In Version: qpid-dispatch-0.4-11
Doc Type: Bug Fix
Doc Text:
A defect in qdrouterd caused a thread deadlock under some circumstances where connections were opening and closing simultaneously. This caused the qdrouterd process to stop responding to management queries and stop forwarding messages. This fix removes the defect such that the qpid-proton library is used properly.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-01-21 02:42:42 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Pavel Moravec 2015-11-13 16:42:30 EST
Description of problem:
qdrouterd in Satellite6 deployments have been seen not responding / likewise deadlocked with a specific backtrace. Although a reproducer scenario utilizing Satellite bits (i.e. goferd, applying errata etc) is unknown at the moment, there is a reproducer scenario outside Satellite6 relying on a python script, qdrouterd and qpidd that leads to the same state. See reproducer below.


Version-Release number of selected component (if applicable):
qpid-dispatch-router-0.4-10.el7.x86_64


How reproducible:
100% after some time


Steps to Reproduce:
1. Start qpid broker with queues "queue.0", "queue.1" to "queue.9"
2. Start qdrouterd with link routing queue. to the qpid router
3. Run 7 times in parallel below script. In nutshell, it fires 10 threads where each thread creates a consumer to one of the "queue.X" queues, disconnects and repeats so in a loop. Running the script in parallel causes many link establishments and dropping are to be routed via qdrouterd in parallel.

Script itself:

#!/usr/bin/python

from time import sleep
from uuid import uuid4

from proton import ConnectionException, Timeout

from proton.utils import BlockingConnection

import threading
import traceback
import os

ROUTER_ADDRESS = "proton+amqp://10.34.84.156:5672"
ADDRESS = "queue"
HEARTBEAT = 2
SLEEP = 2.0
THREADS = 10

class ReceiverThread(threading.Thread):
    def __init__(self, _id, address=ADDRESS):
        super(ReceiverThread, self).__init__()
        self._id = _id
        self.address = address
        print self.address
        self.running = True
        self.conn = None

    def connect(self):
        try:
            self.conn = BlockingConnection(ROUTER_ADDRESS, ssl_domain=None, heartbeat=HEARTBEAT)
            self.conn.create_receiver(self.address, name=str(uuid4()), dynamic=False, options=None)
        except Exception:
            self.conn = None

    def run(self):
        while self.running:
            while self.conn == None:
                self.connect()
            sleep(SLEEP)
            try:
                print "%s: reconnecting " % self.address
                self.conn.close()
            except Exception, e:
                print e
                pass
            self.conn = None

threads = []
for i in range(THREADS):
  threads.append(ReceiverThread(i, '%s.%s' %(ADDRESS, i)))
  sleep(SLEEP/THREADS)
  threads[i].start()

while True:
  sleep(10)


4. Run that script 7 times in parallel (there is a chance it will coredump - if so re-run it 7 times again)
5. Run qdstat -g as liveness check of qdrouterd - time to time


Actual results:
Within 10 minutes, qdstat -g starts to timeout, qdrouterd gets idle (0%CPU) but consuming a lot of memory. No client connection accepted.


Expected results:
No deadlock / missing response.


Additional info:
Backtraces:

Thread 4 (Thread 0x7f0faa937700 (LWP 31721)):
#0  0x00007f0fb6f65f7d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f0fb6f61d32 in _L_lock_791 () from /lib64/libpthread.so.0
#2  0x00007f0fb6f61c38 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f0fb73e4725 in sys_mutex_lock (mutex=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/posix/threading.c:62
#4  0x00007f0fb73eea3e in qd_connection_invoke_deferred (conn=conn@entry=0x7f0fa4012790, 
    call=call@entry=0x7f0fb73e9a30 <qd_router_open_routed_link>, context=0x7f0f9cd7d410)
    at /usr/src/debug/qpid-dispatch-0.4/src/server.c:1135
#5  0x00007f0fb73e9973 in router_link_attach_handler (context=0x1a9a120, link=<optimized out>)
    at /usr/src/debug/qpid-dispatch-0.4/src/router_node.c:1633
#6  0x00007f0fb73db325 in handle_link_open (container=<optimized out>, pn_link=0x7f0fa29bc790)
    at /usr/src/debug/qpid-dispatch-0.4/src/container.c:209
#7  process_handler (unused=<optimized out>, qd_conn=0x1b32c50, container=0x1a82b10)
    at /usr/src/debug/qpid-dispatch-0.4/src/container.c:390
#8  handler (handler_context=0x1a82b10, conn_context=<optimized out>, event=event@entry=QD_CONN_EVENT_PROCESS, 
    qd_conn=0x1b32c50) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:494
#9  0x00007f0fb73ed9dc in process_connector (cxtr=0x1bf89c0, qd_server=0x18b8b40)
    at /usr/src/debug/qpid-dispatch-0.4/src/server.c:398
#10 thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:626
#11 0x00007f0fb6f5fdf5 in start_thread () from /lib64/libpthread.so.0
#12 0x00007f0fb64bb1ad in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7f0faa136700 (LWP 31722)):
#0  0x00007f0fb6f65f7d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f0fb6f61d32 in _L_lock_791 () from /lib64/libpthread.so.0
#2  0x00007f0fb6f61c38 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f0fb73e4725 in sys_mutex_lock (mutex=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/posix/threading.c:62
#4  0x00007f0fb73eb1f2 in qd_router_send (qd=qd@entry=0x179b030, address=address@entry=0x7f0fa005ed00, 
    msg=msg@entry=0x7f0fa0339870) at /usr/src/debug/qpid-dispatch-0.4/src/router_node.c:2141
#5  0x00007f0fb73eb468 in qd_router_send2 (qd=0x179b030, address=<optimized out>, msg=msg@entry=0x7f0fa0339870)
    at /usr/src/debug/qpid-dispatch-0.4/src/router_node.c:2209
#6  0x00007f0fb73e5435 in qd_python_send (self=0x1af3e68, args=<optimized out>)
    at /usr/src/debug/qpid-dispatch-0.4/src/python_embedded.c:608
#7  0x00007f0fb6866b74 in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#8  0x00007f0fb6866930 in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#9  0x00007f0fb6866930 in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#10 0x00007f0fb686818d in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0
#11 0x00007f0fb67f5088 in function_call () from /lib64/libpython2.7.so.1.0
#12 0x00007f0fb67d0073 in PyObject_Call () from /lib64/libpython2.7.so.1.0
#13 0x00007f0fb67df075 in instancemethod_call () from /lib64/libpython2.7.so.1.0
#14 0x00007f0fb67d0073 in PyObject_Call () from /lib64/libpython2.7.so.1.0
#15 0x00007f0fb6861fd7 in PyEval_CallObjectWithKeywords () from /lib64/libpython2.7.so.1.0
#16 0x00007f0fb73ec631 in qd_pyrouter_tick (router=router@entry=0x1a9a120)
    at /usr/src/debug/qpid-dispatch-0.4/src/router_pynode.c:706
#17 0x00007f0fb73e6ee9 in qd_router_timer_handler (context=0x1a9a120)
    at /usr/src/debug/qpid-dispatch-0.4/src/router_node.c:1914
#18 0x00007f0fb73ed367 in thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:490
#19 0x00007f0fb6f5fdf5 in start_thread () from /lib64/libpthread.so.0
#20 0x00007f0fb64bb1ad in clone () from /lib64/libc.so.6
Comment 2 Pavel Moravec 2015-11-20 07:45:43 EST
I can confirm qpid-dispatch 0.4-11 fixes the deadlock.

To the reproducer: the above one has lower probability to hit the bug. An improved version is:

1) Have just_many_consumers.2.py script per #c0
2) Run the script 10 times in parallel and re-start if stopped (update "-lt 10" accordingly to other python processes running on the host):

while true; do while [ $(pgrep python | wc -l) -lt 10 ]; do python just_many_consumers.2.py & sleep 1; done; sleep 1; date; done

3) Sometimes, kill all the scripts - this greatly improves reliability of the reproducer, to invoke faster link dropping in parallel (together with prompt link establishment later on):

while true; do kill $(ps aux | grep just | grep python | awk '{ print $2 }'); sleep 10; sleep $(($((RANDOM))%10)); date; done
Comment 3 Tazim Kolhar 2016-01-08 05:45:11 EST
Hi,

    can you please provide the explanation for step 1 and step 2 :
    I did not get them . is there any command for same ?

      1. Start qpid broker with queues "queue.0", "queue.1" to "queue.9"
      2. Start qdrouterd with link routing queue. to the qpid router

thanks and regards,
Tazim
Comment 4 Pavel Moravec 2016-01-08 06:21:49 EST
(In reply to Tazim Kolhar from comment #3)
> Hi,
> 
>     can you please provide the explanation for step 1 and step 2 :
>     I did not get them . is there any command for same ?
> 
>       1. Start qpid broker with queues "queue.0", "queue.1" to "queue.9"
>       2. Start qdrouterd with link routing queue. to the qpid router
> 
> thanks and regards,
> Tazim

Sure.

>       1. Start qpid broker with queues "queue.0", "queue.1" to "queue.9"

Just add the queued to the running broker on Satellite:

for i in $(seq 0 9); do
  qpid-config add queue queue.${i} --ssl-certificate=/etc/pki/katello/qpid_client_striped.crt -b amqps://localhost:5671
done

(ensure the queues are present by:

qpid-stat --ssl-certificate=/etc/pki/katello/qpid_client_striped.crt -b amqps://localhost:5671 -q | grep queue

)

>       2. Start qdrouterd with link routing queue. to the qpid router

Add to /etc/qpid-dispatch/qdrouterd.conf this section:

linkRoutePattern {
    prefix: queue.
    connector: broker
}

and restart qdrouterd service to apply it.


The reproducer is little bit artificial (not following steps a Satellite user would do to get qdrouterd stuck), but I suppose it is sufficient (it just mimicks such activity done on Content Hosts via the reproducer script).
Comment 5 Tazim Kolhar 2016-01-18 10:47:41 EST
VERIFIED:

# rpm -qa | grep foreman
hp-dl180g6-01.rhts.eng.bos.redhat.com-foreman-proxy-1.0-1.noarch
ruby193-rubygem-foreman_hooks-0.3.7-2.el7sat.noarch
foreman-vmware-1.7.2.50-1.el7sat.noarch
rubygem-hammer_cli_foreman_tasks-0.0.3.5-1.el7sat.noarch
foreman-selinux-1.7.2.17-1.el7sat.noarch
ruby193-rubygem-foreman_bootdisk-4.0.2.14-1.el7sat.noarch
foreman-ovirt-1.7.2.50-1.el7sat.noarch
foreman-1.7.2.50-1.el7sat.noarch
ruby193-rubygem-foreman_docker-1.2.0.24-1.el7sat.noarch
ruby193-rubygem-foreman-tasks-0.6.15.7-1.el7sat.noarch
rubygem-hammer_cli_foreman_bootdisk-0.1.2.7-1.el7sat.noarch
rubygem-hammer_cli_foreman_docker-0.0.3.10-1.el7sat.noarch
foreman-debug-1.7.2.50-1.el7sat.noarch
foreman-proxy-1.7.2.8-1.el7sat.noarch
hp-dl180g6-01.rhts.eng.bos.redhat.com-foreman-client-1.0-1.noarch
hp-dl180g6-01.rhts.eng.bos.redhat.com-foreman-proxy-client-1.0-1.noarch
foreman-discovery-image-3.0.5-3.el7sat.noarch
ruby193-rubygem-foreman_gutterball-0.0.1.9-1.el7sat.noarch
foreman-libvirt-1.7.2.50-1.el7sat.noarch
foreman-gce-1.7.2.50-1.el7sat.noarch
rubygem-hammer_cli_foreman-0.1.4.15-1.el7sat.noarch
ruby193-rubygem-foreman_discovery-2.0.0.23-1.el7sat.noarch
foreman-postgresql-1.7.2.50-1.el7sat.noarch
foreman-compute-1.7.2.50-1.el7sat.noarch
ruby193-rubygem-foreman-redhat_access-0.2.4-1.el7sat.noarch
rubygem-hammer_cli_foreman_discovery-0.0.1.10-1.el7sat.noarch


steps:
>       1. Start qpid broker with queues "queue.0", "queue.1" to "queue.9"

Just add the queued to the running broker on Satellite:

for i in $(seq 0 9); do
  qpid-config add queue queue.${i} --ssl-certificate=/etc/pki/katello/qpid_client_striped.crt -b amqps://localhost:5671
done

(ensure the queues are present by:

qpid-stat --ssl-certificate=/etc/pki/katello/qpid_client_striped.crt -b amqps://localhost:5671 -q | grep queue

)

>       2. Start qdrouterd with link routing queue. to the qpid router

Add to /etc/qpid-dispatch/qdrouterd.conf this section:

linkRoutePattern {
    prefix: queue.
    connector: broker
}

and restart qdrouterd service to apply it.

Script itself:

#!/usr/bin/python

from time import sleep
from uuid import uuid4

from proton import ConnectionException, Timeout

from proton.utils import BlockingConnection

import threading
import traceback
import os

ROUTER_ADDRESS = "proton+amqp://10.34.84.156:5672"
ADDRESS = "queue"
HEARTBEAT = 2
SLEEP = 2.0
THREADS = 10

class ReceiverThread(threading.Thread):
    def __init__(self, _id, address=ADDRESS):
        super(ReceiverThread, self).__init__()
        self._id = _id
        self.address = address
        print self.address
        self.running = True
        self.conn = None

    def connect(self):
        try:
            self.conn = BlockingConnection(ROUTER_ADDRESS, ssl_domain=None, heartbeat=HEARTBEAT)
            self.conn.create_receiver(self.address, name=str(uuid4()), dynamic=False, options=None)
        except Exception:
            self.conn = None

    def run(self):
        while self.running:
            while self.conn == None:
                self.connect()
            sleep(SLEEP)
            try:
                print "%s: reconnecting " % self.address
                self.conn.close()
            except Exception, e:
                print e
                pass
            self.conn = None

threads = []
for i in range(THREADS):
  threads.append(ReceiverThread(i, '%s.%s' %(ADDRESS, i)))
  sleep(SLEEP/THREADS)
  threads[i].start()

while True:
  sleep(10)


4. Run that script 7 times in parallel (there is a chance it will coredump - if so re-run it 7 times again)

# qdstat -b 0.0.0.0:5647 --ssl-certificate /etc/pki/katello/qpid_router_client.crt --ssl-key /etc/pki/katello/qpid_router_client.key --ssl-trustfile /etc/pki/katello/certs/katello-default-ca.crt -g
Router Statistics
  attr           value
  ======================================================
  Mode           interior
  Area           0
  Router Id      hp-dl180g6-01.rhts.eng.bos.redhat.com
  Address Count  12
  Link Count     2
  Node Count     0

No deadlock / missing response. 
Confirmed with the developers too. got reply
that getting response is more important 

moving it to Verified
Comment 10 errata-xmlrpc 2016-01-21 02:42:42 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:0052

Note You need to log in before you can comment on or make changes to this bug.