Bug 672270 - clustered qpidd can enter unresponsive state
Summary: clustered qpidd can enter unresponsive state
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: qpid-cpp
Version: 1.3
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: ---
Assignee: messaging-bugs
QA Contact: MRG Quality Engineering
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-01-24 16:40 UTC by Frantisek Reznicek
Modified: 2015-11-16 01:13 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-01-25 07:51:44 UTC
Target Upstream Version:


Attachments (Terms of Use)
The issue reproducer and logs (178.78 KB, application/x-tbz)
2011-01-24 16:45 UTC, Frantisek Reznicek
no flags Details

Description Frantisek Reznicek 2011-01-24 16:40:21 UTC
Description of problem:

While reproducing the bug 667428 I triggered very frequently state when qpidd broker process is running, but no client can be served. (all get 'Connection refused')

The scenario which triggers this problem is bug 667428 looped by shell script (see attachement).

Once the testing script finishes looping one broker on port 10000 should be up and running. (the testing script launches brokers at ports 5672, then 10000, then launches transactional java sender and kills the head (first/5672) broker)

The process backtrace from the unresponsive:

(gdb)   4 Thread 0x424c6940 (LWP 27463)  0x000000311740b150 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  3 Thread 0x42ec7940 (LWP 27465)  0x000000311740b150 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  2 Thread 0x438c8940 (LWP 27467)  0x00000031168d44b8 in epoll_wait ()
   from /lib64/libc.so.6
* 1 Thread 0x2b745ec5efe0 (LWP 27457)  0x00000031168d44b8 in epoll_wait ()
   from /lib64/libc.so.6
(gdb)
Thread 4 (Thread 0x424c6940 (LWP 27463)):
#0  0x000000311740b150 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#1  0x00002b745e9c75ba in qpid::sys::Timer::run (this=0xffffffffffffff92)
    at ../include/qpid/sys/posix/Condition.h:69
#2  0x00002b745e8e91aa in qpid::sys::(anonymous namespace)::runRunnable (
    p=0x166a4014) at qpid/sys/posix/Thread.cpp:35
#3  0x000000311740673d in start_thread () from /lib64/libpthread.so.0
#4  0x00000031168d40cd in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x42ec7940 (LWP 27465)):
#0  0x000000311740b150 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#1  0x00002b745e9c75ba in qpid::sys::Timer::run (this=0xffffffffffffff92)
    at ../include/qpid/sys/posix/Condition.h:69
#2  0x00002b745e8e91aa in qpid::sys::(anonymous namespace)::runRunnable (
    p=0x166d3494) at qpid/sys/posix/Thread.cpp:35
#3  0x000000311740673d in start_thread () from /lib64/libpthread.so.0
#4  0x00000031168d40cd in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x438c8940 (LWP 27467)):
#0  0x00000031168d44b8 in epoll_wait () from /lib64/libc.so.6
#1  0x00002b745e8f24c1 in qpid::sys::Poller::wait (this=0x1669d4f0,
    timeout=<value optimized out>) at qpid/sys/epoll/EpollPoller.cpp:563
#2  0x00002b745e8f2f57 in qpid::sys::Poller::run (this=0x1669d4f0)
    at qpid/sys/epoll/EpollPoller.cpp:515
#3  0x00002b745e8e91aa in qpid::sys::(anonymous namespace)::runRunnable (p=0x8)
    at qpid/sys/posix/Thread.cpp:35
#4  0x000000311740673d in start_thread () from /lib64/libpthread.so.0
#5  0x00000031168d40cd in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x2b745ec5efe0 (LWP 27457)):
#0  0x00000031168d44b8 in epoll_wait () from /lib64/libc.so.6
#1  0x00002b745e8f24c1 in qpid::sys::Poller::wait (this=0x1669d4f0,
    timeout=<value optimized out>) at qpid/sys/epoll/EpollPoller.cpp:563
#2  0x00002b745e8f2f57 in qpid::sys::Poller::run (this=0x1669d4f0)
    at qpid/sys/epoll/EpollPoller.cpp:515
#3  0x00002b745e433462 in qpid::broker::Broker::run (
    this=<value optimized out>) at qpid/broker/Broker.cpp:344
#4  0x0000000000406acc in QpiddBroker::execute (this=<value optimized out>,
    options=0x1669c790) at posix/QpiddBroker.cpp:187
#5  0x000000000040559f in main (argc=9, argv=0x7fff34af8c78) at qpidd.cpp:80
(gdb) Detaching from program: /usr/sbin/qpidd, process 27457
(gdb) quit


This issue is reliably reproduced on RHEL 5.6 x86_64 (KVM).


Version-Release number of selected component (if applicable):
openais-0.80.6-28.el5
python-qmf-0.7.946106-14.el5
python-qpid-0.7.946106-15.el5
qmf-0.7.946106-27.el5
qmf-devel-0.7.946106-27.el5
qpid-cpp-client-0.7.946106-27.el5
qpid-cpp-client-devel-0.7.946106-27.el5
qpid-cpp-client-devel-docs-0.7.946106-27.el5
qpid-cpp-client-ssl-0.7.946106-27.el5
qpid-cpp-mrg-debuginfo-0.7.946106-27.el5
qpid-cpp-server-0.7.946106-27.el5
qpid-cpp-server-cluster-0.7.946106-27.el5
qpid-cpp-server-devel-0.7.946106-27.el5
qpid-cpp-server-ssl-0.7.946106-27.el5
qpid-cpp-server-store-0.7.946106-27.el5
qpid-cpp-server-xml-0.7.946106-27.el5
qpid-java-client-0.7.946106-14.el5
qpid-java-common-0.7.946106-14.el5
qpid-java-example-0.7.946106-14.el5
qpid-tools-0.7.946106-12.el5


How reproducible:
~ 50%

Steps to Reproduce:
1. configure qpidd, configure openais
2. run the attached reproducer ./bz667428.sh 3
3. ps auxw | grep qpidd
  
Actual results:
qpidd reaches easily unresponsive state.

Expected results:
qpidd should not reach unresponsive state.

Additional info:

[root@dhcp-26-233 bz667428]# ps auxw | grep qpidd
root     27457  0.4  4.2 207016 16128 pts/2    Sl   17:22   0:00 qpidd --auth=no --truncate yes --cluster-name=bz667428_at_dhcp-26-233.brq.redhat.com --log-enable=debug+ -p 10000 --data-dir=./dd10000
root     27673  0.0  0.1  61228   748 pts/2    R+   17:24   0:00 grep qpidd
[root@dhcp-26-233 bz667428]# qpid-config
[root@dhcp-26-233 bz667428]# netstat -nlp | grep qpidd
tcp        0      0 0.0.0.0:10000               0.0.0.0:*                   LISTEN      27457/qpidd
Failed: error: (111, 'Connection refused')
[root@dhcp-26-233 bz667428]# qpid-cluster
Failed: error - (111, 'Connection refused')
[root@dhcp-26-233 bz667428]# dump_process 27457
[root@dhcp-26-233 bz667428]# cat dump_pid_27457.20110124_172448.qpidd
GNU gdb (GDB) Red Hat Enterprise Linux (7.0.1-32.el5)
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
(gdb) Attaching to process 27457
Reading symbols from /usr/sbin/qpidd...Reading symbols from /usr/lib/debug/usr/sbin/qpidd.debug...done.
(gdb) Using memory regions provided by the target.
There are no memory regions defined.
(gdb) From                To                  Syms Read   Shared Object Library
0x00002b745e3bf760  0x00002b745e5508b8  Yes (*)     /usr/lib64/libqpidbroker.so.3
0x00002b745e8d4680  0x00002b745e9d3458  Yes (*)     /usr/lib64/libqpidcommon.so.3
0x0000003132807820  0x0000003132814e58  Yes (*)     /usr/lib64/libqpidtypes.so.1
0x0000003c78a10aa0  0x0000003c78a2dae8  Yes (*)     /usr/lib64/libboost_program_options.so.2
0x0000003c78e04810  0x0000003c78e0cff8  Yes (*)     /usr/lib64/libboost_filesystem.so.2
0x000000311ac01500  0x000000311ac02918  Yes (*)     /lib64/libuuid.so.1
0x0000003116c00e10  0x0000003116c01a08  Yes (*)     /lib64/libdl.so.2
0x0000003118002220  0x0000003118005d98  Yes (*)     /lib64/librt.so.1
0x0000003c79c046e0  0x0000003c79c13be8  Yes (*)     /usr/lib64/libsasl2.so.2
0x000000311b84f430  0x000000311b8c30b8  Yes (*)     /usr/lib64/libstdc++.so.6
0x0000003117003e60  0x0000003117043e38  Yes (*)     /lib64/libm.so.6
0x0000003119401e50  0x000000311940b018  Yes (*)     /lib64/libgcc_s.so.1
0x000000311681d780  0x000000311690a3a8  Yes (*)     /lib64/libc.so.6
0x0000003116400a70  0x00000031164166ce  Yes (*)     /lib64/ld-linux-x86-64.so.2
0x00000031174051f0  0x0000003117410258  Yes (*)     /lib64/libpthread.so.0
0x0000003c792032a0  0x0000003c7920e2d8  Yes (*)     /lib64/libresolv.so.2
0x0000003119c009f0  0x0000003119c06918  Yes (*)     /lib64/libcrypt.so.1
0x00002b745eca47f0  0x00002b745ed14388  Yes (*)     /usr/lib64/qpid/daemon/cluster.so
0x0000003c78601430  0x0000003c78603538  Yes (*)     /usr/lib64/openais/libcpg.so.2
0x00002b745ef4f110  0x00002b745ef51b98  Yes (*)     /usr/lib64/libcman.so.2
0x00002b745f191350  0x00002b745f205428  Yes (*)     /usr/lib64/libqpidclient.so.3
0x00002b745f436830  0x00002b745f440608  Yes (*)     /usr/lib64/qpid/daemon/xml.so
0x00002b745f7c9070  0x00002b745f94a758  Yes (*)     /usr/lib64/libxerces-c.so.28
0x00002b745fddf090  0x00002b745ff6ab28  Yes (*)     /usr/lib64/libxqilla.so.3
0x00002b74602b0f20  0x00002b74602d1318  Yes (*)     /usr/lib64/qpid/daemon/acl.so
0x00002b74604e45f0  0x00002b74604ec838  Yes (*)     /usr/lib64/qpid/daemon/ssl.so
0x00002b746070f640  0x00002b7460726098  Yes (*)     /usr/lib64/libsslcommon.so.3
0x000000311b018310  0x000000311b0f7288  Yes (*)     /usr/lib64/libnss3.so
0x0000003117c08540  0x0000003117c2b868  Yes (*)     /usr/lib64/libssl3.so
0x00002b7460939eb0  0x00002b7460958568  Yes (*)     /usr/lib64/libnspr4.so
0x000000311a008500  0x000000311a012e18  Yes (*)     /usr/lib64/libnssutil3.so
0x000000311a801370  0x000000311a802968  Yes (*)     /usr/lib64/libplc4.so
0x000000311a400e40  0x000000311a401c78  Yes (*)     /usr/lib64/libplds4.so
0x0000003117801fd0  0x000000311780cac8  Yes (*)     /usr/lib64/libz.so.1
0x00002b7460b6d5e0  0x00002b7460b70b78  Yes (*)     /usr/lib64/qpid/daemon/watchdog.so
0x00002b7460d78680  0x00002b7460d7cc48  Yes (*)     /usr/lib64/qpid/daemon/replication_exchange.so
0x00002b7460fb7d70  0x00002b7461045528  Yes (*)     /usr/lib64/qpid/daemon/msgstore.so
0x00002b74612ab5d0  0x00002b7461364428  Yes (*)     /usr/lib64/libdb_cxx-4.3.so
0x00002b746158f510  0x00002b746158f6d1  Yes (*)     /usr/lib64/libaio.so.1
0x00002b7461796b70  0x00002b746179c898  Yes (*)     /usr/lib64/qpid/daemon/replicating_listener.so
(*): Shared library is missing debugging information.
(gdb)   4 Thread 0x424c6940 (LWP 27463)  0x000000311740b150 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  3 Thread 0x42ec7940 (LWP 27465)  0x000000311740b150 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  2 Thread 0x438c8940 (LWP 27467)  0x00000031168d44b8 in epoll_wait ()
   from /lib64/libc.so.6
* 1 Thread 0x2b745ec5efe0 (LWP 27457)  0x00000031168d44b8 in epoll_wait ()
   from /lib64/libc.so.6
(gdb)
Thread 4 (Thread 0x424c6940 (LWP 27463)):
#0  0x000000311740b150 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#1  0x00002b745e9c75ba in qpid::sys::Timer::run (this=0xffffffffffffff92)
    at ../include/qpid/sys/posix/Condition.h:69
#2  0x00002b745e8e91aa in qpid::sys::(anonymous namespace)::runRunnable (
    p=0x166a4014) at qpid/sys/posix/Thread.cpp:35
#3  0x000000311740673d in start_thread () from /lib64/libpthread.so.0
#4  0x00000031168d40cd in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x42ec7940 (LWP 27465)):
#0  0x000000311740b150 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#1  0x00002b745e9c75ba in qpid::sys::Timer::run (this=0xffffffffffffff92)
    at ../include/qpid/sys/posix/Condition.h:69
#2  0x00002b745e8e91aa in qpid::sys::(anonymous namespace)::runRunnable (
    p=0x166d3494) at qpid/sys/posix/Thread.cpp:35
#3  0x000000311740673d in start_thread () from /lib64/libpthread.so.0
#4  0x00000031168d40cd in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x438c8940 (LWP 27467)):
#0  0x00000031168d44b8 in epoll_wait () from /lib64/libc.so.6
#1  0x00002b745e8f24c1 in qpid::sys::Poller::wait (this=0x1669d4f0,
    timeout=<value optimized out>) at qpid/sys/epoll/EpollPoller.cpp:563
#2  0x00002b745e8f2f57 in qpid::sys::Poller::run (this=0x1669d4f0)
    at qpid/sys/epoll/EpollPoller.cpp:515
#3  0x00002b745e8e91aa in qpid::sys::(anonymous namespace)::runRunnable (p=0x8)
    at qpid/sys/posix/Thread.cpp:35
#4  0x000000311740673d in start_thread () from /lib64/libpthread.so.0
#5  0x00000031168d40cd in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x2b745ec5efe0 (LWP 27457)):
#0  0x00000031168d44b8 in epoll_wait () from /lib64/libc.so.6
#1  0x00002b745e8f24c1 in qpid::sys::Poller::wait (this=0x1669d4f0,
    timeout=<value optimized out>) at qpid/sys/epoll/EpollPoller.cpp:563
#2  0x00002b745e8f2f57 in qpid::sys::Poller::run (this=0x1669d4f0)
    at qpid/sys/epoll/EpollPoller.cpp:515
#3  0x00002b745e433462 in qpid::broker::Broker::run (
    this=<value optimized out>) at qpid/broker/Broker.cpp:344
#4  0x0000000000406acc in QpiddBroker::execute (this=<value optimized out>,
    options=0x1669c790) at posix/QpiddBroker.cpp:187
#5  0x000000000040559f in main (argc=9, argv=0x7fff34af8c78) at qpidd.cpp:80
(gdb) Detaching from program: /usr/sbin/qpidd, process 27457
(gdb) quit



/etc/qpidd.conf:
cluster-mechanism=ANONYMOUS
cluster-name=freznice-virt-cluster
#log-enable=info+
log-enable=trace+
log-to-file=/tmp/qpidd.log
mgmt-pub-interval=1

Comment 1 Frantisek Reznicek 2011-01-24 16:45:54 UTC
Created attachment 474987 [details]
The issue reproducer and logs

This is the script which reproduces the problem.

Dependencies:
  openais configured and running.
  java-openjdk 1.6 installed main + devel
  qpid-java-* pkgs installed
  python-qpid installed
How to run:
  ./bz667428 3

Then do:
ps auxw | grep qpidd
netstat -nlp | grep qpidd
qpid-config
qpid-cluster

and you should see something like this:

[root@dhcp-26-233 bz667428]# ps auxw | grep qpidd
root     27457  0.4  4.2 207016 16128 pts/2    Sl   17:22   0:00 qpidd --auth=no --truncate yes --cluster-name=bz667428_at_dhcp-26-233.brq.redhat.com --log-enable=debug+ -p 10000 --data-dir=./dd10000
root     27673  0.0  0.1  61228   748 pts/2    R+   17:24   0:00 grep qpidd
[root@dhcp-26-233 bz667428]# netstat -nlp | grep qpidd
tcp        0      0 0.0.0.0:10000               0.0.0.0:*                   LISTEN      27457/qpidd
[root@dhcp-26-233 bz667428]# qpid-config
Failed: error: (111, 'Connection refused')
[root@dhcp-26-233 bz667428]# qpid-cluster
Failed: error - (111, 'Connection refused')

Comment 2 Alan Conway 2011-01-24 20:52:34 UTC
In comment 1, you can see from the ps and output that the broker that is still running is listening on port 10000. There is no broker listening on the default port 5672. So it is normal to get "connection refused" when running qpid-config or qpid-cluster with no arguments, as they will try to connect to 5672. However these work fine for me:

qpid-config -a localhost:10000
qpid-cluster -a localhost:10000

I repeated about 15 times and didn't see a non-responsiveness problem. Am I missing something?

Comment 3 Frantisek Reznicek 2011-01-25 07:51:44 UTC
I apologize, this is indeed correct and I concur.

The defect has no content and I'm going to close it as NOT-A-BUG.


Note You need to log in before you can comment on or make changes to this bug.