Description of problem: Core dump when run as described below. Version-Release number of selected component (if applicable): revision 732768 How reproducible: easy Steps to Reproduce: In an svn build cd cpp/src/tests and: [aconway@mrg10 tests]$ pkill qpidd ; ../qpidd --cluster-name foobar --cluster-read-max 1 --load-module ../.libs/cluster.so& [aconway@mrg10 tests]$ while ./perftest --count 10 --npubs 10 --nsubs 10 -s ; do true ; done Usually happens in one or two iterations. Actual results: [1]+ Aborted (core dumped) ../qpidd --cluster-name foobar --cluster-read-max 1 --load-module ../.libs/cluster.so Info from core: (gdb) where #0 0x0000003994c30155 in *__GI_raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 #1 0x0000003994c31bf0 in *__GI_abort () at abort.c:88 #2 0x00000039a46becc4 in __gnu_cxx::__verbose_terminate_handler () from /usr/lib64/libstdc++.so.6 #3 0x00000039a46bce36 in std::set_unexpected () from /usr/lib64/libstdc++.so.6 #4 0x00000039a46bce63 in std::terminate () from /usr/lib64/libstdc++.so.6 #5 0x00000039a46bcf4a in __cxa_throw () from /usr/lib64/libstdc++.so.6 #6 0x00002b02b225e217 in ~ScopedLock (this=<value optimized out>) at qpid/sys/posix/Mutex.h:120 #7 0x00002b02b225c67d in qpid::sys::DispatchHandle::processEvent (this=0x2aaaac0c6e98, type=qpid::sys::Poller::WRITABLE) at qpid/sys/DispatchHandle.cpp:353 #8 0x00002b02b2217943 in qpid::sys::Poller::run (this=0x6437b80) at ./qpid/sys/Poller.h:121 #9 0x00002b02b220e7da in qpid::sys::(anonymous namespace)::runRunnable () at /usr/include/boost/any.hpp:46 #10 0x00000039958062f7 in start_thread (arg=<value optimized out>) at pthread_create.c:296 #11 0x0000003994cd1b6d in clone () from /lib64/libc.so.6 (gdb) f 7 #7 0x00002b02b225c67d in qpid::sys::DispatchHandle::processEvent (this=0x2aaaac0c6e98, type=qpid::sys::Poller::WRITABLE) at qpid/sys/DispatchHandle.cpp:353 Current language: auto; currently c++ (gdb) print type $1 = qpid::sys::Poller::WRITABLE (gdb) print state $2 = qpid::sys::DispatchHandle::DELAYED_DELETE (gdb) Expected results: Test passes. Additional info: The key factors to reproducing this are low value of --cluster-read-max (1 above) and a large number of connections, given by high nsub/npub parameters. You still get the crash with cluster-read-max 2, if you push it up to 10 things run reliably. Also if you remove nsub/npub so there are just 2 connections we don't see it. It looks like this is caused by the cluster calling giveReadCredit concurrently with the IO layer closing the connection. There is locking in cluster/OutputInterceptor.cpp intended to ensure that the real output handler is never called after we've received a closed() call from the network layer but it appears there's still a race.
life cycle fox for pollableQueue Committed revision 736398. Not the only issue here.
The underlying issue here turned out to be a subtle problem in the lifecycle behaviour of the DispatchHandle. Essentially, it was unsafe to delete it from outside an IO callback state. This pretty much never happened without the cluster code
I can confirm the issue is not fixed, because I can see it on RHEL 5.2 x86_64 running below script on packages: [root@dhcp-lab-200 bz479326]# rpm -qa | egrep '(qpidd|openais)' openais-devel-0.80.3-22.el5_3.3 qpidd-devel-0.5.752581-1.el5 openais-0.80.3-22.el5_3.3 qpidd-0.5.752581-1.el5 qpidd-rdma-0.5.752581-1.el5 qpidd-ssl-0.5.752581-1.el5 qpidd-acl-0.5.752581-1.el5 qpidd-xml-0.5.752581-1.el5 qpidd-cluster-0.5.752581-1.el5 #!/bin/bash service openais restart ulimit -c unlimited cluster_so=$(rpm -ql qpidd-cluster 2>/dev/null | head -1) qpidd --auth no --log-enable debug+ --cluster-name foobar --cluster-read-max 1 \ --no-module-dir --no-data-dir --load-module ${cluster_so} >qpidd.log 2>&1 & netstat -nlp | grep qpidd sleep 5 for ((i=0;i<10;i++)); do perftest --count 10 --npubs 10 --nsubs 10 -s >>perftest.log echo -n "$?" pid=$(netstat -nlp | grep qpidd | awk '{print $NF}' | awk -F/ '{print $(NF-1)}') if [ -z "${pid}" ]; then break fi done echo pid=$(netstat -nlp | grep qpidd | awk '{print $NF}' | awk -F/ '{print $(NF-1)}') if [ -n "${pid}" ]; then kill %1 sleep 5 fi pid=$(netstat -nlp | grep qpidd | awk '{print $NF}' | awk -F/ '{print $(NF-1)}') if [ -n "${pid}" ]; then echo -n "killing ..." kill ${pid} fi echo "exiting" cat perftest.log # eof which results in core-dump (backtrace included): [root@dhcp-lab-200 bz479326]# ./run.sh Stopping OpenAIS daemon (aisexec): [ OK ] Starting OpenAIS daemon (aisexec): [ OK ] 2009-mar-16 15:32:36 warning Connection closed 2009-mar-16 15:32:36 warning Connection closed 2009-mar-16 15:32:36 warning Connection closed 2009-mar-16 15:32:36 warning Connection closed 2009-mar-16 15:32:36 warning Connection closed 2009-mar-16 15:32:36 warning Connection closed 2009-mar-16 15:32:36 warning Connection closed 2009-mar-16 15:32:36 warning Connection closed 2009-mar-16 15:32:36 warning Connection closed 2009-mar-16 15:32:36 warning Connection closed 2009-mar-16 15:32:36 warning Connection closed 2009-mar-16 15:32:36 warning Connection closed 2009-mar-16 15:32:36 warning Connection closed 2009-mar-16 15:32:36 warning Connection closed 2009-mar-16 15:32:36 warning Connection closed 2009-mar-16 15:32:36 warning Connection closed 2009-mar-16 15:32:36 warning Connection closed Error in shutdown: Connection closed 2009-mar-16 15:32:36 warning Connection closed 2009-mar-16 15:32:36 warning Connection closed Error in shutdown: Connection closed Error in shutdown: Connection closed Error in shutdown: Connection closed Error in shutdown: Connection closed Error in shutdown: Connection closed Error in shutdown: Connection closed Error in shutdown: Connection closed Error in shutdown: Connection closed Error in shutdown: Connection closed Error in shutdown: Connection closed Error in shutdown: Connection closed Error in shutdown: Connection closed Error in shutdown: Connection closed Error in shutdown: Connection closed Error in shutdown: Connection closed Error in shutdown: Connection closed ./run.sh: line 30: 2823 Aborted (core dumped) qpidd --auth no --log-enable debug+ --cluster-name foobar --cluster-read-max 1 --no-module-dir --no-data-dir --load-module ${cluster_so} > qpidd.log 2>&1 0 exiting 2408.44 28.9542 2270.51 2.21729 3320.2 21.9707 1665.36 1.62633 Connection closed Connection refused: localhost:5672 (qpid/sys/posix/Socket.cpp:162) Connection refused: localhost:5672 (qpid/sys/posix/Socket.cpp:162) Connection refused: localhost:5672 (qpid/sys/posix/Socket.cpp:162) Connection refused: localhost:5672 (qpid/sys/posix/Socket.cpp:162) Connection refused: localhost:5672 (qpid/sys/posix/Socket.cpp:162) Connection refused: localhost:5672 (qpid/sys/posix/Socket.cpp:162) Connection refused: localhost:5672 (qpid/sys/posix/Socket.cpp:162) 1256.62 37.4777 2133.65 2.08364 Connection refused: localhost:5672 (qpid/sys/posix/Socket.cpp:162) 1143.96 35.388 2994.51 2.92432 [root@dhcp-lab-200 bz479326]# rpm -q qpidd qpidd-0.5.752581-1.el5 [root@dhcp-lab-200 bz479326]# gdb `which qpidd` core.2 core.2674 core.2823 [root@dhcp-lab-200 bz479326]# gdb `which qpidd` core.2674 GNU gdb Red Hat Linux (6.5-37.el5_2.2rh) Copyright (C) 2006 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu"...(no debugging symbols found) Using host libthread_db library "/lib64/libthread_db.so.1". ... Core was generated by `qpidd --auth no --log-enable debug+ --cluster-name foobar --cluster-read-max 1'. Program terminated with signal 6, Aborted. #0 0x000000350b030155 in raise () from /lib64/libc.so.6 (gdb) thread apply all bt Thread 12 (process 2674): #0 0x000000350b0d1f58 in epoll_wait () from /lib64/libc.so.6 #1 0x0000003549972e8d in qpid::sys::Poller::wait () from /usr/lib64/libqpidcommon.so.0 #2 0x0000003549973c67 in qpid::sys::Poller::run () from /usr/lib64/libqpidcommon.so.0 #3 0x0000003549eccb86 in qpid::broker::Broker::run () from /usr/lib64/libqpidbroker.so.0 #4 0x0000000000406948 in qpid::log::Options::~Options () #5 0x0000000000405438 in __cxa_pure_virtual () #6 0x000000350b01d8b4 in __libc_start_main () from /lib64/libc.so.6 #7 0x0000000000404eb9 in __cxa_pure_virtual () #8 0x00007fff6f2b4ad8 in ?? () #9 0x0000000000000000 in ?? () Thread 11 (process 2677): #0 0x000000350bc0a687 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x0000003549f88e6f in qpid::broker::Timer::run () from /usr/lib64/libqpidbroker.so.0 #2 0x000000354996ac4a in qpid::sys::AbsTime::AbsTime () from /usr/lib64/libqpidcommon.so.0 #3 0x000000350bc062f7 in start_thread () from /lib64/libpthread.so.0 #4 0x000000350b0d1b6d in clone () from /lib64/libc.so.6 Thread 10 (process 2678): #0 0x000000350bc0a687 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x0000003549f88e6f in qpid::broker::Timer::run () from /usr/lib64/libqpidbroker.so.0 #2 0x000000354996ac4a in qpid::sys::AbsTime::AbsTime () from /usr/lib64/libqpidcommon.so.0 #3 0x000000350bc062f7 in start_thread () from /lib64/libpthread.so.0 #4 0x000000350b0d1b6d in clone () from /lib64/libc.so.6 Thread 9 (process 2679): #0 0x000000350bc0a687 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x0000003549f88e6f in qpid::broker::Timer::run () from /usr/lib64/libqpidbroker.so.0 #2 0x000000354996ac4a in qpid::sys::AbsTime::AbsTime () from /usr/lib64/libqpidcommon.so.0 #3 0x000000350bc062f7 in start_thread () from /lib64/libpthread.so.0 #4 0x000000350b0d1b6d in clone () from /lib64/libc.so.6 Thread 8 (process 2681): #0 0x000000350b0d1f58 in epoll_wait () from /lib64/libc.so.6 #1 0x0000003549972e8d in qpid::sys::Poller::wait () from /usr/lib64/libqpidcommon.so.0 #2 0x0000003549973c67 in qpid::sys::Poller::run () from /usr/lib64/libqpidcommon.so.0 #3 0x000000354996ac4a in qpid::sys::AbsTime::AbsTime () from /usr/lib64/libqpidcommon.so.0 #4 0x000000350bc062f7 in start_thread () from /lib64/libpthread.so.0 ---Type <return> to continue, or q <return> to quit--- #5 0x000000350b0d1b6d in clone () from /lib64/libc.so.6 Thread 7 (process 2682): #0 0x000000350bc0c999 in __lll_mutex_unlock_wake () from /lib64/libpthread.so.0 #1 0x000000350bc09a59 in _L_mutex_unlock_59 () from /lib64/libpthread.so.0 #2 0x000000350bc0971b in __pthread_mutex_unlock_usercnt () from /lib64/libpthread.so.0 #3 0x000000350b105da2 in dl_iterate_phdr () from /lib64/libc.so.6 #4 0x000000350fc0a626 in _Unwind_Find_FDE () from /lib64/libgcc_s.so.1 #5 0x000000350fc075b5 in _Unwind_GetIPInfo () from /lib64/libgcc_s.so.1 #6 0x000000350fc08ee9 in _Unwind_RaiseException () from /lib64/libgcc_s.so.1 #7 0x000000350fc09071 in _Unwind_Resume_or_Rethrow () from /lib64/libgcc_s.so.1 #8 0x00000035108bced8 in __cxa_rethrow () from /usr/lib64/libstdc++.so.6 #9 0x00000035108bec5e in __gnu_cxx::__verbose_terminate_handler () from /usr/lib64/libstdc++.so.6 #10 0x00000035108bce36 in std::set_unexpected () from /usr/lib64/libstdc++.so.6 #11 0x00000035108bce63 in std::terminate () from /usr/lib64/libstdc++.so.6 #12 0x00000035108bcf4a in __cxa_throw () from /usr/lib64/libstdc++.so.6 #13 0x00000035499c0eee in qpid::sys::ScopedLock<qpid::sys::Mutex>::~ScopedLock () from /usr/lib64/libqpidcommon.so.0 #14 0x00000035499c07b6 in qpid::sys::DispatchHandle::processEvent () from /usr/lib64/libqpidcommon.so.0 #15 0x0000003549973c93 in qpid::sys::Poller::run () from /usr/lib64/libqpidcommon.so.0 #16 0x000000354996ac4a in qpid::sys::AbsTime::AbsTime () from /usr/lib64/libqpidcommon.so.0 #17 0x000000350bc062f7 in start_thread () from /lib64/libpthread.so.0 #18 0x000000350b0d1b6d in clone () from /lib64/libc.so.6 Thread 6 (process 2683): #0 0x000000350b0d1f58 in epoll_wait () from /lib64/libc.so.6 #1 0x0000003549972e8d in qpid::sys::Poller::wait () from /usr/lib64/libqpidcommon.so.0 #2 0x0000003549973c67 in qpid::sys::Poller::run () from /usr/lib64/libqpidcommon.so.0 #3 0x000000354996ac4a in qpid::sys::AbsTime::AbsTime () from /usr/lib64/libqpidcommon.so.0 #4 0x000000350bc062f7 in start_thread () from /lib64/libpthread.so.0 #5 0x000000350b0d1b6d in clone () from /lib64/libc.so.6 Thread 5 (process 2684): #0 0x000000350b0d1f58 in epoll_wait () from /lib64/libc.so.6 #1 0x0000003549972e8d in qpid::sys::Poller::wait () from /usr/lib64/libqpidcommon.so.0 #2 0x0000003549973c67 in qpid::sys::Poller::run () from /usr/lib64/libqpidcommon.so.0 #3 0x000000354996ac4a in qpid::sys::AbsTime::AbsTime () from /usr/lib64/libqpidcommon.so.0 #4 0x000000350bc062f7 in start_thread () from /lib64/libpthread.so.0 #5 0x000000350b0d1b6d in clone () from /lib64/libc.so.6 Thread 4 (process 2685): ---Type <return> to continue, or q <return> to quit--- #0 0x000000350b0d1f58 in epoll_wait () from /lib64/libc.so.6 #1 0x0000003549972e8d in qpid::sys::Poller::wait () from /usr/lib64/libqpidcommon.so.0 #2 0x0000003549973c67 in qpid::sys::Poller::run () from /usr/lib64/libqpidcommon.so.0 #3 0x000000354996ac4a in qpid::sys::AbsTime::AbsTime () from /usr/lib64/libqpidcommon.so.0 #4 0x000000350bc062f7 in start_thread () from /lib64/libpthread.so.0 #5 0x000000350b0d1b6d in clone () from /lib64/libc.so.6 Thread 3 (process 2686): #0 0x000000350b0d1f58 in epoll_wait () from /lib64/libc.so.6 #1 0x0000003549972e8d in qpid::sys::Poller::wait () from /usr/lib64/libqpidcommon.so.0 #2 0x0000003549973c67 in qpid::sys::Poller::run () from /usr/lib64/libqpidcommon.so.0 #3 0x000000354996ac4a in qpid::sys::AbsTime::AbsTime () from /usr/lib64/libqpidcommon.so.0 #4 0x000000350bc062f7 in start_thread () from /lib64/libpthread.so.0 #5 0x000000350b0d1b6d in clone () from /lib64/libc.so.6 Thread 2 (process 2688): #0 0x000000350b0d1f58 in epoll_wait () from /lib64/libc.so.6 #1 0x0000003549972e8d in qpid::sys::Poller::wait () from /usr/lib64/libqpidcommon.so.0 #2 0x0000003549973c67 in qpid::sys::Poller::run () from /usr/lib64/libqpidcommon.so.0 #3 0x000000354996ac4a in qpid::sys::AbsTime::AbsTime () from /usr/lib64/libqpidcommon.so.0 #4 0x000000350bc062f7 in start_thread () from /lib64/libpthread.so.0 #5 0x000000350b0d1b6d in clone () from /lib64/libc.so.6 Thread 1 (process 2687): #0 0x000000350b030155 in raise () from /lib64/libc.so.6 #1 0x000000350b031bf0 in abort () from /lib64/libc.so.6 #2 0x00000035108bec9f in __gnu_cxx::__verbose_terminate_handler () from /usr/lib64/libstdc++.so.6 #3 0x00000035108bce36 in std::set_unexpected () from /usr/lib64/libstdc++.so.6 #4 0x00000035108bce63 in std::terminate () from /usr/lib64/libstdc++.so.6 #5 0x00000035108bcf4a in __cxa_throw () from /usr/lib64/libstdc++.so.6 #6 0x00000035499c0eee in qpid::sys::ScopedLock<qpid::sys::Mutex>::~ScopedLock () from /usr/lib64/libqpidcommon.so.0 #7 0x00000035499c07b6 in qpid::sys::DispatchHandle::processEvent () from /usr/lib64/libqpidcommon.so.0 #8 0x0000003549973c93 in qpid::sys::Poller::run () from /usr/lib64/libqpidcommon.so.0 #9 0x000000354996ac4a in qpid::sys::AbsTime::AbsTime () from /usr/lib64/libqpidcommon.so.0 #10 0x000000350bc062f7 in start_thread () from /lib64/libpthread.so.0 #11 0x000000350b0d1b6d in clone () from /lib64/libc.so.6 [root@dhcp-lab-200 bz479326]# gdb `which qpidd` core.2674 GNU gdb Red Hat Linux (6.5-37.el5_2.2rh) Copyright (C) 2006 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu"...(no debugging symbols found) Using host libthread_db library "/lib64/libthread_db.so.1". Reading symbols from /usr/lib64/libqpidbroker.so.0...(no debugging symbols found)...done. ... found)...done. Loaded symbols for /usr/lib64/libibverbs.so.1 Core was generated by `qpidd --auth no --log-enable debug+ --cluster-name foobar --cluster-read-max 1'. Program terminated with signal 6, Aborted. #0 0x000000350b030155 in raise () from /lib64/libc.so.6 (gdb) bt #0 0x000000350b030155 in raise () from /lib64/libc.so.6 #1 0x000000350b031bf0 in abort () from /lib64/libc.so.6 #2 0x00000035108bec9f in __gnu_cxx::__verbose_terminate_handler () from /usr/lib64/libstdc++.so.6 #3 0x00000035108bce36 in std::set_unexpected () from /usr/lib64/libstdc++.so.6 #4 0x00000035108bce63 in std::terminate () from /usr/lib64/libstdc++.so.6 #5 0x00000035108bcf4a in __cxa_throw () from /usr/lib64/libstdc++.so.6 #6 0x00000035499c0eee in qpid::sys::ScopedLock<qpid::sys::Mutex>::~ScopedLock () from /usr/lib64/libqpidcommon.so.0 #7 0x00000035499c07b6 in qpid::sys::DispatchHandle::processEvent () from /usr/lib64/libqpidcommon.so.0 #8 0x0000003549973c93 in qpid::sys::Poller::run () from /usr/lib64/libqpidcommon.so.0 #9 0x000000354996ac4a in qpid::sys::AbsTime::AbsTime () from /usr/lib64/libqpidcommon.so.0 #10 0x000000350bc062f7 in start_thread () from /lib64/libpthread.so.0 #11 0x000000350b0d1b6d in clone () from /lib64/libc.so.6
comment #4 addressed as separate bug 490457. bug 490457 blocks this bug.
The issue has been fixed, tested on RHEL 5.5 i386 / x86_64 (as it involves --cluster-read-max) on packages: python-qmf-0.7.946106-11.el5 python-qpid-0.7.946106-13.el5 qmf-*0.7.946106-12.el5 qpid-cpp-*0.7.946106-12.el5 qpid-java-*-0.7.946106-7.el5 qpid-tools-0.7.946106-9.el5 sesame-0.7.3918-6.el5 Reproway: qpid*-0.4.732838-1.el5 vs qpid-cpp-*0.7.946106-12.el5 # message store has to not be loaded (i.e. --no-data-dir) qpidd --cluster-name foo --cluster-read-max 1 --no-data-dir --auth no &> /dev/null & while perftest --count 10 --npubs 10 --nsubs 10 -s ;do true ; done -> VERIFIED
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Due to incorrect implementation of the DispatchHandle, a race condition may have occurred, causing a broker in a cluster to terminate unexpectedly with a core dump. With this update, the underlying code has been modified to prevent such event, and a clustered broker no longer crashes.
Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1 +1 @@ -Due to incorrect implementation of the DispatchHandle, a race condition may have occurred, causing a broker in a cluster to terminate unexpectedly with a core dump. With this update, the underlying code has been modified to prevent such event, and a clustered broker no longer crashes.+An error in the 'DispatchHandle' function may have caused a race condition to occur occur, causing a broker in a cluster to terminate unexpectedly with a core dump. With this update, the underlying code has been modified to prevent such event, and a clustered broker no longer crashes.
Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1 +1 @@ -An error in the 'DispatchHandle' function may have caused a race condition to occur occur, causing a broker in a cluster to terminate unexpectedly with a core dump. With this update, the underlying code has been modified to prevent such event, and a clustered broker no longer crashes.+An error in the 'DispatchHandle' class implementation may have caused a race condition to occur occur, causing a broker in a cluster to terminate unexpectedly with a core dump. With this update, the underlying code has been modified to prevent such event, and a clustered broker no longer crashes.
Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1 +1 @@ -An error in the 'DispatchHandle' class implementation may have caused a race condition to occur occur, causing a broker in a cluster to terminate unexpectedly with a core dump. With this update, the underlying code has been modified to prevent such event, and a clustered broker no longer crashes.+An error in the 'DispatchHandle' class implementation may have caused a race condition to occur, causing a broker in a cluster to terminate unexpectedly with a core dump. With this update, the underlying code has been modified to prevent such event, and a clustered broker no longer crashes.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0773.html