Hide Forgot
Description of problem: qpid high level API client (modified qpid-send, see it attached) very rarely terminates during reconnection / failover when multiple sender clients are reconnecting / failing over at the same moment. There was detected single case when one of the qpid-send clients exited during reconnection following way: ERROR:10.34.37.124:'~/mrg_cluster_tests/qpid-send -b "10.34.37.124:5672" --failover-updates --content-string "msg_%06d" --content-inde x-offset 0 --connection-options="{ reconnect: True, reconnect_limit: 10, reconnect_timeout: 10, username: guest, password:guest }" --address "te st_cluster_failover_NtoN_rs_mq_0_0; {create: sender, delete:receiver, node:{ durable: False}}" --durable no -m 58609', result:?, dur.:1.90 [11:07:25] DEBUG:stdout: [11:07:25] DEBUG:stderr: 2012-02-18 05:07:16 warning Connection [10.34.37.124:36179-10.34.37.124:5672] closed 2012-02-18 05:07:16 warning Connection [10.34.37.124:36183-10.34.37.124:5672] closed pure virtual method called terminate called without an active exception The 'pure virtual method called' highlights that client has tried to call virtual function/method from a constructor or destructor which is fatal condition. I believe that probability of above described case is high only during concurrent reconnect of many clients. No interesting entries in broker logs. Version-Release number of selected component (if applicable): cman-2.0.115-96.el5 cman-devel-2.0.115-96.el5 cman-devel-2.0.115-96.el5 openais-0.80.6-36.el5 openais-debuginfo-0.80.6-36.el5 openais-devel-0.80.6-36.el5 python-qpid-0.14-1.el5 python-qpid-qmf-0.14-2.el5 python-saslwrapper-0.10-4.el5 qpid-cpp-client-0.14-5.el5 qpid-cpp-client-devel-0.14-5.el5 qpid-cpp-client-devel-docs-0.14-5.el5 qpid-cpp-client-rdma-0.14-5.el5 qpid-cpp-client-ssl-0.14-5.el5 qpid-cpp-mrg-debuginfo-0.14-5.el5 qpid-cpp-server-0.14-5.el5 qpid-cpp-server-cluster-0.14-5.el5 qpid-cpp-server-devel-0.14-5.el5 qpid-cpp-server-rdma-0.14-5.el5 qpid-cpp-server-ssl-0.14-5.el5 qpid-cpp-server-store-0.14-5.el5 qpid-cpp-server-xml-0.14-5.el5 qpid-java-client-0.14-1.el5 qpid-java-common-0.14-1.el5 qpid-java-example-0.14-1.el5 qpid-qmf-0.14-2.el5 qpid-qmf-debuginfo-0.14-2.el5 qpid-qmf-devel-0.14-2.el5 qpid-tests-0.14-1.el5 qpid-tools-0.14-1.el5 rgmanager-2.0.52-28.el5 rh-qpid-cpp-tests-0.14-5.el5 ruby-qpid-qmf-0.14-2.el5 ruby-saslwrapper-0.10-4.el5 saslwrapper-0.10-4.el5 saslwrapper-debuginfo-0.10-4.el5 saslwrapper-devel-0.10-4.el5 sesame-1.0-2.el5 sesame-debuginfo-1.0-2.el5 How reproducible: < 1 % Steps to Reproduce: 1. following cluster_test ctests.py run on 4 node RHEL 5.7 x86_64 cluster 2. Error found in test_cluster_failover_NtoN_rs Actual results: Sending client stopped reconnect due to an unknow error Expected results: Sending client should always reconnect. Additional info: Test log: [11:07:25] DEBUG:test_cluster_failover_NtoN_rs(): all (4) sender[s] finished (0.0 sec[s]) [11:07:25] ERROR:10.34.37.124:'~/mrg_cluster_tests/qpid-send -b "10.34.37.124:5672" --failover-updates --content-string "msg_%06d" --content-inde x-offset 0 --connection-options="{ reconnect: True, reconnect_limit: 10, reconnect_timeout: 10, username: guest, password:guest }" --address "te st_cluster_failover_NtoN_rs_mq_0_0; {create: sender, delete:receiver, node:{ durable: False}}" --durable no -m 58609', result:-1, dur.:1.90 [11:07:25] DEBUG:stdout: [11:07:25] DEBUG:stderr: 2012-02-18 05:07:16 warning Connection [10.34.37.124:36179-10.34.37.124:5672] closed 2012-02-18 05:07:16 warning Connection [10.34.37.124:36183-10.34.37.124:5672] closed pure virtual method called terminate called without an active exception [11:07:25] WARNING:10.34.37.124:'~/mrg_cluster_tests/qpid-send -b "10.34.37.124:5672" --failover-updates --content-string "msg_%06d" --content-in dex-offset 0 --connection-options="{ reconnect: True, reconnect_limit: 10, reconnect_timeout: 10, username: guest, password:guest }" --address " test_cluster_failover_NtoN_rs_mq_0_1; {create: sender, delete:receiver, node:{ durable: False}}" --durable no -m 58609', result:0, dur.:4.81 [11:07:25] DEBUG:stderr: 2012-02-18 05:07:16 warning Connection [10.34.37.124:36181-10.34.37.124:5672] closed 2012-02-18 05:07:16 warning Connection [10.34.37.124:36186-10.34.37.124:5672] closed 2012-02-18 05:07:16 warning Connect failed: Connection refused 2012-02-18 05:07:16 warning Connection closed 2012-02-18 05:07:16 warning Connect failed: Connection refused 2012-02-18 05:07:16 warning Connection closed [11:07:25] WARNING:10.34.37.124:'~/mrg_cluster_tests/qpid-send -b "10.34.37.124:5672" --failover-updates --content-string "msg_%06d" --content-in dex-offset 0 --connection-options="{ reconnect: True, reconnect_limit: 10, reconnect_timeout: 10, username: guest, password:guest }" --address " test_cluster_failover_NtoN_rs_mq_0_2; {create: sender, delete:receiver, node:{ durable: False}}" --durable no -m 58609', result:0, dur.:5.71 [11:07:25] DEBUG:stderr: 2012-02-18 05:07:16 warning Connection [10.34.37.124:36182-10.34.37.124:5672] closed 2012-02-18 05:07:16 warning Connection [10.34.37.124:36184-10.34.37.124:5672] closed 2012-02-18 05:07:16 warning Connect failed: Connection refused 2012-02-18 05:07:16 warning Connection closed 2012-02-18 05:07:16 warning Connect failed: Connection refused 2012-02-18 05:07:16 warning Connection closed [11:07:25] WARNING:10.34.37.124:'~/mrg_cluster_tests/qpid-send -b "10.34.37.124:5672" --failover-updates --content-string "msg_%06d" --content-in dex-offset 0 --connection-options="{ reconnect: True, reconnect_limit: 10, reconnect_timeout: 10, username: guest, password:guest }" --address " test_cluster_failover_NtoN_rs_mq_0_3; {create: sender, delete:receiver, node:{ durable: False}}" --durable no -m 58609', result:0, dur.:4.70 [11:07:25] DEBUG:stderr: 2012-02-18 05:07:16 warning Connection [10.34.37.124:36180-10.34.37.124:5672] closed 2012-02-18 05:07:16 warning Connection [10.34.37.124:36185-10.34.37.124:5672] closed 2012-02-18 05:07:16 warning Connect failed: Connection refused 2012-02-18 05:07:16 warning Connection closed 2012-02-18 05:07:16 warning Connect failed: Connection refused 2012-02-18 05:07:16 warning Connection closed [11:07:25] DEBUG:QMF obj. query corresponding to queues test_cluster_failover_NtoN_rs_mq_0_%d started [11:10:46] DEBUG:test_cluster_failover_NtoN_rs(): waiting for queue QMF data timeouted (to:200) [11:10:46] DEBUG:QMF object corresponding to queue test_cluster_failover_NtoN_rs_mq_0_0: org.apache.qpid.broker:queue[0-32-1-0-org.apache.qpid.br oker:queue:test_cluster_failover_NtoN_rs_mq_0_0] org.apache.qpid.broker:queue:test_cluster_failover_NtoN_rs_mq_0_0 [11:10:46] ERROR:localhost:'Queue test_cluster_failover_NtoN_rs_mq_0_0 has msg-depth 23338 expected >= 58609', result:False, dur.:-1.00 [11:10:47] DEBUG:QMF object corresponding to queue test_cluster_failover_NtoN_rs_mq_0_1: org.apache.qpid.broker:queue[0-32-1-0-org.apache.qpid.br oker:queue:test_cluster_failover_NtoN_rs_mq_0_1] org.apache.qpid.broker:queue:test_cluster_failover_NtoN_rs_mq_0_1 [11:10:47] INFO:localhost:'Queue test_cluster_failover_NtoN_rs_mq_0_1 has msg-depth 58609 expected >= 58609', result:True, dur.:-1.00 [11:10:48] DEBUG:QMF object corresponding to queue test_cluster_failover_NtoN_rs_mq_0_2: org.apache.qpid.broker:queue[0-32-1-0-org.apache.qpid.br oker:queue:test_cluster_failover_NtoN_rs_mq_0_2] org.apache.qpid.broker:queue:test_cluster_failover_NtoN_rs_mq_0_2 [11:10:48] INFO:localhost:'Queue test_cluster_failover_NtoN_rs_mq_0_2 has msg-depth 58783 expected >= 58609', result:True, dur.:-1.00 [11:10:49] DEBUG:QMF object corresponding to queue test_cluster_failover_NtoN_rs_mq_0_3: org.apache.qpid.broker:queue[0-32-1-0-org.apache.qpid.br oker:queue:test_cluster_failover_NtoN_rs_mq_0_3] org.apache.qpid.broker:queue:test_cluster_failover_NtoN_rs_mq_0_3 [11:10:49] INFO:localhost:'Queue test_cluster_failover_NtoN_rs_mq_0_3 has msg-depth 58609 expected >= 58609', result:True, dur.:-1.00 [11:10:53] DEBUG:Cluster state: 4 of 4 up and running
Similar case detected also on RHEL 6.2 x86_64 4 node cluster where qpid-receive aborted in similar single sender / receiver reconnect case (test_cluster_failover_1to1_rs): [12:52:55] INFO:localhost:'Cluster is fully up (4 of 4 up)', result:True, dur.:-1.00 [12:53:05] DEBUG:test_cluster_failover_1to1_rs(): all (1) receiver[s] finished (0.0 sec[s]) [12:53:05] ERROR:10.34.37.181:'~/mrg_cluster_tests/qpid-receive -b "10.34.37.181:5672" --failover-updates --connection-options="{ reconnect: Tru e, reconnect_limit: 10, reconnect_timeout: 10, username: guest, password:guest }" --address "test_cluster_failover_1to1_rs_mq_0_0 ; {create: sen der, delete:receiver, node:{ durable: False }}" > /tmp/qpid-receive.0.log', result:134, dur.:1.92 [12:53:05] DEBUG:stdout: [12:53:05] DEBUG:stderr: 2012-02-18 06:52:57 warning Connection [10.34.37.181:34773-10.34.37.181:5672] closed 2012-02-18 06:52:57 warning Connection [10.34.37.181:34774-10.34.37.181:5672] closed pure virtual method called terminate called without an active exception bash: line 1: 21789 Aborted (core dumped) ~/mrg_cluster_tests/qpid-receive -b "10.34.37.181:5672" --failover-updates --connecti on-options="{ reconnect: True, reconnect_limit: 10, reconnect_timeout: 10, username: guest, password:guest }" --address "test_cluster_failover_1 to1_rs_mq_0_0 ; {create: sender, delete:receiver, node:{ durable: False }}" > /tmp/qpid-receive.0.log [12:53:06] INFO:10.34.37.178:'pidof qpid-send && kill `pidof qpid-send`', result:1, dur.:0.17 [12:53:06] INFO:10.34.37.181:'pidof qpid-receive && kill `pidof qpid-receive`', result:1, dur.:0.13 [12:53:07] INFO:10.34.37.181:'sort -u /tmp/qpid-receive.0.log|egrep "^msg_[0-9]{6}$"|wc -l', result:0, dur.:0.23 [12:53:07] DEBUG:stdout: 36777 [12:53:07] ERROR:localhost:'Receiver 0 data check (36777 exp. 98589)', result:False, dur.:-1.00 Although above listing shows that core file should be available, it is unfortunately not the case. Using packages: corosync-1.4.1-4.el6.x86_64 corosynclib-1.4.1-4.el6.x86_64 python-qpid-0.14-3.el6.noarch python-qpid-qmf-0.14-4.el6.x86_64 python-saslwrapper-0.10-2.el6.x86_64 qpid-cpp-client-0.14-6.el6.x86_64 qpid-cpp-client-devel-0.14-6.el6.x86_64 qpid-cpp-client-devel-docs-0.14-6.el6.noarch qpid-cpp-client-rdma-0.14-6.el6.x86_64 qpid-cpp-client-ssl-0.14-6.el6.x86_64 qpid-cpp-debuginfo-0.14-6.el6.x86_64 qpid-cpp-server-0.14-6.el6.x86_64 qpid-cpp-server-cluster-0.14-6.el6.x86_64 qpid-cpp-server-devel-0.14-6.el6.x86_64 qpid-cpp-server-rdma-0.14-6.el6.x86_64 qpid-cpp-server-ssl-0.14-6.el6.x86_64 qpid-cpp-server-store-0.14-6.el6.x86_64 qpid-cpp-server-xml-0.14-6.el6.x86_64 qpid-java-client-0.14-1.el6.noarch qpid-java-common-0.14-1.el6.noarch qpid-java-example-0.14-1.el6.noarch qpid-qmf-0.14-4.el6.x86_64 qpid-qmf-debuginfo-0.14-4.el6.x86_64 qpid-qmf-devel-0.14-4.el6.x86_64 qpid-tests-0.14-1.el6.noarch qpid-tools-0.14-1.el6.noarch rh-qpid-cpp-tests-0.14-6.el6.x86_64 ruby-qpid-qmf-0.14-4.el6.x86_64 ruby-saslwrapper-0.10-2.el6.x86_64 saslwrapper-0.10-2.el6.x86_64 saslwrapper-debuginfo-0.10-2.el6.x86_64 saslwrapper-devel-0.10-2.el6.x86_64 sesame-1.0-2.el6.x86_64 sesame-debuginfo-1.0-2.el6.x86_64 Raising reproducibility to 10% (triggered after 19 hours of testing) I think one of the latest patches should cause this behavior as I did not see this on previous rpmset.
Gordon, please do a brief assessment.
Do we have a backtrace from any of the previous failures? That would help pinpoint the issue.
(In reply to comment #4) > Do we have a backtrace from any of the previous failures? That would help > pinpoint the issue. Not yet, I'm re-running the test now to be able to provide that info.
Requested data available from cluster_test (rhel5.7 cluster, 4 nodes, approx. 4 hours to trigger) Test log: ... [14:14:21] ERROR:10.34.37.124:'~/mrg_cluster_tests/qpid-send -b "10.34.37.124:5672" --failover-updates --content-string "msg_%06d" --content-inde x-offset 0 --connection-options="{ reconnect: True, reconnect_limit: 10, reconnect_timeout: 10, username: guest, password:guest }" --address "te st_cluster_failover_NtoN_rs_mq_0_3; {create: sender, delete:receiver, node:{ durable: False}}" --durable no -m 90072', result:-1, dur.:1.09 [14:14:21] DEBUG:stdout: [14:14:21] DEBUG:stderr: 2012-02-22 08:14:10 warning Connection [10.34.37.124:54236-10.34.37.124:5672] closed 2012-02-22 08:14:10 warning Connection [10.34.37.124:54237-10.34.37.124:5672] closed pure virtual method called terminate called without an active exception Abort analysis: [root@dhcp-37-124 mrg_cluster_tests]# ls -lat /root/core.* -rw------- 1 root root 45199360 Feb 22 08:14 /root/core.17786 [root@dhcp-37-124 mrg_cluster_tests]# gdb ./qpid-send /root/core.17786 GNU gdb (GDB) Red Hat Enterprise Linux (7.0.1-42.el5) ... Core was generated by `/root/mrg_cluster_tests/qpid-send -b 10.34.37.124:5672 --failover-updates --con'. Program terminated with signal 6, Aborted. #0 0x0000003a28030265 in raise () from /lib64/libc.so.6 (gdb) info threads 5 Thread 0x2b2ef8330660 (LWP 17786) 0x0000003a2880af59 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 4 Thread 17800 0x0000003a28079b70 in strlen () from /lib64/libc.so.6 3 Thread 17801 0x0000003a2880af59 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 2 Thread 17831 0x0000003a280d4971 in clone () from /lib64/libc.so.6 * 1 Thread 0x41219940 (LWP 17827) 0x0000003a28030265 in raise () from /lib64/libc.so.6 (gdb) thread apply all bt Thread 5 (Thread 0x2b2ef8330660 (LWP 17786)): #0 0x0000003a2880af59 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x0000003a2bc9dee9 in wait (this=0x2aaaac016948, desired=Traceback (most recent call last): File "/usr/share/gdb/python/libstdcxx/v6/printers.py", line 322, in to_string return '%s with %d elements' % (self.typename, OverflowError: __len__() should return 0 <= outcome < 2**31 Cannot access memory at address 0x1638 ) at ../include/qpid/sys/posix/Condition.h:63 #2 wait (this=0x2aaaac016948, desired=Traceback (most recent call last): File "/usr/share/gdb/python/libstdcxx/v6/printers.py", line 322, in to_string return '%s with %d elements' % (self.typename, OverflowError: __len__() should return 0 <= outcome < 2**31 Cannot access memory at address 0x1638 ) at ../include/qpid/sys/Monitor.h:41 #3 qpid::client::StateManager::waitFor (this=0x2aaaac016948, desired=Traceback (most recent call last): File "/usr/share/gdb/python/libstdcxx/v6/printers.py", line 322, in to_string return '%s with %d elements' % (self.typename, OverflowError: __len__() should return 0 <= outcome < 2**31 Cannot access memory at address 0x1638 ) at qpid/client/StateManager.cpp:51 #4 0x0000003a2bc584c9 in qpid::client::ConnectionHandler::waitForOpen (this=0x2aaaac0168d8) at qpid/client/ConnectionHandler.cpp:145 #5 0x0000003a2bc66a38 in qpid::client::ConnectionImpl::open (this=0x2aaaac0167f0) at qpid/client/ConnectionImpl.cpp:284 #6 0x0000003a2bc5639b in qpid::client::Connection::open (this=0x12ce7ae0, settings=...) at qpid/client/Connection.cpp:125 #7 0x0000003a2bc56908 in qpid::client::Connection::open (this=0x12ce7ae0, url=..., settings=...) at qpid/client/Connection.cpp:81 #8 0x00002b2ef80e8b19 in qpid::client::amqp0_10::ConnectionImpl::tryConnect (this=0x12ce7a20) at qpid/client/amqp0_10/ConnectionImpl.cpp:280 #9 0x00002b2ef80e94df in qpid::client::amqp0_10::ConnectionImpl::connect (this=0x12ce7a20, started=...) at qpid/client/amqp0_10/ConnectionImpl.cpp:250 #10 0x00002b2ef80e98ef in qpid::client::amqp0_10::ConnectionImpl::open (this=0x12ce7a20) at qpid/client/amqp0_10/ConnectionImpl.cpp:224 #11 0x00002b2ef80e482c in qpid::client::amqp0_10::ConnectionImpl::reopen (this=0x2aaaac01697c) at qpid/client/amqp0_10/ConnectionImpl.cpp:235 #12 0x00002b2ef810c14e in qpid::client::amqp0_10::SessionImpl::execute<qpid::client::amqp0_10::SenderImpl::Send> (this=0x12d360c0, f=<value optimized out>) at qpid/client/amqp0_10/SessionImpl.h:102 #13 0x00002b2ef8109d21 in qpid::client::amqp0_10::SenderImpl::send (this=0x12d378f0, message=<value optimized out>, sync=false) at qpid/client/amqp0_10/SenderImpl.cpp:44 #14 0x000000000040fe0a in main () Thread 4 (Thread 17800): #0 0x0000003a28079b70 in strlen () from /lib64/libc.so.6 #1 0x0000003a2a89c500 in std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(char const*, std::allocator<char> const&) () from /usr/lib64/libstdc++.so.6 #2 0x0000003a2992f3f5 in qpid::sys::strError (err=111) at qpid/sys/posix/StrError.cpp:33 #3 0x0000003a29923384 in qpid::sys::posix::AsynchConnector::connComplete (this=0x2aaaac017460, h=...) at qpid/sys/posix/AsynchIO.cpp:217 #4 0x0000003a29a0b13a in boost::function1<void, qpid::sys::DispatchHandle&, std::allocator<boost::function_base> >::operator() ( this=0x3a28120eae, a0=...) at /usr/include/boost/function/function_template.hpp:576 #5 0x0000003a29a0a438 in qpid::sys::DispatchHandle::processEvent (this=0x2aaaac017468, type=DISCONNECTED) at qpid/sys/DispatchHandle.cpp:291 #6 0x0000003a29934ed4 in process (this=0x12ce8330) at qpid/sys/Poller.h:131 #7 qpid::sys::Poller::run (this=0x12ce8330) at qpid/sys/epoll/EpollPoller.cpp:524 #8 0x0000003a2992bfaa in qpid::sys::(anonymous namespace)::runRunnable (p=0x3a28120eae) at qpid/sys/posix/Thread.cpp:35 #9 0x0000003a2880677d in start_thread () from /lib64/libpthread.so.0 #10 0x0000003a280d49ad in clone () from /lib64/libc.so.6 Thread 3 (Thread 17801): #0 0x0000003a2880af59 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00002b2ef80e9697 in acquire (this=0x12ce7a20) at ../include/qpid/sys/posix/Condition.h:63 #2 lock (this=0x12ce7a20) at qpid/sys/Semaphore.h:32 #3 ScopedLock (this=0x12ce7a20) at ../include/qpid/sys/Mutex.h:33 #4 qpid::client::amqp0_10::ConnectionImpl::open (this=0x12ce7a20) at qpid/client/amqp0_10/ConnectionImpl.cpp:222 #5 0x00002b2ef80e482c in qpid::client::amqp0_10::ConnectionImpl::reopen (this=0x12ce7a7c) at qpid/client/amqp0_10/ConnectionImpl.cpp:235 #6 0x00002b2ef80f7605 in qpid::client::amqp0_10::SessionImpl::execute<qpid::client::amqp0_10::ReceiverImpl::Fetch> (this=0x12d34cc0, f=<value optimized out>) at qpid/client/amqp0_10/SessionImpl.h:102 #7 0x00002b2ef80f65ac in qpid::client::amqp0_10::ReceiverImpl::fetch (this=0x12ce8f70, message=<value optimized out>, timeout=...) at qpid/client/amqp0_10/ReceiverImpl.cpp:71 #8 0x00002b2ef80ccd9d in qpid::messaging::Receiver::fetch (this=<value optimized out>, message=..., timeout=...) at qpid/messaging/Receiver.cpp:38 #9 0x00002b2ef80ce399 in qpid::messaging::FailoverUpdatesImpl::run (this=0x12d349b0) at qpid/messaging/FailoverUpdates.cpp:63 #10 0x0000003a2992bfaa in qpid::sys::(anonymous namespace)::runRunnable (p=0x12ce7a7c) at qpid/sys/posix/Thread.cpp:35 #11 0x0000003a2880677d in start_thread () from /lib64/libpthread.so.0 #12 0x0000003a280d49ad in clone () from /lib64/libc.so.6 Thread 2 (Thread 17831): #0 0x0000003a280d4971 in clone () from /lib64/libc.so.6 #1 0x0000003a288066a0 in ?? () from /lib64/libpthread.so.0 #2 0x00000000433d1940 in ?? () #3 0x0000000000000000 in ?? () Thread 1 (Thread 0x41219940 (LWP 17827)): #0 0x0000003a28030265 in raise () from /lib64/libc.so.6 #1 0x0000003a28031d10 in abort () from /lib64/libc.so.6 #2 0x0000003a2a8bed14 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib64/libstdc++.so.6 #3 0x0000003a2a8bce16 in ?? () from /usr/lib64/libstdc++.so.6 #4 0x0000003a2a8bce43 in std::terminate() () from /usr/lib64/libstdc++.so.6 #5 0x0000003a2a8bd34f in __cxa_pure_virtual () from /usr/lib64/libstdc++.so.6 #6 0x0000003a29929c3a in boost::function2<void, qpid::sys::AsynchIO&, qpid::sys::Socket const&, std::allocator<boost::function_base> >::operator() (this=0x0, a0=..., a1=...) at /usr/include/boost/function/function_template.hpp:576 #7 0x0000003a29a0b13a in boost::function1<void, qpid::sys::DispatchHandle&, std::allocator<boost::function_base> >::operator() (this=0x0, a0=...) at /usr/include/boost/function/function_template.hpp:576 #8 0x0000003a29a0a438 in qpid::sys::DispatchHandle::processEvent (this=0x12ce9818, type=DISCONNECTED) at qpid/sys/DispatchHandle.cpp:291 #9 0x0000003a29934ed4 in process (this=0x12ce8330) at qpid/sys/Poller.h:131 #10 qpid::sys::Poller::run (this=0x12ce8330) at qpid/sys/epoll/EpollPoller.cpp:524 #11 0x0000003a2992bfaa in qpid::sys::(anonymous namespace)::runRunnable (p=0x457a) at qpid/sys/posix/Thread.cpp:35 #12 0x0000003a2880677d in start_thread () from /lib64/libpthread.so.0 #13 0x0000003a280d49ad in clone () from /lib64/libc.so.6 (gdb) quit
Thanks Frantisek! Justin: Its an IO thread that aborts, and its at a fairly low level. There may be some way of working around it in the messaging API impl, but my guess is that this is some case we have missed so far in the correct cleanup of things at the TCPConnector/AsynchIO level.
This issue is from time to time seen in cluster test on qpid-cpp*-0.14-14.el6_2 + qpid-qmf-*0.14-6.el6_2... Test raw log: [11:55:44] ERROR:10.34.37.179:'~/mrg_cluster_tests/qpid-receive -b "10.34.37.179:5672" --failover-updates --connection-options="{ reconnect: Tru e, reconnect_limit: 10, reconnect_timeout: 10, username: guest, password:guest }" --address "test_cluster_failover_NtoN_rs_mq_0_2 ; {create: sen der, delete:receiver, node:{ durable: False }}" > /tmp/qpid-receive.2.log', result:134, dur.:2.57 [11:55:44] DEBUG:stdout: [11:55:44] DEBUG:stderr: 2012-04-07 05:55:33 warning Connection [10.34.37.179:38765-10.34.37.179:5672] closed 2012-04-07 05:55:33 warning Connection [10.34.37.179:38769-10.34.37.179:5672] closed pure virtual method called terminate called without an active exception bash: line 1: 27904 Aborted (core dumped) ~/mrg_cluster_tests/qpid-receive -b "10.34.37.179:5672" --failover-updates --connecti on-options="{ reconnect: True, reconnect_limit: 10, reconnect_timeout: 10, username: guest, password:guest }" --address "test_cluster_failover_N toN_rs_mq_0_2 ; {create: sender, delete:receiver, node:{ durable: False }}" > /tmp/qpid-receive.2.log Core file analysis: Core was generated by `/root/mrg_cluster_tests/qpid-receive -b 10.34.37.179:5672 --failover-updates --'. Program terminated with signal 6, Aborted. #0 0x0000003c64432885 in raise () from /lib64/libc.so.6 Missing separate debuginfos, use: debuginfo-install boost-filesystem-1.41.0-11.el6_1.2.x86_64 boost-program-options-1.41.0-11.el6_1.2.x86_64 boost-system-1.41.0-11.el6_1.2.x86_64 cyrus-sasl-gssapi-2.1.23-13.el6.x86_64 cyrus-sasl-lib-2.1.23-13.el6.x86_64 cyrus-sasl-md5-2.1.23-13.el6.x86_64 cyrus-sasl-plain-2.1.23-13.el6.x86_64 db4-4.7.25-16.el6.x86_64 glibc-2.12-1.47.el6_2.9.x86_64 keyutils-libs-1.4-3.el6.x86_64 krb5-libs-1.9-22.el6_2.1.x86_64 libcom_err-1.41.12-11.el6.x86_64 libgcc-4.4.6-3.el6.x86_64 libibverbs-1.1.5-3.el6.x86_64 librdmacm-1.0.14.1-3.el6.x86_64 libselinux-2.0.94-5.2.el6.x86_64 libstdc++-4.4.6-3.el6.x86_64 libuuid-2.17.2-12.4.el6.x86_64 nspr-4.8.9-3.el6_2.x86_64 nss-3.13.1-7.el6_2.x86_64 nss-softokn-freebl-3.12.9-11.el6.x86_64 nss-util-3.13.1-3.el6_2.x86_64 openssl-1.0.0-20.el6_2.3.x86_64 zlib-1.2.3-27.el6.x86_64 (gdb) info threads 5 Thread 0x7f4d01314700 (LWP 27971) 0x0000003c644e2877 in mprotect () from /lib64/libc.so.6 4 Thread 0x7f4d02716700 (LWP 27937) 0x0000003c64c0b3dc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 3 Thread 0x7f4d044da700 (LWP 27928) 0x0000003c644e62c3 in epoll_wait () from /lib64/libc.so.6 2 Thread 0x7f4d053e07a0 (LWP 27904) 0x0000003c64c0b3dc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 * 1 Thread 0x7f4d01d15700 (LWP 27966) 0x0000003c64432885 in raise () from /lib64/libc.so.6 (gdb) t a a bt Thread 5 (Thread 0x7f4d01314700 (LWP 27971)): #0 0x0000003c644e2877 in mprotect () from /lib64/libc.so.6 #1 0x0000003c6447922f in _int_malloc () from /lib64/libc.so.6 #2 0x0000003c644797d1 in malloc () from /lib64/libc.so.6 #3 0x0000003c67cbd0bd in operator new(unsigned long) () from /usr/lib64/libstdc++.so.6 #4 0x0000003c67cbd1d9 in operator new[](unsigned long) () from /usr/lib64/libstdc++.so.6 #5 0x00000031172936f7 in Buff (this=0x7f4cf4002880, aio_=<value optimized out>) at qpid/client/TCPConnector.cpp:50 #6 qpid::client::TCPConnector::start (this=0x7f4cf4002880, aio_=<value optimized out>) at qpid/client/TCPConnector.cpp:121 #7 0x000000311729424e in qpid::client::TCPConnector::connected (this=0x7f4cf4002880) at qpid/client/TCPConnector.cpp:113 #8 0x000000311693d2ff in operator() (this=0x7f4cf4002c30, h=...) at /usr/include/boost/function/function_template.hpp:1013 #9 qpid::sys::posix::AsynchConnector::connComplete (this=0x7f4cf4002c30, h=...) at qpid/sys/posix/AsynchIO.cpp:200 #10 0x0000003116a09503 in boost::function1<void, qpid::sys::DispatchHandle&>::operator() ( this=<value optimized out>, a0=<value optimized out>) at /usr/include/boost/function/function_template.hpp:1013 #11 0x0000003116a063ce in qpid::sys::DispatchHandle::processEvent (this=0x7f4cf4002c38, type=qpid::sys::Poller::WRITABLE) at qpid/sys/DispatchHandle.cpp:287 #12 0x0000003116948d4d in process (this=0xe2cca0) at qpid/sys/Poller.h:131 #13 qpid::sys::Poller::run (this=0xe2cca0) at qpid/sys/epoll/EpollPoller.cpp:524 #14 0x000000311694066a in qpid::sys::(anonymous namespace)::runRunnable (p=<value optimized out>) at qpid/sys/posix/Thread.cpp:35 #15 0x0000003c64c077f1 in start_thread () from /lib64/libpthread.so.0 #16 0x0000003c644e5ccd in clone () from /lib64/libc.so.6 Thread 4 (Thread 0x7f4d02716700 (LWP 27937)): #0 0x0000003c64c0b3dc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f4d0543bd70 in qpid::sys::Condition::wait (this=<value optimized out>, mutex=<value optimized out>) at ../include/qpid/sys/posix/Condition.h:63 #2 0x000000311728b17b in wait (this=0x7f4cf40021d8, desired=std::set with 3 elements = {...}) at ../include/qpid/sys/Monitor.h:41 #3 qpid::client::StateManager::waitFor (this=0x7f4cf40021d8, desired=std::set with 3 elements = {...}) at qpid/client/StateManager.cpp:51 #4 0x000000311725e1dc in qpid::client::ConnectionHandler::waitForOpen (this=0x7f4cf4002168) at qpid/client/ConnectionHandler.cpp:145 #5 0x0000003117265e6d in qpid::client::ConnectionImpl::open (this=0x7f4cf4002080) at qpid/client/ConnectionImpl.cpp:284 #6 0x000000311725c0a0 in qpid::client::Connection::open (this=0xe2dbc0, settings=<value optimized out>) at qpid/client/Connection.cpp:125 #7 0x000000311725ca8e in qpid::client::Connection::open (this=0xe2dbc0, url=..., settings=...) at qpid/client/Connection.cpp:81 #8 0x00007f4d05438f13 in qpid::client::amqp0_10::ConnectionImpl::tryConnect (this=0xe2db00) at qpid/client/amqp0_10/ConnectionImpl.cpp:280 #9 0x00007f4d054397ed in qpid::client::amqp0_10::ConnectionImpl::connect (this=0xe2db00, started=...) at qpid/client/amqp0_10/ConnectionImpl.cpp:250 #10 0x00007f4d0543a423 in qpid::client::amqp0_10::ConnectionImpl::open (this=0xe2db00) at qpid/client/amqp0_10/ConnectionImpl.cpp:224 #11 0x00007f4d05436e16 in qpid::client::amqp0_10::ConnectionImpl::reopen (this=<value optimized out>) at qpid/client/amqp0_10/ConnectionImpl.cpp:235 #12 0x00007f4d05443324 in qpid::client::amqp0_10::SessionImpl::execute<qpid::client::amqp0_10::ReceiverImpl::Fetch> (this=0xe2d1e0, f=<value optimized out>) at qpid/client/amqp0_10/SessionImpl.h:102 #13 0x00007f4d0544206c in qpid::client::amqp0_10::ReceiverImpl::fetch (this=0xe2d900, message=<value optimized out>, timeout=<value optimized out>) at qpid/client/amqp0_10/ReceiverImpl.cpp:71 #14 0x00007f4d05427649 in qpid::messaging::FailoverUpdatesImpl::run (this=0xe2ceb0) at qpid/messaging/FailoverUpdates.cpp:63 #15 0x000000311694066a in qpid::sys::(anonymous namespace)::runRunnable (p=<value optimized out>) at qpid/sys/posix/Thread.cpp:35 #16 0x0000003c64c077f1 in start_thread () from /lib64/libpthread.so.0 #17 0x0000003c644e5ccd in clone () from /lib64/libc.so.6 Thread 3 (Thread 0x7f4d044da700 (LWP 27928)): #0 0x0000003c644e62c3 in epoll_wait () from /lib64/libc.so.6 #1 0x000000311694867a in qpid::sys::Poller::wait (this=0xe2cca0, timeout=...) at qpid/sys/epoll/EpollPoller.cpp:568 #2 0x0000003116948d27 in qpid::sys::Poller::run (this=0xe2cca0) at qpid/sys/epoll/EpollPoller.cpp:520 #3 0x000000311694066a in qpid::sys::(anonymous namespace)::runRunnable (p=<value optimized out>) at qpid/sys/posix/Thread.cpp:35 #4 0x0000003c64c077f1 in start_thread () from /lib64/libpthread.so.0 #5 0x0000003c644e5ccd in clone () from /lib64/libc.so.6 Thread 2 (Thread 0x7f4d053e07a0 (LWP 27904)): #0 0x0000003c64c0b3dc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f4d0543bd70 in qpid::sys::Condition::wait (this=<value optimized out>, mutex=<value optimized out>) at ../include/qpid/sys/posix/Condition.h:63 #2 0x00007f4d0543a3eb in wait (this=0xe2db00) at ../include/qpid/sys/Monitor.h:41 #3 acquire (this=0xe2db00) at qpid/sys/Semaphore.h:50 #4 lock (this=0xe2db00) at qpid/sys/Semaphore.h:32 #5 ScopedLock (this=0xe2db00) at ../include/qpid/sys/Mutex.h:33 #6 qpid::client::amqp0_10::ConnectionImpl::open (this=0xe2db00) at qpid/client/amqp0_10/ConnectionImpl.cpp:222 #7 0x00007f4d05436e16 in qpid::client::amqp0_10::ConnectionImpl::reopen (this=<value optimized out>) at qpid/client/amqp0_10/ConnectionImpl.cpp:235 #8 0x00007f4d05443324 in qpid::client::amqp0_10::SessionImpl::execute<qpid::client::amqp0_10::ReceiverImpl::Fetch> (this=0xe2e840, f=<value optimized out>) at qpid/client/amqp0_10/SessionImpl.h:102 #9 0x00007f4d0544206c in qpid::client::amqp0_10::ReceiverImpl::fetch (this=0xe2f950, message=<value optimized out>, timeout=<value optimized out>) at qpid/client/amqp0_10/ReceiverImpl.cpp:71 #10 0x000000000040d71c in main () Thread 1 (Thread 0x7f4d01d15700 (LWP 27966)): #0 0x0000003c64432885 in raise () from /lib64/libc.so.6 #1 0x0000003c64434065 in abort () from /lib64/libc.so.6 #2 0x0000003c67cbea7d in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib64/libstdc++.so.6 #3 0x0000003c67cbcc06 in ?? () from /usr/lib64/libstdc++.so.6 #4 0x0000003c67cbcc33 in std::terminate() () from /usr/lib64/libstdc++.so.6 #5 0x0000003c67cbd55f in __cxa_pure_virtual () from /usr/lib64/libstdc++.so.6 #6 0x000000311693b1b6 in operator() (this=0x7f4cfc000980, h=<value optimized out>) at /usr/include/boost/function/function_template.hpp:1013 #7 qpid::sys::posix::AsynchIO::close (this=0x7f4cfc000980, h=<value optimized out>) at qpid/sys/posix/AsynchIO.cpp:597 #8 0x000000311693c88f in qpid::sys::posix::AsynchIO::disconnected (this=0x7f4cfc000980, h=...) at qpid/sys/posix/AsynchIO.cpp:587 #9 0x0000003116a09503 in boost::function1<void, qpid::sys::DispatchHandle&>::operator() ( this=<value optimized out>, a0=<value optimized out>) at /usr/include/boost/function/function_template.hpp:1013 #10 0x0000003116a06385 in qpid::sys::DispatchHandle::processEvent (this=0x7f4cfc000988, type=qpid::sys::Poller::DISCONNECTED) at qpid/sys/DispatchHandle.cpp:291 #11 0x0000003116948d4d in process (this=0xe2cca0) at qpid/sys/Poller.h:131 #12 qpid::sys::Poller::run (this=0xe2cca0) at qpid/sys/epoll/EpollPoller.cpp:524 #13 0x000000311694066a in qpid::sys::(anonymous namespace)::runRunnable (p=<value optimized out>) at qpid/sys/posix/Thread.cpp:35 #14 0x0000003c64c077f1 in start_thread () from /lib64/libpthread.so.0 #15 0x0000003c644e5ccd in clone () from /lib64/libc.so.6
This looks like another instance of BZ883469. I'll close this bug as duplicate in favour of that bug, as that bug is active. *** This bug has been marked as a duplicate of bug 883469 ***