Description of problem: Custom unit test custom_py_units.queue.LLAPITests. \ test_queue_nonexclusive_timed_autodelete_w_msgs_on_sclose triggers clustered broker abort[s] in qpid::broker::ConnectionHandler::handle() -> operator () -> __cxa_pure_virtual () -> terminate () ... Thread 1 (Thread 0x43b28940 (LWP 16777)): #0 0x000000312b230265 in raise () from /lib64/libc.so.6 #1 0x000000312b231d10 in abort () from /lib64/libc.so.6 #2 0x000000312fabed14 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib64/libstdc++.so.6 #3 0x000000312fabce16 in ?? () from /usr/lib64/libstdc++.so.6 #4 0x000000312fabce43 in std::terminate() () from /usr/lib64/libstdc++.so.6 #5 0x000000312fabd34f in __cxa_pure_virtual () from /usr/lib64/libstdc++.so.6 #6 0x000000312e35cee1 in operator() (this=0x6bfb1e0, frame=...) at qpid/framing/Handler.h:42 #7 qpid::broker::ConnectionHandler::handle (this=0x6bfb1e0, frame=...) at qpid/broker/ConnectionHandler.cpp:87 #8 0x000000312e3520a8 in qpid::broker::Connection::received (this=0x6bfb000, frame=...) at qpid/broker/Connection.cpp:159 #9 0x00002b0543e56941 in qpid::cluster::Connection::deliveredFrame ( this=0x6bf7850, f=...) at qpid/cluster/Connection.cpp:248 #10 0x00002b0543e24ad9 in qpid::cluster::Cluster::processFrame ( this=0x2aed1b0, e=..., l=<value optimized out>) at qpid/cluster/Cluster.cpp:548 ... Unfortunately there is no log file available atm from above aborts. I believe broker ends with "pure virtual method called" indicated as: #4 0x000000312fabce43 in std::terminate() () from /usr/lib64/libstdc++.so.6 #5 0x000000312fabd34f in __cxa_pure_virtual () This abort is fatal condition and should be fixed. See unit test code attached below. Version-Release number of selected component (if applicable): python-qpid-0.14-4.el5 python-qpid-qmf-0.14-3.el5 qpid-cpp-client-0.14-7.el5 qpid-cpp-client-devel-0.14-7.el5 qpid-cpp-client-devel-docs-0.14-7.el5 qpid-cpp-client-rdma-0.14-7.el5 qpid-cpp-client-ssl-0.14-7.el5 qpid-cpp-mrg-debuginfo-0.14-7.el5 qpid-cpp-server-0.14-7.el5 qpid-cpp-server-cluster-0.14-7.el5 qpid-cpp-server-devel-0.14-7.el5 qpid-cpp-server-rdma-0.14-7.el5 qpid-cpp-server-ssl-0.14-7.el5 qpid-cpp-server-store-0.14-7.el5 qpid-cpp-server-xml-0.14-7.el5 qpid-java-client-0.14-3.el5 qpid-java-common-0.14-3.el5 qpid-java-example-0.14-3.el5 qpid-qmf-0.14-3.el5 qpid-qmf-debuginfo-0.14-3.el5 qpid-qmf-devel-0.14-3.el5 qpid-tests-0.14-1.el5 qpid-tools-0.14-1.el5 rh-qpid-cpp-tests-0.14-7.el5 ruby-qpid-qmf-0.14-3.el5 How reproducible: 80% Steps to Reproduce: 1. set-up clustered broker 2. run test_queue_nonexclusive_timed_autodelete_w_msgs_on_sclose unit Actual results: Clustered broker[s] abort. Expected results: Clustered broker[s] should survive the test. Additional info: Aborts were detected on both RHEL5.7 and 5.8
Created attachment 567681 [details] The full backtrace and unit test code Attachment consist of: - unit test code (test_queue_nonexclusive_timed_autodelete_w_msgs_on_sclose.py) - full backtrace of two detected aborts (bz800028.txt)
Created attachment 568653 [details] clustered brokers logs Reproduced on RHEL5.8 x86_64 on the latest 0.14-10 packages, attaching broker logs. broker logs indicates that both the clustered brokers are aborted with following exception at the same time: 2012-03-08 09:31:21 error Unexpected exception: Task already exists with name DelayedAutoDeletion (qpid/cluster/ClusterTimer.cpp:71) # stat -c %x%y%z core.3671 core.3672 2012-03-08 09:31:21.000000000 -05002012-03-08 09:31:21.000000000 -05002012-03-08 09:31:21.000000000 -0500 2012-03-08 09:31:21.000000000 -05002012-03-08 09:31:21.000000000 -05002012-03-08 09:31:21.000000000 -0500
Root cause for this crash appears to be in ClusterTimer::add(), which throws an exception should two timers be created with the same name. Upstream JIRA created, with a workaround patch that "fixes" the auto delete queue crash: https://issues.apache.org/jira/browse/QPID-3896
Created attachment 570656 [details] clustered brokers logs retested on RHEL6 i686/x86_64, broker crash is fixed, but running the test causes all cluster nodes except one (that one remain standing) to exit the cluster with the following cause: critical Error delivering frames: Cluster timer wakeup non-existent task DelayedAutoDeletion:642922c3-6cb2-4c6b-b332-d2929ab95c34 (qpid/cluster/ClusterTimer.cpp:112) Waiting for qpid-qmf rebuild for RHEL5 retest.
The above mentioned issue is probably the root cause of this new crash, moving back to assigned. -> ASSIGNED
Updated the upstream with a new patch that should actually work in a cluster. See https://issues.apache.org/jira/secure/attachment/12518923/qpid-3896.patch
Retested on RHEL6 i686 on qpid-0.14-12 with the patch from Comment 10 applied, the issue seems to be fixed. Test from Comment 1 is passing with clustered brokers, no broker failure detected. Waiting for packages for overall retest.
Fix posted upstream trunk: http://svn.apache.org/viewvc?view=rev&rev=1303035
This issue has been fixed for rhel5, verified on rhel5.8 i386 & x86_64 - qpid-cpp-mrg-0.14-14.el5. Waiting for rhel6 packages for retest.
Issue fixed, tested on rhel5.8 / 6.2 i/x on packages: qpid-cpp-mrg-0.14-14.el5 qpid-cpp-0.14-14.el6_2 -> VERIFIED