Created attachment 451398 [details] core dump from qpidd Description of problem: Clustered qpidd broker crashing during shutdown in ObjectId(...) -> memcpy (). See coredump in attachment. Version-Release number of selected component (if applicable): qpid-cpp-client-devel-docs-0.7.946106-17.el5 qpid-java-client-0.7.946106-10.el5 qpid-cpp-client-ssl-0.7.946106-17.el5 qpid-cpp-server-devel-0.7.946106-17.el5 qpid-cpp-mrg-debuginfo-0.7.946106-17.el5 qpid-cpp-server-cluster-0.7.946106-17.el5 qpid-cpp-server-xml-0.7.946106-17.el5 qpid-cpp-client-0.7.946106-17.el5 python-qpid-0.7.946106-14.el5 qpid-tools-0.7.946106-11.el5 qpid-cpp-server-ssl-0.7.946106-17.el5 qpid-cpp-server-0.7.946106-17.el5 qpid-cpp-server-store-0.7.946106-17.el5 qpid-java-common-0.7.946106-10.el5 qpid-cpp-client-devel-0.7.946106-17.el5 How reproducible: low (2%) Steps to Reproduce: 1.set up cluster with one node 2.start perftest 3.set up another node to cluster and keep it restarting in a loop (see attachment) Actual results: qpidd crashing Expected results: qpidd not crashing
Created attachment 451400 [details] reproducing script for 2nd node restart
# stat core.12762 File: `core.12762' Size: 344858624 Blocks: 14840 IO Block: 4096 regular file Device: fd00h/64768d Inode: 3112981 Links: 1 Access: (0600/-rw-------) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2010-10-04 10:06:21.000000000 -0400 Modify: 2010-10-04 10:06:21.000000000 -0400 Change: 2010-10-04 10:06:21.000000000 -0400 qpidd2.log: 2010-10-04 10:06:18 info cluster(10.16.64.195:12762 CATCHUP) member update: 10.16.64.195:11320(member) 10.16.64.195:12762(member) frameSeq=25849970 2010-10-04 10:06:18 notice cluster(10.16.64.195:12762 READY) caught up. 2010-10-04 10:06:21 notice Shut down 2010-10-04 10:06:26 info Loaded Module: /usr/lib/qpid/daemon/xml.so 2010-10-04 10:06:26 info Loaded Module: /usr/lib/qpid/daemon/msgstore.so The seqfault happens in exact time of shutdown notice in log file. It's reproducible on RHEL5/i386 only.
I think this one is gone. On latest trunk -- rev 1057786 -- I have seen no failure after 300 repetitions of the second broker being added and then shut down ( while perftest was runing, pointing at the first broker). Same original broker still running -- no core files produced. If the bug was being seen about 2% of the time, then there is 1 chance in 500 of the no-show happening accidentally.
Possibly fixed by http://mrg1.lab.bos.redhat.com/git/?p=qpid.git;a=commit;h=af12fa7cae14821a12b0a3199ab59ca146cac7a7; marking modified for verification.
It looks likes this bug is absent from the current build. Re-tested on an i386 arch machine using RHEL5. ( this is the arch mentioned above.) Using latest dash-17 build, installed RPMs. Broker configured with following modules: cluster msgstore xml While running a perftest ( and sending 2 billion messages ) to the original cluster broker I used the QE script attached above to start and stop another broker in the same cluster 1000 times. result -- the secondary broker always started and shut down normally, and the original broker was still running at the end of the test. No core files were produced. If the original odds of failure were about 2%, then there is basically no chance that this result happened accidentally. ( 1.7 chances in 1 billion ) . It looks likes this bug is absent from the current build.
In the about comment, when I said "dash-17" I meant "dash-27". Off-by-ten error.
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: There was a race condition in cluster code that allowed a cluster connection to be deleted before it was fully initialized. ( Before its codec was conctructed. ) This could occasionally cause a clustered broker to crash if it was shut down soon after being added to its cluster. By adding and then shutting down a new broker to a cluster under load, the crash could be observed about 2% of the time. Alan prevented the race by delaying adding the connection to the Cluster's map until the connection's codec is constructed. I was then able to run a loop that added and then shut down a new broker to a cluster under load 1000 times with no crash.
The defect is under heavy test, so far all tests prove that issue is gone. Unfortunately the scenario is the same as bug 509796 which seems to not be fixed, this fact makes me wonder if I'm testing enough or not. Update soon.
The issue has been fixed as extreme long runs on multiple machines do not trigger this issue anymore. Tested on RHEL 5.6 i386 / x86_64 on packages: python-qpid-0.7.946106-15.el5 qpid-cpp-client-0.7.946106-27.el5 qpid-cpp-client-devel-0.7.946106-27.el5 qpid-cpp-client-devel-docs-0.7.946106-27.el5 qpid-cpp-client-rdma-0.7.946106-27.el5 qpid-cpp-client-ssl-0.7.946106-27.el5 qpid-cpp-mrg-debuginfo-0.7.946106-27.el5 qpid-cpp-server-0.7.946106-27.el5 qpid-cpp-server-cluster-0.7.946106-27.el5 qpid-cpp-server-devel-0.7.946106-27.el5 qpid-cpp-server-rdma-0.7.946106-27.el5 qpid-cpp-server-ssl-0.7.946106-27.el5 qpid-cpp-server-store-0.7.946106-27.el5 qpid-cpp-server-xml-0.7.946106-27.el5 qpid-dotnet-0.4.738274-2.el5 qpid-java-client-0.7.946106-14.el5 qpid-java-common-0.7.946106-14.el5 qpid-java-example-0.7.946106-14.el5 qpid-tools-0.7.946106-12.el5 rh-qpid-cpp-tests-0.7.946106-27.el5 -> VERIFIED
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-0217.html