Created attachment 423905 [details] Test logs from a failed run Description of problem: The cluster_tests.test_management test is failing: cluster_tests.LongTests.test_management ............................................. fail Error during test: Traceback (most recent call last): File "/home/remote/aconway/qpid2/qpid/dbg/src/tests/python/commands/qpid-python-test", line 311, in run phase() File "/home/remote/aconway/qpid2/qpid/cpp/src/tests/cluster_tests.py", line 291, in test_management for b in cluster[alive:]: b.ready() # Check if a broker crashed. File "/home/remote/aconway/qpid2/qpid/dbg/src/tests/python/qpid/brokertest.py", line 393, in ready except: raise RethrownException( RethrownException: Broker cluster1-0 failed ready test: cluster1-0: 2010-06-14 12:20:06 debug cluster(20.0.100.32:2063 LEFT/error) local close of replicated connection 20.0.100.32:2063-4(local) cluster1-0: 2010-06-14 12:20:06 debug cluster(20.0.100.32:2063 LEFT/error) deleted connection: 20.0.100.32:2063-4(local) cluster1-0: 2010-06-14 12:20:06 debug Shutting down CPG cluster1-0: 2010-06-14 12:20:06 notice Shut down Version-Release number of selected component (if applicable): Trunk r954471 How reproducible: every time Steps to Reproduce: 1. cd qpid/cpp/src/tests 2. source test_env.sh 3. run_cluster_tests *.test_management -DDURATION=2 Actual results: fail Expected results: pass
Additional information: This can be reproduced simply by starting up a two-node cluster and running the command "qpid-stat -b" against one of the cluster nodes. The connected node will fail with the following log: 2010-06-15 08:52:09 error Execution exception: invalid-argument: anonymous.dhcp-100-18-254.bos.redhat.com.29971.3: confirmed < (45+0) but only sent < (44+0) (qpid/SessionState.cpp:151) 2010-06-15 08:52:09 critical cluster(127.0.0.1:29929 READY/error) local error 587 did not occur on member 127.0.0.1:29949: invalid-argument: anonymous.dhcp-100-18-254.bos.redhat.com.29971.3: confirmed < (45+0) but only sent < (44+0) (qpid/SessionState.cpp:151) 2010-06-15 08:52:09 critical Error delivering frames: local error did not occur on all cluster members : invalid-argument: anonymous.dhcp-100-18-254.bos.redhat.com.29971.3: confirmed < (45+0) but only sent < (44+0) (qpid/SessionState.cpp:151) (qpid/cluster/ErrorCheck.cpp:89) 2010-06-15 08:52:09 notice cluster(127.0.0.1:29929 LEFT/error) leaving cluster TED 2010-06-15 08:52:09 notice Shut down
Even more information: This problem is introduced at the client level. If you revert qpid/extras/qmf/src/py/qmf/console.py to subversion rev 953702, the problem goes away. The important difference in the console client code is that the newer version (that causes the crash) applies flow control back-pressure on multiple subscriptions. Is it possible that credit balances for flow control are not being handled uniformly by nodes in the cluster?
Fixed in r955370, and mrg 1.3 release repo: http://mrg1.lab.bos.redhat.com/git/?p=qpid.git;a=commitdiff;h=c8e4559e0a26efe70e3a462f8e49a4bd55ba46a2
Tested: on 752581 bug appears on 946106 does not. It has been fixed validated on RHEL 5.5 i386 / x86_64 not on RHEL4 because of no clustering packages: # rpm -qa | grep -E '(qpid|openais|rhm)' | sort -u openais-0.80.6-16.el5_5.1 openais-debuginfo-0.80.6-16.el5_5.1 python-qpid-0.7.946106-1.el5 qpid-cpp-client-0.7.946106-2.el5 qpid-cpp-client-devel-0.7.946106-2.el5 qpid-cpp-client-devel-docs-0.7.946106-2.el5 qpid-cpp-client-ssl-0.7.946106-2.el5 qpid-cpp-mrg-debuginfo-0.7.946106-1.el5 qpid-cpp-server-0.7.946106-2.el5 qpid-cpp-server-cluster-0.7.946106-2.el5 qpid-cpp-server-devel-0.7.946106-2.el5 qpid-cpp-server-ssl-0.7.946106-2.el5 qpid-cpp-server-store-0.7.946106-2.el5 qpid-cpp-server-xml-0.7.946106-2.el5 qpid-java-client-0.7.946106-3.el5 qpid-java-common-0.7.946106-3.el5 qpid-tools-0.7.946106-4.el5 rhm-docs-0.7.946106-1.el5 ->VERIFIED