Description of problem: Cluster nodes exit with: 2009-may-08 17:58:23 error Execution exception: invalid-argument: anonymous.qmfc-39c687e3-a29e-4a40-9904-03af96b93e04: confirmed < (128+0) but only sent < (87+0) (qpid/SessionState.cpp:151) 2009-may-08 17:58:23 critical 192.168.0.2:5766(READY/error) Error 280 did not occur on 192.168.0.2:5745 2009-may-08 17:58:23 error Error delivering frames: Aborted by local failure that did not occur on all replicas 2009-may-08 17:58:23 notice 192.168.0.2:5766(LEFT/error) leaving cluster grs_laptop 2009-may-08 17:58:23 notice Shut down How reproducible: Reliably, but with default settings takes a very long time. Steps to Reproduce: 1. start several cluster nodes 2. run the attached ping client against one of them Actual results: 2009-may-08 17:58:23 error Execution exception: invalid-argument: anonymous.qmfc-39c687e3-a29e-4a40-9904-03af96b93e04: confirmed < (128+0) but only sent < (87+0) (qpid/SessionState.cpp:151) 2009-may-08 17:58:23 critical 192.168.0.2:5766(READY/error) Error 280 did not occur on 192.168.0.2:5745 2009-may-08 17:58:23 error Error delivering frames: Aborted by local failure that did not occur on all replicas 2009-may-08 17:58:23 notice 192.168.0.2:5766(LEFT/error) leaving cluster grs_laptop 2009-may-08 17:58:23 notice Shut down Expected results: Should keep going indefinately Additional info: For development, radically reducing the value of SPONTANEOUS_REQUEST_INTERVAL in qpid/SessionState.cpp speeds up the occurrence of the issue.
Created attachment 343433 [details] ping client
Gsim provided a patch to reproduce this problem more quickly. Completion criteria: test passes repeatedly.
Created attachment 343584 [details] Patch to cause the error to occur more frequently. Gordon Sim wrote: > I had to actually modify the broker code to get a faster occurrence of the bug, specifically lowering the SPONTANEOUS_REQUEST_INTERVAL in the session state drastically. > > To reproduce, apply the patch and recompile, start two broker nodes and then run the patched ping example (in its current form it needs one of the nodes to be listening on port 5672). > > I find that the second node (i.e. the one the ping client is not connected to) then exits in a fairly short time. >
Created attachment 343626 [details] Proposed fix. Attached patch appears to fix the issue for me, I can't reproduce it anymore. gsim can you verify this patch in your test environment before I commit?
Created attachment 343826 [details] Fixed issue in previous fix. Previous patch caused problems in other tests. This one looks good - passes make check on 2 boxes as well as the reproducer case.
I can confirm that patch appears to prevent the problem occuring.
Committed fix for this specific issue in r774809 but we need a larger fix for cluster+management, see bug 501015
Undid commit in r775182 This fix is incorrect. The timer will go off in each member, and each one will send a response message which is replicated, resulting in a response from each member being enqueued rather than a single response. It's also heading in the wrong direction, its making management more replicated but the better solution will probably be to make management full *un*replicated, see bug 501015
Created attachment 344195 [details] better reproducer This is a better reproducing test case. It will issue continuous queries against a set of nodes. To use specify all the nodes to query in host:port format, e.g: ./thrasher localhost:5672 localhost:5673 localhost:5674 localhost:5675 Declaring lots of queues before doing this will also speed up the error.
Believed fixed in qpidd-0.5.752581-6.el5
Created attachment 345421 [details] Fix for 1.1.2 The attached patch is the fix that was applied for qpidd-0.5.752581-6.el5.
The issue has been fixed, validated on RHEL 5.3 i386 / x86_64 on packages: [root@intel-greencity-01 bz499872]# rpm -qa | egrep '(qpid|openais)' | sort -u openais-0.80.3-22.el5_3.7 openais-debuginfo-0.80.3-22.el5_3.7 openais-devel-0.80.3-22.el5_3.7 python-qpid-0.5.752581-1.el5 qpidc-0.5.752581-10.el5 qpidc-debuginfo-0.5.752581-10.el5 qpidc-devel-0.5.752581-10.el5 qpidc-perftest-0.5.752581-10.el5 qpidc-rdma-0.5.752581-10.el5 qpidc-ssl-0.5.752581-10.el5 qpidd-0.5.752581-10.el5 qpidd-acl-0.5.752581-10.el5 qpidd-cluster-0.5.752581-10.el5 qpidd-devel-0.5.752581-10.el5 qpid-dotnet-0.4.738274-2.el5 qpidd-rdma-0.5.752581-10.el5 qpidd-ssl-0.5.752581-10.el5 qpidd-xml-0.5.752581-10.el5 qpid-java-client-0.5.751061-4.el5 qpid-java-common-0.5.751061-4.el5 ->VERIFIED
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2009-1097.html