Bug 499872 - QMF sessions can cause cluster nodes to exit
Summary: QMF sessions can cause cluster nodes to exit
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: qpid-cpp
Version: 1.0
Hardware: All
OS: Linux
urgent
urgent
Target Milestone: 1.1.2
: ---
Assignee: Alan Conway
QA Contact: Frantisek Reznicek
URL:
Whiteboard:
Depends On:
Blocks: 501015
TreeView+ depends on / blocked
 
Reported: 2009-05-08 17:07 UTC by Gordon Sim
Modified: 2015-11-16 00:07 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-06-12 17:38:52 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
ping client (4.15 KB, text/x-c++src)
2009-05-11 11:49 UTC, Gordon Sim
no flags Details
Patch to cause the error to occur more frequently. (1.10 KB, patch)
2009-05-12 13:54 UTC, Alan Conway
no flags Details | Diff
Proposed fix. (19.94 KB, patch)
2009-05-12 16:45 UTC, Alan Conway
no flags Details | Diff
Fixed issue in previous fix. (17.57 KB, patch)
2009-05-13 17:43 UTC, Alan Conway
no flags Details | Diff
better reproducer (2.84 KB, text/x-c++src)
2009-05-15 16:15 UTC, Gordon Sim
no flags Details
Fix for 1.1.2 (2.14 KB, patch)
2009-05-26 09:18 UTC, Gordon Sim
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2009:1097 0 normal SHIPPED_LIVE Red Hat Enterprise MRG Messaging bug fixing update 2009-06-12 17:38:48 UTC

Description Gordon Sim 2009-05-08 17:07:45 UTC
Description of problem:

Cluster nodes exit with:

2009-may-08 17:58:23 error Execution exception: invalid-argument: anonymous.qmfc-39c687e3-a29e-4a40-9904-03af96b93e04: confirmed < (128+0) but only sent < (87+0) (qpid/SessionState.cpp:151)
2009-may-08 17:58:23 critical 192.168.0.2:5766(READY/error) Error 280 did not occur on 192.168.0.2:5745
2009-may-08 17:58:23 error Error delivering frames: Aborted by local failure that did not occur on all replicas
2009-may-08 17:58:23 notice 192.168.0.2:5766(LEFT/error) leaving cluster grs_laptop
2009-may-08 17:58:23 notice Shut down


How reproducible:

Reliably, but with default settings takes a very long time.

Steps to Reproduce:
1. start several cluster nodes
2. run the attached ping client against one of them 
  
Actual results:

2009-may-08 17:58:23 error Execution exception: invalid-argument: anonymous.qmfc-39c687e3-a29e-4a40-9904-03af96b93e04: confirmed < (128+0) but only sent < (87+0) (qpid/SessionState.cpp:151)
2009-may-08 17:58:23 critical 192.168.0.2:5766(READY/error) Error 280 did not occur on 192.168.0.2:5745
2009-may-08 17:58:23 error Error delivering frames: Aborted by local failure that did not occur on all replicas
2009-may-08 17:58:23 notice 192.168.0.2:5766(LEFT/error) leaving cluster grs_laptop
2009-may-08 17:58:23 notice Shut down


Expected results:

Should keep going indefinately

Additional info:

For development, radically reducing the value of SPONTANEOUS_REQUEST_INTERVAL in qpid/SessionState.cpp speeds up the occurrence of the issue.

Comment 1 Gordon Sim 2009-05-11 11:49:38 UTC
Created attachment 343433 [details]
ping client

Comment 2 Alan Conway 2009-05-12 13:42:07 UTC
Gsim provided a patch to reproduce this problem more quickly.

Completion criteria: test passes repeatedly.

Comment 3 Alan Conway 2009-05-12 13:54:14 UTC
Created attachment 343584 [details]
Patch to cause the error to occur more frequently.

Gordon Sim wrote:
> I had to actually modify the broker code to get a faster occurrence of the bug, specifically lowering the SPONTANEOUS_REQUEST_INTERVAL in the session state drastically.
>
> To reproduce, apply the patch and recompile, start two broker nodes and then run the patched ping example (in its current form it needs one of the nodes to be listening on port 5672).
>
> I find that the second node (i.e. the one the ping client is not connected to) then exits in a fairly short time.
>

Comment 4 Alan Conway 2009-05-12 16:45:56 UTC
Created attachment 343626 [details]
Proposed fix.

Attached patch appears to fix the issue for me, I can't reproduce it anymore. 
gsim can you verify this patch in your test environment before I commit?

Comment 5 Alan Conway 2009-05-13 17:43:38 UTC
Created attachment 343826 [details]
Fixed issue in previous fix.

Previous patch caused problems in other tests. This one looks good - passes make check on 2 boxes as well as the reproducer case.

Comment 6 Gordon Sim 2009-05-14 13:42:12 UTC
I can confirm that patch appears to prevent the problem occuring.

Comment 7 Alan Conway 2009-05-15 14:25:40 UTC
Committed fix for this specific issue in r774809 but we need a larger fix for cluster+management, see bug 501015

Comment 8 Alan Conway 2009-05-15 15:12:44 UTC
Undid commit in r775182

This fix is incorrect. The timer will go off in each member, and each one will send a response message which is replicated, resulting in a response from each member being enqueued rather than a single response.

It's also heading in the wrong direction, its making management more replicated but the better solution will probably be to make management full *un*replicated, see bug 501015

Comment 9 Gordon Sim 2009-05-15 16:15:42 UTC
Created attachment 344195 [details]
better reproducer

This is a better reproducing test case. It will issue continuous queries against a set of nodes. To use specify all the nodes to query in host:port format, e.g:

   ./thrasher localhost:5672 localhost:5673 localhost:5674 localhost:5675

Declaring lots of queues before doing this will also speed up the error.

Comment 10 Gordon Sim 2009-05-15 17:14:44 UTC
Believed fixed in qpidd-0.5.752581-6.el5

Comment 11 Gordon Sim 2009-05-26 09:18:22 UTC
Created attachment 345421 [details]
Fix for 1.1.2

The attached patch is the fix that was applied for qpidd-0.5.752581-6.el5.

Comment 12 Frantisek Reznicek 2009-06-01 07:03:31 UTC
The issue has been fixed, validated on RHEL 5.3 i386 / x86_64 on packages:
[root@intel-greencity-01 bz499872]# rpm -qa | egrep '(qpid|openais)' | sort -u
openais-0.80.3-22.el5_3.7
openais-debuginfo-0.80.3-22.el5_3.7
openais-devel-0.80.3-22.el5_3.7
python-qpid-0.5.752581-1.el5
qpidc-0.5.752581-10.el5
qpidc-debuginfo-0.5.752581-10.el5
qpidc-devel-0.5.752581-10.el5
qpidc-perftest-0.5.752581-10.el5
qpidc-rdma-0.5.752581-10.el5
qpidc-ssl-0.5.752581-10.el5
qpidd-0.5.752581-10.el5
qpidd-acl-0.5.752581-10.el5
qpidd-cluster-0.5.752581-10.el5
qpidd-devel-0.5.752581-10.el5
qpid-dotnet-0.4.738274-2.el5
qpidd-rdma-0.5.752581-10.el5
qpidd-ssl-0.5.752581-10.el5
qpidd-xml-0.5.752581-10.el5
qpid-java-client-0.5.751061-4.el5
qpid-java-common-0.5.751061-4.el5

->VERIFIED

Comment 14 errata-xmlrpc 2009-06-12 17:38:52 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-1097.html


Note You need to log in before you can comment on or make changes to this bug.