Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 499872

Summary:

QMF sessions can cause cluster nodes to exit

Product:

Red Hat Enterprise MRG

Reporter:

Gordon Sim <gsim>

Component:

qpid-cpp

Assignee:

Alan Conway <aconway>

Status:

CLOSED ERRATA

QA Contact:

Frantisek Reznicek <freznice>

Severity:

urgent

Docs Contact:

Priority:

urgent

Version:

1.0

CC:

esammons, freznice

Target Milestone:

1.1.2

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2009-06-12 17:38:52 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

501015

Attachments:

Description	Flags
ping client	none
Patch to cause the error to occur more frequently.	none
Proposed fix.	none
Fixed issue in previous fix.	none
better reproducer	none
Fix for 1.1.2	none

Description Gordon Sim 2009-05-08 17:07:45 UTC

Description of problem:

Cluster nodes exit with:

2009-may-08 17:58:23 error Execution exception: invalid-argument: anonymous.qmfc-39c687e3-a29e-4a40-9904-03af96b93e04: confirmed < (128+0) but only sent < (87+0) (qpid/SessionState.cpp:151)
2009-may-08 17:58:23 critical 192.168.0.2:5766(READY/error) Error 280 did not occur on 192.168.0.2:5745
2009-may-08 17:58:23 error Error delivering frames: Aborted by local failure that did not occur on all replicas
2009-may-08 17:58:23 notice 192.168.0.2:5766(LEFT/error) leaving cluster grs_laptop
2009-may-08 17:58:23 notice Shut down


How reproducible:

Reliably, but with default settings takes a very long time.

Steps to Reproduce:
1. start several cluster nodes
2. run the attached ping client against one of them 
  
Actual results:

2009-may-08 17:58:23 error Execution exception: invalid-argument: anonymous.qmfc-39c687e3-a29e-4a40-9904-03af96b93e04: confirmed < (128+0) but only sent < (87+0) (qpid/SessionState.cpp:151)
2009-may-08 17:58:23 critical 192.168.0.2:5766(READY/error) Error 280 did not occur on 192.168.0.2:5745
2009-may-08 17:58:23 error Error delivering frames: Aborted by local failure that did not occur on all replicas
2009-may-08 17:58:23 notice 192.168.0.2:5766(LEFT/error) leaving cluster grs_laptop
2009-may-08 17:58:23 notice Shut down


Expected results:

Should keep going indefinately

Additional info:

For development, radically reducing the value of SPONTANEOUS_REQUEST_INTERVAL in qpid/SessionState.cpp speeds up the occurrence of the issue.

Comment 1 Gordon Sim 2009-05-11 11:49:38 UTC

Created attachment 343433 [details]
ping client

Comment 2 Alan Conway 2009-05-12 13:42:07 UTC

Gsim provided a patch to reproduce this problem more quickly.

Completion criteria: test passes repeatedly.

Comment 3 Alan Conway 2009-05-12 13:54:14 UTC

Created attachment 343584 [details]
Patch to cause the error to occur more frequently.

Gordon Sim wrote:
> I had to actually modify the broker code to get a faster occurrence of the bug, specifically lowering the SPONTANEOUS_REQUEST_INTERVAL in the session state drastically.
>
> To reproduce, apply the patch and recompile, start two broker nodes and then run the patched ping example (in its current form it needs one of the nodes to be listening on port 5672).
>
> I find that the second node (i.e. the one the ping client is not connected to) then exits in a fairly short time.
>

Comment 4 Alan Conway 2009-05-12 16:45:56 UTC

Created attachment 343626 [details]
Proposed fix.

Attached patch appears to fix the issue for me, I can't reproduce it anymore. 
gsim can you verify this patch in your test environment before I commit?

Comment 5 Alan Conway 2009-05-13 17:43:38 UTC

Created attachment 343826 [details]
Fixed issue in previous fix.

Previous patch caused problems in other tests. This one looks good - passes make check on 2 boxes as well as the reproducer case.

Comment 6 Gordon Sim 2009-05-14 13:42:12 UTC

I can confirm that patch appears to prevent the problem occuring.

Comment 7 Alan Conway 2009-05-15 14:25:40 UTC

Committed fix for this specific issue in r774809 but we need a larger fix for cluster+management, see bug 501015

Comment 8 Alan Conway 2009-05-15 15:12:44 UTC

Undid commit in r775182

This fix is incorrect. The timer will go off in each member, and each one will send a response message which is replicated, resulting in a response from each member being enqueued rather than a single response.

It's also heading in the wrong direction, its making management more replicated but the better solution will probably be to make management full *un*replicated, see bug 501015

Comment 9 Gordon Sim 2009-05-15 16:15:42 UTC

Created attachment 344195 [details]
better reproducer

This is a better reproducing test case. It will issue continuous queries against a set of nodes. To use specify all the nodes to query in host:port format, e.g:

   ./thrasher localhost:5672 localhost:5673 localhost:5674 localhost:5675

Declaring lots of queues before doing this will also speed up the error.

Comment 10 Gordon Sim 2009-05-15 17:14:44 UTC

Believed fixed in qpidd-0.5.752581-6.el5

Comment 11 Gordon Sim 2009-05-26 09:18:22 UTC

Created attachment 345421 [details]
Fix for 1.1.2

The attached patch is the fix that was applied for qpidd-0.5.752581-6.el5.

Comment 12 Frantisek Reznicek 2009-06-01 07:03:31 UTC

The issue has been fixed, validated on RHEL 5.3 i386 / x86_64 on packages:
[root@intel-greencity-01 bz499872]# rpm -qa | egrep '(qpid|openais)' | sort -u
openais-0.80.3-22.el5_3.7
openais-debuginfo-0.80.3-22.el5_3.7
openais-devel-0.80.3-22.el5_3.7
python-qpid-0.5.752581-1.el5
qpidc-0.5.752581-10.el5
qpidc-debuginfo-0.5.752581-10.el5
qpidc-devel-0.5.752581-10.el5
qpidc-perftest-0.5.752581-10.el5
qpidc-rdma-0.5.752581-10.el5
qpidc-ssl-0.5.752581-10.el5
qpidd-0.5.752581-10.el5
qpidd-acl-0.5.752581-10.el5
qpidd-cluster-0.5.752581-10.el5
qpidd-devel-0.5.752581-10.el5
qpid-dotnet-0.4.738274-2.el5
qpidd-rdma-0.5.752581-10.el5
qpidd-ssl-0.5.752581-10.el5
qpidd-xml-0.5.752581-10.el5
qpid-java-client-0.5.751061-4.el5
qpid-java-common-0.5.751061-4.el5

->VERIFIED

Comment 14 errata-xmlrpc 2009-06-12 17:38:52 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-1097.html