Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 533431

Summary: inconsistent message positions on node that joins cluster when unacknowledged messages exist
Product: Red Hat Enterprise MRG Reporter: Gordon Sim <gsim>
Component: qpid-cppAssignee: Gordon Sim <gsim>
Status: CLOSED ERRATA QA Contact: Frantisek Reznicek <freznice>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 1.1.6CC: esammons, freznice, iboverma, lbrindle
Target Milestone: 1.2   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Messaging bug fix C: If there are messages that have been delivered from a queue but not acknowledged, when a new node joins the position of those unacked messages were incorrect on the new node. C: The new node would exit with an error. F: The position field of messages in the list of delivery records is now being correctly set. R: Browsing the queue from either node now shows the same results, and new nodes are no longer unexpectedly exiting. If there were messages that had been delivered from a queue but not acknowledged, when a new node joined, the position of those unacked messages were incorrect on the new node. This resulted in the new node exiting with an error. The position field of messages in the list of delivery records is now being correctly set. Browsing the queue from either node now shows the same results, and new nodes are no longer unexpectedly exiting.
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-12-03 09:15:47 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 527551    
Attachments:
Description Flags
acquirer program referenced in step 9 of reproducer none

Description Gordon Sim 2009-11-06 18:50:43 UTC
Created attachment 367863 [details]
acquirer program referenced in step 9 of reproducer

Description of problem:

If there are messages that have been delivered from a queue but not acknowledged at the point that a new node joins, the position of those unacked messages is incorrect on the new node. This can end up causing the second node to exit with an error.

Version-Release number of selected component (if applicable):

qpidd-0.5.752581-30.el5
qpidd-cluster-0.5.752581-30.el5

How reproducible:

100%

Steps to Reproduce:
1. start one node cluster
2. qpid-config add queue test-queue
3. for m in b1 b2 b3; do echo $m; done | sender
4. receiver --ack-frequency 5 --credit-window 3 (leave it running or background it)
6. start second node for the cluster
7. for m in a1 a2 a3; do echo $m; done | sender
8. kill receiver started in 4.
9. run attached acquirer client
  
Actual results:

Second node exits with something like:

2009-nov-06 13:40:47 error Execution exception: invalid-argument: anonymous.66d117b6-49d7-4b0c-9c3f-a561b2ba9f07: confirmed < (7+0) but only sent < (4+0) (qpid/SessionState.cpp:163)
2009-nov-06 13:40:47 error 10.16.44.222:30367(READY/error) channel error 623 on 10.16.44.222:30294-9(shadow): invalid-argument: anonymous.66d117b6-49d7-4b0c-9c3f-a561b2ba9f07: confirmed < (7+0) but only sent < (4+0) (qpid/SessionState.cpp:163) (unresolved: 10.16.44.222:30294 10.16.44.222:30367 )
2009-nov-06 13:40:47 critical 10.16.44.222:30367(READY/error) error 623 did not occur on 10.16.44.222:30294
2009-nov-06 13:40:47 error Error delivering frames: Error 623 did not occur on all members (qpid/cluster/ErrorCheck.cpp:90)
2009-nov-06 13:40:47 notice 10.16.44.222:30367(LEFT/error) leaving cluster grs-test-mrg15
2009-nov-06 13:40:47 notice Shut down

Also, of before step 9 you run the receiver in browse mode against each node, they give different results.

Expected results:

No errors. Browsing the queue from either node (prior to step9) should show same results.

Additional info:

The issue is that the position field of messages in the list of delivery records is not being correctly set. That position is used in browsing so the inconsistency causes messages to be delivered on one node but not the other.

Comment 1 Gordon Sim 2009-11-09 14:51:37 UTC
Fixed on trunk by r834026.

Comment 3 Irina Boverman 2009-11-09 19:12:43 UTC
Release note added. If any revisions are required, please set the 
"requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

New Contents:
Corrected problem with positioning of unacked messages on the new node. If there were messages that have been delivered from a queue but not
acknowledged at the point that a new node joins, the position of those unacked messages was incorrect on the new node and could have caused second node to exit with an error.

Comment 5 Frantisek Reznicek 2009-11-16 15:37:04 UTC
Reproduced on -29, working on semi-automated test to be sure this issue is gone.

Comment 6 Frantisek Reznicek 2009-11-18 10:29:56 UTC
The issue has been fixed on RHEL 5.4 i386 / x86_64 on packages:
[root@mrg-qe-02 bz533361]# rpm -qa | grep -E '(qpid|opena|rhm|qmf)' | sort -u
condor-qmf-plugins-7.4.1-0.5.el5
openais-0.80.6-8.el5_4.1
openais-debuginfo-0.80.6-8.el5_4.1
openais-devel-0.80.6-8.el5_4.1
python-qpid-0.5.758389-2.el5
qmf-0.5.752581-34.el5
qmf-devel-0.5.752581-34.el5
qpidc-0.5.752581-34.el5
qpidc-debuginfo-0.5.752581-34.el5
qpidc-devel-0.5.752581-34.el5
qpidc-perftest-0.5.752581-34.el5
qpidc-rdma-0.5.752581-34.el5
qpidc-ssl-0.5.752581-34.el5
qpidd-0.5.752581-34.el5
qpidd-acl-0.5.752581-34.el5
qpidd-cluster-0.5.752581-34.el5
qpidd-devel-0.5.752581-34.el5
qpid-dotnet-0.4.738274-2.el5
qpidd-rdma-0.5.752581-34.el5
qpidd-ssl-0.5.752581-34.el5
qpidd-xml-0.5.752581-34.el5
qpid-java-client-0.5.751061-9.el5
qpid-java-common-0.5.751061-9.el5
rhm-0.5.3206-27.el5
rhm-debuginfo-0.5.3206-27.el5
rhm-docs-0.5.756148-1.el5
rh-qpid-tests-0.5.752581-34.el5

->VERIFIED


P.S. No node leaving anymore + no different browsing result

Comment 7 Lana Brindley 2009-11-23 06:42:55 UTC
Release note updated. If any revisions are required, please set the 
"requires_release_notes"  flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

Diffed Contents:
@@ -1,2 +1,10 @@
-Corrected problem with positioning of unacked messages on the new node. If there were messages that have been delivered from a queue but not
+Messaging bug fix
-acknowledged at the point that a new node joins, the position of those unacked messages was incorrect on the new node and could have caused second node to exit with an error.+
+C: If there are messages that have been delivered from a queue but not acknowledged, when a new node joins the position of those unacked messages were incorrect on the new node.
+C: The new node would exit with an error.
+F: The position field of messages in the list of delivery records is now being correctly set.  
+R: Browsing the queue from either node now shows
+the same results, and new nodes are no longer unexpectedly exiting.
+
+If there were messages that had been delivered from a queue but not acknowledged, when a new node joined, the position of those unacked messages were incorrect on the new node. This resulted in the new node exiting with an error. The position field of messages in the list of delivery records is now being correctly set. Browsing the queue from either node now shows
+the same results, and new nodes are no longer unexpectedly exiting.

Comment 8 errata-xmlrpc 2009-12-03 09:15:47 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2009-1633.html