Bug 533431
| Summary: | inconsistent message positions on node that joins cluster when unacknowledged messages exist | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise MRG | Reporter: | Gordon Sim <gsim> | ||||
| Component: | qpid-cpp | Assignee: | Gordon Sim <gsim> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Frantisek Reznicek <freznice> | ||||
| Severity: | urgent | Docs Contact: | |||||
| Priority: | urgent | ||||||
| Version: | 1.1.6 | CC: | esammons, freznice, iboverma, lbrindle | ||||
| Target Milestone: | 1.2 | ||||||
| Target Release: | --- | ||||||
| Hardware: | All | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: |
Messaging bug fix
C: If there are messages that have been delivered from a queue but not acknowledged, when a new node joins the position of those unacked messages were incorrect on the new node.
C: The new node would exit with an error.
F: The position field of messages in the list of delivery records is now being correctly set.
R: Browsing the queue from either node now shows
the same results, and new nodes are no longer unexpectedly exiting.
If there were messages that had been delivered from a queue but not acknowledged, when a new node joined, the position of those unacked messages were incorrect on the new node. This resulted in the new node exiting with an error. The position field of messages in the list of delivery records is now being correctly set. Browsing the queue from either node now shows
the same results, and new nodes are no longer unexpectedly exiting.
|
Story Points: | --- | ||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2009-12-03 09:15:47 UTC | Type: | --- | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 527551 | ||||||
| Attachments: |
|
||||||
Fixed on trunk by r834026. Release note added. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Corrected problem with positioning of unacked messages on the new node. If there were messages that have been delivered from a queue but not acknowledged at the point that a new node joins, the position of those unacked messages was incorrect on the new node and could have caused second node to exit with an error. Reproduced on -29, working on semi-automated test to be sure this issue is gone. The issue has been fixed on RHEL 5.4 i386 / x86_64 on packages: [root@mrg-qe-02 bz533361]# rpm -qa | grep -E '(qpid|opena|rhm|qmf)' | sort -u condor-qmf-plugins-7.4.1-0.5.el5 openais-0.80.6-8.el5_4.1 openais-debuginfo-0.80.6-8.el5_4.1 openais-devel-0.80.6-8.el5_4.1 python-qpid-0.5.758389-2.el5 qmf-0.5.752581-34.el5 qmf-devel-0.5.752581-34.el5 qpidc-0.5.752581-34.el5 qpidc-debuginfo-0.5.752581-34.el5 qpidc-devel-0.5.752581-34.el5 qpidc-perftest-0.5.752581-34.el5 qpidc-rdma-0.5.752581-34.el5 qpidc-ssl-0.5.752581-34.el5 qpidd-0.5.752581-34.el5 qpidd-acl-0.5.752581-34.el5 qpidd-cluster-0.5.752581-34.el5 qpidd-devel-0.5.752581-34.el5 qpid-dotnet-0.4.738274-2.el5 qpidd-rdma-0.5.752581-34.el5 qpidd-ssl-0.5.752581-34.el5 qpidd-xml-0.5.752581-34.el5 qpid-java-client-0.5.751061-9.el5 qpid-java-common-0.5.751061-9.el5 rhm-0.5.3206-27.el5 rhm-debuginfo-0.5.3206-27.el5 rhm-docs-0.5.756148-1.el5 rh-qpid-tests-0.5.752581-34.el5 ->VERIFIED P.S. No node leaving anymore + no different browsing result Release note updated. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1,2 +1,10 @@ -Corrected problem with positioning of unacked messages on the new node. If there were messages that have been delivered from a queue but not +Messaging bug fix -acknowledged at the point that a new node joins, the position of those unacked messages was incorrect on the new node and could have caused second node to exit with an error.+ +C: If there are messages that have been delivered from a queue but not acknowledged, when a new node joins the position of those unacked messages were incorrect on the new node. +C: The new node would exit with an error. +F: The position field of messages in the list of delivery records is now being correctly set. +R: Browsing the queue from either node now shows +the same results, and new nodes are no longer unexpectedly exiting. + +If there were messages that had been delivered from a queue but not acknowledged, when a new node joined, the position of those unacked messages were incorrect on the new node. This resulted in the new node exiting with an error. The position field of messages in the list of delivery records is now being correctly set. Browsing the queue from either node now shows +the same results, and new nodes are no longer unexpectedly exiting. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHEA-2009-1633.html |
Created attachment 367863 [details] acquirer program referenced in step 9 of reproducer Description of problem: If there are messages that have been delivered from a queue but not acknowledged at the point that a new node joins, the position of those unacked messages is incorrect on the new node. This can end up causing the second node to exit with an error. Version-Release number of selected component (if applicable): qpidd-0.5.752581-30.el5 qpidd-cluster-0.5.752581-30.el5 How reproducible: 100% Steps to Reproduce: 1. start one node cluster 2. qpid-config add queue test-queue 3. for m in b1 b2 b3; do echo $m; done | sender 4. receiver --ack-frequency 5 --credit-window 3 (leave it running or background it) 6. start second node for the cluster 7. for m in a1 a2 a3; do echo $m; done | sender 8. kill receiver started in 4. 9. run attached acquirer client Actual results: Second node exits with something like: 2009-nov-06 13:40:47 error Execution exception: invalid-argument: anonymous.66d117b6-49d7-4b0c-9c3f-a561b2ba9f07: confirmed < (7+0) but only sent < (4+0) (qpid/SessionState.cpp:163) 2009-nov-06 13:40:47 error 10.16.44.222:30367(READY/error) channel error 623 on 10.16.44.222:30294-9(shadow): invalid-argument: anonymous.66d117b6-49d7-4b0c-9c3f-a561b2ba9f07: confirmed < (7+0) but only sent < (4+0) (qpid/SessionState.cpp:163) (unresolved: 10.16.44.222:30294 10.16.44.222:30367 ) 2009-nov-06 13:40:47 critical 10.16.44.222:30367(READY/error) error 623 did not occur on 10.16.44.222:30294 2009-nov-06 13:40:47 error Error delivering frames: Error 623 did not occur on all members (qpid/cluster/ErrorCheck.cpp:90) 2009-nov-06 13:40:47 notice 10.16.44.222:30367(LEFT/error) leaving cluster grs-test-mrg15 2009-nov-06 13:40:47 notice Shut down Also, of before step 9 you run the receiver in browse mode against each node, they give different results. Expected results: No errors. Browsing the queue from either node (prior to step9) should show same results. Additional info: The issue is that the position field of messages in the list of delivery records is not being correctly set. That position is used in browsing so the inconsistency causes messages to be delivered on one node but not the other.