Bug 654872 - MRG clustered node fails with invalid-argument error
Summary: MRG clustered node fails with invalid-argument error
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: qpid-cpp
Version: 1.3
Hardware: Unspecified
OS: Unspecified
urgent
high
Target Milestone: 1.3.2-RC2
: ---
Assignee: Alan Conway
QA Contact: Petr Matousek
URL:
Whiteboard:
: 658198 (view as bug list)
Depends On: 648927 655078 655141 669343 669452
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-11-18 22:11 UTC by Mike Cressman
Modified: 2018-11-26 19:30 UTC (History)
8 users (show)

Fixed In Version: qpid-cpp-mrg-0.7.946106-27
Doc Type: Bug Fix
Doc Text:
Under certain circumstances, using a management console such as qpid-tool or cumin in a clustered environment could lead to inconsistencies in management queues. When this happened, a broker in such cluster could shut down with an "invalid-argument" error. This update ensures that management actions in a cluster are replicated properly, and using a management console no longer causes cluster brokers to shut down.
Clone Of:
Environment:
Last Closed: 2011-02-15 12:10:40 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
qpidd log file (2.83 MB, application/x-gzip)
2010-11-18 22:11 UTC, Mike Cressman
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2011:0217 0 normal SHIPPED_LIVE Red Hat Enterprise MRG Messaging and Grid bug fix and enhancement update 2011-02-15 12:10:15 UTC

Description Mike Cressman 2010-11-18 22:11:28 UTC
Created attachment 461406 [details]
qpidd log file

Description of problem:
Clustered MRG (2 nodes) - second node keeps failing and leaving the cluster.  It can be restarted and rejoins ok, but gets the same error in about an hour.

From log:
Nov  1 17:16:22 mrg02 qpidd[25461]: 2010-11-01 17:16:22 error Execution exception: invalid-argument: anonymous.mrg01.3873.1: confirmed < (931639+0) but only sent < (931638+0) (qpid/SessionState.cpp:154)
Nov  1 17:16:22 mrg02 qpidd[25461]: 2010-11-01 17:16:22 critical cluster(192.168.239.2:25461 READY/error) local error 7960557 did not occur on member 192.168.239.1:4877: invalid-argument: anonymous.mrg01.3873.1: confirmed < (931639+0) but only sent < (931638+0) (qpid/SessionState.cpp:154)
Nov  1 17:16:22 mrg02 qpidd[25461]: 2010-11-01 17:16:22 critical Error delivering frames: local error did not occur on all cluster members : invalid-argument: anonymous.mrg01.3873.1: confirmed < (931639+0) but only sent < (931638+0) (qpid/SessionState.cpp:154) (qpid/cluster/ErrorCheck.cpp:89)
Nov  1 17:16:22 mrg02 qpidd[25461]: 2010-11-01 17:16:22 notice cluster(192.168.239.2:25461 LEFT/error) leaving cluster ProdCluster
Nov  1 17:16:22 mrg02 qpidd[25461]: 2010-11-01 17:16:22 notice Shut down


Version-Release number of selected component (if applicable):
MRG 1.3 (qpid-cpp-server-0.7.946106-17)

How reproducible:
Not sure -- it seems pretty consistent in customer environment

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:
Production environment, always the same node that gets the error.

Comment 3 Alan Conway 2010-11-19 15:00:36 UTC
This might be a duplicate of Bug 655078 or Bug 648927

Comment 7 Alan Conway 2010-12-08 20:12:47 UTC
*** Bug 658198 has been marked as a duplicate of this bug. ***

Comment 12 Alan Conway 2011-01-13 17:10:44 UTC
Partial fix upstream: 
r1056378 - QPID-2982: Improved cluster/management logging and automated test for log consistency.
r1058664 - QPID-2982: Fix discrepancy in management object and deleted object counts.

Fixes one possible cause of this bug, still testing for complete fix.

Comment 14 Alan Conway 2011-01-20 14:13:49 UTC
Committed partial fix r1061308

    Bug 654872, QPID-3007: Batch management messages by count, not size.

    QMF V1 management messages were being batched by accumulating up to a
    certain total size of data. Since management messages may have
    different sizes on brokers in a cluster, this was leading to
    inconsistencies.

    This patch batches V1 messages by count rather than by size, similar
    to V2 messages.

Comment 15 Alan Conway 2011-01-20 14:59:14 UTC
The following set of commits address this issue, svn revisions in ():

- Bug 654872 - MRG clustered node fails with invalid-argument - fix object counts (1058664)
- Bug 654872 - MRG clustered node fails with invalid-argument error - verify log consistency. (1056378)
- Bug 662765 - Management broker ID should be the same for members of a cluster. (1049566)

- Bug 669452 - Creating a route and using management tools can crash cluster members (1060568)
- Bug 654872, QPID-3007: Batch management messages by count, not size. (1061308)
- Bug 669343 - Inconsistency in management object ids due to disambiguation (1060401)

Comment 17 Mike Cressman 2011-01-24 14:36:44 UTC
In build for 1.3.2 RC 2

Comment 18 Alan Conway 2011-01-26 19:52:54 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause: Using a management console (e.g. qpid-tool or cumin) with a cluster could lead to inconsistencies on management queues.
Consequence: A broker in a cluster could shut down with an "invalid-argument" error
Fix: management actions are replicated properly in the cluster.
Result: Using management console should not cause cluster brokers to shut down.

Comment 20 Petr Matousek 2011-02-01 13:22:15 UTC
The issue has been fixed, tested on RHEL 5.6 i386 / x86_64 on packages:
python-qpid-0.7.946106-15.el5
qpid-cpp-client-0.7.946106-27.el5
qpid-cpp-client-devel-0.7.946106-27.el5
qpid-cpp-client-devel-docs-0.7.946106-27.el5
qpid-cpp-client-ssl-0.7.946106-27.el5
qpid-cpp-server-0.7.946106-27.el5
qpid-cpp-server-cluster-0.7.946106-27.el5
qpid-cpp-server-devel-0.7.946106-27.el5
qpid-cpp-server-ssl-0.7.946106-27.el5
qpid-cpp-server-store-0.7.946106-27.el5
qpid-cpp-server-xml-0.7.946106-27.el5
qpid-java-client-0.7.946106-14.el5
qpid-java-common-0.7.946106-14.el5
qpid-java-example-0.7.946106-14.el5
qpid-tools-0.7.946106-12.el5
rh-qpid-cpp-tests-0.7.946106-27.el5

VERIFIED

Comment 21 Jaromir Hradilek 2011-02-08 16:25:51 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1,4 +1 @@
-Cause: Using a management console (e.g. qpid-tool or cumin) with a cluster could lead to inconsistencies on management queues.
+Under certain circumstances, using a management console such as qpid-tool or cumin in a clustered environment could lead to inconsistencies in management queues. When this happened, a broker in such cluster could shut down with an "invalid-argument" error. This update ensures that management actions in a cluster are replicated properly, and using a management console no longer causes cluster brokers to shut down.-Consequence: A broker in a cluster could shut down with an "invalid-argument" error
-Fix: management actions are replicated properly in the cluster.
-Result: Using management console should not cause cluster brokers to shut down.

Comment 22 errata-xmlrpc 2011-02-15 12:10:40 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0217.html


Note You need to log in before you can comment on or make changes to this bug.