Created attachment 461406 [details] qpidd log file Description of problem: Clustered MRG (2 nodes) - second node keeps failing and leaving the cluster. It can be restarted and rejoins ok, but gets the same error in about an hour. From log: Nov 1 17:16:22 mrg02 qpidd[25461]: 2010-11-01 17:16:22 error Execution exception: invalid-argument: anonymous.mrg01.3873.1: confirmed < (931639+0) but only sent < (931638+0) (qpid/SessionState.cpp:154) Nov 1 17:16:22 mrg02 qpidd[25461]: 2010-11-01 17:16:22 critical cluster(192.168.239.2:25461 READY/error) local error 7960557 did not occur on member 192.168.239.1:4877: invalid-argument: anonymous.mrg01.3873.1: confirmed < (931639+0) but only sent < (931638+0) (qpid/SessionState.cpp:154) Nov 1 17:16:22 mrg02 qpidd[25461]: 2010-11-01 17:16:22 critical Error delivering frames: local error did not occur on all cluster members : invalid-argument: anonymous.mrg01.3873.1: confirmed < (931639+0) but only sent < (931638+0) (qpid/SessionState.cpp:154) (qpid/cluster/ErrorCheck.cpp:89) Nov 1 17:16:22 mrg02 qpidd[25461]: 2010-11-01 17:16:22 notice cluster(192.168.239.2:25461 LEFT/error) leaving cluster ProdCluster Nov 1 17:16:22 mrg02 qpidd[25461]: 2010-11-01 17:16:22 notice Shut down Version-Release number of selected component (if applicable): MRG 1.3 (qpid-cpp-server-0.7.946106-17) How reproducible: Not sure -- it seems pretty consistent in customer environment Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: Production environment, always the same node that gets the error.
This might be a duplicate of Bug 655078 or Bug 648927
*** Bug 658198 has been marked as a duplicate of this bug. ***
Partial fix upstream: r1056378 - QPID-2982: Improved cluster/management logging and automated test for log consistency. r1058664 - QPID-2982: Fix discrepancy in management object and deleted object counts. Fixes one possible cause of this bug, still testing for complete fix.
Committed partial fix r1061308 Bug 654872, QPID-3007: Batch management messages by count, not size. QMF V1 management messages were being batched by accumulating up to a certain total size of data. Since management messages may have different sizes on brokers in a cluster, this was leading to inconsistencies. This patch batches V1 messages by count rather than by size, similar to V2 messages.
The following set of commits address this issue, svn revisions in (): - Bug 654872 - MRG clustered node fails with invalid-argument - fix object counts (1058664) - Bug 654872 - MRG clustered node fails with invalid-argument error - verify log consistency. (1056378) - Bug 662765 - Management broker ID should be the same for members of a cluster. (1049566) - Bug 669452 - Creating a route and using management tools can crash cluster members (1060568) - Bug 654872, QPID-3007: Batch management messages by count, not size. (1061308) - Bug 669343 - Inconsistency in management object ids due to disambiguation (1060401)
In build for 1.3.2 RC 2
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Cause: Using a management console (e.g. qpid-tool or cumin) with a cluster could lead to inconsistencies on management queues. Consequence: A broker in a cluster could shut down with an "invalid-argument" error Fix: management actions are replicated properly in the cluster. Result: Using management console should not cause cluster brokers to shut down.
The issue has been fixed, tested on RHEL 5.6 i386 / x86_64 on packages: python-qpid-0.7.946106-15.el5 qpid-cpp-client-0.7.946106-27.el5 qpid-cpp-client-devel-0.7.946106-27.el5 qpid-cpp-client-devel-docs-0.7.946106-27.el5 qpid-cpp-client-ssl-0.7.946106-27.el5 qpid-cpp-server-0.7.946106-27.el5 qpid-cpp-server-cluster-0.7.946106-27.el5 qpid-cpp-server-devel-0.7.946106-27.el5 qpid-cpp-server-ssl-0.7.946106-27.el5 qpid-cpp-server-store-0.7.946106-27.el5 qpid-cpp-server-xml-0.7.946106-27.el5 qpid-java-client-0.7.946106-14.el5 qpid-java-common-0.7.946106-14.el5 qpid-java-example-0.7.946106-14.el5 qpid-tools-0.7.946106-12.el5 rh-qpid-cpp-tests-0.7.946106-27.el5 VERIFIED
Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1,4 +1 @@ -Cause: Using a management console (e.g. qpid-tool or cumin) with a cluster could lead to inconsistencies on management queues. +Under certain circumstances, using a management console such as qpid-tool or cumin in a clustered environment could lead to inconsistencies in management queues. When this happened, a broker in such cluster could shut down with an "invalid-argument" error. This update ensures that management actions in a cluster are replicated properly, and using a management console no longer causes cluster brokers to shut down.-Consequence: A broker in a cluster could shut down with an "invalid-argument" error -Fix: management actions are replicated properly in the cluster. -Result: Using management console should not cause cluster brokers to shut down.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-0217.html