654872 – MRG clustered node fails with invalid-argument error

Bug 654872 - MRG clustered node fails with invalid-argument error

Summary: MRG clustered node fails with invalid-argument error

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	qpid-cpp
Sub Component:
Version:	1.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	high
Target Milestone:	1.3.2-RC2
Target Release:	---
Assignee:	Alan Conway
QA Contact:	Petr Matousek
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	658198 (view as bug list)
Depends On:	648927 655078 655141 669343 669452
Blocks:
TreeView+	depends on / blocked

Reported:	2010-11-18 22:11 UTC by Mike Cressman
Modified:	2018-11-26 19:30 UTC (History)
CC List:	8 users (show)
Fixed In Version:	qpid-cpp-mrg-0.7.946106-27
Doc Type:	Bug Fix
Doc Text:	Under certain circumstances, using a management console such as qpid-tool or cumin in a clustered environment could lead to inconsistencies in management queues. When this happened, a broker in such cluster could shut down with an "invalid-argument" error. This update ensures that management actions in a cluster are replicated properly, and using a management console no longer causes cluster brokers to shut down.
Clone Of:
Environment:
Last Closed:	2011-02-15 12:10:40 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
qpidd log file (2.83 MB, application/x-gzip) 2010-11-18 22:11 UTC, Mike Cressman	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2011:0217	0	normal	SHIPPED_LIVE	Red Hat Enterprise MRG Messaging and Grid bug fix and enhancement update	2011-02-15 12:10:15 UTC

Description Mike Cressman 2010-11-18 22:11:28 UTC

Created attachment 461406 [details]
qpidd log file

Description of problem:
Clustered MRG (2 nodes) - second node keeps failing and leaving the cluster.  It can be restarted and rejoins ok, but gets the same error in about an hour.

From log:
Nov  1 17:16:22 mrg02 qpidd[25461]: 2010-11-01 17:16:22 error Execution exception: invalid-argument: anonymous.mrg01.3873.1: confirmed < (931639+0) but only sent < (931638+0) (qpid/SessionState.cpp:154)
Nov  1 17:16:22 mrg02 qpidd[25461]: 2010-11-01 17:16:22 critical cluster(192.168.239.2:25461 READY/error) local error 7960557 did not occur on member 192.168.239.1:4877: invalid-argument: anonymous.mrg01.3873.1: confirmed < (931639+0) but only sent < (931638+0) (qpid/SessionState.cpp:154)
Nov  1 17:16:22 mrg02 qpidd[25461]: 2010-11-01 17:16:22 critical Error delivering frames: local error did not occur on all cluster members : invalid-argument: anonymous.mrg01.3873.1: confirmed < (931639+0) but only sent < (931638+0) (qpid/SessionState.cpp:154) (qpid/cluster/ErrorCheck.cpp:89)
Nov  1 17:16:22 mrg02 qpidd[25461]: 2010-11-01 17:16:22 notice cluster(192.168.239.2:25461 LEFT/error) leaving cluster ProdCluster
Nov  1 17:16:22 mrg02 qpidd[25461]: 2010-11-01 17:16:22 notice Shut down


Version-Release number of selected component (if applicable):
MRG 1.3 (qpid-cpp-server-0.7.946106-17)

How reproducible:
Not sure -- it seems pretty consistent in customer environment

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:
Production environment, always the same node that gets the error.

Comment 3 Alan Conway 2010-11-19 15:00:36 UTC

This might be a duplicate of Bug 655078 or Bug 648927

Comment 7 Alan Conway 2010-12-08 20:12:47 UTC

*** Bug 658198 has been marked as a duplicate of this bug. ***

Comment 12 Alan Conway 2011-01-13 17:10:44 UTC

Partial fix upstream: 
r1056378 - QPID-2982: Improved cluster/management logging and automated test for log consistency.
r1058664 - QPID-2982: Fix discrepancy in management object and deleted object counts.

Fixes one possible cause of this bug, still testing for complete fix.

Comment 14 Alan Conway 2011-01-20 14:13:49 UTC

Committed partial fix r1061308

    Bug 654872, QPID-3007: Batch management messages by count, not size.

    QMF V1 management messages were being batched by accumulating up to a
    certain total size of data. Since management messages may have
    different sizes on brokers in a cluster, this was leading to
    inconsistencies.

    This patch batches V1 messages by count rather than by size, similar
    to V2 messages.

Comment 15 Alan Conway 2011-01-20 14:59:14 UTC

The following set of commits address this issue, svn revisions in ():

- Bug 654872 - MRG clustered node fails with invalid-argument - fix object counts (1058664)
- Bug 654872 - MRG clustered node fails with invalid-argument error - verify log consistency. (1056378)
- Bug 662765 - Management broker ID should be the same for members of a cluster. (1049566)

- Bug 669452 - Creating a route and using management tools can crash cluster members (1060568)
- Bug 654872, QPID-3007: Batch management messages by count, not size. (1061308)
- Bug 669343 - Inconsistency in management object ids due to disambiguation (1060401)

Comment 17 Mike Cressman 2011-01-24 14:36:44 UTC

In build for 1.3.2 RC 2

Comment 18 Alan Conway 2011-01-26 19:52:54 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause: Using a management console (e.g. qpid-tool or cumin) with a cluster could lead to inconsistencies on management queues.
Consequence: A broker in a cluster could shut down with an "invalid-argument" error
Fix: management actions are replicated properly in the cluster.
Result: Using management console should not cause cluster brokers to shut down.

Comment 20 Petr Matousek 2011-02-01 13:22:15 UTC

The issue has been fixed, tested on RHEL 5.6 i386 / x86_64 on packages:
python-qpid-0.7.946106-15.el5
qpid-cpp-client-0.7.946106-27.el5
qpid-cpp-client-devel-0.7.946106-27.el5
qpid-cpp-client-devel-docs-0.7.946106-27.el5
qpid-cpp-client-ssl-0.7.946106-27.el5
qpid-cpp-server-0.7.946106-27.el5
qpid-cpp-server-cluster-0.7.946106-27.el5
qpid-cpp-server-devel-0.7.946106-27.el5
qpid-cpp-server-ssl-0.7.946106-27.el5
qpid-cpp-server-store-0.7.946106-27.el5
qpid-cpp-server-xml-0.7.946106-27.el5
qpid-java-client-0.7.946106-14.el5
qpid-java-common-0.7.946106-14.el5
qpid-java-example-0.7.946106-14.el5
qpid-tools-0.7.946106-12.el5
rh-qpid-cpp-tests-0.7.946106-27.el5

VERIFIED

Comment 21 Jaromir Hradilek 2011-02-08 16:25:51 UTC

    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1,4 +1 @@
-Cause: Using a management console (e.g. qpid-tool or cumin) with a cluster could lead to inconsistencies on management queues.
+Under certain circumstances, using a management console such as qpid-tool or cumin in a clustered environment could lead to inconsistencies in management queues. When this happened, a broker in such cluster could shut down with an "invalid-argument" error. This update ensures that management actions in a cluster are replicated properly, and using a management console no longer causes cluster brokers to shut down.-Consequence: A broker in a cluster could shut down with an "invalid-argument" error
-Fix: management actions are replicated properly in the cluster.
-Result: Using management console should not cause cluster brokers to shut down.

Comment 22 errata-xmlrpc 2011-02-15 12:10:40 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0217.html

Note You need to log in before you can comment on or make changes to this bug.