557243 – Cluster recovery of persistent queues can crash nodes with Execution exception

Bug 557243 - Cluster recovery of persistent queues can crash nodes with Execution exception

Summary: Cluster recovery of persistent queues can crash nodes with Execution exception

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	qpid-cpp
Sub Component:
Version:	Development
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	1.3
Target Release:	---
Assignee:	Alan Conway
QA Contact:	ppecka
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	556351
TreeView+	depends on / blocked

Reported:	2010-01-20 19:44 UTC by Kim van der Riet
Modified:	2010-10-20 11:28 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2010-10-20 11:28:37 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Description of how to reproduce bug (5.48 KB, text/plain) 2010-01-20 19:44 UTC, Kim van der Riet	no flags	Details
View All

Description Kim van der Riet 2010-01-20 19:44:45 UTC

Created attachment 385771 [details]
Description of how to reproduce bug

A cluster can behave badly after recovering all nodes from persistent stores. In particular, the following error message is seen on one or more nodes:

2010-01-20 14:09:15 error Execution exception: invalid-argument: anonymous.f06a1d50-05ae-401f-bfa7-286758fab447: confirmed < (5+0) but only sent < (4+0) (qpid/SessionState.cpp:151)
2010-01-20 14:09:15 critical cluster(10.16.16.49:15910 READY/error) local error 699 did not occur on member 10.16.16.49:15961: invalid-argument: anonymous.f06a1d50-05ae-401f-bfa7-286758fab447: confirmed < (5+0) but only sent < (4+0) (qpid/SessionState.cpp:151)
2010-01-20 14:09:15 critical Error delivering frames: local error did not occur on all cluster members : invalid-argument: anonymous.f06a1d50-05ae-401f-bfa7-286758fab447: confirmed < (5+0) but only sent < (4+0) (qpid/SessionState.cpp:151) (qpid/cluster/ErrorCheck.cpp:89)
2010-01-20 14:09:15 notice cluster(10.16.16.49:15910 LEFT/error) leaving cluster clusterX

Revisions:
Qpid: 900860
Store: 3809

Method for reproduction is described in attached file.

Comment 1 Kim van der Riet 2010-02-02 16:35:15 UTC

While testing on r.905680, I see the behavior of this bug has changed a little. Now, when restarting the three nodes after the shutdown, the second and third nodes fail with an error:

error Exchange already created: nfl.scores (MessageStoreImpl.cpp:529)

during the cluster catch-up.

Comment 2 Alan Conway 2010-02-02 19:46:25 UTC

The issue here is with a broker with a clean store trying to join a cluster that is already running.

The decision to push or recover the store is made early (in Cluster::Cluster) based only on whether the store is clean (recover) or dirty (push) This is before initMapCompleted when the broker knows the disposition of the other cluster members and their stores. 

So if a broker with a clean store joins a running cluster, it has already allowed the store to recover by the time it discovers that there are already active cluster members - the current code attempts an update in this case which fails as above.

At the least there should be a better error message saying "clean store can't join cluster, delete my store", but what we really want in this scenario is for the broker to ditch its store and join. 

We can probably break the initialStatus negotiation into two phases: 
1. In ctor, wait till we have status from all _currently_ running brokers and make push/recover descision based on that.
2. After ctor continue to full completion - i.e. wait for N brokers --cluster-size=N

There's also need for doc/release notes to explain the use of --cluster-size for persistent clusters, and perhaps we should enforce --cluster-size > 1 for persistent brokers.

Comment 3 jrd 2010-03-08 16:49:16 UTC

Alan, where did we end up on this one?

Comment 4 Alan Conway 2010-03-08 19:48:29 UTC

I haven't done anything on this beyond what it says in comment 2. Why is it marked modified?

Comment 5 Alan Conway 2010-03-09 17:38:02 UTC

The fix to this is to drop the --cluster-size option. The only function it serves is to allow multiple clean brokers to recover from store rather than having the first broker recover and the rest get an update. This isn't an important enough optimization to justify the extra configuration complexity.

Comment 6 Alan Conway 2010-03-09 20:18:54 UTC

comment 5 is not correct, we do need --cluster-size. comment 2 has the right solution

Comment 7 Alan Conway 2010-03-10 19:16:19 UTC

The binding to <unknown> seems to be independent of the persistent cluster start-up issue, it has been assigned to bug 572221

Comment 8 Alan Conway 2010-03-12 20:12:25 UTC

Fixed in r922412

Comment 9 ppecka 2010-05-17 15:13:55 UTC

bug reproduced, verified on RHEL 5.5 - i386/x86_64:

# rpm -qa | grep -E '(ais|qpid)'
qpid-cpp-client-0.7.935473-1.el5
qpid-cpp-server-xml-0.7.935473-1.el5
qpid-tools-0.7.934605-2.el5
openais-0.80.6-16.el5_5.1
qpid-cpp-server-ssl-0.7.935473-1.el5
qpid-cpp-server-cluster-0.7.935473-1.el5
qpid-cpp-client-ssl-0.7.935473-1.el5
qpid-java-common-0.7.934605-1.el5
qpid-java-client-0.7.934605-1.el5
qpid-cpp-server-store-0.7.935473-1.el5
qpid-cpp-server-0.7.935473-1.el5
python-qpid-0.7.938298-1.el5

--> VERIFIED

opened new bug for clarification of "cluster-size" option
https://bugzilla.redhat.com/show_bug.cgi?id=592995

Note You need to log in before you can comment on or make changes to this bug.