619759 – cluster failover issues

Bug 619759 - cluster failover issues

Summary: cluster failover issues

Keywords:
Status:	CLOSED DUPLICATE of bug 620418
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	qpid-cpp
Sub Component:
Version:	beta
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	medium
Target Milestone:	1.3
Target Release:	---
Assignee:	Alan Conway
QA Contact:	MRG Quality Engineering
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2010-07-30 13:16 UTC by Graham Biswell
Modified:	2014-08-15 01:43 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2010-08-03 15:40:55 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
logs & conf files for both nodes (3.60 MB, application/x-tar) 2010-07-30 13:23 UTC, Graham Biswell	no flags	Details
View All

Description Graham Biswell 2010-07-30 13:16:59 UTC

Testing clustering with the 1.3 beta ...

- Applications and brokers all stopped
- start both brokers
- Perform a couple of failovers to verify nodes leave & join the cluster successfully
- start our application suite
- shutdown one broker
- some apps failover successfully, some do not (theory: those that use durable topic subscriptions do not survive)
- attempt to start the stopped broker. Cluster rejoin fails.
- Try once more, same failure.
- Shutdown the application suite, except for a single monitoring app.
- Start the stopped broker. This time it successfully rejoins the cluster.
- Perform a few more failovers between the two brokers (checking connectivity via the monitoring app)
- Shut down both brokers.

Between the applications (approx. 20) we make use of most features of qpid - fanout exchanges, LVQs, ring queues, durable topics, direct queues. All clients are java apps.

Comment 1 Graham Biswell 2010-07-30 13:23:13 UTC

Created attachment 435551 [details]
logs & conf files for both nodes

Comment 2 Gordon Sim 2010-07-30 13:35:37 UTC

There are errors in the logs relating to locked exclusive queues which I believe (as suggested above) relate to durable subscriptions from the JMS client. The sessions owning these queues are not detached at the point the clients failover.

E.g. first error in the log for amqb02 is at line 11912 (22:27:08), the session owning that queue doesn't get detached until line 69705 (22:30:49).

The failure to join appears to be down to an inconsistent error during update:

E.g. in the other log (for amqb01):

2010-07-29 22:32:11 critical cluster(10.34.22.64:26810 CATCHUP/error) local error 34184 did not occur on member 10.34.22.65:4830: resource-locked: Cannot grant exclusive access to queue _admin (qpid/broker/SessionAdapter.cpp:399)
2010-07-29 22:32:11 debug Exception constructed: local error did not occur on all cluster members : resource-locked: Cannot grant exclusive access to queue _admin (qpid/broker/SessionAdapter.cpp:399) (qpid/cluster/ErrorCheck.cpp:89)
2010-07-29 22:32:11 critical Error delivering frames: local error did not occur on all cluster members : resource-locked: Cannot grant exclusive access to queue _admin (qpid/broker/SessionAdapter.cpp:399) (qpid/cluster/ErrorCheck.cpp:89)
2010-07-29 22:32:11 notice cluster(10.34.22.64:26810 LEFT/error) leaving cluster intg1

Comment 3 Gordon Sim 2010-07-30 17:44:04 UTC

I think failover is not relevant to the lines I indicated in the start of the last comment. The errors appear to be a result of an attempt to use the same durable subscription ids and occur before the node is shutdown.

Comment 4 Alan Conway 2010-08-03 15:40:55 UTC

Although the symptoms are different the cause is the same as bug 620418

*** This bug has been marked as a duplicate of bug 620418 ***

Note You need to log in before you can comment on or make changes to this bug.