Bug 619759 - cluster failover issues
Summary: cluster failover issues
Status: CLOSED DUPLICATE of bug 620418
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: qpid-cpp   
(Show other bugs)
Version: beta
Hardware: All Linux
Target Milestone: 1.3
: ---
Assignee: Alan Conway
QA Contact: MRG Quality Engineering
Depends On:
TreeView+ depends on / blocked
Reported: 2010-07-30 13:16 UTC by Graham Biswell
Modified: 2014-08-15 01:43 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2010-08-03 15:40:55 UTC
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
logs & conf files for both nodes (3.60 MB, application/x-tar)
2010-07-30 13:23 UTC, Graham Biswell
no flags Details

Description Graham Biswell 2010-07-30 13:16:59 UTC
Testing clustering with the 1.3 beta ...

- Applications and brokers all stopped
- start both brokers
- Perform a couple of failovers to verify nodes leave & join the cluster successfully
- start our application suite
- shutdown one broker
- some apps failover successfully, some do not (theory: those that use durable topic subscriptions do not survive)
- attempt to start the stopped broker. Cluster rejoin fails.
- Try once more, same failure.
- Shutdown the application suite, except for a single monitoring app.
- Start the stopped broker. This time it successfully rejoins the cluster.
- Perform a few more failovers between the two brokers (checking connectivity via the monitoring app)
- Shut down both brokers.

Between the applications (approx. 20) we make use of most features of qpid - fanout exchanges, LVQs, ring queues, durable topics, direct queues. All clients are java apps.

Comment 1 Graham Biswell 2010-07-30 13:23:13 UTC
Created attachment 435551 [details]
logs & conf files for both nodes

Comment 2 Gordon Sim 2010-07-30 13:35:37 UTC
There are errors in the logs relating to locked exclusive queues which I believe (as suggested above) relate to durable subscriptions from the JMS client. The sessions owning these queues are not detached at the point the clients failover.

E.g. first error in the log for amqb02 is at line 11912 (22:27:08), the session owning that queue doesn't get detached until line 69705 (22:30:49).

The failure to join appears to be down to an inconsistent error during update:

E.g. in the other log (for amqb01):

2010-07-29 22:32:11 critical cluster( CATCHUP/error) local error 34184 did not occur on member resource-locked: Cannot grant exclusive access to queue _admin@amq.topic_98760bd9-072f-4b25-91c3-42c4ab65a169 (qpid/broker/SessionAdapter.cpp:399)
2010-07-29 22:32:11 debug Exception constructed: local error did not occur on all cluster members : resource-locked: Cannot grant exclusive access to queue _admin@amq.topic_98760bd9-072f-4b25-91c3-42c4ab65a169 (qpid/broker/SessionAdapter.cpp:399) (qpid/cluster/ErrorCheck.cpp:89)
2010-07-29 22:32:11 critical Error delivering frames: local error did not occur on all cluster members : resource-locked: Cannot grant exclusive access to queue _admin@amq.topic_98760bd9-072f-4b25-91c3-42c4ab65a169 (qpid/broker/SessionAdapter.cpp:399) (qpid/cluster/ErrorCheck.cpp:89)
2010-07-29 22:32:11 notice cluster( LEFT/error) leaving cluster intg1

Comment 3 Gordon Sim 2010-07-30 17:44:04 UTC
I think failover is not relevant to the lines I indicated in the start of the last comment. The errors appear to be a result of an attempt to use the same durable subscription ids and occur before the node is shutdown.

Comment 4 Alan Conway 2010-08-03 15:40:55 UTC
Although the symptoms are different the cause is the same as bug 620418

*** This bug has been marked as a duplicate of bug 620418 ***

Note You need to log in before you can comment on or make changes to this bug.