Bug 619759

Summary:

cluster failover issues

Product:

Red Hat Enterprise MRG

Reporter:

Graham Biswell <gbiswell>

Component:

qpid-cpp

Assignee:

Alan Conway <aconway>

Status:

CLOSED DUPLICATE

QA Contact:

MRG Quality Engineering <mrgqe-bugs>

Severity:

medium

Docs Contact:

Priority:

urgent

Version:

beta

CC:

aconway, gcooper, gsim

Target Milestone:

1.3

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2010-08-03 15:40:55 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
logs & conf files for both nodes	none

Description Graham Biswell 2010-07-30 13:16:59 UTC

Testing clustering with the 1.3 beta ...

- Applications and brokers all stopped
- start both brokers
- Perform a couple of failovers to verify nodes leave & join the cluster successfully
- start our application suite
- shutdown one broker
- some apps failover successfully, some do not (theory: those that use durable topic subscriptions do not survive)
- attempt to start the stopped broker. Cluster rejoin fails.
- Try once more, same failure.
- Shutdown the application suite, except for a single monitoring app.
- Start the stopped broker. This time it successfully rejoins the cluster.
- Perform a few more failovers between the two brokers (checking connectivity via the monitoring app)
- Shut down both brokers.

Between the applications (approx. 20) we make use of most features of qpid - fanout exchanges, LVQs, ring queues, durable topics, direct queues. All clients are java apps.

Comment 1 Graham Biswell 2010-07-30 13:23:13 UTC

Created attachment 435551 [details]
logs & conf files for both nodes

Comment 2 Gordon Sim 2010-07-30 13:35:37 UTC

There are errors in the logs relating to locked exclusive queues which I believe (as suggested above) relate to durable subscriptions from the JMS client. The sessions owning these queues are not detached at the point the clients failover.

E.g. first error in the log for amqb02 is at line 11912 (22:27:08), the session owning that queue doesn't get detached until line 69705 (22:30:49).

The failure to join appears to be down to an inconsistent error during update:

E.g. in the other log (for amqb01):

2010-07-29 22:32:11 critical cluster(10.34.22.64:26810 CATCHUP/error) local error 34184 did not occur on member 10.34.22.65:4830: resource-locked: Cannot grant exclusive access to queue _admin (qpid/broker/SessionAdapter.cpp:399)
2010-07-29 22:32:11 debug Exception constructed: local error did not occur on all cluster members : resource-locked: Cannot grant exclusive access to queue _admin (qpid/broker/SessionAdapter.cpp:399) (qpid/cluster/ErrorCheck.cpp:89)
2010-07-29 22:32:11 critical Error delivering frames: local error did not occur on all cluster members : resource-locked: Cannot grant exclusive access to queue _admin (qpid/broker/SessionAdapter.cpp:399) (qpid/cluster/ErrorCheck.cpp:89)
2010-07-29 22:32:11 notice cluster(10.34.22.64:26810 LEFT/error) leaving cluster intg1

Comment 3 Gordon Sim 2010-07-30 17:44:04 UTC

I think failover is not relevant to the lines I indicated in the start of the last comment. The errors appear to be a result of an attempt to use the same durable subscription ids and occur before the node is shutdown.

Comment 4 Alan Conway 2010-08-03 15:40:55 UTC

Although the symptoms are different the cause is the same as bug 620418

*** This bug has been marked as a duplicate of bug 620418 ***