Bug 970657

Summary:	Failover and/or relocation of qpidd-primary service should be limited to ready brokers only
Product:	Red Hat Enterprise MRG	Reporter:	Pavel Moravec <pmoravec>
Component:	qpid-cpp	Assignee:	messaging-bugs <messaging-bugs>
Status:	NEW ---	QA Contact:	Messaging QE <messaging-qe-bugs>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	Development	CC:	jross
Target Milestone:	---	Keywords:	Improvement
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Enhancement
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:		Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Pavel Moravec 2013-06-04 14:00:53 UTC

Description of problem:
(for non-qpid people, some background is in next comment)

In newHA cluster, rgrmanager is used to maintain exactly one qpidd-primary service cluster-wide. This has one limitation to migration or automatic fail-over of qpidd-primary service to a node with a broker not in ready state. See e.g. two scenarios in "Steps to reproduce". As a consequence, this can lead to lose of primary forever, or at least to redundant multiple failover of primary to inappropriate cluster node(s).

This has to be either warned in manuals, or (better) prevented in implementation.

Version-Release number of selected component (if applicable):
qpid-cpp-*-0.18-14 (new HA as tech.preview)
rgmanager-3.0.12.1-17

How reproducible:
100%

Steps to Reproduce:
Two independent scenarios:
A) Having failover domain of qpidd-primary service ordered and restarting rgmanager on the most priority node.

B) Manual migration of qpidd-primary to a node with qpidd not in ready state.

Actual results:
Scenario A): Stopping rgmanager causes qpidd-primary fails over to 2nd node. Once rgmanager is up, it starts qpidd broker being in some catchup state, and concurrently 2nd node starts failing-over qpidd-primary to 1st node. I.e. broker on 1st node is in waiting/catchup state, promotion fails and qpidd-primary is to be failed over (can do so back to 2nd node). 2nd node is in waiting/catchup state as well (qpidd-primary stop in fact stops qpidd broker as well), so promotion of this broker fails and qpidd-primary fails back to 1st node etc.
(sometimes, rgmanager decides to failover qpidd-primary to 3rd node and we are lucky; but that's not every case)

Scenario B): qpidd-primary stop also stops (or restarts) current primary broker; Attempt to promote not-ready broker fails and qpidd-primary service is failed over again. Hopefully to third node where the broker might be in ready state. Anyway, failover to not-ready broker is ridiculous.

Expected results:
Scenario A): No infinite failing failovers of qpidd-primary should happen.

Scenario B): No ridiculous failover to a broker not in ready state should be allowed.

Additional info:
The below possible resolution won't probably work (per clusterHA guy I discussed with):
- modify "service qpidd status" such that it returns true if qpidd runs and further:
- qpid-ha module not loaded, or
- qpid-ha module loaded AND qpid-ha status returns ready state
- mark the qpidd-primary service as dependant on qpidd service (such that qpidd-primary can run only if qpidd status is OK)

Then, rgmanager should be clever enough not to migrate or fail-over qpidd-primary service to a node where prerequisities are not met.

BUT: per clusterHA, such dependency accross different failover domains is not allowed, and rgmanager is not clever enough to prevent failover :-/

Comment 1 Pavel Moravec 2013-06-04 14:08:58 UTC

Background for non-qpid people: active-passive clustering in qpid uses clusterHA as follows:

- Having (usually 3node) cluster with no shared disks, running 3 identical copies of qpid broker one on each node (i.e. in restricted failover domains of one node each). This service is "base service".

- Having "primary service" that optionally keeps VIP and promotes the broker on the node where it runs as "primary", i.e. the active. Other qpid brokers are passive.

- When a new "base service" / qpid broker is starting, it needs some time to catchup / align with the primary broker. Within this time, it can't be promoted as primary (such promotion fails).

- Primary service has failover domain over the whole cluster - any node can run it, but there can be just one primary broker at any time.

- Without primary service running (resp. without a broker successfully promoted as primary), co client can connect to whole qpid cluster. So losing primary serrvice forever is a big pain.

Comment 2 Alan Conway 2013-06-05 19:29:55 UTC

Brokers in catchup or joining will refuse promotion, the idea is that rgmanager will try to promote all the brokers till it finds the ready one.

The problem with an ordered domain appears to be that rgmanager tries to relocate qpidd-primary as soon as the top-priority node comes online, in two steps:
1. stop the current primary qpidd.
2. start and promote qpidd on top priority host.

Now 2. will almost certainly fail because qpidd wont have time to get to a ready state before the attempt to promote. As a result of 1. we've also killed a potential backup and reset it to joining, so if there's any problem with the third broker it is game over.

A workaround would be to set nofailback on the ordered domain: https://fedorahosted.org/cluster/wiki/FailoverDomains.
We should document that as a requirement for now that we only support ordered domains with nofailback.

We can improve things a bit by allow primary brokers to be "demoted". If we deliberately stop the primary, not due to a failure (like rgmanager trying to relocate a service) then that broker should switch into a backup role but keeping all its queues and messages intact.

That still won't solve the failback problem though. To do that we need a way to delay relocating the primary from a lower-priority node until the broker on the higher-priorty node has completed its and us ready to take over. I'm not aware of a way to do this in rgmanager.

Comment 3 Pavel Moravec 2013-06-06 11:24:28 UTC

> A workaround would be to set nofailback on the ordered domain:
> https://fedorahosted.org/cluster/wiki/FailoverDomains.
> We should document that as a requirement for now that we only support
> ordered domains with nofailback.

bz971368 created for documentation changes that must be in Vienna release. As I found another configuration gotcha not covered here (primary service having recovery="restart").