Bug 971368

Summary: Docs: Limitations to cluster.conf setup for newHA
Product: Red Hat Enterprise MRG Reporter: Pavel Moravec <pmoravec>
Component: Messaging_Installation_and_Configuration_GuideAssignee: Jared MORGAN <jmorgan>
Status: CLOSED CURRENTRELEASE QA Contact: Frantisek Reznicek <freznice>
Severity: high Docs Contact:
Priority: high    
Version: 2.3CC: esammons, freznice, mmurray
Target Milestone: 3.0   
Target Release: ---   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-01-22 15:28:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Pavel Moravec 2013-06-06 11:21:55 UTC
Description of problem:
Based on https://bugzilla.redhat.com/show_bug.cgi?id=970657#c2, there are few requirements to cluster.conf in order to make active-passive qpid clusters properly work.

In particular:
1) Manual reallocation of qpidd-primary service cannot be done to a node where qpid broker is not in ready state (is stopped, or either in catchup or joining state). Such reallocation would definitely fail.

2) When using ordered failover domains, use nofailback option (nofailback="1"). That prevents the below situation to occur:
- the most priority node is joining the cluster and starting qpidd service
- qpidd service is in catchup or joining state
- rgmanager tries to relocate qpidd-primary to this node (such that it restarts qpidd broker on 2nd node that runs qpidd-primary)
- reallocation fails as qpidd on node1 isnt ready, so rgmanager tries to reallocate to 2nd node
- broker on 2nd node is in joining state, so qpidd-primary service fails to start
- rgmanager tries to reallocate to 1st node to closing this infinite loop

3) primary service recovery procedure has to be "relocate", not "restart". As currently stopping qpidd-primary means stopping / restarting qpidd broker as well. Newly started broker wont be in ready state when qpidd-primary service would be attempted to start.

Comment 1 Joshua Wulf 2013-09-23 06:55:29 UTC
Added notes about the first two here:

http://deathstar1.usersys.redhat.com:3000/builds/18173-Messaging_Installation_and_Configuration_Guide/#Limitations_in_HA_in_MRG_3

With the third one, about relocate vs restart, currently

http://deathstar1.usersys.redhat.com:3000/builds/18173-Messaging_Installation_and_Configuration_Guide/#Configure_rgmanager

Has in step 9 restart for the individual nodes, and in step 10 relocate for the primary service.

Comment 2 Frantisek Reznicek 2013-12-05 14:56:48 UTC
1), 3) are ok.

2) wording is not optimal, see below proposed change:

Failback with ordered domains can cause an infinite failover loop under certain conditions. To avoid this, when using ordered domains use nofailback=1.

replace to (when talking about domain - it has to be alwasy [cluster] failover-domain)

Failback with cluster ordered failover-domains (cluster.conf 'ordered=1') can cause an infinite failover loop under certain conditions. To avoid this use cluster ordered failover-domains with nofailback=1 parameter.

-> ASSIGNED

Comment 3 Joshua Wulf 2013-12-12 05:02:02 UTC
Changed to: 

"Failback with cluster ordered failover-domains ('ordered=1' in cluster.conf) can cause an infinite failover loop under certain conditions. To avoid this, use cluster ordered failover-domains with nofailback=1 specified in cluster.conf."

http://deathstar1.usersys.redhat.com:3000/builds/18173-Messaging_Installation_and_Configuration_Guide/#Limitations_in_HA_in_MRG_3

Comment 4 Frantisek Reznicek 2013-12-17 12:26:10 UTC
Thanks for your change, I'm satisfied now.

-> VERIFIED