Description of problem: Based on https://bugzilla.redhat.com/show_bug.cgi?id=970657#c2, there are few requirements to cluster.conf in order to make active-passive qpid clusters properly work. In particular: 1) Manual reallocation of qpidd-primary service cannot be done to a node where qpid broker is not in ready state (is stopped, or either in catchup or joining state). Such reallocation would definitely fail. 2) When using ordered failover domains, use nofailback option (nofailback="1"). That prevents the below situation to occur: - the most priority node is joining the cluster and starting qpidd service - qpidd service is in catchup or joining state - rgmanager tries to relocate qpidd-primary to this node (such that it restarts qpidd broker on 2nd node that runs qpidd-primary) - reallocation fails as qpidd on node1 isnt ready, so rgmanager tries to reallocate to 2nd node - broker on 2nd node is in joining state, so qpidd-primary service fails to start - rgmanager tries to reallocate to 1st node to closing this infinite loop 3) primary service recovery procedure has to be "relocate", not "restart". As currently stopping qpidd-primary means stopping / restarting qpidd broker as well. Newly started broker wont be in ready state when qpidd-primary service would be attempted to start.
Added notes about the first two here: http://deathstar1.usersys.redhat.com:3000/builds/18173-Messaging_Installation_and_Configuration_Guide/#Limitations_in_HA_in_MRG_3 With the third one, about relocate vs restart, currently http://deathstar1.usersys.redhat.com:3000/builds/18173-Messaging_Installation_and_Configuration_Guide/#Configure_rgmanager Has in step 9 restart for the individual nodes, and in step 10 relocate for the primary service.
1), 3) are ok. 2) wording is not optimal, see below proposed change: Failback with ordered domains can cause an infinite failover loop under certain conditions. To avoid this, when using ordered domains use nofailback=1. replace to (when talking about domain - it has to be alwasy [cluster] failover-domain) Failback with cluster ordered failover-domains (cluster.conf 'ordered=1') can cause an infinite failover loop under certain conditions. To avoid this use cluster ordered failover-domains with nofailback=1 parameter. -> ASSIGNED
Changed to: "Failback with cluster ordered failover-domains ('ordered=1' in cluster.conf) can cause an infinite failover loop under certain conditions. To avoid this, use cluster ordered failover-domains with nofailback=1 specified in cluster.conf." http://deathstar1.usersys.redhat.com:3000/builds/18173-Messaging_Installation_and_Configuration_Guide/#Limitations_in_HA_in_MRG_3
Thanks for your change, I'm satisfied now. -> VERIFIED