Bug 831648

Summary: rgmanager prefers 2 nodes in 3 nodes cluster
Product: Red Hat Enterprise Linux 6 Reporter: Martin Kudlej <mkudlej>
Component: rgmanagerAssignee: Ryan McCabe <rmccabe>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: low Docs Contact:
Priority: low    
Version: 6.3CC: cluster-maint, jkortus, jpokorny, mjuricek
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: rgmanager-3.0.12.1-15.el6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-02-21 10:18:19 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Proposed patch none

Description Martin Kudlej 2012-06-13 14:06:00 UTC
Description of problem:
I've got 3 nodes in cluster(cluster configuration below). I periodically kill Condor agents on all nodes. Rgmanager prefers to start Condor agents only on 2 nodes(node02 and node03). Rgmanager starts Condor agent on first node(node01) just about in 1 case from 50 occasions.

Version-Release number of selected component (if applicable):
pacemaker-cluster-libs-1.1.7-6.el6.x86_64
modcluster-0.16.2-18.el6.x86_64
lvm2-cluster-2.02.95-10.el6.x86_64
fence-virt-0.2.3-9.el6.x86_64
clusterlib-3.0.12.1-32.el6.x86_64
cman-3.0.12.1-32.el6.x86_64
rgmanager-3.0.12.1-12.el6.x86_64
condor-cluster-resource-agent-7.6.5-0.15.el6.x86_64
cluster-glue-libs-1.0.5-6.el6.x86_64
fence-agents-3.1.5-17.el6.x86_64
cluster-glue-1.0.5-6.el6.x86_64

How reproducible:
100%

Steps to Reproduce:
1. install clustering + Condor
2. setup cluster configuration below
3. periodically kill Condor agents on all nodes

Expected results:
Rgmanager will not prefer any node for starting recovering agent.

Additional info:
$ cat /etc/cluster/cluster.conf

<?xml version="1.0"?>
<cluster config_version="86" name="HACondorCluster">
        <fence_daemon post_join_delay="30"/>
        <fence_xvmd debug="10" multicast_interface="eth1"/>
        <clusternodes>
                <clusternode name="xulqrxy-node01" nodeid="1">
                        <fence>
                                <method name="virt-fenc">
                                        <device domain="test" name="fenc"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="xulqrxy-node02" nodeid="2">
                        <fence>
                                <method name="virt-fenc">
                                        <device domain="test" name="fenc"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="xulqrxy-node03" nodeid="3">
                        <fence>
                                <method name="virt-fenc">
                                        <device domain="test" name="fenc"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman expected_votes="1">
                <multicast addr="224.0.0.1"/>
        </cman>
        <fencedevices>
                <fencedevice agent="fence_xvm" name="fenc"/>
        </fencedevices>
        <rm>   
                <failoverdomains>
                        <failoverdomain name="Schedd hasched1 Failover Domain" nofailback="0" ordered="0" restricted="1">
                                <failoverdomainnode name="xulqrxy-node01"/>
                                <failoverdomainnode name="xulqrxy-node02"/>
                                <failoverdomainnode name="xulqrxy-node03"/>
                        </failoverdomain>
                        <failoverdomain name="Schedd hasched2 Failover Domain" nofailback="0" ordered="0" restricted="1">
                                <failoverdomainnode name="xulqrxy-node01"/>
                                <failoverdomainnode name="xulqrxy-node02"/>
                                <failoverdomainnode name="xulqrxy-node03"/>
                        </failoverdomain>
                        <failoverdomain name="Schedd hasched3 Failover Domain" nofailback="0" ordered="0" restricted="1">
                                <failoverdomainnode name="xulqrxy-node01"/>
                                <failoverdomainnode name="xulqrxy-node02"/>
                                <failoverdomainnode name="xulqrxy-node03"/>
                        </failoverdomain>
                        <failoverdomain name="Schedd hasched4 Failover Domain" nofailback="0" ordered="0" restricted="1">
                                <failoverdomainnode name="xulqrxy-node01"/>
                                <failoverdomainnode name="xulqrxy-node02"/>
                                <failoverdomainnode name="xulqrxy-node03"/>
                        </failoverdomain>
                </failoverdomains>
                <resources/>
                
       <service autostart="1" domain="Schedd hasched1 Failover Domain" name="HA Schedd hasched1" recovery="relocate">
                        <netfs export="/mnt/qa" force_unmount="on" host="nest.test.redhat.com" mountpoint="/mnt/qa/MRG/cluster_mkudlej1" name="Job Queue for hasched1" options="rw,soft">  
                                <condor __independent_subtree="1" __max_restarts="3" __restart_expire_time="300" name="hasched1" type="schedd"/>
                        </netfs>
                </service>
                <service autostart="1" domain="Schedd hasched2 Failover Domain" name="HA Schedd hasched2" recovery="relocate">
                        <netfs export="/mnt/qa" force_unmount="on" host="nest.test.redhat.com" mountpoint="/mnt/qa/MRG/cluster_mkudlej2" name="Job Queue for hasched2" options="rw,soft">
                                <condor __independent_subtree="1" __max_restarts="3" __restart_expire_time="300" name="hasched2" type="schedd"/>
                        </netfs>
                </service>
                <service autostart="1" domain="Schedd hasched3 Failover Domain" name="HA Schedd hasched3" recovery="relocate">
                        <netfs export="/mnt/qa" force_unmount="on" host="nest.test.redhat.com" mountpoint="/mnt/qa/MRG/cluster_mkudlej3" name="Job Queue for hasched3" options="rw,soft">
                                <condor __independent_subtree="1" __max_restarts="3" __restart_expire_time="300" name="hasched3" type="schedd"/>
                        </netfs>
                </service>
                <service autostart="1" domain="Schedd hasched4 Failover Domain" name="HA Schedd hasched4" recovery="relocate">
                        <netfs export="/mnt/qa" force_unmount="on" host="nest.test.redhat.com" mountpoint="/mnt/qa/MRG/cluster_mkudlej4" name="Job Queue for hasched4" options="rw,soft">
                                <condor __independent_subtree="1" __max_restarts="3" __restart_expire_time="300" name="hasched4" type="schedd"/>
                        </netfs>
                </service>
        </rm>
        <logging debug="on"/>
</cluster>

Comment 1 Jan Pokorný [poki] 2012-08-02 14:04:55 UTC
Admittedly, "Red Hat Cluster Suite" product in Bugzilla is tempting,
but no longer in use (no longer having a standalone position).

As per the packages, flipping to RHEL 6 -- rgmanager.

Comment 3 Lon Hohberger 2012-08-06 19:00:32 UTC
Yes, without ordering and priorities, rgmanager will not prefer any node.  Consider this configuration instead:

<?xml version="1.0"?>
<cluster config_version="86" name="HACondorCluster">
        <fence_daemon post_join_delay="30"/>
        <fence_xvmd debug="10" multicast_interface="eth1"/>
        <clusternodes>
                <clusternode name="xulqrxy-node01" nodeid="1">
                        <fence>
                                <method name="virt-fenc">
                                        <device domain="test" name="fenc"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="xulqrxy-node02" nodeid="2">
                        <fence>
                                <method name="virt-fenc">
                                        <device domain="test" name="fenc"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="xulqrxy-node03" nodeid="3">
                        <fence>
                                <method name="virt-fenc">
                                        <device domain="test" name="fenc"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman expected_votes="1">
                <multicast addr="224.0.0.1"/>
        </cman>
        <fencedevices>
                <fencedevice agent="fence_xvm" name="fenc"/>
        </fencedevices>
        <rm>   
                <failoverdomains>
                        <failoverdomain name="Schedd hasched1 Failover Domain" nofailback="0" ordered="1" restricted="0">
                                <failoverdomainnode name="xulqrxy-node01" priority="1"/>
                                <failoverdomainnode name="xulqrxy-node02" priority="2"/>
                                <failoverdomainnode name="xulqrxy-node03" priority="3"/>
                        </failoverdomain>
                        <failoverdomain name="Schedd hasched2 Failover Domain" nofailback="0" ordered="1" restricted="0">
                                <failoverdomainnode name="xulqrxy-node01" priority="3"/>
                                <failoverdomainnode name="xulqrxy-node02" priority="1"/>
                                <failoverdomainnode name="xulqrxy-node03" priority="2"/>
                        </failoverdomain>
                        <failoverdomain name="Schedd hasched3 Failover Domain" nofailback="0" ordered="1" restricted="1">
                                <failoverdomainnode name="xulqrxy-node01" priority="2"/>
                                <failoverdomainnode name="xulqrxy-node02" priority="3"/>
                                <failoverdomainnode name="xulqrxy-node03" priority="1"/>
                        </failoverdomain>

                </failoverdomains>
                <resources/>
                
       <service autostart="1" domain="Schedd hasched1 Failover Domain" name="HA Schedd hasched1" recovery="relocate">
                        <netfs export="/mnt/qa" force_unmount="on" host="nest.test.redhat.com" mountpoint="/mnt/qa/MRG/cluster_mkudlej1" name="Job Queue for hasched1" options="rw,soft">  
                                <condor __independent_subtree="1" __max_restarts="3" __restart_expire_time="300" name="hasched1" type="schedd"/>
                        </netfs>
                </service>
                <service autostart="1" domain="Schedd hasched2 Failover Domain" name="HA Schedd hasched2" recovery="relocate">
                        <netfs export="/mnt/qa" force_unmount="on" host="nest.test.redhat.com" mountpoint="/mnt/qa/MRG/cluster_mkudlej2" name="Job Queue for hasched2" options="rw,soft">
                                <condor __independent_subtree="1" __max_restarts="3" __restart_expire_time="300" name="hasched2" type="schedd"/>
                        </netfs>
                </service>
                <service autostart="1" domain="Schedd hasched3 Failover Domain" name="HA Schedd hasched3" recovery="relocate">
                        <netfs export="/mnt/qa" force_unmount="on" host="nest.test.redhat.com" mountpoint="/mnt/qa/MRG/cluster_mkudlej3" name="Job Queue for hasched3" options="rw,soft">
                                <condor __independent_subtree="1" __max_restarts="3" __restart_expire_time="300" name="hasched3" type="schedd"/>
                        </netfs>
                </service>
                <service autostart="1" name="HA Schedd hasched4" recovery="relocate">
                        <netfs export="/mnt/qa" force_unmount="on" host="nest.test.redhat.com" mountpoint="/mnt/qa/MRG/cluster_mkudlej4" name="Job Queue for hasched4" options="rw,soft">
                                <condor __independent_subtree="1" __max_restarts="3" __restart_expire_time="300" name="hasched4" type="schedd"/>
                        </netfs>
                </service>
        </rm>
        <logging debug="on"/>
</cluster>


If required, you can also make rgmanager be more "random" with node placement for "HA Schedd hasched4" by adding 'central_processing="1"' to the <rm> tag [NOTE: To do this, stop rgmanager on all hosts, then add it, then restart rgmanager on all hosts]

Comment 4 Lon Hohberger 2012-08-06 19:01:11 UTC
Oops, that third domain doesn't need 'restricted="1"' (it doesn't matter that it's there, but it's not necessary).

Comment 6 Lon Hohberger 2012-08-06 20:51:19 UTC
Let me know if this improves things.

Comment 7 Lon Hohberger 2012-08-06 20:52:18 UTC
Oh, also, 'recovery' should probably be 'restart', not 'relocate'.

Comment 8 Jaroslav Kortus 2012-08-13 10:32:54 UTC
I've tested this with very simple configuration:
  <rm>
    <resources>
      <script name="servicescript" file="/bin/true"/>
    </resources>
    <service autostart="1" name="service1" recovery="relocate">
        <script ref="servicescript"/>
    </service>
  </rm>

And here are the results:
]$ clustat
Cluster Status for STSRHTS16926 @ Mon Aug 13 05:18:39 2012
Member Status: Quorate

 Member Name                                                     ID   Status
 ------ ----                                                     ---- ------
 c2-node01                                                           1 Online, Local, rgmanager
 c2-node02                                                           2 Online, rgmanager
 c2-node03                                                           3 Online, rgmanager

(rgmanager running on all 3 nodes)


# for i in `seq 1 500`; do echo "Iteration $i"; clusvcadm -r service1 ;  clustat -x | gxpp '//group/@owner' >> owners.txt; done
[...]
# sort owners.txt | uniq -c
    250 c2-node02
    250 c2-node03

It's clear that node01 was ignored in 500 attempts. Expectation is to get around 1/3 of relocations on node01.

Comment 9 Jaroslav Kortus 2012-08-13 11:50:05 UTC
it seems that it always jumps between two nodes with highest ID.. If I set nodeid="6" for c2-node01, I get the same results (to highest IDs win).

The same applies for 5-node cluster (node04 and node05 will have the service all the time).

I've missed this is actually a rhel6 bug, so the tests were performed on rhel5 (rgmanager-2.0.52-34.el5), but I assume not much has changed in this regard :).

Comment 11 Ryan McCabe 2012-10-15 13:25:02 UTC
Created attachment 627416 [details]
Proposed patch

Comment 15 errata-xmlrpc 2013-02-21 10:18:19 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-0409.html