Bug 131073

Summary:	Services do not migrate to their preferred host in a 2-member cluster, when both hosts come up almost simultaneously
Product:	[Retired] Red Hat Cluster Suite	Reporter:	Satya Prakash Tripathi <sptripathi78>
Component:	clumanager	Assignee:	Lon Hohberger <lhh>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Cluster QE <mspqa-list>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	3	CC:	cluster-maint
Target Milestone:	---
Target Release:	---
Hardware:	i686
OS:	Linux
Whiteboard:
Fixed In Version:	1.2.28	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2005-11-22 13:48:13 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Satya Prakash Tripathi 2004-08-27 10:04:25 UTC

Description of problem:

When we simultaneously reboot the two hosts of the 2-member cluster( 
with tie-breaker IP in use), the services sometimes *NEVER* migrate 
to their preferred host, when both hosts come up almost at the same 
time. Rather they are settled down on one of the two hosts.

Not sure if it's a problem with my cluster.xml file.
The same is attached in "Additional Information" section.

Version-Release number of selected component (if applicable):


How reproducible:
                    Sometimes


Steps to Reproduce:

1. reboot both hosts of a 2-member cluster simultaneously
2. when they come up, all the services run on one host and do not 
migrate to their preferred host, which is up with cluster 
successfully. This is observed for enough duration, and the cluster 
seems to have decided to run all services on one host FOREVER.
   ( this happens sometimes)

  
Actual results:
services NEVER relocate to their preferred host.


Expected results:
services should finally relocate to their preferred host.


Additional info:

We are using clumanager-1.2.16-1 with following configuration details:

* 2-member cluster configuration 
* soft-quorum forced permanently, using cludb
* IP-based tie-breaker in use

* cluster.xml file:

<?xml version="1.0"?>
<cluconfig version="3.0">
  <cluster alias_ip="192.168.0.140" timestamp="1091659302" 
name="rhas0.something.com" config_viewnumber="2"/>
  <clumembd broadcast="yes" interval="750000" loglevel="7" 
multicast="yes" multicast_ipaddress="225.0.0.11" tko_count="20"/>
  <clulockd loglevel="7"/>
  <cluquorumd sametimenetup="4" checkinterval="15" pinginterval="2" 
loglevel="7" sametimenetdown="2" powerswitchtimeout="120" 
tiebreaker_ip="192.168.0.252" allow_soft="yes"/>
  <clurmtabd loglevel="7" pollinterval="4"/>
  <clusvcmgrd loglevel="7"/>
  <members>
    <member id="1" name="host02" watchdog="no">
                </member>
    <member id="0" name="host01" watchdog="no">
                </member>
  </members>
  <sharedstate driver="libsharedraw.so" type="raw" 
rawprimary="/dev/raw/raw1" rawshadow="/dev/raw/raw2"/>
  <services>
    <service id="0" relocateonpreferrednodeboot="yes" name="svc0" 
checkinterval="0" preferrednode="host01" 
userscript="/opt/cluster/svc0.sh" failoverdomain="host01domain">
      <service_ipaddresses>
        <service_ipaddress id="0" broadcast="192.168.0.255" 
ipaddress="192.168.0.143"/>
        <service_ipaddress id="1" broadcast="10.51.8.255" 
netmask="255.255.255.0" ipaddress="10.51.8.143"/>
      </service_ipaddresses>
      <device id="0" sharename="None" name="/dev/sde2">
        <mount fstype="ext3" options="rw,nosuid" 
mountpoint="/cluster/arch" forceunmount="yes"/>
        <nfsexport id="0" name="/cluster/arch">
          <client id="0" name="192.168.0.141" 
options="rw,no_root_squash"/>
          ...
          ......    
        </nfsexport>
      </device>
      <device id="1" sharename="None" name="/dev/sdc2">
        <mount fstype="ext3" options="rw,nosuid" 
mountpoint="/cluster/xfs" forceunmount="yes"/>
        <nfsexport id="0" name="/cluster/xfs">
          <client id="0" name="192.168.0.141" 
options="rw,no_root_squash"/>
          ...
          ..... 
        </nfsexport>
      </device>
<device id="3" sharename="None" name="/dev/sde1">
        <mount fstype="ext3" options="rw,nosuid" 
mountpoint="/cluster/amap" forceunmount="yes"/>
      </device>
      <device id="4" sharename="None" name="/dev/sdc1">
        <mount fstype="ext3" options="rw,nosuid" 
mountpoint="/cluster/db" forceunmount="yes"/>
      </device>
    </service>
<service id="1" preferrednode="host02" name="svc1" checkinterval="0" 
relocateonpreferrednodeboot="yes" userscript="/opt/cluster/svc1.sh" 
failoverdomain="host02domain">
      <service_ipaddresses>
        <service_ipaddress id="0" broadcast="10.51.8.255" 
netmask="255.255.255.0" ipaddress="10.51.8.146"/>
        <service_ipaddress id="1" broadcast="192.168.0.255" 
netmask="255.255.255.0" ipaddress="192.168.0.146"/>
      </service_ipaddresses>
      <device id="0" sharename="None" name="/dev/sde5">
        <mount fstype="ext3" options="rw,nosuid" 
mountpoint="/cluster/cdr/cdrcp01" forceunmount="yes"/>
        <nfsexport id="0" name="/cluster/cdr/cdrcp01">
          <client id="0" name="192.168.0.141" 
options="rw,no_root_squash"/>
          ...
          ....
        </nfsexport>
      </device>
      <device id="1" sharename="None" name="/dev/sdd1">
        <mount fstype="ext3" options="rw,nosuid" 
mountpoint="/cluster/log" forceunmount="yes"/>
        <nfsexport id="0" name="/cluster/log">
          <client id="0" name="192.168.0.141" 
options="rw,no_root_squash"/>
          ...
          .....
        </nfsexport>
      </device>
      <device id="2" sharename="None" name="/dev/sde6">
        <mount fstype="ext3" options="rw,nosuid" 
mountpoint="/cluster/cdr/cdrcp02" forceunmount="yes"/>
        <nfsexport id="0" name="/cluster/cdr/cdrcp02">
          <client id="0" name="192.168.0.141" 
options="rw,no_root_squash"/>
          ...
          ....  
        </nfsexport>
      </device>
    </service>
    <service id="2" preferrednode="host01" 
userscript="/opt/cluster/svc2.sh" name="svc2" checkinterval="15" 
relocateonpreferrednodeboot="yes" failoverdomain="host01domain">
                </service>
    <service id="3" preferrednode="host02" 
userscript="/opt/cluster/svc3.sh" name="svc3" checkinterval="15" 
relocateonpreferrednodeboot="yes" failoverdomain="host02domain">
                </service>
  </services>
<failoverdomains>
    <failoverdomain id="0" name="host01domain" restricted="no">
      <failoverdomainnode id="0" name="host01"/>
    </failoverdomain>
    <failoverdomain id="1" name="host02domain" restricted="no">
      <failoverdomainnode id="1" name="host02"/>
    </failoverdomain>
  </failoverdomains>
</cluconfig>

Comment 1 Lon Hohberger 2004-08-27 14:35:57 UTC

Intersting; this works for me.  There's a lot of noise from the
translation from RHEL 2.1, but it looks like your configuration is
correct.

Does it occur without soft quorum enabled?

Comment 3 Lon Hohberger 2004-08-27 19:56:35 UTC

You could apply this patch which fixes performance and other
problems with the IP tiebreaker (fix to previous patch):

http://bugzilla.redhat.com/bugzilla/attachment.cgi?id=103180&action=view

Comment 4 Satya Prakash Tripathi 2004-08-30 13:36:02 UTC

Responding to your comment #1:

We have only two member cluster, so that if I don't force soft-quorum(
either using cluforce at runtime or permanently using cludb), the
clusters on the two hosts don't form a quorum when the hosts come up.

As I understand it, it's becuase when host1 comes up, it finds that
2/1=1 members are up, which is not a majority( it's just half, not
more than half ).
Sadly, by the time the host2 comes up, it finds that the host1's
cluster is inactive and host2 can not form a quorum by itself, again
for the abovestated reson. 
That's why we need to enable soft-quorum so that one host with active
cluster can form a quorum.
If you know any other/better workaround for this problem, please suggest!

Responding to your comment #3:

I've applied this patch. Will update this bug with the results, after
some experiments.

Thanks.

Comment 5 Lon Hohberger 2004-10-01 15:03:56 UTC

As a side note, I really think the problem observed which caused "soft
quorum" to be disabled per default should be fixed by this patch.

Comment 6 Lon Hohberger 2005-11-21 21:39:40 UTC

Hi Satya,

Any luck with the applied patches?

Comment 7 Satya Prakash Tripathi 2005-11-22 07:11:47 UTC

Hi Lon,

sorry for the long silence. It does work for us. 
No new occurences of this problem reoprted.
So you can close this bug.
Thanks a lot!