Bug 836429

Summary: hosta can never rejoin the cluster because of "reservation conflict" unless reboot -f hosta.
Product: Red Hat Enterprise Linux 6 Reporter: davidyangyi <davidyangyi>
Component: clusterAssignee: Ryan O'Hara <rohara>
Status: CLOSED NOTABUG QA Contact: Cluster QE <mspqa-list>
Severity: high Docs Contact:
Priority: unspecified    
Version: 6.2CC: ccaulfie, cluster-maint, davidyangyi, dyasny, fdinitto, lhh, rpeterso, sdake, teigland
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-06-30 13:36:38 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description davidyangyi 2012-06-29 05:21:26 UTC
Description of problem:
I have a RHCS of rhel6.2, using iscsi lun as fence device(SCSI reservations). When I unplug wire of the master(hosta), the slave(hostb) fenced successfully and become master. And hosta is still thinking it is the master. 
When I plug wire back into hosta, hosta can never rejoin the cluster because of "reservation conflict" unless reboot -f hosta.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.hosta is master, hostb is slave
2.using iscsi as fence device
3.unplug hosta's wire, waiting hostb becoming master
4.plug hosta's wire
  
Actual results:


Expected results:


Additional info:

Comment 1 davidyangyi 2012-06-29 05:23:24 UTC
cluster.conf

<?xml version="1.0"?>
<cluster config_version="41" name="bcec_img">
        <clusternodes>
                <clusternode name="hosta" nodeid="1">
                        <fence>
                                <method name="scsi">
                                        <device name="scsifence"/>
                                </method>
                        </fence>
                        <unfence>
                                <device action="on" name="scsifence"/>
                        </unfence>
                </clusternode>
                <clusternode name="hostb" nodeid="2">
                        <fence>
                                <method name="scsi">
                                        <device name="scsifence"/>
                                </method>
                        </fence>
                        <unfence>
                                <device action="on" name="scsifence"/>
                        </unfence>
                </clusternode>
        </clusternodes>
        <cman broadcast="yes" expected_votes="1" two_node="1"/>
        <rm>
                <resources>
                        <ip address="172.16.200.31/25" monitor_link="on" sleeptime="10"/>
                        <fs device="/dev/mapper/mpathap1" fsid="7094" mountpoint="/bcec_images" name="bcec_imgages" quick_status="on"/>
                </resources>
                <failoverdomains>
                        <failoverdomain name="bcec_img" nofailback="1" ordered="0" restricted="0">
                                <failoverdomainnode name="hosta" priority="2"/>
                                <failoverdomainnode name="hostb" priority="1"/>
                        </failoverdomain>
                </failoverdomains>
                <service domain="bcec_img" exclusive="1" name="bcec_img" recovery="relocate">
                        <ip ref="172.16.200.31/25"/>
                </service>
        </rm>
        <fence_daemon post_join_delay="25"/>
        <fencedevices>
                <fencedevice agent="fence_scsi" name="scsifence"/>
        </fencedevices>
        <logging debug="on">
                <logging_daemon debug="on" name="rgmanager"/>
                <logging_daemon debug="on" name="corosync"/>
                <logging_daemon debug="on" name="fenced"/>
                <logging_daemon debug="on" name="dlm_controld"/>
                <logging_daemon debug="on" name="corosync" subsys="CMAN"/>
        </logging>
</cluster>

Comment 3 Steven Dake 2012-06-29 18:54:56 UTC
If problem exists, in base cluster, not corosync.  Reassigning to cluster for triage.

Comment 4 Fabio Massimo Di Nitto 2012-06-30 04:01:46 UTC
Reassining to fence_scsi maintainer to confirm, but it sounds like the correct behaviour to me.

Comment 5 Fabio Massimo Di Nitto 2012-06-30 04:03:13 UTC
IF anything, cman has to be restarted in order to perform "unfencing".

plug the fiber back
restart cman -> unfencing
node rejoins the cluster.

Comment 6 davidyangyi 2012-06-30 13:15:40 UTC
Thank you.
I mean pull out ethernet cable and plug it back, not fiber to array.

But I can't restart cman or hosta remove hostb from cluster.

Here is the fenced.log on hostb
Jun 26 22:51:25 fenced daemon node 1 stateful merge
Jun 26 22:51:25 fenced daemon node 1 kill due to stateful merge
Jun 26 22:51:25 fenced telling cman to remove nodeid 1 from cluster
Jun 26 22:51:25 fenced daemon cpg_dispatch error 2
Jun 26 22:51:25 fenced cluster is down, exiting
Jun 26 22:51:25 fenced daemon cpg_dispatch error 2


Here is the fenced.log on hosta
Jun 26 22:42:48 fenced daemon node 2 stateful merge
Jun 26 22:42:48 fenced daemon node 2 kill due to stateful merge
Jun 26 22:42:48 fenced telling cman to remove nodeid 2 from cluster
Jun 26 22:42:48 fenced daemon_member 2 zero proto
Jun 26 22:43:04 fenced cluster node 2 removed seq 1224
Jun 26 22:43:04 fenced receive_protocol from 1 max 1.1.1.0 run 1.1.1.1
Jun 26 22:43:04 fenced daemon node 1 max 1.1.1.0 run 1.1.1.1
Jun 26 22:43:04 fenced daemon node 1 join 1340720582 left 0 local quorum 1340720582
Jun 26 22:43:04 fenced fenced:daemon conf 1 0 1 memb 1 join left 2
Jun 26 22:43:04 fenced fenced:daemon ring 1:1224 1 memb 1

Two nodes both think they are both master, and will remove each other from cluster.

In this situation, I have to reboot hosta to rejoin cluster to become the slave one.

Comment 7 Fabio Massimo Di Nitto 2012-06-30 13:36:38 UTC
This is absolutely normal behaviour again.

You cannot plug the cable back without stopping the cluster on the failed node first.

Since you are running RHEL6.2, I strongly recommend you contact GSS that will point you towards the correct documentation on how to use and administer a cluster.

Comment 8 davidyangyi 2012-07-01 02:23:05 UTC
Fabio Massimo Di Nitto

Thank you, how can I contact GSS ?

Comment 9 Fabio Massimo Di Nitto 2012-07-01 06:15:35 UTC
GSS is Red Hat Global Support Services. You can find information on www.redhat.com website on how to open a ticket.