Bug 836429 - hosta can never rejoin the cluster because of "reservation conflict" unless reboot -f hosta.
hosta can never rejoin the cluster because of "reservation conflict" unless r...
Status: CLOSED NOTABUG
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: cluster (Show other bugs)
6.2
x86_64 Linux
unspecified Severity high
: rc
: ---
Assigned To: Ryan O'Hara
Cluster QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2012-06-29 01:21 EDT by davidyangyi
Modified: 2012-07-01 02:15 EDT (History)
9 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2012-06-30 09:36:38 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description davidyangyi 2012-06-29 01:21:26 EDT
Description of problem:
I have a RHCS of rhel6.2, using iscsi lun as fence device(SCSI reservations). When I unplug wire of the master(hosta), the slave(hostb) fenced successfully and become master. And hosta is still thinking it is the master. 
When I plug wire back into hosta, hosta can never rejoin the cluster because of "reservation conflict" unless reboot -f hosta.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.hosta is master, hostb is slave
2.using iscsi as fence device
3.unplug hosta's wire, waiting hostb becoming master
4.plug hosta's wire
  
Actual results:


Expected results:


Additional info:
Comment 1 davidyangyi 2012-06-29 01:23:24 EDT
cluster.conf

<?xml version="1.0"?>
<cluster config_version="41" name="bcec_img">
        <clusternodes>
                <clusternode name="hosta" nodeid="1">
                        <fence>
                                <method name="scsi">
                                        <device name="scsifence"/>
                                </method>
                        </fence>
                        <unfence>
                                <device action="on" name="scsifence"/>
                        </unfence>
                </clusternode>
                <clusternode name="hostb" nodeid="2">
                        <fence>
                                <method name="scsi">
                                        <device name="scsifence"/>
                                </method>
                        </fence>
                        <unfence>
                                <device action="on" name="scsifence"/>
                        </unfence>
                </clusternode>
        </clusternodes>
        <cman broadcast="yes" expected_votes="1" two_node="1"/>
        <rm>
                <resources>
                        <ip address="172.16.200.31/25" monitor_link="on" sleeptime="10"/>
                        <fs device="/dev/mapper/mpathap1" fsid="7094" mountpoint="/bcec_images" name="bcec_imgages" quick_status="on"/>
                </resources>
                <failoverdomains>
                        <failoverdomain name="bcec_img" nofailback="1" ordered="0" restricted="0">
                                <failoverdomainnode name="hosta" priority="2"/>
                                <failoverdomainnode name="hostb" priority="1"/>
                        </failoverdomain>
                </failoverdomains>
                <service domain="bcec_img" exclusive="1" name="bcec_img" recovery="relocate">
                        <ip ref="172.16.200.31/25"/>
                </service>
        </rm>
        <fence_daemon post_join_delay="25"/>
        <fencedevices>
                <fencedevice agent="fence_scsi" name="scsifence"/>
        </fencedevices>
        <logging debug="on">
                <logging_daemon debug="on" name="rgmanager"/>
                <logging_daemon debug="on" name="corosync"/>
                <logging_daemon debug="on" name="fenced"/>
                <logging_daemon debug="on" name="dlm_controld"/>
                <logging_daemon debug="on" name="corosync" subsys="CMAN"/>
        </logging>
</cluster>
Comment 3 Steven Dake 2012-06-29 14:54:56 EDT
If problem exists, in base cluster, not corosync.  Reassigning to cluster for triage.
Comment 4 Fabio Massimo Di Nitto 2012-06-30 00:01:46 EDT
Reassining to fence_scsi maintainer to confirm, but it sounds like the correct behaviour to me.
Comment 5 Fabio Massimo Di Nitto 2012-06-30 00:03:13 EDT
IF anything, cman has to be restarted in order to perform "unfencing".

plug the fiber back
restart cman -> unfencing
node rejoins the cluster.
Comment 6 davidyangyi 2012-06-30 09:15:40 EDT
Thank you.
I mean pull out ethernet cable and plug it back, not fiber to array.

But I can't restart cman or hosta remove hostb from cluster.

Here is the fenced.log on hostb
Jun 26 22:51:25 fenced daemon node 1 stateful merge
Jun 26 22:51:25 fenced daemon node 1 kill due to stateful merge
Jun 26 22:51:25 fenced telling cman to remove nodeid 1 from cluster
Jun 26 22:51:25 fenced daemon cpg_dispatch error 2
Jun 26 22:51:25 fenced cluster is down, exiting
Jun 26 22:51:25 fenced daemon cpg_dispatch error 2


Here is the fenced.log on hosta
Jun 26 22:42:48 fenced daemon node 2 stateful merge
Jun 26 22:42:48 fenced daemon node 2 kill due to stateful merge
Jun 26 22:42:48 fenced telling cman to remove nodeid 2 from cluster
Jun 26 22:42:48 fenced daemon_member 2 zero proto
Jun 26 22:43:04 fenced cluster node 2 removed seq 1224
Jun 26 22:43:04 fenced receive_protocol from 1 max 1.1.1.0 run 1.1.1.1
Jun 26 22:43:04 fenced daemon node 1 max 1.1.1.0 run 1.1.1.1
Jun 26 22:43:04 fenced daemon node 1 join 1340720582 left 0 local quorum 1340720582
Jun 26 22:43:04 fenced fenced:daemon conf 1 0 1 memb 1 join left 2
Jun 26 22:43:04 fenced fenced:daemon ring 1:1224 1 memb 1

Two nodes both think they are both master, and will remove each other from cluster.

In this situation, I have to reboot hosta to rejoin cluster to become the slave one.
Comment 7 Fabio Massimo Di Nitto 2012-06-30 09:36:38 EDT
This is absolutely normal behaviour again.

You cannot plug the cable back without stopping the cluster on the failed node first.

Since you are running RHEL6.2, I strongly recommend you contact GSS that will point you towards the correct documentation on how to use and administer a cluster.
Comment 8 davidyangyi 2012-06-30 22:23:05 EDT
Fabio Massimo Di Nitto

Thank you, how can I contact GSS ?
Comment 9 Fabio Massimo Di Nitto 2012-07-01 02:15:35 EDT
GSS is Red Hat Global Support Services. You can find information on www.redhat.com website on how to open a ticket.

Note You need to log in before you can comment on or make changes to this bug.