Bug 144170

Summary: Regain quorum: "fencing deferred to 4294967295"
Product: [Retired] Red Hat Cluster Suite Reporter: Derek Anderson <danderso>
Component: fenceAssignee: Adam "mantis" Manthei <amanthei>
Status: CLOSED CURRENTRELEASE QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: medium    
Version: 4CC: cluster-maint
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2005-01-11 19:37:27 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Derek Anderson 2005-01-04 21:44:54 UTC
Description of problem:
Create quorum in a 3 node cluster (link-10,link-11,link-12).  Remove
two of the nodes (link-10,link-12) so link-11 loses quorum and goes
into "Activity blocked" mode.  Bring back link-10 with: ccsd;
cman_tool join; fence_tool join.

When link-10 joins the fence domain this appears in the messages file:
Jan  4 15:39:00 link-10 fenced[2489]: fencing deferred to 4294967295

###
### link-11's view of the inquorate cluster:
###
[root@link-11 root]# cat /proc/cluster/nodes
Node  Votes Exp Sts  Name
   1    1    3   X   link-10
   2    1    3   M   link-11
   3    1    3   X   link-12
[root@link-11 root]# cat /proc/cluster/services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 run       -
[2]

DLM Lock Space:  "clvmd"                             2   3 run       -
[2]

[root@link-11 root]# cat /proc/cluster/status
Protocol version: 4.0.1
Config version: 1
Cluster name: MILTON
Cluster ID: 4812
Membership state: Cluster-Member
Nodes: 1
Expected_votes: 3
Total_votes: 1
Quorum: 2  Activity blocked
Active subsystems: 3
Node addresses: 192.168.44.161

###
### link-11's view of the cluster after link-10 rejoins
###
[root@link-11 root]# cat /proc/cluster/nodes
Node  Votes Exp Sts  Name
   1    1    3   M   link-10
   2    1    3   M   link-11
   3    1    3   X   link-12
[root@link-11 root]# cat /proc/cluster/services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 run       -
[2 1]

DLM Lock Space:  "clvmd"                             2   3 run       -
[2]

[root@link-11 root]# cat /proc/cluster/status
Protocol version: 4.0.1
Config version: 1
Cluster name: MILTON
Cluster ID: 4812
Membership state: Cluster-Member
Nodes: 2
Expected_votes: 3
Total_votes: 2
Quorum: 2
Active subsystems: 3
Node addresses: 192.168.44.161

Version-Release number of selected component (if applicable):
Latest 6.1 RPMS built Wed 15 Dec 2004 01:13:08 PM CST

How reproducible:
First time this has been noticed

Steps to Reproduce:
1. 3 nodes quorate and joined to fence domain
2. Remove 2 nodes simultaneously
3. Join one of the two back to regain quorum and rejoin fence domain
  
Actual results:
fencing deferred to an unknown node number (4294967295).

Expected results:
fencing action assumed by one of the now quorate cluster members.

Additional info:

Comment 1 David Teigland 2005-01-06 15:51:38 UTC
Fixing this is as simple as printing the node name instead of the node
number.  In this situation the node number hasn't been set yet
(4294967295 == -1) but the node name is available.

Comment 2 David Teigland 2005-01-07 16:13:53 UTC
In the cases where we'd print -1, we actually don't know the node
we're deferring to, we only know it's not us.  Now we print the
node name if we know or "prior member" if we don't.

/cvs/cluster/cluster/fence/fenced/recover.c,v  <--  recover.c
new revision: 1.10.2.1; previous revision: 1.10


Comment 3 Derek Anderson 2005-01-11 19:37:27 UTC
Verified.  Log message now looks like:

Jan 11 13:33:32 link-12 fenced[2634]: fencing deferred to prior member

[root@link-11 root]# fenced -V
fenced 1.7. (built Jan 10 2005 16:22:11)