Bug 127021 - Fence Domain service remains in join state, fencing all cluster nodes
Summary: Fence Domain service remains in join state, fencing all cluster nodes
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Cluster Suite
Classification: Retired
Component: gfs
Version: 4
Hardware: All
OS: Linux
high
medium
Target Milestone: ---
Assignee: David Teigland
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2004-06-30 18:44 UTC by Derek Anderson
Modified: 2010-01-12 02:53 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2004-08-26 22:00:42 UTC
Embargoed:


Attachments (Terms of Use)

Description Derek Anderson 2004-06-30 18:44:59 UTC
Description of problem:
The vast majority of the time (all except once that I've noticed) the
Fence Domain service fails to start properly on all nodes.  The result
is that fencing doesn't work, and it's not possible to cleanly shut
down cman because a dependent service is running.

This is a 3 node cluster.  Run these steps on all:
- modprobe dm-mod, gfs, and lock_dlm
- ccsd
- cman_tool join (wait for quorum)
- fence_tool join
- clvmd (not necessary, but illustrates the services point)

The /proc/cluster/services then looks like this on the three nodes:
[root@link-10 root]# cat /proc/cluster/services

Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 join     
S-6,20,1
[2]

DLM Lock Space:  "clvmd"                             2   3 run       -
[1 2 3]
[root@link-11 root]# cat /proc/cluster/services

Service          Name                              GID LID State     Code
Fence Domain:    "default"                           0   2 join     
S-1,280,3
[]

DLM Lock Space:  "clvmd"                             2   3 run       -
[2 1 3]
[root@link-12 root]# cat /proc/cluster/services

Service          Name                              GID LID State     Code
Fence Domain:    "default"                           0   2 join     
S-1,80,3
[]

DLM Lock Space:  "clvmd"                             2   3 run       -
[1 2 3]

Here's /proc/cluster/nodes:
[root@link-12 root]# cat /proc/cluster/nodes
Node  Votes Exp Sts  Name
   1    1    3   M   link-12.lab.msp.redhat.com
   2    1    3   M   link-10.lab.msp.redhat.com
   3    1    3   M   link-11.lab.msp.redhat.com

So it looks like the node [2] is in the default domain and the other
two are not in a domain.  There is nothing printed to the log file
that I can see to help determine what is happening here.


Version-Release number of selected component (if applicable):


How reproducible:
99%

Steps to Reproduce:
1. modprobe dm-mod, gfs, and lock_dlm
2. Start ccsd
3. Run cman_tool join and fence_tool join
4. Check /proc/cluster/services or otherwise try fencing
  
Actual results:


Expected results:


Additional info:

Comment 1 Amir Guindehi 2004-06-30 22:22:08 UTC
I had the same problem until i found out the following:

Starting fence_tool with -t 120 allows the other nodes to join the
cluster (withing 120 seconds) before joining the fence domain. That
way all nodes of the cluster are available when the first joins the
fence domain.

Comment 2 Derek Anderson 2004-07-02 14:06:42 UTC
This is not the same issue as starting fenced and then having other
nodes join the cluster.  As can be seen from /proc/cluster/nodes we
have cman_tool join started on each of the three nodes and have full
quorum:
[root@link-12 block]# cat /proc/cluster/nodes
Node  Votes Exp Sts  Name
   1    1    3   M   link-12.lab.msp.redhat.com
   2    1    3   M   link-11.lab.msp.redhat.com
   3    1    3   M   link-10.lab.msp.redhat.com

This may not be noticable when the fence agent used is fence_manual,
as I checked by setting my fence agent to /bin/true.  All nodes are
fenced, but if that agent doesn't physically do anything to the
machine all nodes eventually join the fence domain.

From the debug of the fenced you can see that the fence agent is
execed for each node in the cluster.  Raising priority on this.

[root@link-12 block]# fenced -D
Command Line Arguments:
  name = default
  debug = 1
fenced: start:
fenced:   event_id    = 1
fenced:   last_stop   = 0
fenced:   last_start  = 1
fenced:   last_finish = 0
fenced:   node_count  = 1
fenced:   start_type  = join
fenced: members:
fenced:   nodeid = 1 "link-12.lab.msp.redhat.com"
fenced: do_recovery stop 0 start 1 finish 0
fenced: our nodeid 1
fenced: add first victim 0
fenced: add first victim 0
fenced: add first victim 0
fenced: fencing node "link-10"
fenced: fencing node "link-11"
fenced: fencing node "link-12"
fenced: finish:
fenced:   event_id    = 1
fenced:   last_stop   = 0
fenced:   last_start  = 1
fenced:   last_finish = 1
fenced:   node_count  = 0


Comment 3 David Teigland 2004-07-02 14:36:29 UTC
could you do the following:
- run cman_tool join on all nodes
- wait for all nodes to be members (by watching /proc/cluster/nodes)
- save /proc/cluster/nodes for me to look at
- verify that /proc/cluster/services is empty at this point
- start fenced -D on /one/ node (logging all output to a file)
- start fenced (normally is fine) on the remaining nodes
  (the first fenced -D is accumulating output throughout)
- when there is no more output from the "fenced -D" copy all
  the saved output somewhere where I can take a look along
  with the initial output from /proc/cluster/nodes

Comment 4 Dean Jansa 2004-07-22 14:41:17 UTC
Now that I installed net::telnet I see the same thing as derek on my 
cluster...   

Comment 5 Derek Anderson 2004-07-22 15:06:01 UTC
[root@link-10 cluster]# cat /proc/cluster/nodes
Node  Votes Exp Sts  Name
   1    1    3   M   link-10.lab.msp.redhat.com
   2    1    3   M   link-11.lab.msp.redhat.com
   3    1    3   M   link-12.lab.msp.redhat.com
[root@link-10 root]# cat /proc/cluster/services

Service          Name                              GID LID State     Code
[root@link-10 root]#

FENCED INFO:
Started fenced -D on link-10, then started fenced on link-11 and link-12.

Command Line Arguments:
  name = default
  debug = 1
fenced: start:
fenced:   event_id    = 1
fenced:   last_stop   = 0
fenced:   last_start  = 1
fenced:   last_finish = 0
fenced:   node_count  = 1
fenced:   start_type  = join
fenced: members:
fenced:   nodeid = 2 "link-10.lab.msp.redhat.com"
fenced: do_recovery stop 0 start 1 finish 0
fenced: our nodeid 2
fenced: add first victim 0
fenced: add first victim 0
fenced: add first victim 0
fenced: fencing node "link-10"
fenced: fencing node "link-11"
fenced: fencing node "link-12"
fenced: finish:
fenced:   event_id    = 1
fenced:   last_stop   = 0
fenced:   last_start  = 1
fenced:   last_finish = 1
fenced:   node_count  = 0
fenced: stop:
fenced:   event_id    = 0
fenced:   last_stop   = 1
fenced:   last_start  = 1
fenced:   last_finish = 1
fenced:   node_count  = 0
fenced: start:
fenced:   event_id    = 2
fenced:   last_stop   = 1
fenced:   last_start  = 2
fenced:   last_finish = 1
fenced:   node_count  = 2
fenced:   start_type  = join
fenced: members:
fenced:   nodeid = 2 "link-10.lab.msp.redhat.com"
fenced:   nodeid = 3 "link-11.lab.msp.redhat.com"
fenced: do_recovery stop 1 start 2 finish 1
fenced: finish:
fenced:   event_id    = 2
fenced:   last_stop   = 1
fenced:   last_start  = 2
fenced:   last_finish = 2
fenced:   node_count  = 0
fenced: stop:
fenced:   event_id    = 0
fenced:   last_stop   = 2
fenced:   last_start  = 2
fenced:   last_finish = 2
fenced:   node_count  = 0
fenced: start:
fenced:   event_id    = 3
fenced:   last_stop   = 2
fenced:   last_start  = 3
fenced:   last_finish = 2
fenced:   node_count  = 3
fenced:   start_type  = join
fenced: members:
fenced:   nodeid = 2 "link-10.lab.msp.redhat.com"
fenced:   nodeid = 3 "link-11.lab.msp.redhat.com"
fenced:   nodeid = 1 "link-12.lab.msp.redhat.com"
fenced: do_recovery stop 2 start 3 finish 2
fenced: finish:
fenced:   event_id    = 3
fenced:   last_stop   = 2
fenced:   last_start  = 3
fenced:   last_finish = 3
fenced:   node_count  = 0

Comment 6 David Teigland 2004-07-22 15:41:22 UTC
On my eight new test machines I see this problem immediately. I'll
get to work on this one right away and hopefully have a fix quickly.

Comment 7 David Teigland 2004-07-23 08:52:25 UTC
this should be fixed with today's checkins

Comment 8 Corey Marthaler 2004-08-26 22:00:42 UTC
This fix has been verified. 

The fence domain does reach the "run" state and fencing does work now.

Comment 9 Kiersten (Kerri) Anderson 2004-11-16 19:09:26 UTC
Updating version to the right level in the defects.  Sorry for the storm.


Note You need to log in before you can comment on or make changes to this bug.