Bug 144826

Summary: fence_tool leave: not leaving fence domain, returning success
Product: [Retired] Red Hat Cluster Suite Reporter: Derek Anderson <danderso>
Component: fenceAssignee: David Teigland <teigland>
Status: CLOSED NEXTRELEASE QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: medium    
Version: 4CC: cluster-maint
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2005-02-09 20:58:33 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Derek Anderson 2005-01-11 19:49:34 UTC
Description of problem:
This started happening with today's RPMs.  On each node in the cluster
run 'ccsd; modprobe cman; cman_tool join; fence_tool join'.  Then
attempt to leave the fence domain with 'fence_tool leave'.  The
command returns 0 but the node does not leave the fence domain.  Have
to 'kill -9 <fenced_pid> to continue shutting down the cluster.

[root@link-12 root]# cat /proc/cluster/services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 run       -
[1 2 3]

[root@link-12 root]# fence_tool leave
[root@link-12 root]# echo $?
0
[root@link-12 root]# cat /proc/cluster/services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 run       -
[1 2 3]

Version-Release number of selected component (if applicable):
[root@link-12 root]# fenced -V
fenced 1.7. (built Jan 10 2005 16:22:11)

How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Dean Jansa 2005-01-11 19:51:13 UTC
FWIW -- I see the same thing on my 6 node cluster...

Comment 2 David Teigland 2005-01-14 07:53:26 UTC
This is related to an old bz where we said that fence_tool join/leave
are asynchronous commands.  They return success if they are able to
initiate the join/leave, but don't wait around to see what happens.

Unfortunately, there's no good way of making the join or the leave
synchronous without a bit of work.  One quick hack is to make
fence_tool watch /proc/cluster/services to determine when the join
or leave is complete, but at that point we have "service fenced stop".


Comment 3 Dean Jansa 2005-01-14 15:27:31 UTC
Dave --

I'm not sure this is what we are seeing.  I realize you exit without
waiting around, but the nodes never leave the fence domain.  So the
issue is we can't leave, not that we don't get an error.  (Although
that is an issue as well, but a different one)

If I issue a fence_tool leave on a single node, and wait for, oh 15
minutes -- I'm still in the fence domain.  No messages in the logs,
nothing on the console to indicate something is wrong.

/proc/cluste/services looks like:

Fence Domain:    "default"                           1   2 run       -
[1 5 3 4 2]

So that is the crux of this bug.

Comment 5 David Teigland 2005-01-18 11:27:59 UTC
Sorry about that, didn't look closely enough.

I believe you're seeing a bug I created on Jan 10 while fixing another
fenced bug.  I fixed it the next day, Jan 11, but not quickly enough
for it to get into the rpm builds.


Comment 6 Derek Anderson 2005-02-09 20:58:33 UTC
Fix verified, fence-1.15-7.