Description of problem: This started happening with today's RPMs. On each node in the cluster run 'ccsd; modprobe cman; cman_tool join; fence_tool join'. Then attempt to leave the fence domain with 'fence_tool leave'. The command returns 0 but the node does not leave the fence domain. Have to 'kill -9 <fenced_pid> to continue shutting down the cluster. [root@link-12 root]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 1 2 run - [1 2 3] [root@link-12 root]# fence_tool leave [root@link-12 root]# echo $? 0 [root@link-12 root]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 1 2 run - [1 2 3] Version-Release number of selected component (if applicable): [root@link-12 root]# fenced -V fenced 1.7. (built Jan 10 2005 16:22:11) How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
FWIW -- I see the same thing on my 6 node cluster...
This is related to an old bz where we said that fence_tool join/leave are asynchronous commands. They return success if they are able to initiate the join/leave, but don't wait around to see what happens. Unfortunately, there's no good way of making the join or the leave synchronous without a bit of work. One quick hack is to make fence_tool watch /proc/cluster/services to determine when the join or leave is complete, but at that point we have "service fenced stop".
Dave -- I'm not sure this is what we are seeing. I realize you exit without waiting around, but the nodes never leave the fence domain. So the issue is we can't leave, not that we don't get an error. (Although that is an issue as well, but a different one) If I issue a fence_tool leave on a single node, and wait for, oh 15 minutes -- I'm still in the fence domain. No messages in the logs, nothing on the console to indicate something is wrong. /proc/cluste/services looks like: Fence Domain: "default" 1 2 run - [1 5 3 4 2] So that is the crux of this bug.
Sorry about that, didn't look closely enough. I believe you're seeing a bug I created on Jan 10 while fixing another fenced bug. I fixed it the next day, Jan 11, but not quickly enough for it to get into the rpm builds.
Fix verified, fence-1.15-7.