Description of problem: The vast majority of the time (all except once that I've noticed) the Fence Domain service fails to start properly on all nodes. The result is that fencing doesn't work, and it's not possible to cleanly shut down cman because a dependent service is running. This is a 3 node cluster. Run these steps on all: - modprobe dm-mod, gfs, and lock_dlm - ccsd - cman_tool join (wait for quorum) - fence_tool join - clvmd (not necessary, but illustrates the services point) The /proc/cluster/services then looks like this on the three nodes: [root@link-10 root]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 1 2 join S-6,20,1 [2] DLM Lock Space: "clvmd" 2 3 run - [1 2 3] [root@link-11 root]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 0 2 join S-1,280,3 [] DLM Lock Space: "clvmd" 2 3 run - [2 1 3] [root@link-12 root]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 0 2 join S-1,80,3 [] DLM Lock Space: "clvmd" 2 3 run - [1 2 3] Here's /proc/cluster/nodes: [root@link-12 root]# cat /proc/cluster/nodes Node Votes Exp Sts Name 1 1 3 M link-12.lab.msp.redhat.com 2 1 3 M link-10.lab.msp.redhat.com 3 1 3 M link-11.lab.msp.redhat.com So it looks like the node [2] is in the default domain and the other two are not in a domain. There is nothing printed to the log file that I can see to help determine what is happening here. Version-Release number of selected component (if applicable): How reproducible: 99% Steps to Reproduce: 1. modprobe dm-mod, gfs, and lock_dlm 2. Start ccsd 3. Run cman_tool join and fence_tool join 4. Check /proc/cluster/services or otherwise try fencing Actual results: Expected results: Additional info:
I had the same problem until i found out the following: Starting fence_tool with -t 120 allows the other nodes to join the cluster (withing 120 seconds) before joining the fence domain. That way all nodes of the cluster are available when the first joins the fence domain.
This is not the same issue as starting fenced and then having other nodes join the cluster. As can be seen from /proc/cluster/nodes we have cman_tool join started on each of the three nodes and have full quorum: [root@link-12 block]# cat /proc/cluster/nodes Node Votes Exp Sts Name 1 1 3 M link-12.lab.msp.redhat.com 2 1 3 M link-11.lab.msp.redhat.com 3 1 3 M link-10.lab.msp.redhat.com This may not be noticable when the fence agent used is fence_manual, as I checked by setting my fence agent to /bin/true. All nodes are fenced, but if that agent doesn't physically do anything to the machine all nodes eventually join the fence domain. From the debug of the fenced you can see that the fence agent is execed for each node in the cluster. Raising priority on this. [root@link-12 block]# fenced -D Command Line Arguments: name = default debug = 1 fenced: start: fenced: event_id = 1 fenced: last_stop = 0 fenced: last_start = 1 fenced: last_finish = 0 fenced: node_count = 1 fenced: start_type = join fenced: members: fenced: nodeid = 1 "link-12.lab.msp.redhat.com" fenced: do_recovery stop 0 start 1 finish 0 fenced: our nodeid 1 fenced: add first victim 0 fenced: add first victim 0 fenced: add first victim 0 fenced: fencing node "link-10" fenced: fencing node "link-11" fenced: fencing node "link-12" fenced: finish: fenced: event_id = 1 fenced: last_stop = 0 fenced: last_start = 1 fenced: last_finish = 1 fenced: node_count = 0
could you do the following: - run cman_tool join on all nodes - wait for all nodes to be members (by watching /proc/cluster/nodes) - save /proc/cluster/nodes for me to look at - verify that /proc/cluster/services is empty at this point - start fenced -D on /one/ node (logging all output to a file) - start fenced (normally is fine) on the remaining nodes (the first fenced -D is accumulating output throughout) - when there is no more output from the "fenced -D" copy all the saved output somewhere where I can take a look along with the initial output from /proc/cluster/nodes
Now that I installed net::telnet I see the same thing as derek on my cluster...
[root@link-10 cluster]# cat /proc/cluster/nodes Node Votes Exp Sts Name 1 1 3 M link-10.lab.msp.redhat.com 2 1 3 M link-11.lab.msp.redhat.com 3 1 3 M link-12.lab.msp.redhat.com [root@link-10 root]# cat /proc/cluster/services Service Name GID LID State Code [root@link-10 root]# FENCED INFO: Started fenced -D on link-10, then started fenced on link-11 and link-12. Command Line Arguments: name = default debug = 1 fenced: start: fenced: event_id = 1 fenced: last_stop = 0 fenced: last_start = 1 fenced: last_finish = 0 fenced: node_count = 1 fenced: start_type = join fenced: members: fenced: nodeid = 2 "link-10.lab.msp.redhat.com" fenced: do_recovery stop 0 start 1 finish 0 fenced: our nodeid 2 fenced: add first victim 0 fenced: add first victim 0 fenced: add first victim 0 fenced: fencing node "link-10" fenced: fencing node "link-11" fenced: fencing node "link-12" fenced: finish: fenced: event_id = 1 fenced: last_stop = 0 fenced: last_start = 1 fenced: last_finish = 1 fenced: node_count = 0 fenced: stop: fenced: event_id = 0 fenced: last_stop = 1 fenced: last_start = 1 fenced: last_finish = 1 fenced: node_count = 0 fenced: start: fenced: event_id = 2 fenced: last_stop = 1 fenced: last_start = 2 fenced: last_finish = 1 fenced: node_count = 2 fenced: start_type = join fenced: members: fenced: nodeid = 2 "link-10.lab.msp.redhat.com" fenced: nodeid = 3 "link-11.lab.msp.redhat.com" fenced: do_recovery stop 1 start 2 finish 1 fenced: finish: fenced: event_id = 2 fenced: last_stop = 1 fenced: last_start = 2 fenced: last_finish = 2 fenced: node_count = 0 fenced: stop: fenced: event_id = 0 fenced: last_stop = 2 fenced: last_start = 2 fenced: last_finish = 2 fenced: node_count = 0 fenced: start: fenced: event_id = 3 fenced: last_stop = 2 fenced: last_start = 3 fenced: last_finish = 2 fenced: node_count = 3 fenced: start_type = join fenced: members: fenced: nodeid = 2 "link-10.lab.msp.redhat.com" fenced: nodeid = 3 "link-11.lab.msp.redhat.com" fenced: nodeid = 1 "link-12.lab.msp.redhat.com" fenced: do_recovery stop 2 start 3 finish 2 fenced: finish: fenced: event_id = 3 fenced: last_stop = 2 fenced: last_start = 3 fenced: last_finish = 3 fenced: node_count = 0
On my eight new test machines I see this problem immediately. I'll get to work on this one right away and hopefully have a fix quickly.
this should be fixed with today's checkins
This fix has been verified. The fence domain does reach the "run" state and fencing does work now.
Updating version to the right level in the defects. Sorry for the storm.