Hide Forgot
Description of problem: There is two nodes in my cluster with a qdisk and I execute "/etc/init.d/network stop" on node1. On node2, clustat shows everything is OK, and the log shows node1 has been fenced successd. While on node1, I found that, [root@node1 ~]# clustat Cluster Status for Ask8TBxVaX2qyMl0 @ Thu Dec 19 15:42:40 2013 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ 192.168.35.11 1 Online, rgmanager 192.168.35.12 2 Online, Local, rgmanager HBDEV 0 Offline, Quorum Disk Service Name Owner (Last) State ------- ---- ----- ------ ----- vm:aeVYUOcG-XYGKVH-VujB (none) disabled /var/log/messages shows: Version-Release number of selected component (if applicable): cman-2.0.115-51 How reproducible: always Steps to Reproduce: 1. two nodes cluster with qdisk 2. execute "/etc/init.d/network stop" on node1 3. execute "clustat" on node1 Actual results: node1 does not be fenced Expected results: node1 should be fenced Additional info:
Sorry, logs on node1 is here: 2013-12-19T15:34:29.932373+08:00 err daemon node1 qdiskd[11158]: <err> Qdisk heartbeat send message to address 239.192.103.60 failed,errno=22 2013-12-19T15:34:29.935842+08:00 err daemon node1 qdiskd[11158]: <err> Error writing node ID block 1 2013-12-19T15:34:29.935915+08:00 err daemon node1 qdiskd[11158]: <err> Error writing to quorum disk 2013-12-19T15:34:29.935949+08:00 err daemon node1 qdiskd[11158]: <err> Qdisk heartbeat send message to address 239.192.103.60 failed,errno=22 2013-12-19T15:34:29.935968+08:00 err daemon node1 qdiskd[11158]: <err> Error writing node ID block 1 2013-12-19T15:34:29.935986+08:00 err daemon node1 qdiskd[11158]: <err> Error writing to quorum disk
It's impossible to be sure without seeing the whole configuration. But if node2 said that node1 was fenced and it wasn't, I would check your fencing configuration is correct. One of the things you must always do before deploying a cluster is to check that the fence_node command does actually fence the nodes.
Created attachment 839503 [details] my cluster configuration
Uploaded my configuration. Logs on node2(192.168.35.12): 2013-12-20T17:54:33.848478+08:00 info daemon h35-12 fenced[2424]: 192.168.35.11 not a cluster member after 3 sec post_fail_delay 2013-12-20T17:54:33.849866+08:00 info daemon h35-12 fenced[2424]: fencing node "192.168.35.11" 2013-12-20T17:54:34.207879+08:00 alert user h35-12 python: Fence_agent:Fence begin. 2013-12-20T17:54:35.227584+08:00 alert user h35-12 python: Fence_agent: Params from cman read target:192.168.35.11 2013-12-20T17:54:35.841740+08:00 info local6 h35-12 clurgmgrd[2484]: <info> Waiting for node #1 to be fenced 2013-12-20T17:54:37.035085+08:00 alert user h35-12 python: Fence_agent:Fence timeout:Fence finished.:['/sbin/fence_agent'] 192.168.35.11 2013-12-20T17:54:37.115505+08:00 info daemon h35-12 fenced[2424]: fence "192.168.35.11" success 2013-12-20T17:54:37.842771+08:00 info local6 h35-12 clurgmgrd[2484]: <info> Node #1 fenced; continuing On node2: [root@h35-12 ~]# clustat Cluster Status for Ask8TBxVaX2qyMl0 @ Fri Dec 20 17:56:47 2013 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ 192.168.35.11 1 Offline 192.168.35.12 2 Online, Local, rgmanager HBDEV 0 Online, Quorum Disk Service Name Owner (Last) State ------- ---- ----- ------ ----- vm:bfdfLHnJ-hrcCaL-TknD (none) stopped [root@h35-12 ~]# cman_tool nodes -af Node Sts Inc Joined Name 0 M 0 2013-12-19 19:25:22 HBDEV 1 X 336 192.168.35.11 Last fenced: 2013-12-20 17:54:37 by two_nodes_device 2 M 308 2013-12-19 19:21:02 192.168.35.12 Addresses: 192.168.35.12 Fence_node command is Ok, as well as unplug the cable directly, node1 will reboot when fence happened. But now, when shutdown the network service on node1, its cluster status always shows quorate. Node1 should be reboot or at least, the cluster status should be inquorate I think.
What is 'fence_agent' and where did it come from? it doesn't look like a supported Red Hat fence agent to me.
We don't hava a fence device, 'fence_agent' is a script to replace that
In that case you're on your own I'm afraid. We can only support clusters with valid fencing. Looking at the log I can see that the 'fence agent' you have is giving some sort of error. But if it doesn't actually do anything do reboot the other system then it's not a fence agent and you can't expect the other node to know what's been going on. I'm going to close this bug, because your fence agent isn't fencing and that's nothing we can fix.