Bug 1044886

Summary: Stop the network service and the node does not be fenced
Product: Red Hat Enterprise Linux 6 Reporter: Michael Yang <michael199089>
Component: clusterAssignee: Christine Caulfield <ccaulfie>
Status: CLOSED NOTABUG QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 6.5CC: ccaulfie, cluster-maint, michael199089, rpeterso, teigland
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-12-20 10:43:24 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
my cluster configuration none

Description Michael Yang 2013-12-19 07:59:31 UTC
Description of problem:
There is two nodes in my cluster with a qdisk and I execute "/etc/init.d/network stop" on node1.

On node2, clustat shows everything is OK, and the log shows node1 has been fenced successd.

While on node1, I found that,
[root@node1 ~]# clustat
Cluster Status for Ask8TBxVaX2qyMl0 @ Thu Dec 19 15:42:40 2013
Member Status: Quorate

 Member Name                                                     ID   Status
 ------ ----                                                     ---- ------
 192.168.35.11                                                       1 Online, rgmanager
 192.168.35.12                                                       2 Online, Local, rgmanager
 HBDEV                                                               0 Offline, Quorum Disk

 Service Name                                                     Owner (Last)                                                     State         
 ------- ----                                                     ----- ------                                                     -----         
 vm:aeVYUOcG-XYGKVH-VujB                                          (none)                                                           disabled   


/var/log/messages shows:



Version-Release number of selected component (if applicable):
cman-2.0.115-51

How reproducible:
always

Steps to Reproduce:
1. two nodes cluster with qdisk
2. execute "/etc/init.d/network stop" on node1
3. execute "clustat" on node1

Actual results:
node1 does not be fenced

Expected results:
node1 should be fenced

Additional info:

Comment 1 Michael Yang 2013-12-19 08:09:34 UTC
Sorry, logs on node1 is here:

2013-12-19T15:34:29.932373+08:00 err daemon node1 qdiskd[11158]:  <err> Qdisk heartbeat send message to address 239.192.103.60 failed,errno=22
2013-12-19T15:34:29.935842+08:00 err daemon node1 qdiskd[11158]:  <err> Error writing node ID block 1
2013-12-19T15:34:29.935915+08:00 err daemon node1 qdiskd[11158]:  <err> Error writing to quorum disk
2013-12-19T15:34:29.935949+08:00 err daemon node1 qdiskd[11158]:  <err> Qdisk heartbeat send message to address 239.192.103.60 failed,errno=22
2013-12-19T15:34:29.935968+08:00 err daemon node1 qdiskd[11158]:  <err> Error writing node ID block 1
2013-12-19T15:34:29.935986+08:00 err daemon node1 qdiskd[11158]:  <err> Error writing to quorum disk

Comment 2 Christine Caulfield 2013-12-20 09:47:40 UTC
It's impossible to be sure without seeing the whole configuration. But if node2 said that node1 was fenced and it wasn't, I would check your fencing configuration is correct. 

One of the things you must always do before deploying a cluster is to check that the fence_node command does actually fence the nodes.

Comment 3 Michael Yang 2013-12-20 10:12:08 UTC
Created attachment 839503 [details]
my cluster configuration

Comment 4 Michael Yang 2013-12-20 10:12:32 UTC
Uploaded my configuration.

Logs on node2(192.168.35.12):

2013-12-20T17:54:33.848478+08:00 info daemon h35-12 fenced[2424]:  192.168.35.11 not a cluster member after 3 sec post_fail_delay
2013-12-20T17:54:33.849866+08:00 info daemon h35-12 fenced[2424]:  fencing node "192.168.35.11"
2013-12-20T17:54:34.207879+08:00 alert user h35-12 python:  Fence_agent:Fence begin.
2013-12-20T17:54:35.227584+08:00 alert user h35-12 python:  Fence_agent: Params from cman read target:192.168.35.11
2013-12-20T17:54:35.841740+08:00 info local6 h35-12 clurgmgrd[2484]:  <info> Waiting for node #1 to be fenced
2013-12-20T17:54:37.035085+08:00 alert user h35-12 python:  Fence_agent:Fence timeout:Fence finished.:['/sbin/fence_agent'] 192.168.35.11
2013-12-20T17:54:37.115505+08:00 info daemon h35-12 fenced[2424]:  fence "192.168.35.11" success
2013-12-20T17:54:37.842771+08:00 info local6 h35-12 clurgmgrd[2484]:  <info> Node #1 fenced; continuing

On node2:

[root@h35-12 ~]# clustat
Cluster Status for Ask8TBxVaX2qyMl0 @ Fri Dec 20 17:56:47 2013
Member Status: Quorate

 Member Name                                                     ID   Status
 ------ ----                                                     ---- ------
 192.168.35.11                                                       1 Offline
 192.168.35.12                                                       2 Online, Local, rgmanager
 HBDEV                                                               0 Online, Quorum Disk

 Service Name                                                     Owner (Last)                                                     State         
 ------- ----                                                     ----- ------                                                     -----         
 vm:bfdfLHnJ-hrcCaL-TknD                                          (none)                                                           stopped       

[root@h35-12 ~]# cman_tool nodes -af
Node  Sts   Inc   Joined               Name
   0   M      0   2013-12-19 19:25:22  HBDEV
   1   X    336                        192.168.35.11
       Last fenced:   2013-12-20 17:54:37 by two_nodes_device
   2   M    308   2013-12-19 19:21:02  192.168.35.12
       Addresses: 192.168.35.12 


Fence_node command is Ok, as well as unplug the cable directly, node1 will reboot when fence happened.

But now, when shutdown the network service on node1, its cluster status always shows quorate. Node1 should be reboot or at least, the cluster status should be inquorate I think.

Comment 5 Christine Caulfield 2013-12-20 10:30:48 UTC
What is 'fence_agent' and where did it come from? it doesn't look like a supported Red Hat fence agent to me.

Comment 6 Michael Yang 2013-12-20 10:38:56 UTC
We don't hava a fence device, 'fence_agent' is a script to replace that

Comment 7 Christine Caulfield 2013-12-20 10:43:24 UTC
In that case you're on your own I'm afraid. We can only support clusters with valid fencing.

Looking at the log I can see that the 'fence agent' you have is giving some sort of error. But if it doesn't actually do anything do reboot the other system then it's not a fence agent and you can't expect the other node to know what's been going on.

I'm going to close this bug, because your fence agent isn't fencing and that's nothing we can fix.