Bug 1044886 - Stop the network service and the node does not be fenced
Summary: Stop the network service and the node does not be fenced
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: cluster
Version: 6.5
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: rc
: ---
Assignee: Christine Caulfield
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-12-19 07:59 UTC by Michael Yang
Modified: 2013-12-20 10:43 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-12-20 10:43:24 UTC
Target Upstream Version:


Attachments (Terms of Use)
my cluster configuration (1.71 KB, application/xml)
2013-12-20 10:12 UTC, Michael Yang
no flags Details

Description Michael Yang 2013-12-19 07:59:31 UTC
Description of problem:
There is two nodes in my cluster with a qdisk and I execute "/etc/init.d/network stop" on node1.

On node2, clustat shows everything is OK, and the log shows node1 has been fenced successd.

While on node1, I found that,
[root@node1 ~]# clustat
Cluster Status for Ask8TBxVaX2qyMl0 @ Thu Dec 19 15:42:40 2013
Member Status: Quorate

 Member Name                                                     ID   Status
 ------ ----                                                     ---- ------
 192.168.35.11                                                       1 Online, rgmanager
 192.168.35.12                                                       2 Online, Local, rgmanager
 HBDEV                                                               0 Offline, Quorum Disk

 Service Name                                                     Owner (Last)                                                     State         
 ------- ----                                                     ----- ------                                                     -----         
 vm:aeVYUOcG-XYGKVH-VujB                                          (none)                                                           disabled   


/var/log/messages shows:



Version-Release number of selected component (if applicable):
cman-2.0.115-51

How reproducible:
always

Steps to Reproduce:
1. two nodes cluster with qdisk
2. execute "/etc/init.d/network stop" on node1
3. execute "clustat" on node1

Actual results:
node1 does not be fenced

Expected results:
node1 should be fenced

Additional info:

Comment 1 Michael Yang 2013-12-19 08:09:34 UTC
Sorry, logs on node1 is here:

2013-12-19T15:34:29.932373+08:00 err daemon node1 qdiskd[11158]:  <err> Qdisk heartbeat send message to address 239.192.103.60 failed,errno=22
2013-12-19T15:34:29.935842+08:00 err daemon node1 qdiskd[11158]:  <err> Error writing node ID block 1
2013-12-19T15:34:29.935915+08:00 err daemon node1 qdiskd[11158]:  <err> Error writing to quorum disk
2013-12-19T15:34:29.935949+08:00 err daemon node1 qdiskd[11158]:  <err> Qdisk heartbeat send message to address 239.192.103.60 failed,errno=22
2013-12-19T15:34:29.935968+08:00 err daemon node1 qdiskd[11158]:  <err> Error writing node ID block 1
2013-12-19T15:34:29.935986+08:00 err daemon node1 qdiskd[11158]:  <err> Error writing to quorum disk

Comment 2 Christine Caulfield 2013-12-20 09:47:40 UTC
It's impossible to be sure without seeing the whole configuration. But if node2 said that node1 was fenced and it wasn't, I would check your fencing configuration is correct. 

One of the things you must always do before deploying a cluster is to check that the fence_node command does actually fence the nodes.

Comment 3 Michael Yang 2013-12-20 10:12:08 UTC
Created attachment 839503 [details]
my cluster configuration

Comment 4 Michael Yang 2013-12-20 10:12:32 UTC
Uploaded my configuration.

Logs on node2(192.168.35.12):

2013-12-20T17:54:33.848478+08:00 info daemon h35-12 fenced[2424]:  192.168.35.11 not a cluster member after 3 sec post_fail_delay
2013-12-20T17:54:33.849866+08:00 info daemon h35-12 fenced[2424]:  fencing node "192.168.35.11"
2013-12-20T17:54:34.207879+08:00 alert user h35-12 python:  Fence_agent:Fence begin.
2013-12-20T17:54:35.227584+08:00 alert user h35-12 python:  Fence_agent: Params from cman read target:192.168.35.11
2013-12-20T17:54:35.841740+08:00 info local6 h35-12 clurgmgrd[2484]:  <info> Waiting for node #1 to be fenced
2013-12-20T17:54:37.035085+08:00 alert user h35-12 python:  Fence_agent:Fence timeout:Fence finished.:['/sbin/fence_agent'] 192.168.35.11
2013-12-20T17:54:37.115505+08:00 info daemon h35-12 fenced[2424]:  fence "192.168.35.11" success
2013-12-20T17:54:37.842771+08:00 info local6 h35-12 clurgmgrd[2484]:  <info> Node #1 fenced; continuing

On node2:

[root@h35-12 ~]# clustat
Cluster Status for Ask8TBxVaX2qyMl0 @ Fri Dec 20 17:56:47 2013
Member Status: Quorate

 Member Name                                                     ID   Status
 ------ ----                                                     ---- ------
 192.168.35.11                                                       1 Offline
 192.168.35.12                                                       2 Online, Local, rgmanager
 HBDEV                                                               0 Online, Quorum Disk

 Service Name                                                     Owner (Last)                                                     State         
 ------- ----                                                     ----- ------                                                     -----         
 vm:bfdfLHnJ-hrcCaL-TknD                                          (none)                                                           stopped       

[root@h35-12 ~]# cman_tool nodes -af
Node  Sts   Inc   Joined               Name
   0   M      0   2013-12-19 19:25:22  HBDEV
   1   X    336                        192.168.35.11
       Last fenced:   2013-12-20 17:54:37 by two_nodes_device
   2   M    308   2013-12-19 19:21:02  192.168.35.12
       Addresses: 192.168.35.12 


Fence_node command is Ok, as well as unplug the cable directly, node1 will reboot when fence happened.

But now, when shutdown the network service on node1, its cluster status always shows quorate. Node1 should be reboot or at least, the cluster status should be inquorate I think.

Comment 5 Christine Caulfield 2013-12-20 10:30:48 UTC
What is 'fence_agent' and where did it come from? it doesn't look like a supported Red Hat fence agent to me.

Comment 6 Michael Yang 2013-12-20 10:38:56 UTC
We don't hava a fence device, 'fence_agent' is a script to replace that

Comment 7 Christine Caulfield 2013-12-20 10:43:24 UTC
In that case you're on your own I'm afraid. We can only support clusters with valid fencing.

Looking at the log I can see that the 'fence agent' you have is giving some sort of error. But if it doesn't actually do anything do reboot the other system then it's not a fence agent and you can't expect the other node to know what's been going on.

I'm going to close this bug, because your fence agent isn't fencing and that's nothing we can fix.


Note You need to log in before you can comment on or make changes to this bug.