Bug 504158 - fenced doesn't work with IP addresses for node names [NEEDINFO]
fenced doesn't work with IP addresses for node names
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: cman (Show other bugs)
All Linux
low Severity medium
: rc
: ---
Assigned To: David Teigland
Cluster QE
Depends On:
  Show dependency treegraph
Reported: 2009-06-04 10:19 EDT by David Teigland
Modified: 2010-10-23 05:58 EDT (History)
8 users (show)

See Also:
Fixed In Version: cman-2_0_115-8_el5
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2010-03-30 04:42:09 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
cward: needinfo? (teigland)

Attachments (Terms of Use)
Patch to fix (1.41 KB, patch)
2009-06-04 11:49 EDT, Christine Caulfield
no flags Details | Diff

External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2010:0266 normal SHIPPED_LIVE cman bug fix and enhancement update 2010-03-29 08:54:44 EDT

  None (edit)
Description David Teigland 2009-06-04 10:19:38 EDT
Description of problem:

<dct> chrissie: do you know if we intend IP's to work as node names in RHEL5?
<chrissie> yes, we do
<dct> I thought so, but I've been helping a guy on linux-cluster and it appears fencing doesn't work right with them
<chrissie> ugh
<dct> not entirely sure, he reinstalled his nodes and set them up again with names and his problem went away
<chrissie> what fence agent was he using ?
<dct> ipmi
<dct> the bug I think is in fenced/member_cman.c
<dct> name_equal()
<chrissie> is it looking for the wrong name ?
<dct> i.e name_equal() assumes names not ip's
<dct> it'll match with
<dct> because of the common 10.

Report started here,

but continued off list with the following info,

It's a Two-Node Cluster, so this is only the Output from the surviving
node after i initiate a power down with ipmitool. When i do a fence_node
manually after the failure everthing works fine, the fencing action is
succesful and the node takes over the resource.

>- cman_tool nodes

# cman_tool nodes
Node  Sts   Inc   Joined               Name
   1   M    820   2009-06-03 15:54:55
   2   X    844              

>- group_tool -v

# group_tool -v
type             level name       id       state node id local_done
fence            0     default    00010001 none
dlm              1     rgmanager  00010002 none

>- group_tool dump fence

[root@ipsdb01 ~]# group_tool dump fence
1244037297 our_nodeid 1 our_name
1244037297 listen 4 member 5 groupd 7
1244037324 client 3: join default
1244037324 delay post_join 0s post_fail 0s
1244037324 added 2 nodes from ccs
1244037324 setid default 65537
1244037324 start default 1 members 1 2
1244037324 do_recovery stop 0 start 1 finish 0
1244037324 finish default 1
1244039642 client 3: dump
1244055972 stop default
1244055972 start default 3 members 1
1244055972 do_recovery stop 1 start 3 finish 1
1244055972 add node 2 to list 1
1244055972 averting fence of node
1244055972 finish default 3
1244056089 client 3: dump

>- any messages in /var/log/messages

openais[4144]: [TOTEM] The token was lost in the OPERATIONAL state.
openais[4144]: [TOTEM] Receive multicast socket recv buffer size (288000 bytes).
openais[4144]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes).
openais[4144]: [TOTEM] entering GATHER state from 2.
openais[4144]: [TOTEM] entering GATHER state from 0.
openais[4144]: [TOTEM] Creating commit token because I am the rep.
openais[4144]: [TOTEM] Saving state aru 2a high seq received 2a
openais[4144]: [TOTEM] Storing new sequence id for ring 350
openais[4144]: [TOTEM] entering COMMIT state.
openais[4144]: [TOTEM] entering RECOVERY state.
openais[4144]: [TOTEM] position [0] member
openais[4144]: [TOTEM] previous ring seq 844 rep
openais[4144]: [TOTEM] aru 2a high delivered 2a received flag 1
openais[4144]: [TOTEM] Did not need to originate any messages in recovery.
openais[4144]: [TOTEM] Sending initial ORF token
openais[4144]: [CLM  ] New Configuration:
openais[4144]: [CLM  ]  r(0) ip(
openais[4144]: [CLM  ] Members Left:
openais[4144]: [CLM  ]  r(0) ip(
openais[4144]: [CLM  ] Members Joined:
openais[4144]: [CLM  ] New Configuration:
kernel: dlm: closing connection to node 2
fenced[4163]: not a cluster member after 0 sec post_fail_delay
openais[4144]: [CLM  ]  r(0) ip(
openais[4144]: [CLM  ] Members Left:
openais[4144]: [CLM  ] Members Joined:
openais[4144]: [SYNC ] This node is within the primary component and will provide service.
openais[4144]: [TOTEM] entering OPERATIONAL state.
openais[4144]: [CLM  ] got nodejoin message
openais[4144]: [CPG  ] got joinlist message from node 1

The Problem is, that the surviving node did not takeover the resource
from the failed on.

This is the cluster Status in this moment:

# clustat
Cluster Status for dbcluster @ Thu May 28 14:28:36 2009
Member Status: Quorate

 Member Name                                                     ID
 ------ ----                                                     ----
------                                                        1
Online, Local, rgmanager                                                        2

 Service Name                                               Owner
(Last)                                               State
 ------- ----                                               -----
------                                               -----
 service:dbservices                                               starte

The Node is offline but owner of the Service: dbservices.
Not until i fence manually the takeover occurs.

>1244055972 averting fence of node <--
>1244055972 finish default 3
>that is probably happening because you used the fence_node command?

No, i used the fence_node command five minutes after the simulated
failure. The averting fence Message appears just after the failure.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
Actual results:

Expected results:

Additional info:
Comment 1 David Teigland 2009-06-04 10:32:45 EDT
Left out the end of the story,

> Hm, the only other way for "averting" to happen is if it thinks the failed
> node has become a cluster member again.  I wonder if those member checks are  
> failing because you're using IP addresses for names (I thought IP's worked).
> Could you configure names in place of those IP's and try it?

I reinstalled the whole cluster, and uses names instead of ip's as
adviced by you.

The first Tests where all succesful !!!
Comment 2 David Teigland 2009-06-04 10:45:20 EDT
I've verified the fenced bug on my own cluster.  The bug may not be obvious to people using IP addresses, because fenced just silently skips fencing a failed node.  It sounds like rgmanager is effected, though, probably because cman does not report the node as having been fenced.  When I kill the third node I see,

# group_tool -v
type             level name     id       state node id local_done
fence            0     default  00000000 none        
[1 2]

# cman_tool nodes -f
Node  Sts   Inc   Joined               Name
   1   M    472   2009-06-04 09:56:32
   2   M    476   2009-06-04 09:56:47
   3   X    480              
       Node has not been fenced since it went down
Comment 3 Christine Caulfield 2009-06-04 11:49:35 EDT
Created attachment 346556 [details]
Patch to fix

The attached patch causes the node check routine to exit if the node is an IP address and there hasn't been an exact string match.

This should stop (eg) matching with
Comment 4 Christine Caulfield 2009-06-04 12:10:30 EDT
Committed to RHEL5 branch:
Commit:        f27717f5ec074b5567734d09ac04746c21fcff01
fence: Allow IP addresses as node names

Comment 6 Christine Caulfield 2009-10-06 04:13:58 EDT
This is now in RHEL5.5 distcvs
Comment 8 Chris Ward 2010-02-11 05:05:23 EST
~~ Attention Customers and Partners - RHEL 5.5 Beta is now available on RHN ~~

RHEL 5.5 Beta has been released! There should be a fix present in this 
release that addresses your request. Please test and report back results 
here, by March 3rd 2010 (2010-03-03) or sooner.

Upon successful verification of this request, post your results and update 
the Verified field in Bugzilla with the appropriate value.

If you encounter any issues while testing, please describe them and set 
this bug into NEED_INFO. If you encounter new defects or have additional 
patch(es) to request for inclusion, please clone this bug per each request
and escalate through your support representative.
Comment 9 Jaroslav Kortus 2010-03-11 11:40:51 EST
Fencing + automatic service relocation are successful with node names as IP addresses and cman-2.0.115-33.el5.
Comment 11 errata-xmlrpc 2010-03-30 04:42:09 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.


Note You need to log in before you can comment on or make changes to this bug.