504158 – fenced doesn't work with IP addresses for node names

Bug 504158 - fenced doesn't work with IP addresses for node names

Summary: fenced doesn't work with IP addresses for node names

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	cman
Sub Component:
Version:	5.3
Hardware:	All
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	David Teigland
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2009-06-04 14:19 UTC by David Teigland
Modified:	2023-09-14 01:16 UTC (History)
CC List:	8 users (show)
Fixed In Version:	cman-2_0_115-8_el5
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2010-03-30 08:42:09 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Patch to fix (1.41 KB, patch) 2009-06-04 15:49 UTC, Christine Caulfield	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2010:0266	0	normal	SHIPPED_LIVE	cman bug fix and enhancement update	2010-03-29 12:54:44 UTC

Description David Teigland 2009-06-04 14:19:38 UTC

Description of problem:

<dct> chrissie: do you know if we intend IP's to work as node names in RHEL5?
<chrissie> yes, we do
<dct> I thought so, but I've been helping a guy on linux-cluster and it appears fencing doesn't work right with them
<chrissie> ugh
<dct> not entirely sure, he reinstalled his nodes and set them up again with names and his problem went away
<chrissie> what fence agent was he using ?
<dct> ipmi
<dct> the bug I think is in fenced/member_cman.c
<dct> name_equal()
<chrissie> is it looking for the wrong name ?
<dct> i.e name_equal() assumes names not ip's
<dct> it'll match 10.1.1.1 with 10.2.2.2
<dct> because of the common 10.

Report started here,
https://www.redhat.com/archives/linux-cluster/2009-June/msg00011.html

but continued off list with the following info,

It's a Two-Node Cluster, so this is only the Output from the surviving
node after i initiate a power down with ipmitool. When i do a fence_node
manually after the failure everthing works fine, the fencing action is
succesful and the node 10.102.10.51 takes over the resource.

>- cman_tool nodes

# cman_tool nodes
Node  Sts   Inc   Joined               Name
   1   M    820   2009-06-03 15:54:55  10.102.10.51
   2   X    844                        10.102.10.28

>- group_tool -v

# group_tool -v
type             level name       id       state node id local_done
fence            0     default    00010001 none
[1]
dlm              1     rgmanager  00010002 none
[1]

>- group_tool dump fence

[root@ipsdb01 ~]# group_tool dump fence
1244037297 our_nodeid 1 our_name 10.102.10.51
1244037297 listen 4 member 5 groupd 7
1244037324 client 3: join default
1244037324 delay post_join 0s post_fail 0s
1244037324 added 2 nodes from ccs
1244037324 setid default 65537
1244037324 start default 1 members 1 2
1244037324 do_recovery stop 0 start 1 finish 0
1244037324 finish default 1
1244039642 client 3: dump
1244055972 stop default
1244055972 start default 3 members 1
1244055972 do_recovery stop 1 start 3 finish 1
1244055972 add node 2 to list 1
1244055972 averting fence of node 10.102.10.28
1244055972 finish default 3
1244056089 client 3: dump

>- any messages in /var/log/messages

openais[4144]: [TOTEM] The token was lost in the OPERATIONAL state.
openais[4144]: [TOTEM] Receive multicast socket recv buffer size (288000 bytes).
openais[4144]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes).
openais[4144]: [TOTEM] entering GATHER state from 2.
openais[4144]: [TOTEM] entering GATHER state from 0.
openais[4144]: [TOTEM] Creating commit token because I am the rep.
openais[4144]: [TOTEM] Saving state aru 2a high seq received 2a
openais[4144]: [TOTEM] Storing new sequence id for ring 350
openais[4144]: [TOTEM] entering COMMIT state.
openais[4144]: [TOTEM] entering RECOVERY state.
openais[4144]: [TOTEM] position [0] member 10.102.10.51:
openais[4144]: [TOTEM] previous ring seq 844 rep 10.102.10.28
openais[4144]: [TOTEM] aru 2a high delivered 2a received flag 1
openais[4144]: [TOTEM] Did not need to originate any messages in recovery.
openais[4144]: [TOTEM] Sending initial ORF token
openais[4144]: [CLM  ] CLM CONFIGURATION CHANGE
openais[4144]: [CLM  ] New Configuration:
openais[4144]: [CLM  ]  r(0) ip(10.102.10.51)
openais[4144]: [CLM  ] Members Left:
openais[4144]: [CLM  ]  r(0) ip(10.102.10.28)
openais[4144]: [CLM  ] Members Joined:
openais[4144]: [CLM  ] CLM CONFIGURATION CHANGE
openais[4144]: [CLM  ] New Configuration:
kernel: dlm: closing connection to node 2
fenced[4163]: 10.102.10.28 not a cluster member after 0 sec post_fail_delay
openais[4144]: [CLM  ]  r(0) ip(10.102.10.51)
openais[4144]: [CLM  ] Members Left:
openais[4144]: [CLM  ] Members Joined:
openais[4144]: [SYNC ] This node is within the primary component and will provide service.
openais[4144]: [TOTEM] entering OPERATIONAL state.
openais[4144]: [CLM  ] got nodejoin message 10.102.10.51
openais[4144]: [CPG  ] got joinlist message from node 1


The Problem is, that the surviving node did not takeover the resource
from the failed on.

This is the cluster Status in this moment:

# clustat
Cluster Status for dbcluster @ Thu May 28 14:28:36 2009
Member Status: Quorate

 Member Name                                                     ID
Status
 ------ ----                                                     ----
------
 10.102.10.51                                                        1
Online, Local, rgmanager
 10.102.10.28                                                        2
Offline

 Service Name                                               Owner
(Last)                                               State
 ------- ----                                               -----
------                                               -----
 service:dbservices
10.102.10.28                                               starte

The Node 10.102.10.28 is offline but owner of the Service: dbservices.
Not until i fence 10.102.10.28 manually the takeover occurs.


>1244055972 averting fence of node 10.102.10.28 <--
>1244055972 finish default 3
>
>that is probably happening because you used the fence_node command?
>

No, i used the fence_node command five minutes after the simulated
failure. The averting fence Message appears just after the failure.


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 David Teigland 2009-06-04 14:32:45 UTC

Left out the end of the story,

> Hm, the only other way for "averting" to happen is if it thinks the failed
> node has become a cluster member again.  I wonder if those member checks are  
> failing because you're using IP addresses for names (I thought IP's worked).
> Could you configure names in place of those IP's and try it?

I reinstalled the whole cluster, and uses names instead of ip's as
adviced by you.

The first Tests where all succesful !!!

Comment 2 David Teigland 2009-06-04 14:45:20 UTC

I've verified the fenced bug on my own cluster.  The bug may not be obvious to people using IP addresses, because fenced just silently skips fencing a failed node.  It sounds like rgmanager is effected, though, probably because cman does not report the node as having been fenced.  When I kill the third node I see,

# group_tool -v
type             level name     id       state node id local_done
fence            0     default  00000000 none        
[1 2]

# cman_tool nodes -f
Node  Sts   Inc   Joined               Name
   1   M    472   2009-06-04 09:56:32  10.15.84.91
   2   M    476   2009-06-04 09:56:47  10.15.84.92
   3   X    480                        10.15.84.93
       Node has not been fenced since it went down

Comment 3 Christine Caulfield 2009-06-04 15:49:35 UTC

Created attachment 346556 [details]
Patch to fix

The attached patch causes the node check routine to exit if the node is an IP address and there hasn't been an exact string match.

This should stop (eg)
10.2.1.1 matching with 10.2.1.2

Comment 4 Christine Caulfield 2009-06-04 16:10:30 UTC

Committed to RHEL5 branch:
Commit:        f27717f5ec074b5567734d09ac04746c21fcff01
fence: Allow IP addresses as node names

Also on STABLE2 & STABLE3

Comment 6 Christine Caulfield 2009-10-06 08:13:58 UTC

This is now in RHEL5.5 distcvs

Comment 8 Chris Ward 2010-02-11 10:05:23 UTC

~~ Attention Customers and Partners - RHEL 5.5 Beta is now available on RHN ~~

RHEL 5.5 Beta has been released! There should be a fix present in this 
release that addresses your request. Please test and report back results 
here, by March 3rd 2010 (2010-03-03) or sooner.

Upon successful verification of this request, post your results and update 
the Verified field in Bugzilla with the appropriate value.

If you encounter any issues while testing, please describe them and set 
this bug into NEED_INFO. If you encounter new defects or have additional 
patch(es) to request for inclusion, please clone this bug per each request
and escalate through your support representative.

Comment 9 Jaroslav Kortus 2010-03-11 16:40:51 UTC

Fencing + automatic service relocation are successful with node names as IP addresses and cman-2.0.115-33.el5.

Comment 11 errata-xmlrpc 2010-03-30 08:42:09 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2010-0266.html

Comment 13 Red Hat Bugzilla 2023-09-14 01:16:40 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days

Note You need to log in before you can comment on or make changes to this bug.