Description of problem: <dct> chrissie: do you know if we intend IP's to work as node names in RHEL5? <chrissie> yes, we do <dct> I thought so, but I've been helping a guy on linux-cluster and it appears fencing doesn't work right with them <chrissie> ugh <dct> not entirely sure, he reinstalled his nodes and set them up again with names and his problem went away <chrissie> what fence agent was he using ? <dct> ipmi <dct> the bug I think is in fenced/member_cman.c <dct> name_equal() <chrissie> is it looking for the wrong name ? <dct> i.e name_equal() assumes names not ip's <dct> it'll match 10.1.1.1 with 10.2.2.2 <dct> because of the common 10. Report started here, https://www.redhat.com/archives/linux-cluster/2009-June/msg00011.html but continued off list with the following info, It's a Two-Node Cluster, so this is only the Output from the surviving node after i initiate a power down with ipmitool. When i do a fence_node manually after the failure everthing works fine, the fencing action is succesful and the node 10.102.10.51 takes over the resource. >- cman_tool nodes # cman_tool nodes Node Sts Inc Joined Name 1 M 820 2009-06-03 15:54:55 10.102.10.51 2 X 844 10.102.10.28 >- group_tool -v # group_tool -v type level name id state node id local_done fence 0 default 00010001 none [1] dlm 1 rgmanager 00010002 none [1] >- group_tool dump fence [root@ipsdb01 ~]# group_tool dump fence 1244037297 our_nodeid 1 our_name 10.102.10.51 1244037297 listen 4 member 5 groupd 7 1244037324 client 3: join default 1244037324 delay post_join 0s post_fail 0s 1244037324 added 2 nodes from ccs 1244037324 setid default 65537 1244037324 start default 1 members 1 2 1244037324 do_recovery stop 0 start 1 finish 0 1244037324 finish default 1 1244039642 client 3: dump 1244055972 stop default 1244055972 start default 3 members 1 1244055972 do_recovery stop 1 start 3 finish 1 1244055972 add node 2 to list 1 1244055972 averting fence of node 10.102.10.28 1244055972 finish default 3 1244056089 client 3: dump >- any messages in /var/log/messages openais[4144]: [TOTEM] The token was lost in the OPERATIONAL state. openais[4144]: [TOTEM] Receive multicast socket recv buffer size (288000 bytes). openais[4144]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes). openais[4144]: [TOTEM] entering GATHER state from 2. openais[4144]: [TOTEM] entering GATHER state from 0. openais[4144]: [TOTEM] Creating commit token because I am the rep. openais[4144]: [TOTEM] Saving state aru 2a high seq received 2a openais[4144]: [TOTEM] Storing new sequence id for ring 350 openais[4144]: [TOTEM] entering COMMIT state. openais[4144]: [TOTEM] entering RECOVERY state. openais[4144]: [TOTEM] position [0] member 10.102.10.51: openais[4144]: [TOTEM] previous ring seq 844 rep 10.102.10.28 openais[4144]: [TOTEM] aru 2a high delivered 2a received flag 1 openais[4144]: [TOTEM] Did not need to originate any messages in recovery. openais[4144]: [TOTEM] Sending initial ORF token openais[4144]: [CLM ] CLM CONFIGURATION CHANGE openais[4144]: [CLM ] New Configuration: openais[4144]: [CLM ] r(0) ip(10.102.10.51) openais[4144]: [CLM ] Members Left: openais[4144]: [CLM ] r(0) ip(10.102.10.28) openais[4144]: [CLM ] Members Joined: openais[4144]: [CLM ] CLM CONFIGURATION CHANGE openais[4144]: [CLM ] New Configuration: kernel: dlm: closing connection to node 2 fenced[4163]: 10.102.10.28 not a cluster member after 0 sec post_fail_delay openais[4144]: [CLM ] r(0) ip(10.102.10.51) openais[4144]: [CLM ] Members Left: openais[4144]: [CLM ] Members Joined: openais[4144]: [SYNC ] This node is within the primary component and will provide service. openais[4144]: [TOTEM] entering OPERATIONAL state. openais[4144]: [CLM ] got nodejoin message 10.102.10.51 openais[4144]: [CPG ] got joinlist message from node 1 The Problem is, that the surviving node did not takeover the resource from the failed on. This is the cluster Status in this moment: # clustat Cluster Status for dbcluster @ Thu May 28 14:28:36 2009 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ 10.102.10.51 1 Online, Local, rgmanager 10.102.10.28 2 Offline Service Name Owner (Last) State ------- ---- ----- ------ ----- service:dbservices 10.102.10.28 starte The Node 10.102.10.28 is offline but owner of the Service: dbservices. Not until i fence 10.102.10.28 manually the takeover occurs. >1244055972 averting fence of node 10.102.10.28 <-- >1244055972 finish default 3 > >that is probably happening because you used the fence_node command? > No, i used the fence_node command five minutes after the simulated failure. The averting fence Message appears just after the failure. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Left out the end of the story, > Hm, the only other way for "averting" to happen is if it thinks the failed > node has become a cluster member again. I wonder if those member checks are > failing because you're using IP addresses for names (I thought IP's worked). > Could you configure names in place of those IP's and try it? I reinstalled the whole cluster, and uses names instead of ip's as adviced by you. The first Tests where all succesful !!!
I've verified the fenced bug on my own cluster. The bug may not be obvious to people using IP addresses, because fenced just silently skips fencing a failed node. It sounds like rgmanager is effected, though, probably because cman does not report the node as having been fenced. When I kill the third node I see, # group_tool -v type level name id state node id local_done fence 0 default 00000000 none [1 2] # cman_tool nodes -f Node Sts Inc Joined Name 1 M 472 2009-06-04 09:56:32 10.15.84.91 2 M 476 2009-06-04 09:56:47 10.15.84.92 3 X 480 10.15.84.93 Node has not been fenced since it went down
Created attachment 346556 [details] Patch to fix The attached patch causes the node check routine to exit if the node is an IP address and there hasn't been an exact string match. This should stop (eg) 10.2.1.1 matching with 10.2.1.2
Committed to RHEL5 branch: Commit: f27717f5ec074b5567734d09ac04746c21fcff01 fence: Allow IP addresses as node names Also on STABLE2 & STABLE3
This is now in RHEL5.5 distcvs
~~ Attention Customers and Partners - RHEL 5.5 Beta is now available on RHN ~~ RHEL 5.5 Beta has been released! There should be a fix present in this release that addresses your request. Please test and report back results here, by March 3rd 2010 (2010-03-03) or sooner. Upon successful verification of this request, post your results and update the Verified field in Bugzilla with the appropriate value. If you encounter any issues while testing, please describe them and set this bug into NEED_INFO. If you encounter new defects or have additional patch(es) to request for inclusion, please clone this bug per each request and escalate through your support representative.
Fencing + automatic service relocation are successful with node names as IP addresses and cman-2.0.115-33.el5.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2010-0266.html
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days