401481 – fenced doesn't use second fence method after the first one fails.

Bug 401481 - fenced doesn't use second fence method after the first one fails.

Summary: fenced doesn't use second fence method after the first one fails.

Keywords:
Status:	CLOSED DUPLICATE of bug 276541
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	cman
Sub Component:
Version:	5.0
Hardware:	x86_64
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Jan Friesse
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2007-11-27 17:37 UTC by Yevheniy Demchenko
Modified:	2009-06-17 02:11 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2008-11-20 13:56:48 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Yevheniy Demchenko 2007-11-27 17:37:07 UTC

Description of problem:
If the first fence method fails, fenced doesn't use the second method to fence a
node. Probably, it can't get necessary information from ccs. Relevant entries
from log:
Nov 27 16:47:52 192.168.100.51 openais[1590]: [TOTEM] entering GATHER state from 11.
Nov 27 16:47:52 192.168.100.51 openais[1590]: [TOTEM] Creating commit token
because I am the rep.
Nov 27 16:47:52 192.168.100.51 openais[1590]: [TOTEM] entering COMMIT state.
Nov 27 16:47:52 192.168.100.51 openais[1590]: [TOTEM] entering RECOVERY state.
Nov 27 16:47:52 192.168.100.51 openais[1590]: [TOTEM] position [0] member
192.168.100.51:
Nov 27 16:47:52 192.168.100.51 openais[1590]: [TOTEM] previous ring seq 164 rep
192.168.100.51
Nov 27 16:47:52 192.168.100.51 openais[1590]: [TOTEM] aru 203 high delivered 203
received flag 0
Nov 27 16:47:52 192.168.100.51 openais[1590]: [TOTEM] position [1] member
192.168.100.53:
Nov 27 16:47:52 192.168.100.51 openais[1590]: [TOTEM] previous ring seq 164 rep
192.168.100.51
Nov 27 16:47:52 192.168.100.51 openais[1590]: [TOTEM] aru 203 high delivered 203
received flag 0
Nov 27 16:47:52 192.168.100.51 openais[1590]: [TOTEM] position [2] member
192.168.100.54:
Nov 27 16:47:52 192.168.100.51 openais[1590]: [TOTEM] previous ring seq 164 rep
192.168.100.51
Nov 27 16:47:52 192.168.100.51 openais[1590]: [TOTEM] aru 203 high delivered 203
received flag 0
Nov 27 16:47:52 192.168.100.51 openais[1590]: [TOTEM] Did not need to originate
any messages in recovery.
Nov 27 16:47:52 192.168.100.51 openais[1590]: [TOTEM] Storing new sequence id
for ring ac
Nov 27 16:47:52 192.168.100.51 openais[1590]: [TOTEM] Sending initial ORF token
Nov 27 16:47:52 192.168.100.51 openais[1590]: [CLM  ] CLM CONFIGURATION CHANGE
Nov 27 16:47:52 192.168.100.51 openais[1590]: [CLM  ] New Configuration:
Nov 27 16:47:52 192.168.100.51 openais[1590]: [CLM  ]   r(0) ip(192.168.100.51)
Nov 27 16:47:52 192.168.100.51 openais[1590]: [CLM  ]   r(0) ip(192.168.100.53)
Nov 27 16:47:52 192.168.100.51 openais[1590]: [CLM  ]   r(0) ip(192.168.100.54)
Nov 27 16:47:52 192.168.100.51 openais[1590]: [CLM  ] Members Left:
Nov 27 16:47:52 192.168.100.51 openais[1590]: [CLM  ]   r(0) ip(192.168.100.52)
Nov 27 16:47:52 192.168.100.51 openais[1590]: [CLM  ] Members Joined:
Nov 27 16:47:52 192.168.100.51 openais[1590]: [SYNC ] This node is within the
primary component and will provide service.
Nov 27 16:47:52 192.168.100.51 openais[1590]: [CLM  ] CLM CONFIGURATION CHANGE
Nov 27 16:47:52 192.168.100.51 openais[1590]: [CLM  ] New Configuration:
Nov 27 16:47:52 192.168.100.51 openais[1590]: [CLM  ]   r(0) ip(192.168.100.51)
Nov 27 16:47:52 192.168.100.51 fenced[1598]: node2.clean not a cluster member
after 0 sec post_fail_delay
Nov 27 16:47:52 192.168.100.51 openais[1590]: [CLM  ]   r(0) ip(192.168.100.53)
Nov 27 16:47:52 192.168.100.51 openais[1590]: [CLM  ]   r(0) ip(192.168.100.54)
Nov 27 16:47:52 192.168.100.51 openais[1590]: [CLM  ] Members Left:
Nov 27 16:47:52 192.168.100.51 openais[1590]: [CLM  ] Members Joined:
Nov 27 16:47:52 192.168.100.51 openais[1590]: [SYNC ] This node is within the
primary component and will provide service.
Nov 27 16:47:52 192.168.100.51 openais[1590]: [TOTEM] entering OPERATIONAL state.
Nov 27 16:47:52 192.168.100.51 openais[1590]: [CLM  ] got nodejoin message
192.168.100.51
Nov 27 16:47:52 192.168.100.51 openais[1590]: [CLM  ] got nodejoin message
192.168.100.53
Nov 27 16:47:52 192.168.100.51 openais[1590]: [CLM  ] got nodejoin message
192.168.100.54
Nov 27 16:47:52 192.168.100.51 openais[1590]: [CPG  ] got joinlist message from
node 1
Nov 27 16:47:52 192.168.100.51 fenced[1598]: fencing node "node2.clean"
Nov 27 16:47:52 192.168.100.51 openais[1590]: [CPG  ] got joinlist message from
node 3
Nov 27 16:47:52 192.168.100.51 openais[1590]: [CPG  ] got joinlist message from
node 4


Nov 27 16:51:32 192.168.100.51 fenced[1598]: agent "fence_ipmilan" reports:
Rebooting machine @ IPMI:192.168.100.202...ipmilan: Failed to connect after 30
seconds Failed
Nov 27 16:51:32 192.168.100.51 ccsd[1584]: process_get: Invalid connection
descriptor received.
Nov 27 16:51:32 192.168.100.51 ccsd[1584]: Error while processing get: Invalid
request descriptor
Nov 27 16:51:32 192.168.100.51 fenced[1598]: fence "node2.clean" failed
Nov 27 16:51:40 192.168.100.51 fenced[1598]: fencing node "node2.clean"
Nov 27 16:51:40 192.168.100.51 ccsd[1584]: process_get: Invalid connection
descriptor received.
Nov 27 16:51:40 192.168.100.51 ccsd[1584]: Error while processing get: Invalid
request descriptor
Nov 27 16:51:40 192.168.100.51 fenced[1598]: fence "node2.clean" failed
Nov 27 16:51:46 192.168.100.51 fenced[1598]: fencing node "node2.clean"
Nov 27 16:51:46 192.168.100.51 ccsd[1584]: process_get: Invalid connection
descriptor received.
Nov 27 16:51:46 192.168.100.51 ccsd[1584]: Error while processing get: Invalid
request descriptor
Nov 27 16:51:46 192.168.100.51 fenced[1598]: fence "node2.clean" failed
Nov 27 16:51:53 192.168.100.51 fenced[1598]: fencing node "node2.clean"
Nov 27 16:51:53 192.168.100.51 ccsd[1584]: process_get: Invalid connection
descriptor received.
Nov 27 16:51:53 192.168.100.51 ccsd[1584]: Error while processing get: Invalid
request descriptor

and so on....

cat /etc/cluster/cluster.conf:
<?xml version="1.0"?>
<cluster alias="clean" config_version="15" name="clean">
        <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="12"/>
        <cman>
                <multicast addr="224.0.0.1"/>
        </cman>
        <clusternodes>
                <clusternode name="node1.clean" nodeid="1" votes="1">
                        <multicast addr="224.0.0.1" interface="eth0"/>
                        <fence>
                                <method name="1">
                                        <device name="APC" port="1"/>
                                </method>
                        </fence>
                      </clusternode>
                <clusternode name="node2.clean" nodeid="2" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="ipmi_node2"/>
                                </method>
                                <method name="2">
                                        <device name="APC" port="2"/>
                                </method>

                        </fence>
                        
                </clusternode>
                <clusternode name="node3.clean" nodeid="3" votes="1">
                        <fence>

                                <method name="1">
                                        <device name="ipmi_node3"/>
                                </method>
                                <method name="2">
                                        <device name="APC" port="3"/>
                                </method>

                        </fence>
                        
                </clusternode>
                <clusternode name="node4.clean" nodeid="4" votes="1">
                        <fence>

                                <method name="1">
                                        <device name="ipmi_node4"/>
                                </method>
                                <method name="2">
                                        <device name="APC" port="4"/>
                                </method>

                        </fence>
                        
                </clusternode>
        </clusternodes>
        <fencedevices>
                <fencedevice agent="fence_apc" ipaddr="192.168.100.250"
login="apc" name="APC" passwd="apc"/>
                <fencedevice agent="fence_ipmilan" ipaddr="192.168.100.203"
login="Admin" name="ipmi_node3" passwd="ipmi"/>
                <fencedevice agent="fence_ipmilan" ipaddr="192.168.100.204"
login="ADMIN" name="ipmi_node4" passwd="ipmi"/>
                <fencedevice agent="fence_ipmilan" ipaddr="192.168.100.202"
login="ADMIN" name="ipmi_node2" passwd="ipmi"/>
        </fencedevices>
        <rm>
                <failoverdomains>
                        <failoverdomain name="clean_0" ordered="0" restricted="1">
                                <failoverdomainnode name="node2.clean"
priority="1"/>
                                <failoverdomainnode name="node3.clean"
priority="1"/>
                                <failoverdomainnode name="node4.clean"
priority="1"/>
                        </failoverdomain>
                </failoverdomains>
                <resources>
                        <lvm lv_name="shared_test" name="shared_clean_test"
vg_name="shared_clean"/>
                        <ip address="192.168.100.100" monitor_link="1"/>
                        <script file="/etc/init.d/luci" name="luci"/>
                </resources>
                <service autostart="1" domain="clean_0" exclusive="0"
name="luci_service" recovery="restart">
                        <ip ref="192.168.100.100">
                                <script ref="luci"/>
                        </ip>
                </service>
        </rm>
</cluster>




Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1.Create cluster, provide some cluster node with two fence methods
(fence_ipmilan and fence_apc), fence ipmi should be before fence apc. 
2.Disconnect ethernet cable from the node with two fences
3.
  
Actual results:
cluster freezes

Expected results:
node gets fenced using fence_apc

Additional info:

Comment 2 Jan Friesse 2008-11-20 13:56:48 UTC

This problem is caused by long timeout of IPMI fence agent. It's duplicate of older bug, so I'm closing this one.

*** This bug has been marked as a duplicate of bug 276541 ***

Note You need to log in before you can comment on or make changes to this bug.