Bug 205457 - fence_wti fails to completely power off HP DL585 server
Summary: fence_wti fails to completely power off HP DL585 server
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Cluster Suite
Classification: Retired
Component: fence
Version: 4
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Jim Parsons
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2006-09-06 14:10 UTC by Mark Whitehead
Modified: 2009-04-16 20:11 UTC (History)
2 users (show)

Fixed In Version: RHBA-2007-0138
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2007-05-10 21:24:58 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2007:0138 0 normal SHIPPED_LIVE fence bug fix update 2007-05-10 21:23:14 UTC

Description Mark Whitehead 2006-09-06 14:10:30 UTC
From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)

Description of problem:
We have a cluster of two DL585 and a single DL145G2 with two fencing 
levels using the server ilo interfaces and WTI power switches.

For resiliency the servers have a power connection to each of two WTI
power switches (one in the DL145 case)

During testing we have found that a server is fenced via the power 
switches in the following order.

        telnet to the switch

        check power status

        send an off command

        check power status

        If off then send an on command

For a server to be fenced this happens to two power switches 
simultaneously and the power is killed to the server then restored.

In the case of the DL585, the servers are configured via the advanced 
ilo function to reboot on restoration of power.  This is configured in 
system BIOS for the DL145.

The problem we have is when a DL585 server is killed via fencing, the 
interval between power being killed and restored again is enough for 
the server to die but too short for the ilo to die. Because the ilo 
has not been killed it doesn't see the 'restoration' of power and 
fails to reboot the server.  This doesn't happen with the DL145 
because the reboot function is configured in BIOS.

We have tested increasing the interval manually between the power 
being removed and restored and found that an interval in excess of 3 
seconds was required for the ilo to die completely and cause the 
server to reboot successfully when power was restored.

We then edited a copy of the /sbin/fence_wti script to incorporate 
this delay and tested the fencing which then worked ok.
(added '  sleep 5;' as line 189 in fence_wti).

Can an update be made to either introduce a fixed time delay of 5 seconds
or allow a user configurable delay to be selected

Version-Release number of selected component (if applicable):
fence-1.32.18-0.x86_64.rpm 

How reproducible:
Always


Steps to Reproduce:
1.  Disconnect ilo connection to prevent ilo fencing
2.  Disconnect server network connection
3.  Cluster then fences server via the WTI power switch

Actual Results:
The server is fenced from the cluster but does not reboot and rejoin as it does not see a complete power disconnection due to the off and on commands being issued to quickly

Expected Results:
The server should have power cut via the power switches and allow a complete power off (approx 5 seconds) then when power is reapplied the ilo configuration is set to boot the server

Additional info:

Comment 1 Jim Parsons 2006-09-12 13:22:28 UTC
This delay should be a configurable parameter rather than fixed. I also think it
should be added to all power switch fence agents, rather than special casing WTI
switches. If PM and the boss both ack this request, it should take less than a
day to implement and test. I am not certain whether to include this parameter in
the UI, however. This is kind of a special case parameter, and it is nice not to
clustter the UI with params that the user may find confusing. Please comment on
my opinion regarding the UI. If the consensus is to include it there, it is
trivial to add and will take an hour at most to complete -- the question is not
one of whether to invest the effort, but whether to include it from a usability
standpoint.

Comment 4 Jim Parsons 2007-01-31 20:01:25 UTC
Fix is in. Thanks for the work, Mark.

Comment 7 Red Hat Bugzilla 2007-05-10 21:24:59 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0138.html



Note You need to log in before you can comment on or make changes to this bug.