Red Hat Bugzilla – Bug 205457
fence_wti fails to completely power off HP DL585 server
Last modified: 2009-04-16 16:11:55 EDT
From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)
Description of problem:
We have a cluster of two DL585 and a single DL145G2 with two fencing
levels using the server ilo interfaces and WTI power switches.
For resiliency the servers have a power connection to each of two WTI
power switches (one in the DL145 case)
During testing we have found that a server is fenced via the power
switches in the following order.
telnet to the switch
check power status
send an off command
check power status
If off then send an on command
For a server to be fenced this happens to two power switches
simultaneously and the power is killed to the server then restored.
In the case of the DL585, the servers are configured via the advanced
ilo function to reboot on restoration of power. This is configured in
system BIOS for the DL145.
The problem we have is when a DL585 server is killed via fencing, the
interval between power being killed and restored again is enough for
the server to die but too short for the ilo to die. Because the ilo
has not been killed it doesn't see the 'restoration' of power and
fails to reboot the server. This doesn't happen with the DL145
because the reboot function is configured in BIOS.
We have tested increasing the interval manually between the power
being removed and restored and found that an interval in excess of 3
seconds was required for the ilo to die completely and cause the
server to reboot successfully when power was restored.
We then edited a copy of the /sbin/fence_wti script to incorporate
this delay and tested the fencing which then worked ok.
(added ' sleep 5;' as line 189 in fence_wti).
Can an update be made to either introduce a fixed time delay of 5 seconds
or allow a user configurable delay to be selected
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Disconnect ilo connection to prevent ilo fencing
2. Disconnect server network connection
3. Cluster then fences server via the WTI power switch
The server is fenced from the cluster but does not reboot and rejoin as it does not see a complete power disconnection due to the off and on commands being issued to quickly
The server should have power cut via the power switches and allow a complete power off (approx 5 seconds) then when power is reapplied the ilo configuration is set to boot the server
This delay should be a configurable parameter rather than fixed. I also think it
should be added to all power switch fence agents, rather than special casing WTI
switches. If PM and the boss both ack this request, it should take less than a
day to implement and test. I am not certain whether to include this parameter in
the UI, however. This is kind of a special case parameter, and it is nice not to
clustter the UI with params that the user may find confusing. Please comment on
my opinion regarding the UI. If the consensus is to include it there, it is
trivial to add and will take an hour at most to complete -- the question is not
one of whether to invest the effort, but whether to include it from a usability
Fix is in. Thanks for the work, Mark.
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.