From Bugzilla Helper: User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727) Description of problem: We have a cluster of two DL585 and a single DL145G2 with two fencing levels using the server ilo interfaces and WTI power switches. For resiliency the servers have a power connection to each of two WTI power switches (one in the DL145 case) During testing we have found that a server is fenced via the power switches in the following order. telnet to the switch check power status send an off command check power status If off then send an on command For a server to be fenced this happens to two power switches simultaneously and the power is killed to the server then restored. In the case of the DL585, the servers are configured via the advanced ilo function to reboot on restoration of power. This is configured in system BIOS for the DL145. The problem we have is when a DL585 server is killed via fencing, the interval between power being killed and restored again is enough for the server to die but too short for the ilo to die. Because the ilo has not been killed it doesn't see the 'restoration' of power and fails to reboot the server. This doesn't happen with the DL145 because the reboot function is configured in BIOS. We have tested increasing the interval manually between the power being removed and restored and found that an interval in excess of 3 seconds was required for the ilo to die completely and cause the server to reboot successfully when power was restored. We then edited a copy of the /sbin/fence_wti script to incorporate this delay and tested the fencing which then worked ok. (added ' sleep 5;' as line 189 in fence_wti). Can an update be made to either introduce a fixed time delay of 5 seconds or allow a user configurable delay to be selected Version-Release number of selected component (if applicable): fence-1.32.18-0.x86_64.rpm How reproducible: Always Steps to Reproduce: 1. Disconnect ilo connection to prevent ilo fencing 2. Disconnect server network connection 3. Cluster then fences server via the WTI power switch Actual Results: The server is fenced from the cluster but does not reboot and rejoin as it does not see a complete power disconnection due to the off and on commands being issued to quickly Expected Results: The server should have power cut via the power switches and allow a complete power off (approx 5 seconds) then when power is reapplied the ilo configuration is set to boot the server Additional info:
This delay should be a configurable parameter rather than fixed. I also think it should be added to all power switch fence agents, rather than special casing WTI switches. If PM and the boss both ack this request, it should take less than a day to implement and test. I am not certain whether to include this parameter in the UI, however. This is kind of a special case parameter, and it is nice not to clustter the UI with params that the user may find confusing. Please comment on my opinion regarding the UI. If the consensus is to include it there, it is trivial to add and will take an hour at most to complete -- the question is not one of whether to invest the effort, but whether to include it from a usability standpoint.
Fix is in. Thanks for the work, Mark.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0138.html