Bug 507514
Summary: | fence_ilo fails to reboot, possibly timing problem with ilo2 1.70 | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Michael Kearey <mkearey> | ||||||||
Component: | cman | Assignee: | Marek Grac <mgrac> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> | ||||||||
Severity: | medium | Docs Contact: | |||||||||
Priority: | urgent | ||||||||||
Version: | 5.3 | CC: | akarlsso, clasohm, cluster-maint, cmarcant, cward, dejohnso, djuran, edamato, gkeegan, james.brown, jkortus, jwest, mgrac, pbiswas, pep, richard.f.dawson, richard.w.jerrido, robert.lawton, tao | ||||||||
Target Milestone: | rc | Keywords: | ZStream | ||||||||
Target Release: | 5.5 | ||||||||||
Hardware: | All | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | cman-2.0.115-19.el5 | Doc Type: | Bug Fix | ||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2010-03-30 08:38:14 UTC | Type: | --- | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | |||||||||||
Bug Blocks: | 533379 | ||||||||||
Attachments: |
|
Description
Michael Kearey
2009-06-23 04:03:36 UTC
Unfortunately currently I'm unable to reproduce it but can you try to increase POWER_TIMEOUT in fencing.py ? If it will work I will make these *_TIMEOUT more configurable (e.g. cmd-line option) We've been affected by this issue and as such we've tested against our BL685 G5 blade servers with iLO firmware versions 1.6/1.7/1.77 & 1.78. The results were identical on both nodes Stock with no changes to fencing.py: 1.60 - Node powered off and powered back on. 1.70 - Node is powered off and does not power back on. 1.77 - Node is powered off and does not power back on. 1.78 - Node is powered off and does not power back on. With the time.sleep(5) changes to /usr/lib/fence/fencing.py recommended by Michael Kearney: 1.60 - Node powered off and powered back on. 1.70 - Node powered off and powered back on. 1.77 - Node powered off and powered back on. 1.78 - Node powered off and powered back on. With POWER_TIMEOUT=30 in fencing.py as recommended by Marek Grac (Default is 20): 1.60 - Node powered off and powered back on. 1.70 - Node powered off and powered back on. However, fence_ilo reports "Timed out waiting to power ON" 1.77 - Node powered off and powered back on. However, fence_ilo reports "Timed out waiting to power ON" 1.78 - Node powered off and powered back on. However, fence_ilo reports "Timed out waiting to power ON" It would appear that any firmware > 1.70 is affected by this issue. Tested on ilo2 with 1.70 - 5 of 10 attempts successful (same with sleep) But it looks like that ssh access works as expected again (v1.70, v1.79). So I modified fence agent for ilo_mp to support also iLO2 with v1.70+ [older version does not recognize stop -f /system1 - and where able to do only graceful shutdown]. My tests where successful but it will be better if you can that too. As power-wait option is used you need a new fencing library and you will have to set PYTHONPATH to directory with it. Fencing agent and fencing library will be attached below. Created attachment 364731 [details]
Fence agent for iLO2 with firmware 1.70+
Created attachment 364732 [details]
Fencing library
From IT 356169: As I mentioned to you on our conference call - we've got a bit of an issue with fence_ilo and our HP BL460 blades. Although it accomplishes its primary objective of ensuring that the node is not accessing cluster resources, we do really need it to power the node back on. I get the following messages if I run fence_ilo manually: Timed out waiting to power ON Success: Rebooted This problem seems similar to bugzilla BZ#507514. In that bugzilla, someone suggests introducing a 5 second sleep in /usr/lib/fence/fencing.py as a workaround. I can confirm that this workaround does work in our environment. The bugzilla suggested that this problem was caused by version 1.70 of the iLO2 firmware; however, we're experiencing the same issue with 1.77. Perhaps it is every firmware since 1.70. The publicly viewable bugzilla entries do not appear to show ay progress on this issue by Red Hat. Can I please have an update on this? We're planning on rolling out a large number of Xen clusters on HP blades in the coming months. We are very keen to see this issue resolved shorter rather than later. In the upstream code we've developed a generic interface for all fence agents to allow you to specify various timeouts on the fence agent command line. This functionality will be backported to the RHEL5 fence agents in order to resolve this issue. Marek, can you give an update on the ETA of doing the backport? @Debbie: Patch to add timeouts is ready and can be deployed in RHEL5 (use --power-wait <N> from cmd line tests; N in seconds). But in ours test environment it does help at all (tested with v1.70 and v1.77). Different fence agent originally developed for iLO MP works for us [included as attachement in this bz], if you can please test it. If timeout works for you then our timeout options will be enough and I can add them (but without default values that will work for you). Test build can be ready in Friday morning (US time). There is new option --retry-on=N (retry_on on stdin) for HP iLO. <N> specifies number of attempts (to send power ON signal and wait for status). During our tests we found out that 3 is enough, so it is set as default. Timeout options are also available, please consult http://sources.redhat.com/cluster/wiki/FenceTiming for details. Alternative fence agent should work too and it is possible that it is faster in powering ON machine (time to shutdown node looks similar). http://git.fedorahosted.org/git/fence-agents.git?p=fence-agents.git;a=commit;h=8acc1a69d6695bc5d3a86f21f60910186053bac1 Created attachment 368159 [details]
Fix traceback when using any SNMP agent
Main problem was hidden in forgotten "self" in fencing_snmp caused by commit:
Author: Marek 'marx' Grac <mgrac>
Date: Fri Oct 9 13:36:25 2009 +0200
fencing: Timeout options added
Fixed in fence_snmp_traceback.patch (build cman-2.0.115-17.el5) Fix problems introduced by --retry-on ~~ Attention Customers and Partners - RHEL 5.5 Beta is now available on RHN ~~ RHEL 5.5 Beta has been released! There should be a fix present in this release that addresses your request. Please test and report back results here, by March 3rd 2010 (2010-03-03) or sooner. Upon successful verification of this request, post your results and update the Verified field in Bugzilla with the appropriate value. If you encounter any issues while testing, please describe them and set this bug into NEED_INFO. If you encounter new defects or have additional patch(es) to request for inclusion, please clone this bug per each request and escalate through your support representative. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2010-0266.html The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |