Bug 507514 - fence_ilo fails to reboot, possibly timing problem with ilo2 1.70 [NEEDINFO]
fence_ilo fails to reboot, possibly timing problem with ilo2 1.70
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: cman (Show other bugs)
5.3
All Linux
urgent Severity medium
: rc
: 5.5
Assigned To: Marek Grac
Cluster QE
: ZStream
Depends On:
Blocks: 533379
  Show dependency treegraph
 
Reported: 2009-06-23 00:03 EDT by Michael Kearey
Modified: 2016-04-26 12:19 EDT (History)
19 users (show)

See Also:
Fixed In Version: cman-2.0.115-19.el5
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2010-03-30 04:38:14 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
cward: needinfo? (mkearey)


Attachments (Terms of Use)
Fence agent for iLO2 with firmware 1.70+ (1.73 KB, application/octet-stream)
2009-10-14 05:45 EDT, Marek Grac
no flags Details
Fencing library (26.03 KB, application/octet-stream)
2009-10-14 05:45 EDT, Marek Grac
no flags Details
Fix traceback when using any SNMP agent (991 bytes, patch)
2009-11-09 05:16 EST, Jan Friesse
no flags Details | Diff

  None (edit)
Description Michael Kearey 2009-06-23 00:03:36 EDT
Description of problem:
Appears to be a subtle timing issue with the fence_ilo agent when used on a HP ilo2 with firmware 1.70

For example - One system the fencing works - the action -o reboot results in the system rebooted, in another system the action="reboot" does not work - The system does not power on

Version-Release number of selected component (if applicable):

cman-2.0.98-1.el5_3.1
ilo2 details:

   FIRMWARE_VERSION = "1.70"
   FIRMWARE_DATE    = "Dec 02 2008"
   MANAGEMENT_PROCESSOR    = "iLO2"


How reproducible:
 50%  - One customer has reported a problem using ilo2 firmware version 1.77, rolled back th firmware to 1.70 and is happy it works. Another customer is using 1.70 firmware, and it fails to power on after a fence action -o reboot. They rolled back to .160 and have their fencing working now.

Steps to Reproduce:
1. Run HP server system with ilo2 
2. Issue fence_ilo -v -o reboot -a <ip address> -l <ilo username> -p <ilo_password> -z addressing the ilo2 interface

  
Actual results:

System is powered down but does not power up again the logs from ilo2 devices say:

1) <SERVER_INFO MODE = "read"><GET_HOST_POWER_STATUS/>
then, as the server is ON, it gets powered off
2) <SERVER_INFO MODE = "write"><HOLD_PWR_BTN TOGGLE="yes" />
then it gets the status and it results OFF, so that it tries to boot again with:
3) <SERVER_INFO MODE = "write"><HOLD_PWR_BTN TOGGLE="yes" />
then after 20 attempts it goes timeout, but resulting success apparently to the fence_ilo command.....

Expected results:

fence_ilo should be able to reliably power off then power on when system when the reboot action is sent.


Additional info:

Edited /usr/lib/fence/fencing.py file and introduced a time.sleep(5) after the power OFF is issued and a wait_power_status is done like so appears to resolve the problem:

	elif options["-o"] == "reboot":
		if status != "off":
			options["-o"] = "off"
			set_power_fn(tn, options)
			if wait_power_status(tn, options, get_power_fn) == 0:
				fail(EC_WAITING_OFF)
		time.sleep(5)
		options["-o"] = "on"
		set_power_fn(tn, options)
		if wait_power_status(tn, options, get_power_fn) == 0:
			sys.stderr.write('Timed out waiting to power ON\n')
		print "Success: Rebooted"
	elif options["-o"] == "status":
		print "Status: " + status.upper()



The only problem being this delay applies to all fence agents when it is placed here.

I assume that we could place the time.sleep() in fence_ilo so it is only applied to 1.70 ilo2 firmware  like so:

cat /sbin/fence_ilo
...,
import sys, re, pexpect, socket, time
...,
def set_power_status(conn, options):
        conn.send("<LOGIN USER_LOGIN = \"" + options["-l"] + "\"" + \
                " PASSWORD = \"" + options["-p"] + "\">\r\n")
        conn.send("<SERVER_INFO MODE = \"write\">")

        if options.has_key("fw_processor") and options["fw_processor"] == "iLO2":
                if options["fw_version"] > 1.29:
                        conn.send("<HOLD_PWR_BTN TOGGLE=\"yes\" />\r\n")
                        ## Introduce delay for ilo2 1.70 firmware
                        if options["fw_version"] == 1.70:
                                 time.sleep(5)         ## DELAY 5 sec
                else:
                        conn.send("<HOLD_PWR_BTN />\r\n")
        elif options["-r"] < 2.21:
                conn.send("<SET_HOST_POWER HOST_POWER = \"" + options["-o"] + "\" />\r\n")
        else:
                if options["-o"] == "off":
                        conn.send("<HOLD_PWR_BTN/>\r\n")
                else:
                        conn.send("<PRESS_PWR_BTN/>\r\n")
        conn.send("</SERVER_INFO></LOGIN>\r\n")

        return


Note that this  has not been tested by me, as I had difficulty getting hands on a ilo2 equioped system and downgrading the firmware.
Comment 2 Marek Grac 2009-07-13 07:43:25 EDT
Unfortunately currently I'm unable to reproduce it but can you try to increase POWER_TIMEOUT in fencing.py ?

If it will work I will make these *_TIMEOUT more configurable (e.g. cmd-line option)
Comment 3 Rich Jerrido 2009-08-08 11:18:20 EDT
We've been affected by this issue and as such we've tested against our BL685 G5 blade servers with iLO firmware versions 1.6/1.7/1.77 & 1.78. The results were identical on both nodes

Stock with no changes to fencing.py:

1.60 - Node powered off and powered back on.
1.70 - Node is powered off and does not power back on.
1.77 - Node is powered off and does not power back on.
1.78 - Node is powered off and does not power back on.


With the time.sleep(5) changes to /usr/lib/fence/fencing.py recommended by Michael Kearney:
1.60 - Node powered off and powered back on.
1.70 - Node powered off and powered back on.
1.77 - Node powered off and powered back on.
1.78 - Node powered off and powered back on.


With POWER_TIMEOUT=30 in fencing.py as recommended by Marek Grac (Default is 20):

1.60 - Node powered off and powered back on.
1.70 - Node powered off and powered back on.
	However, fence_ilo reports "Timed out waiting to power ON"
1.77 - Node powered off and powered back on.
	However, fence_ilo reports "Timed out waiting to power ON"
1.78 - Node powered off and powered back on.
	However, fence_ilo reports "Timed out waiting to power ON"

It would appear that any firmware > 1.70 is affected by this issue.
Comment 9 Marek Grac 2009-10-14 05:44:17 EDT
Tested on ilo2 with 1.70 - 5 of 10 attempts successful (same with sleep)

But it looks like that ssh access works as expected again (v1.70, v1.79). So I modified fence agent for ilo_mp to support also iLO2 with v1.70+ [older version does not recognize stop -f /system1 - and where able to do only graceful shutdown]. My tests where successful but it will be better if you can that too. 

As power-wait option is used you need a new fencing library and you will have to set PYTHONPATH to directory with it. Fencing agent and fencing library will be attached below.
Comment 10 Marek Grac 2009-10-14 05:45:08 EDT
Created attachment 364731 [details]
Fence agent for iLO2 with firmware 1.70+
Comment 11 Marek Grac 2009-10-14 05:45:44 EDT
Created attachment 364732 [details]
Fencing library
Comment 14 Debbie Johnson 2009-10-20 14:33:52 EDT
From IT 356169:

As I mentioned to you on our conference call - we've got a bit of an issue with fence_ilo and our HP BL460 blades.

Although it accomplishes its primary objective of ensuring that the node is not accessing cluster resources, we do really need it to power the node back on.

I get the following messages if I run fence_ilo manually:

Timed out waiting to power ON
Success: Rebooted

This problem seems similar to bugzilla BZ#507514.

In that bugzilla, someone suggests introducing a 5 second sleep in /usr/lib/fence/fencing.py as a workaround.

I can confirm that this workaround does work in our environment.

The bugzilla suggested that this problem was caused by version 1.70 of the iLO2 firmware; however, we're experiencing the same issue with 1.77. Perhaps it is every firmware since 1.70.

The publicly viewable bugzilla entries do not appear to show ay progress on this issue by Red Hat. Can I please have an update on this?

We're planning on rolling out a large number of Xen clusters on HP blades in the coming months. We are very keen to see this issue resolved shorter rather than later.
Comment 15 Perry Myers 2009-10-20 17:06:59 EDT
In the upstream code we've developed a generic interface for all fence agents to allow you to specify various timeouts on the fence agent command line.  This functionality will be backported to the RHEL5 fence agents in order to resolve this issue.

Marek, can you give an update on the ETA of doing the backport?
Comment 19 Marek Grac 2009-10-22 05:12:18 EDT
@Debbie:

Patch to add timeouts is ready and can be deployed in RHEL5 (use --power-wait <N> from cmd line tests; N in seconds). But in ours test environment it does help at all (tested with v1.70 and v1.77). Different fence agent originally developed for iLO MP works for us [included as attachement in this bz], if you can please test it. 

If timeout works for you then our timeout options will be enough and I can add them (but without default values that will work for you).

Test build can be ready in Friday morning (US time).
Comment 21 Marek Grac 2009-11-06 07:49:00 EST
There is new option --retry-on=N (retry_on on stdin) for HP iLO. <N> specifies number of attempts (to send power ON signal and wait for status). During our tests we found out that 3 is enough, so it is set as default. 

Timeout options are also available, please consult http://sources.redhat.com/cluster/wiki/FenceTiming for details.

Alternative fence agent should work too and it is possible that it is faster in powering ON machine (time to shutdown node looks similar).


http://git.fedorahosted.org/git/fence-agents.git?p=fence-agents.git;a=commit;h=8acc1a69d6695bc5d3a86f21f60910186053bac1
Comment 23 Jan Friesse 2009-11-09 05:16:33 EST
Created attachment 368159 [details]
Fix traceback when using any SNMP agent

Main problem was hidden in forgotten "self" in fencing_snmp caused by commit:

Author: Marek 'marx' Grac <mgrac@redhat.com>
Date:   Fri Oct 9 13:36:25 2009 +0200
fencing: Timeout options added
Comment 24 Marek Grac 2009-11-09 08:22:28 EST
Fixed in fence_snmp_traceback.patch (build cman-2.0.115-17.el5)
Comment 25 Marek Grac 2009-11-11 09:23:16 EST
Fix problems introduced by --retry-on
Comment 33 Chris Ward 2010-02-11 05:04:09 EST
~~ Attention Customers and Partners - RHEL 5.5 Beta is now available on RHN ~~

RHEL 5.5 Beta has been released! There should be a fix present in this 
release that addresses your request. Please test and report back results 
here, by March 3rd 2010 (2010-03-03) or sooner.

Upon successful verification of this request, post your results and update 
the Verified field in Bugzilla with the appropriate value.

If you encounter any issues while testing, please describe them and set 
this bug into NEED_INFO. If you encounter new defects or have additional 
patch(es) to request for inclusion, please clone this bug per each request
and escalate through your support representative.
Comment 38 errata-xmlrpc 2010-03-30 04:38:14 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2010-0266.html

Note You need to log in before you can comment on or make changes to this bug.