Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1093742 - System is not power on after a fencing operation (ILO3).
System is not power on after a fencing operation (ILO3).
Status: CLOSED ERRATA
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine (Show other bugs)
3.3.0
x86_64 Linux
medium Severity medium
: ---
: 3.5.0
Assigned To: Eli Mesika
sefi litmanovich
infra
: ZStream
Depends On:
Blocks: 1114618 rhev3.5beta 1156165
  Show dependency treegraph
 
Reported: 2014-05-02 10:16 EDT by Roman Hodain
Modified: 2016-02-10 14:33 EST (History)
12 users (show)

See Also:
Fixed In Version: vt1.3
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1114618 (view as bug list)
Environment:
Last Closed: 2015-02-11 13:01:08 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Infra
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 1117603 None None None Never
oVirt gerrit 29383 master MERGED core: handle fence agent power wait param on stop Never
Red Hat Product Errata RHSA-2015:0158 normal SHIPPED_LIVE Important: Red Hat Enterprise Virtualization Manager 3.5.0 2015-02-11 17:38:50 EST

  None (edit)
Description Roman Hodain 2014-05-02 10:16:40 EDT
Description of problem:
	When iLO is used for power management. The system is not powered on
automatically. This is caused by the default parameter power_wait=4. This can
cause the system to be powered off after 4 seconds. The system seems to be off
but it is still running and the "on" operation will not trigger the start
operation because of that.

	
Version-Release number of selected component (if applicable):
	RHEVM 3.3

How reproducible:
	100% in not very loaded environments

Steps to Reproduce:
	1. configure the default powermanagement for iLO
	2. use iptable to block the port 54321 on the hypervisor in order to
	   trigger the fencing.

Actual results:
	System remains down

Expected results:
	System is powered off and then on again
Comment 1 Eli Mesika 2014-05-04 03:23:08 EDT
(In reply to Roman Hodain from comment #0)
> Description of problem:
> 	When iLO is used for power management. The system is not powered on
> automatically. This is caused by the default parameter power_wait=4. This can
> cause the system to be powered off after 4 seconds. The system seems to be
> off
> but it is still running and the "on" operation will not trigger the start
> operation because of that.

We are using the status command to poll the host status and actually reboot is implemented as 

call Stop command
wait to status 'off'
call Start command
wait to status 'on'

So, IMO in ILO if we use power_wait=4 then status should return 'on' during this 4 seconds 

Putting need info on Marec G to examine if this is a BZ in the fence-agents package
Comment 2 Marek Grac 2014-05-05 03:20:11 EDT
Fencing process for reboot looks like:

1) wait for X seconds (--delay)
2) get status
3) call 'off' command
4) wait for X seconds (--power-wait)
5) wait until power status is changed (--power-timeout)
6) call 'on' command
7) wait for X seconds (--power-wait)
8) wait until power status is changed (--power-timeout)
9) problem in step (8) is not critical so fencing is considered to be successful 


If you can test fence agent directly it will be great because currently I do not have any information that will be helpful.
Comment 3 Eli Mesika 2014-05-05 04:07:24 EDT
(In reply to Marek Grac from comment #2)
>Fencing process for reboot looks like:
>1) wait for X seconds (--delay)
>2) get status
>3) call 'off' command
>4) wait for X seconds (--power-wait)
>5) wait until power status is changed (--power-timeout)

Marec, we are not using the 'reboot' since its a fire-and-forget operation and we have to make sure that the host is down before attempting to run its VMs on other host. (we are polling the host status, so in restart we can not tell if the host was 'off'...)

So, in oVirt a reboot is done as I had described in comment 1 using only the stop and start commands.

So the question is :
If I use the ILO card and calling stop command on time T(0) giving power_wait=4 , I understand that calling status on T(5) will return 'off', but what the device returns as status on T(1) to T(4) ???
From what you had described in comment 2, I understand that the system will return 'off' immediately (step 3 in you flow) while the machine is off only in step 5. that means that if a start command is issued after step 3 and before step 5 it will be ignored??? (as documented in the bug description)
Comment 4 Marek Grac 2014-05-05 04:11:43 EDT
@Eli:

fence agents do not believe those device too, so 'reboot' is done as off, check status, on, check status - imho, there is no need to do same in the other side too.

--
with power_wait=4;

T(0): stop
T(1,2,3,4): sleep / do nothing
T(5): check if machine is 'off'
T(6): ...
T(5+ --power-timeout): check if machine is 'off'
T(5+ --power-timeout + 1): operation was not succesful

Take a look at:
https://fedorahosted.org/cluster/wiki/FenceTiming
Comment 7 Marek Grac 2014-05-05 08:22:51 EDT
@Eli:

fence agents do not finish on/off operation if status is not validated. So running 'status' action again should not be required. If it is, then it should be fixed at fence agents side.
Comment 9 Eli Mesika 2014-05-20 03:20:44 EDT
Please attach full relevant engine & vdsm logs , I need to see the whole flow to investigate that
Comment 16 Eli Mesika 2014-06-19 09:26:18 EDT
Please specify the exact ILO details being used
Comment 17 Roman Hodain 2014-06-20 02:13:22 EDT
(In reply to Eli Mesika from comment #16)
> Please specify the exact ILO details being used

Hi,

Firmware Version 1.28
Comment 31 Marek Grac 2014-06-26 04:33:57 EDT
@Eli: completely right

ad FenceExecutor) you do not have to do this flow and you can do left that to fence agent. Doing a code review of that part should be simple enough and as this is a part on which cluster believes it can save you some complexity
Comment 35 sefi litmanovich 2014-08-25 08:24:25 EDT
Verified with ovirt-engine-3.5.0-0.0.master.20140821064931.gitb794d66.el6.noarch.

tested with ipmilan fence agent.
added agent's credentials to host's power management configurations.
added power_wait=20 in options.

issued fence STOP command.
after STOP action was executed:
2014-08-25 15:09:20,333 INFO  [org.ovirt.engine.core.bll.StopVdsCommand] (org.ovirt.thread.pool-8-thread-12) [79a7668e] Waiting for vds rose07.qa.lab.tlv.redhat.com to stop

waited for 20 + 5 seconds and then status command was issued and received status = off.
Comment 37 errata-xmlrpc 2015-02-11 13:01:08 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-0158.html

Note You need to log in before you can comment on or make changes to this bug.