Bug 1093742 - System is not power on after a fencing operation (ILO3).
Summary: System is not power on after a fencing operation (ILO3).
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 3.3.0
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ---
: 3.5.0
Assignee: Eli Mesika
QA Contact: sefi litmanovich
URL:
Whiteboard: infra
Depends On:
Blocks: 1114618 rhev3.5beta 1156165
TreeView+ depends on / blocked
 
Reported: 2014-05-02 14:16 UTC by Roman Hodain
Modified: 2019-04-28 09:55 UTC (History)
11 users (show)

Fixed In Version: vt1.3
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1114618 (view as bug list)
Environment:
Last Closed: 2015-02-11 18:01:08 UTC
oVirt Team: Infra
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 1117603 0 None None None Never
Red Hat Product Errata RHSA-2015:0158 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Virtualization Manager 3.5.0 2015-02-11 22:38:50 UTC
oVirt gerrit 29383 0 master MERGED core: handle fence agent power wait param on stop Never

Description Roman Hodain 2014-05-02 14:16:40 UTC
Description of problem:
	When iLO is used for power management. The system is not powered on
automatically. This is caused by the default parameter power_wait=4. This can
cause the system to be powered off after 4 seconds. The system seems to be off
but it is still running and the "on" operation will not trigger the start
operation because of that.

	
Version-Release number of selected component (if applicable):
	RHEVM 3.3

How reproducible:
	100% in not very loaded environments

Steps to Reproduce:
	1. configure the default powermanagement for iLO
	2. use iptable to block the port 54321 on the hypervisor in order to
	   trigger the fencing.

Actual results:
	System remains down

Expected results:
	System is powered off and then on again

Comment 1 Eli Mesika 2014-05-04 07:23:08 UTC
(In reply to Roman Hodain from comment #0)
> Description of problem:
> 	When iLO is used for power management. The system is not powered on
> automatically. This is caused by the default parameter power_wait=4. This can
> cause the system to be powered off after 4 seconds. The system seems to be
> off
> but it is still running and the "on" operation will not trigger the start
> operation because of that.

We are using the status command to poll the host status and actually reboot is implemented as 

call Stop command
wait to status 'off'
call Start command
wait to status 'on'

So, IMO in ILO if we use power_wait=4 then status should return 'on' during this 4 seconds 

Putting need info on Marec G to examine if this is a BZ in the fence-agents package

Comment 2 Marek Grac 2014-05-05 07:20:11 UTC
Fencing process for reboot looks like:

1) wait for X seconds (--delay)
2) get status
3) call 'off' command
4) wait for X seconds (--power-wait)
5) wait until power status is changed (--power-timeout)
6) call 'on' command
7) wait for X seconds (--power-wait)
8) wait until power status is changed (--power-timeout)
9) problem in step (8) is not critical so fencing is considered to be successful 


If you can test fence agent directly it will be great because currently I do not have any information that will be helpful.

Comment 3 Eli Mesika 2014-05-05 08:07:24 UTC
(In reply to Marek Grac from comment #2)
>Fencing process for reboot looks like:
>1) wait for X seconds (--delay)
>2) get status
>3) call 'off' command
>4) wait for X seconds (--power-wait)
>5) wait until power status is changed (--power-timeout)

Marec, we are not using the 'reboot' since its a fire-and-forget operation and we have to make sure that the host is down before attempting to run its VMs on other host. (we are polling the host status, so in restart we can not tell if the host was 'off'...)

So, in oVirt a reboot is done as I had described in comment 1 using only the stop and start commands.

So the question is :
If I use the ILO card and calling stop command on time T(0) giving power_wait=4 , I understand that calling status on T(5) will return 'off', but what the device returns as status on T(1) to T(4) ???
From what you had described in comment 2, I understand that the system will return 'off' immediately (step 3 in you flow) while the machine is off only in step 5. that means that if a start command is issued after step 3 and before step 5 it will be ignored??? (as documented in the bug description)

Comment 4 Marek Grac 2014-05-05 08:11:43 UTC
@Eli:

fence agents do not believe those device too, so 'reboot' is done as off, check status, on, check status - imho, there is no need to do same in the other side too.

--
with power_wait=4;

T(0): stop
T(1,2,3,4): sleep / do nothing
T(5): check if machine is 'off'
T(6): ...
T(5+ --power-timeout): check if machine is 'off'
T(5+ --power-timeout + 1): operation was not succesful

Take a look at:
https://fedorahosted.org/cluster/wiki/FenceTiming

Comment 7 Marek Grac 2014-05-05 12:22:51 UTC
@Eli:

fence agents do not finish on/off operation if status is not validated. So running 'status' action again should not be required. If it is, then it should be fixed at fence agents side.

Comment 9 Eli Mesika 2014-05-20 07:20:44 UTC
Please attach full relevant engine & vdsm logs , I need to see the whole flow to investigate that

Comment 16 Eli Mesika 2014-06-19 13:26:18 UTC
Please specify the exact ILO details being used

Comment 17 Roman Hodain 2014-06-20 06:13:22 UTC
(In reply to Eli Mesika from comment #16)
> Please specify the exact ILO details being used

Hi,

Firmware Version 1.28

Comment 31 Marek Grac 2014-06-26 08:33:57 UTC
@Eli: completely right

ad FenceExecutor) you do not have to do this flow and you can do left that to fence agent. Doing a code review of that part should be simple enough and as this is a part on which cluster believes it can save you some complexity

Comment 35 sefi litmanovich 2014-08-25 12:24:25 UTC
Verified with ovirt-engine-3.5.0-0.0.master.20140821064931.gitb794d66.el6.noarch.

tested with ipmilan fence agent.
added agent's credentials to host's power management configurations.
added power_wait=20 in options.

issued fence STOP command.
after STOP action was executed:
2014-08-25 15:09:20,333 INFO  [org.ovirt.engine.core.bll.StopVdsCommand] (org.ovirt.thread.pool-8-thread-12) [79a7668e] Waiting for vds rose07.qa.lab.tlv.redhat.com to stop

waited for 20 + 5 seconds and then status command was issued and received status = off.

Comment 37 errata-xmlrpc 2015-02-11 18:01:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-0158.html


Note You need to log in before you can comment on or make changes to this bug.