Description of problem:
When iLO is used for power management. The system is not powered on
automatically. This is caused by the default parameter power_wait=4. This can
cause the system to be powered off after 4 seconds. The system seems to be off
but it is still running and the "on" operation will not trigger the start
operation because of that.
Version-Release number of selected component (if applicable):
100% in not very loaded environments
Steps to Reproduce:
1. configure the default powermanagement for iLO
2. use iptable to block the port 54321 on the hypervisor in order to
trigger the fencing.
System remains down
System is powered off and then on again
(In reply to Roman Hodain from comment #0)
> Description of problem:
> When iLO is used for power management. The system is not powered on
> automatically. This is caused by the default parameter power_wait=4. This can
> cause the system to be powered off after 4 seconds. The system seems to be
> but it is still running and the "on" operation will not trigger the start
> operation because of that.
We are using the status command to poll the host status and actually reboot is implemented as
call Stop command
wait to status 'off'
call Start command
wait to status 'on'
So, IMO in ILO if we use power_wait=4 then status should return 'on' during this 4 seconds
Putting need info on Marec G to examine if this is a BZ in the fence-agents package
Fencing process for reboot looks like:
1) wait for X seconds (--delay)
2) get status
3) call 'off' command
4) wait for X seconds (--power-wait)
5) wait until power status is changed (--power-timeout)
6) call 'on' command
7) wait for X seconds (--power-wait)
8) wait until power status is changed (--power-timeout)
9) problem in step (8) is not critical so fencing is considered to be successful
If you can test fence agent directly it will be great because currently I do not have any information that will be helpful.
(In reply to Marek Grac from comment #2)
>Fencing process for reboot looks like:
>1) wait for X seconds (--delay)
>2) get status
>3) call 'off' command
>4) wait for X seconds (--power-wait)
>5) wait until power status is changed (--power-timeout)
Marec, we are not using the 'reboot' since its a fire-and-forget operation and we have to make sure that the host is down before attempting to run its VMs on other host. (we are polling the host status, so in restart we can not tell if the host was 'off'...)
So, in oVirt a reboot is done as I had described in comment 1 using only the stop and start commands.
So the question is :
If I use the ILO card and calling stop command on time T(0) giving power_wait=4 , I understand that calling status on T(5) will return 'off', but what the device returns as status on T(1) to T(4) ???
From what you had described in comment 2, I understand that the system will return 'off' immediately (step 3 in you flow) while the machine is off only in step 5. that means that if a start command is issued after step 3 and before step 5 it will be ignored??? (as documented in the bug description)
fence agents do not believe those device too, so 'reboot' is done as off, check status, on, check status - imho, there is no need to do same in the other side too.
T(1,2,3,4): sleep / do nothing
T(5): check if machine is 'off'
T(5+ --power-timeout): check if machine is 'off'
T(5+ --power-timeout + 1): operation was not succesful
Take a look at:
fence agents do not finish on/off operation if status is not validated. So running 'status' action again should not be required. If it is, then it should be fixed at fence agents side.
Please attach full relevant engine & vdsm logs , I need to see the whole flow to investigate that
Please specify the exact ILO details being used
(In reply to Eli Mesika from comment #16)
> Please specify the exact ILO details being used
Firmware Version 1.28
@Eli: completely right
ad FenceExecutor) you do not have to do this flow and you can do left that to fence agent. Doing a code review of that part should be simple enough and as this is a part on which cluster believes it can save you some complexity
Verified with ovirt-engine-3.5.0-0.0.master.20140821064931.gitb794d66.el6.noarch.
tested with ipmilan fence agent.
added agent's credentials to host's power management configurations.
added power_wait=20 in options.
issued fence STOP command.
after STOP action was executed:
2014-08-25 15:09:20,333 INFO [org.ovirt.engine.core.bll.StopVdsCommand] (org.ovirt.thread.pool-8-thread-12) [79a7668e] Waiting for vds rose07.qa.lab.tlv.redhat.com to stop
waited for 20 + 5 seconds and then status command was issued and received status = off.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.