Bug 1005756 - stopping the engine service while fencing is in progress might result in powered-off hosts
Summary: stopping the engine service while fencing is in progress might result in powe...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: oVirt
Classification: Retired
Component: ovirt-engine-core
Version: 3.3
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 3.5.0
Assignee: Eli Mesika
QA Contact: sefi litmanovich
URL:
Whiteboard: infra
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-09-09 10:43 UTC by Oved Ourfali
Modified: 2016-02-10 19:35 UTC (History)
8 users (show)

Fixed In Version: ovirt-engine-3.5.0_beta
Clone Of:
Environment:
Last Closed: 2014-10-17 12:24:33 UTC
oVirt Team: Infra
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 28305 0 master MERGED core: start PM enabled hosts after engine restart Never

Description Oved Ourfali 2013-09-09 10:43:36 UTC
Description of problem:

If during host fencing the engine service is stopped, the host might remain in powered-off status until manually powering it up.

Comment 1 Eli Mesika 2014-04-02 20:21:41 UTC
(In reply to Oved Ourfali from comment #0)
> Description of problem:
> 
> If during host fencing the engine service is stopped, the host might remain
> in powered-off status until manually powering it up.

relevant mainly to hosted engine.

Comment 2 sefi litmanovich 2014-09-18 15:54:03 UTC
Tried to verify according to the following scenario:

1. installed hosted engine on HOST A with hosted-engine --deploy.
2. added a second HOST B also with hosted-engine --deploy.
3. on the vm running the engine enabled PMHealthCheck=True, PMHealthCheckIntervalInSec=60 using engine-config.
4. restarted engine.
5. configured HOST A (which is master and running the vm) with working PM agent credentials and tested to validate.
6. put service network down on HOST A.

result:

after some time, HOST B was set to be the master and vm was restarted on it according to hosted-engine HA. after engine was started again HOST A was identified as non-responsive and audit log mentioned the grace period until next fence action (120 sec), this reoccurs each 120 seconds and HOST A is not fenced in any point.

Comment 3 Eli Mesika 2014-09-20 19:36:59 UTC
Please consult with me, I don't think you are testing this feature correctly.

this feature claims to resolve the issue in case the engine was restarted EXACTLY after the host was stopped and BEFORE it was started again, that means that the host status at that point in vds_dynamic should be 'reboot' , so , the engine will try to start this host after a configurable quite time (5 min) when the engine is restarted 

From your explanation above I think that you had missed that point.
Please recheck or provide vds_dynamic capture for Host A before the engine was restarted

Comment 4 sefi litmanovich 2014-10-02 07:42:49 UTC
Attaching our conclusions from our email correspondence:

It seems that when we have two hosts supporting hosted-engine , Host A (with PM configured) and Host B when we block communication to Host A  

* Engine is restarted on Host B BEFORE the non-responding treatment event start.
* Engine runs on Host B
* Engine on Host B ignores Host A non-responding since it is in 5 min of engine startup

Result : Host A is never restarted although it has a configured PM.

Please advice how that should be resolved

Comment 5 Doron Fediuck 2014-10-05 08:34:38 UTC
As replied elsewhere, the fencing procedures should take other
scenarios into account, including engine crash. If we have an
indication in the DB you can use it to fence post quiet time.
Thus even if an un-hosted engine crashes you can resume such actions.

Comment 6 sefi litmanovich 2014-10-14 14:23:44 UTC
This is clear, but as I understood from my correspondence with emesika this bz should be verified using hosted-engine as the most likely use case is a scenario where a host running a hosted engine is fenced due to e.g. network problem and the engine service is stopped as well and starts again on a second host. this is where I get the above mentioned problem.

If you want this bz to be verified based a different scenario, please let me know so I'll do that and open a different bz for the hosted engine problem.

Comment 7 Sandro Bonazzola 2014-10-17 12:24:33 UTC
oVirt 3.5 has been released and should include the fix for this issue.

Comment 8 Eli Mesika 2014-10-22 08:30:16 UTC
The bug is relevant mainly to hosted engine, so , should be reopned if failed QA


Note You need to log in before you can comment on or make changes to this bug.