Bug 1005756

Summary: stopping the engine service while fencing is in progress might result in powered-off hosts
Product: [Retired] oVirt Reporter: Oved Ourfali <oourfali>
Component: ovirt-engine-coreAssignee: Eli Mesika <emesika>
Status: CLOSED CURRENTRELEASE QA Contact: sefi litmanovich <slitmano>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 3.3CC: bugs, dfediuck, emesika, gklein, iheim, rbalakri, slitmano, yeylon
Target Milestone: ---   
Target Release: 3.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: infra
Fixed In Version: ovirt-engine-3.5.0_beta Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-10-17 12:24:33 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Oved Ourfali 2013-09-09 10:43:36 UTC
Description of problem:

If during host fencing the engine service is stopped, the host might remain in powered-off status until manually powering it up.

Comment 1 Eli Mesika 2014-04-02 20:21:41 UTC
(In reply to Oved Ourfali from comment #0)
> Description of problem:
> 
> If during host fencing the engine service is stopped, the host might remain
> in powered-off status until manually powering it up.

relevant mainly to hosted engine.

Comment 2 sefi litmanovich 2014-09-18 15:54:03 UTC
Tried to verify according to the following scenario:

1. installed hosted engine on HOST A with hosted-engine --deploy.
2. added a second HOST B also with hosted-engine --deploy.
3. on the vm running the engine enabled PMHealthCheck=True, PMHealthCheckIntervalInSec=60 using engine-config.
4. restarted engine.
5. configured HOST A (which is master and running the vm) with working PM agent credentials and tested to validate.
6. put service network down on HOST A.

result:

after some time, HOST B was set to be the master and vm was restarted on it according to hosted-engine HA. after engine was started again HOST A was identified as non-responsive and audit log mentioned the grace period until next fence action (120 sec), this reoccurs each 120 seconds and HOST A is not fenced in any point.

Comment 3 Eli Mesika 2014-09-20 19:36:59 UTC
Please consult with me, I don't think you are testing this feature correctly.

this feature claims to resolve the issue in case the engine was restarted EXACTLY after the host was stopped and BEFORE it was started again, that means that the host status at that point in vds_dynamic should be 'reboot' , so , the engine will try to start this host after a configurable quite time (5 min) when the engine is restarted 

From your explanation above I think that you had missed that point.
Please recheck or provide vds_dynamic capture for Host A before the engine was restarted

Comment 4 sefi litmanovich 2014-10-02 07:42:49 UTC
Attaching our conclusions from our email correspondence:

It seems that when we have two hosts supporting hosted-engine , Host A (with PM configured) and Host B when we block communication to Host A  

* Engine is restarted on Host B BEFORE the non-responding treatment event start.
* Engine runs on Host B
* Engine on Host B ignores Host A non-responding since it is in 5 min of engine startup

Result : Host A is never restarted although it has a configured PM.

Please advice how that should be resolved

Comment 5 Doron Fediuck 2014-10-05 08:34:38 UTC
As replied elsewhere, the fencing procedures should take other
scenarios into account, including engine crash. If we have an
indication in the DB you can use it to fence post quiet time.
Thus even if an un-hosted engine crashes you can resume such actions.

Comment 6 sefi litmanovich 2014-10-14 14:23:44 UTC
This is clear, but as I understood from my correspondence with emesika this bz should be verified using hosted-engine as the most likely use case is a scenario where a host running a hosted engine is fenced due to e.g. network problem and the engine service is stopped as well and starts again on a second host. this is where I get the above mentioned problem.

If you want this bz to be verified based a different scenario, please let me know so I'll do that and open a different bz for the hosted engine problem.

Comment 7 Sandro Bonazzola 2014-10-17 12:24:33 UTC
oVirt 3.5 has been released and should include the fix for this issue.

Comment 8 Eli Mesika 2014-10-22 08:30:16 UTC
The bug is relevant mainly to hosted engine, so , should be reopned if failed QA