Bug 1005756 - stopping the engine service while fencing is in progress might result in powered-off hosts
stopping the engine service while fencing is in progress might result in powe...
Status: CLOSED CURRENTRELEASE
Product: oVirt
Classification: Community
Component: ovirt-engine-core (Show other bugs)
3.3
Unspecified Unspecified
unspecified Severity urgent
: ---
: 3.5.0
Assigned To: Eli Mesika
sefi litmanovich
infra
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-09-09 06:43 EDT by Oved Ourfali
Modified: 2016-02-10 14:35 EST (History)
8 users (show)

See Also:
Fixed In Version: ovirt-engine-3.5.0_beta
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2014-10-17 08:24:33 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Infra
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
oVirt gerrit 28305 master MERGED core: start PM enabled hosts after engine restart Never

  None (edit)
Description Oved Ourfali 2013-09-09 06:43:36 EDT
Description of problem:

If during host fencing the engine service is stopped, the host might remain in powered-off status until manually powering it up.
Comment 1 Eli Mesika 2014-04-02 16:21:41 EDT
(In reply to Oved Ourfali from comment #0)
> Description of problem:
> 
> If during host fencing the engine service is stopped, the host might remain
> in powered-off status until manually powering it up.

relevant mainly to hosted engine.
Comment 2 sefi litmanovich 2014-09-18 11:54:03 EDT
Tried to verify according to the following scenario:

1. installed hosted engine on HOST A with hosted-engine --deploy.
2. added a second HOST B also with hosted-engine --deploy.
3. on the vm running the engine enabled PMHealthCheck=True, PMHealthCheckIntervalInSec=60 using engine-config.
4. restarted engine.
5. configured HOST A (which is master and running the vm) with working PM agent credentials and tested to validate.
6. put service network down on HOST A.

result:

after some time, HOST B was set to be the master and vm was restarted on it according to hosted-engine HA. after engine was started again HOST A was identified as non-responsive and audit log mentioned the grace period until next fence action (120 sec), this reoccurs each 120 seconds and HOST A is not fenced in any point.
Comment 3 Eli Mesika 2014-09-20 15:36:59 EDT
Please consult with me, I don't think you are testing this feature correctly.

this feature claims to resolve the issue in case the engine was restarted EXACTLY after the host was stopped and BEFORE it was started again, that means that the host status at that point in vds_dynamic should be 'reboot' , so , the engine will try to start this host after a configurable quite time (5 min) when the engine is restarted 

From your explanation above I think that you had missed that point.
Please recheck or provide vds_dynamic capture for Host A before the engine was restarted
Comment 4 sefi litmanovich 2014-10-02 03:42:49 EDT
Attaching our conclusions from our email correspondence:

It seems that when we have two hosts supporting hosted-engine , Host A (with PM configured) and Host B when we block communication to Host A  

* Engine is restarted on Host B BEFORE the non-responding treatment event start.
* Engine runs on Host B
* Engine on Host B ignores Host A non-responding since it is in 5 min of engine startup

Result : Host A is never restarted although it has a configured PM.

Please advice how that should be resolved
Comment 5 Doron Fediuck 2014-10-05 04:34:38 EDT
As replied elsewhere, the fencing procedures should take other
scenarios into account, including engine crash. If we have an
indication in the DB you can use it to fence post quiet time.
Thus even if an un-hosted engine crashes you can resume such actions.
Comment 6 sefi litmanovich 2014-10-14 10:23:44 EDT
This is clear, but as I understood from my correspondence with emesika this bz should be verified using hosted-engine as the most likely use case is a scenario where a host running a hosted engine is fenced due to e.g. network problem and the engine service is stopped as well and starts again on a second host. this is where I get the above mentioned problem.

If you want this bz to be verified based a different scenario, please let me know so I'll do that and open a different bz for the hosted engine problem.
Comment 7 Sandro Bonazzola 2014-10-17 08:24:33 EDT
oVirt 3.5 has been released and should include the fix for this issue.
Comment 8 Eli Mesika 2014-10-22 04:30:16 EDT
The bug is relevant mainly to hosted engine, so , should be reopned if failed QA

Note You need to log in before you can comment on or make changes to this bug.