Red Hat Bugzilla – Bug 1005756
stopping the engine service while fencing is in progress might result in powered-off hosts
Last modified: 2016-02-10 14:35:16 EST
Description of problem:
If during host fencing the engine service is stopped, the host might remain in powered-off status until manually powering it up.
(In reply to Oved Ourfali from comment #0)
> Description of problem:
> If during host fencing the engine service is stopped, the host might remain
> in powered-off status until manually powering it up.
relevant mainly to hosted engine.
Tried to verify according to the following scenario:
1. installed hosted engine on HOST A with hosted-engine --deploy.
2. added a second HOST B also with hosted-engine --deploy.
3. on the vm running the engine enabled PMHealthCheck=True, PMHealthCheckIntervalInSec=60 using engine-config.
4. restarted engine.
5. configured HOST A (which is master and running the vm) with working PM agent credentials and tested to validate.
6. put service network down on HOST A.
after some time, HOST B was set to be the master and vm was restarted on it according to hosted-engine HA. after engine was started again HOST A was identified as non-responsive and audit log mentioned the grace period until next fence action (120 sec), this reoccurs each 120 seconds and HOST A is not fenced in any point.
Please consult with me, I don't think you are testing this feature correctly.
this feature claims to resolve the issue in case the engine was restarted EXACTLY after the host was stopped and BEFORE it was started again, that means that the host status at that point in vds_dynamic should be 'reboot' , so , the engine will try to start this host after a configurable quite time (5 min) when the engine is restarted
From your explanation above I think that you had missed that point.
Please recheck or provide vds_dynamic capture for Host A before the engine was restarted
Attaching our conclusions from our email correspondence:
It seems that when we have two hosts supporting hosted-engine , Host A (with PM configured) and Host B when we block communication to Host A
* Engine is restarted on Host B BEFORE the non-responding treatment event start.
* Engine runs on Host B
* Engine on Host B ignores Host A non-responding since it is in 5 min of engine startup
Result : Host A is never restarted although it has a configured PM.
Please advice how that should be resolved
As replied elsewhere, the fencing procedures should take other
scenarios into account, including engine crash. If we have an
indication in the DB you can use it to fence post quiet time.
Thus even if an un-hosted engine crashes you can resume such actions.
This is clear, but as I understood from my correspondence with emesika this bz should be verified using hosted-engine as the most likely use case is a scenario where a host running a hosted engine is fenced due to e.g. network problem and the engine service is stopped as well and starts again on a second host. this is where I get the above mentioned problem.
If you want this bz to be verified based a different scenario, please let me know so I'll do that and open a different bz for the hosted engine problem.
oVirt 3.5 has been released and should include the fix for this issue.
The bug is relevant mainly to hosted engine, so , should be reopned if failed QA