Description of problem: After startup of engine there's an internval during which fencing is disabled. It's called DisableFenceAtStartupInSec and by default it's set to 5 minutes. It can be changed using engine-config -s DisableFenceAtStartupInSec but please do that with caution. Why do we have such timeout? It's a prevention of fencing storm, which could happen in during power issues in whole DC: when both engine and hosts are started, for huge hosts it may take a lot of time until become up and VDSM start to communicate with engine. So usually engine is started first and without this interval engine will start fencing for hosts which are just starting ... Another thing: if we cannot properly fence the host, we cannot determine if there's not just communication issue between engine and host, so we cannot restart HA VMs on another host. The only thing we can do is to offer "Mark host as rebooted" manual option to administrator. If administrator execution this option, we try to restart HA VMs on different host ASAP, because admin took the responsibility of validation that VMs are really not running. When engine is started, following actions related to fencing are taken: 1. Get status of all hosts from DB and schedule Non Responding Treatment after DisableFenceAtStartupInSec timeout is passed 2. Try to communicate with all host and refresh their status If some host become Non Resposive during DisableFenceAtStartupInSec interval we skip fencing and administator will see message in Events tab that host is Non Responsive, but fencing is disabled due to startup interval. So administrator have to take care of such host manually. Now what happened in your case: 1. Hosted engine VM is running on host1 with other VMs 2. Status of host1 and host2 is Up 3. You kill/shutdown host1 -> hosted engine VM is also shut down -> no engine is running to detect issue with host1 and change its status to Non Responsive 4. In the meantime hosted engine VM is started on host2 -> it will read host status from DB, but all hosts are up -> it will try to communicate with host1, but it's unreachable -> so it changes host1 status Non Responsive and starts Non Responsive Treatment for host1 -> Non Responsive Treatment is aborted because engine is still in DisableFenceAtStartupInSec Version-Release number of selected component (if applicable): oVirt 3.5.3 How reproducible: 100% Steps to Reproduce: 1. Install hosted engine into into 2 host environment (host1 and host2) 2. Run HA VM and hosted engine VM on host 1 3. Crash host1 Actual results: hosted engine VM is restarted on host2, but HA VMs from host1 are not, because host1 is not fenced Expected results: hosted engine VM and HA VMs are restarted on host2 Additional info:
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.
*** Bug 1303897 has been marked as a duplicate of this bug. ***
We didn't, we wanted to get this merged into 3.6.3, but I wasn't able to verify it until today :-(
(In reply to Martin Perina from comment #9) > We didn't, we wanted to get this merged into 3.6.3, but I wasn't able to > verify it until today :-( We will have a build next week probably if this is indeed urgent.
So changing back to 3.6.3
Not yet in 3.6.3 branch
Verified on rhevm-backend-3.6.3.2-0.1.el6.noarch Scenario 1: ========== 1) Deploy HE on two hosts 2) Start two HA vms on host with HE vm 3) Poweroff host(host does not have PM) 4) Wait until HE vm start on second host, first host will dropped to non-responding state and HA vms to unknown state 5) Confirm host reboot on first host 6) HA vms start on second host PASS Scenario 2: ========== 1) Deploy HE on two hosts 2) Start two HA vms on host with HE vm 3) Poweroff host(host has PM) 4) I see that engine try to fence host, but failed because 2016-02-25 15:33:34,353 ERROR [org.ovirt.engine.core.bll.pm.VdsNotRespondingTreatmentCommand] (org.ovirt.thread.pool-6-thread-34) [] Failed to run Fence script on vds 'hosted_engine_2'. 2016-02-25 15:33:34,398 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-34) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Host hosted_engine_2 became non responsive. It has no power management configured. Please check the host status, manually reboot it, and click "Confirm Host Has Been Rebooted" I will verify this one and but opened https://bugzilla.redhat.com/show_bug.cgi?id=1312039 connect to PM issue.