Description of problem: Based on recent discussion on the mailing list: [Users] two node ovirt cluster with HA An [RFE] in regards to reduce the strictness in power fencing and VM High Availability has been proposed. I see two scenarios: * Power Management sees that the host is defined as OFF then: 1) The non-responsive treatment should be modified to check Host status via PM agent 2) If Host is off , HA VMs will attempt to run on another host ASAP 3) The host status should be set to DOWN 4) No attempt to restart vdsm (soft fencing) or restart the host (hard fencing) will be done My second scenario is quite risky and probably not ideal: * If Power Management device and Host become unreachable and the storage is on a shared storage (still accessible by the engine) 1) If the above criteria are met and the disk images which the VM uses have not been accessed or written to within X timeout we can assume that the VM is no longer running. 2) We can then remove the old sanlock release and bring the VMs up on a new host? I believe this second scenario is very risky and happens in rare cases. An example scenario would be a network split, however on top of a gluster backed storage this would lead to very bad split-brain data corruption.
This is a bug and not RFE, so setting it properly.
Verified with rhevm-3.5.0-0.10.master.el6ev.noarch. test flow: 1) have 2 hosts (1 with pm configured) in cluster, connected to nfs storage domain - all up. 2) create HA vm running on host_with_pm. 3) connect to host_with_pm (via ssh or fence agent) and shut it down. 4) host state moves to connecting -> not_responsive_down. 5) in engine.log verified there was no SshSoftFencing task. 6) after grace period time host is rebooted (fence STATUS action gets off status, then START action is issued) 7) HA vm migrates successfully to second host. 8) host_with_pm moves to non-responsive and eventually up state.
oVirt 3.5 has been released and should include the fix for this issue.