Bug 1058737

Summary: Restart HA VMs when power management reports host is in powered off state
Product: [Retired] oVirt Reporter: Andrew Lau <andrew>
Component: ovirt-engine-coreAssignee: Eli Mesika <emesika>
Status: CLOSED CURRENTRELEASE QA Contact: sefi litmanovich <slitmano>
Severity: high Docs Contact:
Priority: unspecified    
Version: unspecifiedCC: emesika, gklein, iheim, josh, oourfali, rbalakri, s.kieske, yeylon
Target Milestone: ---   
Target Release: 3.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: infra
Fixed In Version: ovirt-3.5.0-alpha2 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-10-17 12:20:39 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Andrew Lau 2014-01-28 13:09:15 UTC
Description of problem:
Based on recent discussion on the mailing list:

[Users] two node ovirt cluster with HA

An [RFE] in regards to reduce the strictness in power fencing and VM High Availability has been proposed. 

I see two scenarios:

* Power Management sees that the host is defined as OFF then:
1) The non-responsive treatment should be modified to check Host status via PM agent
2) If Host is off , HA VMs will attempt to run on another host ASAP
3) The host status should be set to DOWN
4) No attempt to restart vdsm (soft fencing) or restart the host (hard fencing) will be done

My second scenario is quite risky and probably not ideal:

* If Power Management device and Host become unreachable and the storage is on a shared storage (still accessible by the engine)
1) If the above criteria are met and the disk images which the VM uses have not been accessed or written to within X timeout we can assume that the VM is no longer running.
2) We can then remove the old sanlock release and bring the VMs up on a new host?

I believe this second scenario is very risky and happens in rare cases. An example scenario would be a network split, however on top of a gluster backed storage this would lead to very bad split-brain data corruption.

Comment 1 Oved Ourfali 2014-06-29 08:55:06 UTC
This is a bug and not RFE, so setting it properly.

Comment 2 sefi litmanovich 2014-09-10 11:55:08 UTC
Verified with rhevm-3.5.0-0.10.master.el6ev.noarch.

test flow:

1) have 2 hosts (1 with pm configured) in cluster, connected to nfs storage domain - all up.
2) create HA vm running on host_with_pm.
3) connect to host_with_pm (via ssh or fence agent) and shut it down.
4) host state moves to connecting -> not_responsive_down.
5) in engine.log verified there was no SshSoftFencing task.
6) after grace period time host is rebooted (fence STATUS action gets off status, then START action is issued)
7) HA vm migrates successfully to second host.
8) host_with_pm moves to non-responsive and eventually up state.

Comment 3 Sandro Bonazzola 2014-10-17 12:20:39 UTC
oVirt 3.5 has been released and should include the fix for this issue.