Bug 1058737 - Restart HA VMs when power management reports host is in powered off state
Summary: Restart HA VMs when power management reports host is in powered off state
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: oVirt
Classification: Retired
Component: ovirt-engine-core
Version: unspecified
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 3.5.0
Assignee: Eli Mesika
QA Contact: sefi litmanovich
URL:
Whiteboard: infra
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-01-28 13:09 UTC by Andrew Lau
Modified: 2016-02-10 19:31 UTC (History)
8 users (show)

Fixed In Version: ovirt-3.5.0-alpha2
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-10-17 12:20:39 UTC
oVirt Team: Infra
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 27985 0 master MERGED core: [RFE] Restart HA VMs ASAP Never

Description Andrew Lau 2014-01-28 13:09:15 UTC
Description of problem:
Based on recent discussion on the mailing list:

[Users] two node ovirt cluster with HA

An [RFE] in regards to reduce the strictness in power fencing and VM High Availability has been proposed. 

I see two scenarios:

* Power Management sees that the host is defined as OFF then:
1) The non-responsive treatment should be modified to check Host status via PM agent
2) If Host is off , HA VMs will attempt to run on another host ASAP
3) The host status should be set to DOWN
4) No attempt to restart vdsm (soft fencing) or restart the host (hard fencing) will be done

My second scenario is quite risky and probably not ideal:

* If Power Management device and Host become unreachable and the storage is on a shared storage (still accessible by the engine)
1) If the above criteria are met and the disk images which the VM uses have not been accessed or written to within X timeout we can assume that the VM is no longer running.
2) We can then remove the old sanlock release and bring the VMs up on a new host?

I believe this second scenario is very risky and happens in rare cases. An example scenario would be a network split, however on top of a gluster backed storage this would lead to very bad split-brain data corruption.

Comment 1 Oved Ourfali 2014-06-29 08:55:06 UTC
This is a bug and not RFE, so setting it properly.

Comment 2 sefi litmanovich 2014-09-10 11:55:08 UTC
Verified with rhevm-3.5.0-0.10.master.el6ev.noarch.

test flow:

1) have 2 hosts (1 with pm configured) in cluster, connected to nfs storage domain - all up.
2) create HA vm running on host_with_pm.
3) connect to host_with_pm (via ssh or fence agent) and shut it down.
4) host state moves to connecting -> not_responsive_down.
5) in engine.log verified there was no SshSoftFencing task.
6) after grace period time host is rebooted (fence STATUS action gets off status, then START action is issued)
7) HA vm migrates successfully to second host.
8) host_with_pm moves to non-responsive and eventually up state.

Comment 3 Sandro Bonazzola 2014-10-17 12:20:39 UTC
oVirt 3.5 has been released and should include the fix for this issue.


Note You need to log in before you can comment on or make changes to this bug.