1058737 – Restart HA VMs when power management reports host is in powered off state

Bug 1058737 - Restart HA VMs when power management reports host is in powered off state

Summary: Restart HA VMs when power management reports host is in powered off state

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	oVirt
Classification:	Retired
Component:	ovirt-engine-core
Sub Component:
Version:	unspecified
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	3.5.0
Assignee:	Eli Mesika
QA Contact:	sefi litmanovich
Docs Contact:
URL:
Whiteboard:	infra
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-01-28 13:09 UTC by Andrew Lau
Modified:	2016-02-10 19:31 UTC (History)
CC List:	8 users (show)
Fixed In Version:	ovirt-3.5.0-alpha2
Clone Of:
Environment:
Last Closed:	2014-10-17 12:20:39 UTC
oVirt Team:	Infra
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
oVirt gerrit	27985	0	master	MERGED	core: [RFE] Restart HA VMs ASAP	Never

Description Andrew Lau 2014-01-28 13:09:15 UTC

Description of problem:
Based on recent discussion on the mailing list:

[Users] two node ovirt cluster with HA

An [RFE] in regards to reduce the strictness in power fencing and VM High Availability has been proposed. 

I see two scenarios:

* Power Management sees that the host is defined as OFF then:
1) The non-responsive treatment should be modified to check Host status via PM agent
2) If Host is off , HA VMs will attempt to run on another host ASAP
3) The host status should be set to DOWN
4) No attempt to restart vdsm (soft fencing) or restart the host (hard fencing) will be done

My second scenario is quite risky and probably not ideal:

* If Power Management device and Host become unreachable and the storage is on a shared storage (still accessible by the engine)
1) If the above criteria are met and the disk images which the VM uses have not been accessed or written to within X timeout we can assume that the VM is no longer running.
2) We can then remove the old sanlock release and bring the VMs up on a new host?

I believe this second scenario is very risky and happens in rare cases. An example scenario would be a network split, however on top of a gluster backed storage this would lead to very bad split-brain data corruption.

Comment 1 Oved Ourfali 2014-06-29 08:55:06 UTC

This is a bug and not RFE, so setting it properly.

Comment 2 sefi litmanovich 2014-09-10 11:55:08 UTC

Verified with rhevm-3.5.0-0.10.master.el6ev.noarch.

test flow:

1) have 2 hosts (1 with pm configured) in cluster, connected to nfs storage domain - all up.
2) create HA vm running on host_with_pm.
3) connect to host_with_pm (via ssh or fence agent) and shut it down.
4) host state moves to connecting -> not_responsive_down.
5) in engine.log verified there was no SshSoftFencing task.
6) after grace period time host is rebooted (fence STATUS action gets off status, then START action is issued)
7) HA vm migrates successfully to second host.
8) host_with_pm moves to non-responsive and eventually up state.

Comment 3 Sandro Bonazzola 2014-10-17 12:20:39 UTC

oVirt 3.5 has been released and should include the fix for this issue.

Note You need to log in before you can comment on or make changes to this bug.