Bug 1317429

Summary: [RFE] Improve HA failover, so that even when power fencing is not available, automatic HA will work without manual confirmation on host rebooted.
Product: [oVirt] ovirt-engine Reporter: Yaniv Lavi <ylavi>
Component: RFEsAssignee: Nir Soffer <nsoffer>
Status: CLOSED CURRENTRELEASE QA Contact: Lilach Zitnitski <lzitnits>
Severity: high Docs Contact:
Priority: urgent    
Version: 3.6.0CC: ahadas, amureini, bgraveno, bhaubeck, bugs, dmoessne, ederevea, gklein, j.bittner, jentrena, kgoldbla, lpeer, lyarwood, meverett, michal.skrivanek, mkalinin, mtessun, nsoffer, pablo.iranzo, pep, ratamir, rbinkhor, scohen, sherold, sputhenp, sraje, teigland, tnisan, ykaul, ylavi
Target Milestone: ovirt-4.1.0-betaKeywords: FutureFeature
Target Release: ---Flags: rule-engine: ovirt-4.1+
rule-engine: exception+
ylavi: priority_rfe_tracking+
gklein: testing_plan_complete+
ylavi: planning_ack+
amureini: devel_ack+
ratamir: testing_ack+
Hardware: Unspecified   
OS: Unspecified   
URL: http://www.ovirt.org/develop/release-management/features/storage/vm-leases/
Whiteboard:
Fixed In Version: Doc Type: Enhancement
Doc Text:
This update adds the ability to acquire a lease per virtual machine on shared storage, without attaching the lease to a disk. This adds the capability to avoid split-brain, and avoid starting a virtual machine on another host if the original host becomes non-responsive, therefore improving virtual machine high availability.
Story Points: ---
Clone Of: 804272 Environment:
Last Closed: 2017-02-15 15:05:22 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1406765, 1410320, 1412230, 1415488    
Bug Blocks: 804272, 1421432    

Description Yaniv Lavi 2016-03-14 09:03:40 UTC
Improve HA failover, so that even when power fencing is not available, automatic HA will work without manual confirmation on host rebooted. We need to provide a way to restart VMs and move SPM role to a running server in case power fencing does fail.

Power Fencing failing can be due to various reasons:
1. PowerOutage leaves the iLO/Drac, whatever unreachable
2. Network outage also leads to Power Fencing not reachable
3. Strange system failures that also affects the power fencing device
4. Misconfiguration of e.g. Firewalls

All these should lead to VMs running on other hypervisors afterwards so
that they are reachable again. Therefore wwe need to make sure that the host running the VM previously has no chance of reaching the storage anymore and as such it can't do any harm to the data.

Comment 1 Allon Mureinik 2016-03-14 12:16:18 UTC
We need to finilize the design, marking that we haven't completed it yet. Once the design is finilized, we can properly devel ack/nack accordingly.

Comment 2 Sandro Bonazzola 2016-05-02 10:09:37 UTC
Moving from 4.0 alpha to 4.0 beta since 4.0 alpha has been already released and bug is not ON_QA.

Comment 4 Nir Soffer 2016-11-23 12:25:28 UTC
Here is the storage-side feature page:
http://www.ovirt.org/develop/release-management/features/storage/vm-leases/

On top of this there is the virt-side feature page (in review):
https://github.com/oVirt/ovirt-site/pull/586

Comment 6 Nir Soffer 2016-12-01 17:29:50 UTC
We are not finished yet, moving back to POST.

Comment 7 Yaniv Lavi 2017-01-04 16:28:22 UTC
Arik, can you please open a blocking bug on API for the feature?

Comment 8 Tal Nisan 2017-01-18 11:33:11 UTC
REST API bug was opened and already solved

Comment 9 Nir Soffer 2017-01-22 14:56:45 UTC
Add a patch to require the libvirt version that allows working with vm leases.

Moving back to post until this patch is merged (should be quick).