Bug 1386444

Summary: [RFE] Introduce HA timeout for VMs in Paused state due to Unreachable Storage
Product: Red Hat Enterprise Virtualization Manager Reporter: Germano Veit Michel <gveitmic>
Component: RFEsAssignee: Rob Young <royoung>
Status: CLOSED DUPLICATE QA Contact:
Severity: medium Docs Contact:
Priority: high    
Version: 4.0.3CC: aperotti, dfediuck, lsurette, mgoldboi, mkalinin, rbalakri, srevivo, ykaul
Target Milestone: ovirt-4.3.0Keywords: FutureFeature
Target Release: ---Flags: lsvaty: testing_plan_complete-
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-12-04 14:25:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: SLA RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1230788    
Bug Blocks: 1417161    

Description Germano Veit Michel 2016-10-19 01:06:08 UTC
Consider the following scenario:
1. VM is running in a Host
2. Host loses Access to the Storage Domain that contains the VM disk
3. VM is paused, goes to Non Operational State (NonOperationalReason.STORAGE_DOMAIN_UNREACHABLE)
4. VM will not be automatically migrated from the Host (MIGRATE_PAUSED_EIO_VM_IS_NOT_SUPPORTED)
5. VM is stuck in the host forever, until manually fenced or Storage Recovers.

What is suggested:
- An OPTIONAL parameter in VM HA configuration, that specifies a maximum amount of time a HA VM can remain in Paused state due to NonOperationalReason.STORAGE_DOMAIN_UNREACHABLE. This is to allow the administrator to specify a time threshold HA VMs can wait for the SD to recover. When this threshold is crossed the Engine tries to kill the VM, if successful it get's started again by the already implemented HA mechanism, now in an Operational Host with reachable storage domain.

In current implementation the Admin needs to power off manually all the VMs and start the somewhere else. Or even fence the host manually. Granted, powering off VMs this way might be risky but in some situations it's the only way forward.

I am not sure what is the status of http://www.ovirt.org/develop/release-management/features/storage/sanlock-fencing/ and https://bugzilla.redhat.com/show_bug.cgi?id=1317429, but I believe the request provided here is simple enough and does not require any bigger change on the existing code or storage formats, while covering a fairly useful use case. However, not sure if would be duplicate effort as theoretically this could can be covered by the mechanisms discussed in that BZ as well.

Comment 1 Germano Veit Michel 2016-10-19 01:15:59 UTC
(In reply to Germano Veit Michel from comment #0)
> 3. VM is paused, goes to Non Operational State
> (NonOperationalReason.STORAGE_DOMAIN_UNREACHABLE)

Should read:

VM is paused, HOST goes to Non Operational State
(NonOperationalReason.STORAGE_DOMAIN_UNREACHABLE)

Comment 2 Marina Kalinin 2016-10-31 14:53:01 UTC
This rfe may help here, scheduled to 4.1:
https://bugzilla.redhat.com/show_bug.cgi?id=1379771

Comment 3 Doron Fediuck 2017-08-07 10:34:54 UTC
This will be resolved by bug #1230788, where HA VMs can be started in a 'kill' mode, which means instead of pausing, it'll die and restart elsewhere.

Comment 4 Yaniv Kaul 2017-12-03 20:56:18 UTC
Doron, why can't we close this as dup?

Comment 5 Doron Fediuck 2017-12-04 14:25:51 UTC

*** This bug has been marked as a duplicate of bug 1230788 ***