Bug 1386444 - [RFE] Introduce HA timeout for VMs in Paused state due to Unreachable Storage
Summary: [RFE] Introduce HA timeout for VMs in Paused state due to Unreachable Storage
Keywords:
Status: CLOSED DUPLICATE of bug 1230788
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: RFEs
Version: 4.0.3
Hardware: x86_64
OS: Linux
high
medium
Target Milestone: ovirt-4.3.0
: ---
Assignee: Rob Young
QA Contact:
URL:
Whiteboard:
Depends On: rhv_turn_off_autoresume_of_paused_VMs
Blocks: 1417161
TreeView+ depends on / blocked
 
Reported: 2016-10-19 01:06 UTC by Germano Veit Michel
Modified: 2021-06-10 11:38 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-12-04 14:25:51 UTC
oVirt Team: SLA
Target Upstream Version:
Embargoed:
lsvaty: testing_plan_complete-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1230788 0 urgent CLOSED [RFE] Have a policy for autoresume of VMs paused due to IO errors (stay paused, turn off, restart with defined time out ... 2021-09-09 11:40:56 UTC
Red Hat Knowledge Base (Solution) 2749481 0 None None None 2016-11-04 04:07:34 UTC

Internal Links: 1230788

Description Germano Veit Michel 2016-10-19 01:06:08 UTC
Consider the following scenario:
1. VM is running in a Host
2. Host loses Access to the Storage Domain that contains the VM disk
3. VM is paused, goes to Non Operational State (NonOperationalReason.STORAGE_DOMAIN_UNREACHABLE)
4. VM will not be automatically migrated from the Host (MIGRATE_PAUSED_EIO_VM_IS_NOT_SUPPORTED)
5. VM is stuck in the host forever, until manually fenced or Storage Recovers.

What is suggested:
- An OPTIONAL parameter in VM HA configuration, that specifies a maximum amount of time a HA VM can remain in Paused state due to NonOperationalReason.STORAGE_DOMAIN_UNREACHABLE. This is to allow the administrator to specify a time threshold HA VMs can wait for the SD to recover. When this threshold is crossed the Engine tries to kill the VM, if successful it get's started again by the already implemented HA mechanism, now in an Operational Host with reachable storage domain.

In current implementation the Admin needs to power off manually all the VMs and start the somewhere else. Or even fence the host manually. Granted, powering off VMs this way might be risky but in some situations it's the only way forward.

I am not sure what is the status of http://www.ovirt.org/develop/release-management/features/storage/sanlock-fencing/ and https://bugzilla.redhat.com/show_bug.cgi?id=1317429, but I believe the request provided here is simple enough and does not require any bigger change on the existing code or storage formats, while covering a fairly useful use case. However, not sure if would be duplicate effort as theoretically this could can be covered by the mechanisms discussed in that BZ as well.

Comment 1 Germano Veit Michel 2016-10-19 01:15:59 UTC
(In reply to Germano Veit Michel from comment #0)
> 3. VM is paused, goes to Non Operational State
> (NonOperationalReason.STORAGE_DOMAIN_UNREACHABLE)

Should read:

VM is paused, HOST goes to Non Operational State
(NonOperationalReason.STORAGE_DOMAIN_UNREACHABLE)

Comment 2 Marina Kalinin 2016-10-31 14:53:01 UTC
This rfe may help here, scheduled to 4.1:
https://bugzilla.redhat.com/show_bug.cgi?id=1379771

Comment 3 Doron Fediuck 2017-08-07 10:34:54 UTC
This will be resolved by bug #1230788, where HA VMs can be started in a 'kill' mode, which means instead of pausing, it'll die and restart elsewhere.

Comment 4 Yaniv Kaul 2017-12-03 20:56:18 UTC
Doron, why can't we close this as dup?

Comment 5 Doron Fediuck 2017-12-04 14:25:51 UTC

*** This bug has been marked as a duplicate of bug 1230788 ***


Note You need to log in before you can comment on or make changes to this bug.