Consider the following scenario: 1. VM is running in a Host 2. Host loses Access to the Storage Domain that contains the VM disk 3. VM is paused, goes to Non Operational State (NonOperationalReason.STORAGE_DOMAIN_UNREACHABLE) 4. VM will not be automatically migrated from the Host (MIGRATE_PAUSED_EIO_VM_IS_NOT_SUPPORTED) 5. VM is stuck in the host forever, until manually fenced or Storage Recovers. What is suggested: - An OPTIONAL parameter in VM HA configuration, that specifies a maximum amount of time a HA VM can remain in Paused state due to NonOperationalReason.STORAGE_DOMAIN_UNREACHABLE. This is to allow the administrator to specify a time threshold HA VMs can wait for the SD to recover. When this threshold is crossed the Engine tries to kill the VM, if successful it get's started again by the already implemented HA mechanism, now in an Operational Host with reachable storage domain. In current implementation the Admin needs to power off manually all the VMs and start the somewhere else. Or even fence the host manually. Granted, powering off VMs this way might be risky but in some situations it's the only way forward. I am not sure what is the status of http://www.ovirt.org/develop/release-management/features/storage/sanlock-fencing/ and https://bugzilla.redhat.com/show_bug.cgi?id=1317429, but I believe the request provided here is simple enough and does not require any bigger change on the existing code or storage formats, while covering a fairly useful use case. However, not sure if would be duplicate effort as theoretically this could can be covered by the mechanisms discussed in that BZ as well.
(In reply to Germano Veit Michel from comment #0) > 3. VM is paused, goes to Non Operational State > (NonOperationalReason.STORAGE_DOMAIN_UNREACHABLE) Should read: VM is paused, HOST goes to Non Operational State (NonOperationalReason.STORAGE_DOMAIN_UNREACHABLE)
This rfe may help here, scheduled to 4.1: https://bugzilla.redhat.com/show_bug.cgi?id=1379771
This will be resolved by bug #1230788, where HA VMs can be started in a 'kill' mode, which means instead of pausing, it'll die and restart elsewhere.
Doron, why can't we close this as dup?
*** This bug has been marked as a duplicate of bug 1230788 ***