This service will be undergoing maintenance at 00:00 UTC, 2017-10-23 It is expected to last about 30 minutes
Bug 1317450 - (oVirt_turn_off_autoresume_of_paused_VMs) [RFE] Have a policy for autoresume of VMs paused due to IO errors (stay paused, turn off, restart with defined time out time) [NEEDINFO]
[RFE] Have a policy for autoresume of VMs paused due to IO errors (stay pause...
Status: POST
Product: ovirt-engine
Classification: oVirt
Component: RFEs (Show other bugs)
3.6.0
All Linux
urgent Severity medium (vote)
: ovirt-4.2.0
: ---
Assigned To: Milan Zamazal
Elad
: FutureFeature
: 1467893 (view as bug list)
Depends On: ovirt_refactor_disk_class_hierarchy
Blocks: rhv_turn_off_autoresume_of_paused_VMs 1417161
  Show dependency treegraph
 
Reported: 2016-03-14 05:41 EDT by Yaniv Lavi (Dary)
Modified: 2017-10-19 08:59 EDT (History)
21 users (show)

See Also:
Fixed In Version:
Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of: rhv_turn_off_autoresume_of_paused_VMs
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Virt
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
michal.skrivanek: needinfo? (ylavi)
ylavi: ovirt‑4.2?
ylavi: exception?
acanan: testing_plan_complete-
ylavi: planning_ack+
michal.skrivanek: devel_ack+
ylavi: testing_ack?


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 3105171 None None None 2017-10-02 02:18 EDT
oVirt gerrit 82394 master POST core, webadmin: added support for resume behavior 2017-10-19 11:13 EDT
oVirt gerrit 82532 master MERGED virt: Auto-resume VMs only conditionally 2017-10-12 06:32 EDT
oVirt gerrit 82533 master MERGED virt: Track pause time 2017-10-12 06:32 EDT
oVirt gerrit 82534 master MERGED virt: Kill VMs paused for too long 2017-10-12 06:32 EDT
oVirt gerrit 82570 master MERGED Add Resume Behavior to VmBase 2017-10-19 08:36 EDT
oVirt gerrit 82571 master POST restapi: add Resume Behavior to VmBase 2017-10-12 05:23 EDT
oVirt gerrit 82977 model_4.2 MERGED Add Resume Behavior to VmBase 2017-10-19 08:36 EDT
oVirt gerrit 82982 master MERGED restapi: Update to model 4.2.22 2017-10-19 10:47 EDT

  None (edit)
Description Yaniv Lavi (Dary) 2016-03-14 05:41:11 EDT
Make automatic resume of VMs paused due to I/O error configurable. Engine should be able to set that VMs paused as a result of an I/O error will not be resumed automatically once the storage domain recovers.

If VMs are resumed automatically (in an uncontrolled way) when the error condition in the storage domain is resolved, this will cause unexpected and/or undesired effects in their application. For example, resumed VMs don't have their clock in sync after they resume, which would cause significant issues for the application.

Admin needs to be able to configure engine not to automatically resume VMs that paused as a result of problems with the storage.
Comment 1 Allon Mureinik 2016-03-16 05:41:44 EDT
The correct way to represent this property (regardless of how it's later displayed to the user, UX-wise) is per vm-disk relationship. Thus, the refactoring described in bug 1142762 should be done first.

Removing the devel-ack+ until that's done, and then we should re-evaluate according to the timeframes.
Comment 5 Allon Mureinik 2016-07-17 11:15:37 EDT
Seeing as we don't have a sensible way to differentiate between EIO and ENOSPC at the moment, implementing something like this would be very risky wrt thing provisioning.

Let's push it out to 4.1 and do properly.
Comment 8 Yaniv Kaul 2017-06-06 15:02:48 EDT
(In reply to Allon Mureinik from comment #5)
> Seeing as we don't have a sensible way to differentiate between EIO and
> ENOSPC at the moment, implementing something like this would be very risky
> wrt thing provisioning.

Is there a bug open on the ability to differentiate between the two ?
Comment 20 Yaniv Lavi (Dary) 2017-06-21 04:37:23 EDT
The request from users was to be able to control the resume so that if a VM was paused for more than X amount of time, you will be able to set it to:
- Stay paused and let admin handle,
- Turn off the VM.
- Clean restart.

This is true for all three of these options and at its core is moving this decision to the management and not let VDSM just decide to resume it. Even if X amount of time didn't pass, the engine should probably resume the VM if the storage issue was resolved.
Comment 23 Nir Soffer 2017-06-21 15:45:03 EDT
Based on comment 21 and comment 22, I think we can do this:

engine:
- add an option like "Resume VM after I/O errors", defaulting to true,
  keeping the current behavior
- pass the option to vdsm when starting a vm

vdsm:
- when a vm is paused, keep the reason (e.g. ENOSPC, and drive e.g. vda)
  the failure should probably kept in the drive object, so we can handle
  failures of multiple drives.
- when we resume a vm after extending a disk, resume it only if it was paused
  because of ENOSPC on a thin provisioned drive, or if the resume after I/O error
  is enabled.
Comment 24 Yaniv Lavi (Dary) 2017-06-25 10:36:03 EDT
(In reply to Nir Soffer from comment #23)
> Based on comment 21 and comment 22, I think we can do this:
> 
> engine:
> - add an option like "Resume VM after I/O errors", defaulting to true,
>   keeping the current behavior
> - pass the option to vdsm when starting a vm
> 
> vdsm:
> - when a vm is paused, keep the reason (e.g. ENOSPC, and drive e.g. vda)
>   the failure should probably kept in the drive object, so we can handle
>   failures of multiple drives.
> - when we resume a vm after extending a disk, resume it only if it was paused
>   because of ENOSPC on a thin provisioned drive, or if the resume after I/O
> error
>   is enabled.

Please review comment 20. 
We might not want to allow VM to resume, but take a different handling like restarting or shutdown automatically.
Comment 25 Michal Skrivanek 2017-09-19 09:50:59 EDT
based on a discussion with storage team I'm moving this under virt. 
Revised implementation proposal:
add "resume behavior" property to VM with following behavior
- auto resume - default, current behavior for non-HAwL VMs(other than "HA with lease", so both plain HA and no HA).
- leave paused - new behavior for non-HAwL VMs to bypass the autoresume on vdsm side and just leave VMs in the paused state indefinitely
- kill - the only option for HAwL VMs, addressing also bug 1467893 and the possible disk corruption scenario there. Kills the VM when it is unpaused if the predefined interval passed. The interval could be set similar to sanlock lease expiration interval, 80s, needs to be more than 0 to tolerate short time hiccups and must not be longer than time it takes engine to move VM to Unknown which would trigger a restart elsewhere(~5 mins). The unpause operation happens either automatically due to the built-in autoresume code in vdsm or it is checked and kiled in VM recovery in case of vdsm restart. "kill" option for non-HAwL VMs is probably not very useful, but we can keep it as a possibility.

vdsm changes are confined to the autoresume code (and recovery), the configuration can be pased via xml metadata, as well as the time when the VM was paused can be stored there and checked on unpause.

This should cover both this RFE and the problematic behavior in bug 1467893 _and_ would be feasible for 4.2
Comment 26 Nir Soffer 2017-09-19 10:02:17 EDT
(In reply to Yaniv Lavi (Dary) from comment #20)

> The request from users was to be able to control the resume so that if a VM
> was paused for more than X amount of time, you will be able to set it to:
> - Stay paused and let admin handle,

We can do this for IOError, but not for ENOSPC errors, otherwise thin
provisioning will not be possible.

> - Turn off the VM.

We can do this.

> - Clean restart.

Restarting is not possible for HA vm, the vm may have been started
on another host.

For HA VMs we must kill the vm after if it could not be resumed after some
timeout, since engine may try to move the vm to another host - see comment 25.

> This is true for all three of these options and at its core is moving this
> decision to the management and not let VDSM just decide to resume it. Even
> if X amount of time didn't pass, the engine should probably resume the VM if
> the storage issue was resolved.

Resuming the vm should be done on Vdsm side otherwise when engine is down VMs may
get paused. So we cannot move pause handling to the management.

What we can do is use distributed database like etcd to share the state in the
cluster with all the hosts. In this case vdsm can know that a vm moved from
a host to another can be do the right thing.
Comment 27 Martin Tessun 2017-09-22 10:40:57 EDT
(In reply to Michal Skrivanek from comment #25)
> based on a discussion with storage team I'm moving this under virt. 
> Revised implementation proposal:
> add "resume behavior" property to VM with following behavior
> - auto resume - default, current behavior for non-HAwL VMs(other than "HA
> with lease", so both plain HA and no HA).
> - leave paused - new behavior for non-HAwL VMs to bypass the autoresume on
> vdsm side and just leave VMs in the paused state indefinitely
> - kill - the only option for HAwL VMs, addressing also bug 1467893 and the
> possible disk corruption scenario there. Kills the VM when it is unpaused if
> the predefined interval passed. The interval could be set similar to sanlock
> lease expiration interval, 80s, needs to be more than 0 to tolerate short
> time hiccups and must not be longer than time it takes engine to move VM to
> Unknown which would trigger a restart elsewhere(~5 mins). The unpause
> operation happens either automatically due to the built-in autoresume code
> in vdsm or it is checked and kiled in VM recovery in case of vdsm restart.
> "kill" option for non-HAwL VMs is probably not very useful, but we can keep
> it as a possibility.
> 

Looks good to me. I would also keel the "kill" option for no HAwL VMs (maybe even triggering a start in case the killed VM is a HA VM).

> vdsm changes are confined to the autoresume code (and recovery), the
> configuration can be pased via xml metadata, as well as the time when the VM
> was paused can be stored there and checked on unpause.
> 
> This should cover both this RFE and the problematic behavior in bug 1467893
> _and_ would be feasible for 4.2

Sounds good to me.
Comment 28 Tomas Jelinek 2017-10-02 02:18:06 EDT
*** Bug 1467893 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.