Bug 1317450 (oVirt_turn_off_autoresume_of_paused_VMs) - [RFE] Have a policy for autoresume of VMs paused due to IO errors (stay paused, turn off, restart with defined time out time)
Summary: [RFE] Have a policy for autoresume of VMs paused due to IO errors (stay pause...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: oVirt_turn_off_autoresume_of_paused_VMs
Product: ovirt-engine
Classification: oVirt
Component: RFEs
Version: 3.6.0
Hardware: All
OS: Linux
urgent
medium
Target Milestone: ovirt-4.2.0
: 4.2.0
Assignee: Milan Zamazal
QA Contact: Polina
URL:
Whiteboard:
: 1467893 (view as bug list)
Depends On: ovirt_refactor_disk_class_hierarchy
Blocks: rhv_turn_off_autoresume_of_paused_VMs 1417161
TreeView+ depends on / blocked
 
Reported: 2016-03-14 09:41 UTC by Yaniv Lavi
Modified: 2020-08-13 08:24 UTC (History)
24 users (show)

Fixed In Version:
Clone Of: rhv_turn_off_autoresume_of_paused_VMs
Environment:
Last Closed: 2018-05-04 10:45:57 UTC
oVirt Team: Virt
Embargoed:
rule-engine: ovirt-4.2+
rule-engine: exception+
pagranat: testing_plan_complete+
ylavi: planning_ack+
michal.skrivanek: devel_ack+
mavital: testing_ack+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1540548 0 high CLOSED [RFE] Automatically restart HA VMs paused due to I/O Error 2021-09-09 13:09:55 UTC
Red Hat Knowledge Base (Solution) 3105171 0 None None None 2017-10-02 06:18:05 UTC
oVirt gerrit 82394 0 master MERGED core, webadmin: added support for resume behavior 2020-06-17 12:52:25 UTC
oVirt gerrit 82532 0 master MERGED virt: Auto-resume VMs only conditionally 2020-06-17 12:52:25 UTC
oVirt gerrit 82533 0 master MERGED virt: Track pause time 2020-06-17 12:52:24 UTC
oVirt gerrit 82534 0 master MERGED virt: Kill VMs paused for too long 2020-06-17 12:52:24 UTC
oVirt gerrit 82570 0 master MERGED Add Resume Behavior to VmBase 2020-06-17 12:52:24 UTC
oVirt gerrit 82571 0 master MERGED restapi: add Resume Behavior to VmBase 2020-06-17 12:52:24 UTC
oVirt gerrit 82977 0 model_4.2 MERGED Add Resume Behavior to VmBase 2020-06-17 12:52:24 UTC
oVirt gerrit 82982 0 master MERGED restapi: Update to model 4.2.22 2020-06-17 12:52:24 UTC
oVirt gerrit 83029 0 master MERGED vm: add support to force the resume behavior 2020-06-17 12:52:23 UTC
oVirt gerrit 83030 0 master ABANDONED virt: set resume behavior when extending volume 2020-06-17 12:52:23 UTC
oVirt gerrit 83031 0 master ABANDONED virt: always auto resume in thin provision flow 2020-06-17 12:52:22 UTC
oVirt gerrit 83032 0 master ABANDONED virt: always autoresume when starting LSM 2020-06-17 12:52:22 UTC
oVirt gerrit 83033 0 master ABANDONED virt: always autoresume when ending live merge 2020-06-17 12:52:22 UTC
oVirt gerrit 83082 0 master MERGED Update to model 4.2.23 2020-06-17 12:52:22 UTC
oVirt gerrit 83319 0 master MERGED virt: Always resume VMs paused due to ENOSPC 2020-06-17 12:52:22 UTC

Internal Links: 1540548

Description Yaniv Lavi 2016-03-14 09:41:11 UTC
Make automatic resume of VMs paused due to I/O error configurable. Engine should be able to set that VMs paused as a result of an I/O error will not be resumed automatically once the storage domain recovers.

If VMs are resumed automatically (in an uncontrolled way) when the error condition in the storage domain is resolved, this will cause unexpected and/or undesired effects in their application. For example, resumed VMs don't have their clock in sync after they resume, which would cause significant issues for the application.

Admin needs to be able to configure engine not to automatically resume VMs that paused as a result of problems with the storage.

Comment 1 Allon Mureinik 2016-03-16 09:41:44 UTC
The correct way to represent this property (regardless of how it's later displayed to the user, UX-wise) is per vm-disk relationship. Thus, the refactoring described in bug 1142762 should be done first.

Removing the devel-ack+ until that's done, and then we should re-evaluate according to the timeframes.

Comment 5 Allon Mureinik 2016-07-17 15:15:37 UTC
Seeing as we don't have a sensible way to differentiate between EIO and ENOSPC at the moment, implementing something like this would be very risky wrt thing provisioning.

Let's push it out to 4.1 and do properly.

Comment 8 Yaniv Kaul 2017-06-06 19:02:48 UTC
(In reply to Allon Mureinik from comment #5)
> Seeing as we don't have a sensible way to differentiate between EIO and
> ENOSPC at the moment, implementing something like this would be very risky
> wrt thing provisioning.

Is there a bug open on the ability to differentiate between the two ?

Comment 20 Yaniv Lavi 2017-06-21 08:37:23 UTC
The request from users was to be able to control the resume so that if a VM was paused for more than X amount of time, you will be able to set it to:
- Stay paused and let admin handle,
- Turn off the VM.
- Clean restart.

This is true for all three of these options and at its core is moving this decision to the management and not let VDSM just decide to resume it. Even if X amount of time didn't pass, the engine should probably resume the VM if the storage issue was resolved.

Comment 23 Nir Soffer 2017-06-21 19:45:03 UTC
Based on comment 21 and comment 22, I think we can do this:

engine:
- add an option like "Resume VM after I/O errors", defaulting to true,
  keeping the current behavior
- pass the option to vdsm when starting a vm

vdsm:
- when a vm is paused, keep the reason (e.g. ENOSPC, and drive e.g. vda)
  the failure should probably kept in the drive object, so we can handle
  failures of multiple drives.
- when we resume a vm after extending a disk, resume it only if it was paused
  because of ENOSPC on a thin provisioned drive, or if the resume after I/O error
  is enabled.

Comment 24 Yaniv Lavi 2017-06-25 14:36:03 UTC
(In reply to Nir Soffer from comment #23)
> Based on comment 21 and comment 22, I think we can do this:
> 
> engine:
> - add an option like "Resume VM after I/O errors", defaulting to true,
>   keeping the current behavior
> - pass the option to vdsm when starting a vm
> 
> vdsm:
> - when a vm is paused, keep the reason (e.g. ENOSPC, and drive e.g. vda)
>   the failure should probably kept in the drive object, so we can handle
>   failures of multiple drives.
> - when we resume a vm after extending a disk, resume it only if it was paused
>   because of ENOSPC on a thin provisioned drive, or if the resume after I/O
> error
>   is enabled.

Please review comment 20. 
We might not want to allow VM to resume, but take a different handling like restarting or shutdown automatically.

Comment 25 Michal Skrivanek 2017-09-19 13:50:59 UTC
based on a discussion with storage team I'm moving this under virt. 
Revised implementation proposal:
add "resume behavior" property to VM with following behavior
- auto resume - default, current behavior for non-HAwL VMs(other than "HA with lease", so both plain HA and no HA).
- leave paused - new behavior for non-HAwL VMs to bypass the autoresume on vdsm side and just leave VMs in the paused state indefinitely
- kill - the only option for HAwL VMs, addressing also bug 1467893 and the possible disk corruption scenario there. Kills the VM when it is unpaused if the predefined interval passed. The interval could be set similar to sanlock lease expiration interval, 80s, needs to be more than 0 to tolerate short time hiccups and must not be longer than time it takes engine to move VM to Unknown which would trigger a restart elsewhere(~5 mins). The unpause operation happens either automatically due to the built-in autoresume code in vdsm or it is checked and kiled in VM recovery in case of vdsm restart. "kill" option for non-HAwL VMs is probably not very useful, but we can keep it as a possibility.

vdsm changes are confined to the autoresume code (and recovery), the configuration can be pased via xml metadata, as well as the time when the VM was paused can be stored there and checked on unpause.

This should cover both this RFE and the problematic behavior in bug 1467893 _and_ would be feasible for 4.2

Comment 26 Nir Soffer 2017-09-19 14:02:17 UTC
(In reply to Yaniv Lavi (Dary) from comment #20)

> The request from users was to be able to control the resume so that if a VM
> was paused for more than X amount of time, you will be able to set it to:
> - Stay paused and let admin handle,

We can do this for IOError, but not for ENOSPC errors, otherwise thin
provisioning will not be possible.

> - Turn off the VM.

We can do this.

> - Clean restart.

Restarting is not possible for HA vm, the vm may have been started
on another host.

For HA VMs we must kill the vm after if it could not be resumed after some
timeout, since engine may try to move the vm to another host - see comment 25.

> This is true for all three of these options and at its core is moving this
> decision to the management and not let VDSM just decide to resume it. Even
> if X amount of time didn't pass, the engine should probably resume the VM if
> the storage issue was resolved.

Resuming the vm should be done on Vdsm side otherwise when engine is down VMs may
get paused. So we cannot move pause handling to the management.

What we can do is use distributed database like etcd to share the state in the
cluster with all the hosts. In this case vdsm can know that a vm moved from
a host to another can be do the right thing.

Comment 27 Martin Tessun 2017-09-22 14:40:57 UTC
(In reply to Michal Skrivanek from comment #25)
> based on a discussion with storage team I'm moving this under virt. 
> Revised implementation proposal:
> add "resume behavior" property to VM with following behavior
> - auto resume - default, current behavior for non-HAwL VMs(other than "HA
> with lease", so both plain HA and no HA).
> - leave paused - new behavior for non-HAwL VMs to bypass the autoresume on
> vdsm side and just leave VMs in the paused state indefinitely
> - kill - the only option for HAwL VMs, addressing also bug 1467893 and the
> possible disk corruption scenario there. Kills the VM when it is unpaused if
> the predefined interval passed. The interval could be set similar to sanlock
> lease expiration interval, 80s, needs to be more than 0 to tolerate short
> time hiccups and must not be longer than time it takes engine to move VM to
> Unknown which would trigger a restart elsewhere(~5 mins). The unpause
> operation happens either automatically due to the built-in autoresume code
> in vdsm or it is checked and kiled in VM recovery in case of vdsm restart.
> "kill" option for non-HAwL VMs is probably not very useful, but we can keep
> it as a possibility.
> 

Looks good to me. I would also keel the "kill" option for no HAwL VMs (maybe even triggering a start in case the killed VM is a HA VM).

> vdsm changes are confined to the autoresume code (and recovery), the
> configuration can be pased via xml metadata, as well as the time when the VM
> was paused can be stored there and checked on unpause.
> 
> This should cover both this RFE and the problematic behavior in bug 1467893
> _and_ would be feasible for 4.2

Sounds good to me.

Comment 28 Tomas Jelinek 2017-10-02 06:18:06 UTC
*** Bug 1467893 has been marked as a duplicate of this bug. ***

Comment 29 Yaniv Lavi 2017-10-29 17:01:34 UTC
Looks good to me as well.

Comment 33 Polina 2018-05-02 13:24:17 UTC
Summary for verification on rhv-release-4.2.3-4-001.noarch:

The RFE is verified on Glusted , FC , ISCSI, NFS storages.

1. On ISCSI and Gluster the I/O Pause was created by dropping rule with iptables command.
2. On FC - by making LUN path faulty (like echo "offline" > /sys/block/sdd/device/state).
3. On NFS I succeeded to get IO Error Pause changing the Retransmissions & Timeout parameters for SD.
here are steps:
   1. Put the SD in maintenance(by Data Center)
   2. Open Storage Domains/Manage Domain/Custom Connection Parameters 
   3. Change the following parameters:
	Retransmissions (#) = 2
	Timeout (deciseconds) = 1 (i.e.10 sec)
   4. Activate the SD. 
   5. Run the VM associated with this SD.

The behavior of NFS VMs has been tested in this setup.

Comment 34 Sandro Bonazzola 2018-05-04 10:45:57 UTC
This bugzilla is included in oVirt 4.2.0 release, published on Dec 20th 2017.

Since the problem described in this bug report should be
resolved in oVirt 4.2.0 release, published on Dec 20th 2017, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.