1317450 – (oVirt_turn_off_autoresume_of_paused_VMs) [RFE] Have a policy for autoresume of VMs paused due to IO errors (stay paused, turn off, restart with defined time out time)

Bug 1317450 (oVirt_turn_off_autoresume_of_paused_VMs) - [RFE] Have a policy for autoresume of VMs paused due to IO errors (stay paused, turn off, restart with defined time out time)

Summary: [RFE] Have a policy for autoresume of VMs paused due to IO errors (stay pause...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	oVirt_turn_off_autoresume_of_paused_VMs
Product:	ovirt-engine
Classification:	oVirt
Component:	RFEs
Sub Component:
Version:	3.6.0
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	medium
Target Milestone:	ovirt-4.2.0
Target Release:	4.2.0
Assignee:	Milan Zamazal
QA Contact:	Polina
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1467893 (view as bug list)
Depends On:	ovirt_refactor_disk_class_hierarchy
Blocks:	rhv_turn_off_autoresume_of_paused_VMs 1417161
TreeView+	depends on / blocked

Reported:	2016-03-14 09:41 UTC by Yaniv Lavi
Modified:	2020-08-13 08:24 UTC (History)
CC List:	24 users (show)
Fixed In Version:
Clone Of:	rhv_turn_off_autoresume_of_paused_VMs
Environment:
Last Closed:	2018-05-04 10:45:57 UTC
oVirt Team:	Virt
Embargoed:
Dependent Products:
Flags:	rule-engine: ovirt-4.2+ rule-engine: exception+ pagranat: testing_plan_complete+ ylavi: planning_ack+ michal.skrivanek: devel_ack+ mavital: testing_ack+

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1540548	high	CLOSED	[RFE] Automatically restart HA VMs paused due to I/O Error	2021-09-09 13:09:55 UTC
Red Hat Knowledge Base (Solution)	3105171	None	None	None	2017-10-02 06:18:05 UTC
oVirt gerrit	82394	master	MERGED	core, webadmin: added support for resume behavior	2020-06-17 12:52:25 UTC
oVirt gerrit	82532	master	MERGED	virt: Auto-resume VMs only conditionally	2020-06-17 12:52:25 UTC
oVirt gerrit	82533	master	MERGED	virt: Track pause time	2020-06-17 12:52:24 UTC
oVirt gerrit	82534	master	MERGED	virt: Kill VMs paused for too long	2020-06-17 12:52:24 UTC
oVirt gerrit	82570	master	MERGED	Add Resume Behavior to VmBase	2020-06-17 12:52:24 UTC
oVirt gerrit	82571	master	MERGED	restapi: add Resume Behavior to VmBase	2020-06-17 12:52:24 UTC
oVirt gerrit	82977	model_4.2	MERGED	Add Resume Behavior to VmBase	2020-06-17 12:52:24 UTC
oVirt gerrit	82982	master	MERGED	restapi: Update to model 4.2.22	2020-06-17 12:52:24 UTC
oVirt gerrit	83029	master	MERGED	vm: add support to force the resume behavior	2020-06-17 12:52:23 UTC
oVirt gerrit	83030	master	ABANDONED	virt: set resume behavior when extending volume	2020-06-17 12:52:23 UTC
oVirt gerrit	83031	master	ABANDONED	virt: always auto resume in thin provision flow	2020-06-17 12:52:22 UTC
oVirt gerrit	83032	master	ABANDONED	virt: always autoresume when starting LSM	2020-06-17 12:52:22 UTC
oVirt gerrit	83033	master	ABANDONED	virt: always autoresume when ending live merge	2020-06-17 12:52:22 UTC
oVirt gerrit	83082	master	MERGED	Update to model 4.2.23	2020-06-17 12:52:22 UTC
oVirt gerrit	83319	master	MERGED	virt: Always resume VMs paused due to ENOSPC	2020-06-17 12:52:22 UTC

Internal Links: 1540548

Description Yaniv Lavi 2016-03-14 09:41:11 UTC

Make automatic resume of VMs paused due to I/O error configurable. Engine should be able to set that VMs paused as a result of an I/O error will not be resumed automatically once the storage domain recovers.

If VMs are resumed automatically (in an uncontrolled way) when the error condition in the storage domain is resolved, this will cause unexpected and/or undesired effects in their application. For example, resumed VMs don't have their clock in sync after they resume, which would cause significant issues for the application.

Admin needs to be able to configure engine not to automatically resume VMs that paused as a result of problems with the storage.

Comment 1 Allon Mureinik 2016-03-16 09:41:44 UTC

The correct way to represent this property (regardless of how it's later displayed to the user, UX-wise) is per vm-disk relationship. Thus, the refactoring described in bug 1142762 should be done first.

Removing the devel-ack+ until that's done, and then we should re-evaluate according to the timeframes.

Comment 5 Allon Mureinik 2016-07-17 15:15:37 UTC

Seeing as we don't have a sensible way to differentiate between EIO and ENOSPC at the moment, implementing something like this would be very risky wrt thing provisioning.

Let's push it out to 4.1 and do properly.

Comment 8 Yaniv Kaul 2017-06-06 19:02:48 UTC

(In reply to Allon Mureinik from comment #5)
> Seeing as we don't have a sensible way to differentiate between EIO and
> ENOSPC at the moment, implementing something like this would be very risky
> wrt thing provisioning.

Is there a bug open on the ability to differentiate between the two ?

Comment 20 Yaniv Lavi 2017-06-21 08:37:23 UTC

The request from users was to be able to control the resume so that if a VM was paused for more than X amount of time, you will be able to set it to:
- Stay paused and let admin handle,
- Turn off the VM.
- Clean restart.

This is true for all three of these options and at its core is moving this decision to the management and not let VDSM just decide to resume it. Even if X amount of time didn't pass, the engine should probably resume the VM if the storage issue was resolved.

Comment 23 Nir Soffer 2017-06-21 19:45:03 UTC

Based on comment 21 and comment 22, I think we can do this:

engine:
- add an option like "Resume VM after I/O errors", defaulting to true,
  keeping the current behavior
- pass the option to vdsm when starting a vm

vdsm:
- when a vm is paused, keep the reason (e.g. ENOSPC, and drive e.g. vda)
  the failure should probably kept in the drive object, so we can handle
  failures of multiple drives.
- when we resume a vm after extending a disk, resume it only if it was paused
  because of ENOSPC on a thin provisioned drive, or if the resume after I/O error
  is enabled.

Comment 24 Yaniv Lavi 2017-06-25 14:36:03 UTC

(In reply to Nir Soffer from comment #23)
> Based on comment 21 and comment 22, I think we can do this:
> 
> engine:
> - add an option like "Resume VM after I/O errors", defaulting to true,
>   keeping the current behavior
> - pass the option to vdsm when starting a vm
> 
> vdsm:
> - when a vm is paused, keep the reason (e.g. ENOSPC, and drive e.g. vda)
>   the failure should probably kept in the drive object, so we can handle
>   failures of multiple drives.
> - when we resume a vm after extending a disk, resume it only if it was paused
>   because of ENOSPC on a thin provisioned drive, or if the resume after I/O
> error
>   is enabled.

Please review comment 20. 
We might not want to allow VM to resume, but take a different handling like restarting or shutdown automatically.

Comment 25 Michal Skrivanek 2017-09-19 13:50:59 UTC

based on a discussion with storage team I'm moving this under virt. 
Revised implementation proposal:
add "resume behavior" property to VM with following behavior
- auto resume - default, current behavior for non-HAwL VMs(other than "HA with lease", so both plain HA and no HA).
- leave paused - new behavior for non-HAwL VMs to bypass the autoresume on vdsm side and just leave VMs in the paused state indefinitely
- kill - the only option for HAwL VMs, addressing also bug 1467893 and the possible disk corruption scenario there. Kills the VM when it is unpaused if the predefined interval passed. The interval could be set similar to sanlock lease expiration interval, 80s, needs to be more than 0 to tolerate short time hiccups and must not be longer than time it takes engine to move VM to Unknown which would trigger a restart elsewhere(~5 mins). The unpause operation happens either automatically due to the built-in autoresume code in vdsm or it is checked and kiled in VM recovery in case of vdsm restart. "kill" option for non-HAwL VMs is probably not very useful, but we can keep it as a possibility.

vdsm changes are confined to the autoresume code (and recovery), the configuration can be pased via xml metadata, as well as the time when the VM was paused can be stored there and checked on unpause.

This should cover both this RFE and the problematic behavior in bug 1467893 _and_ would be feasible for 4.2

Comment 26 Nir Soffer 2017-09-19 14:02:17 UTC

(In reply to Yaniv Lavi (Dary) from comment #20)

> The request from users was to be able to control the resume so that if a VM
> was paused for more than X amount of time, you will be able to set it to:
> - Stay paused and let admin handle,

We can do this for IOError, but not for ENOSPC errors, otherwise thin
provisioning will not be possible.

> - Turn off the VM.

We can do this.

> - Clean restart.

Restarting is not possible for HA vm, the vm may have been started
on another host.

For HA VMs we must kill the vm after if it could not be resumed after some
timeout, since engine may try to move the vm to another host - see comment 25.

> This is true for all three of these options and at its core is moving this
> decision to the management and not let VDSM just decide to resume it. Even
> if X amount of time didn't pass, the engine should probably resume the VM if
> the storage issue was resolved.

Resuming the vm should be done on Vdsm side otherwise when engine is down VMs may
get paused. So we cannot move pause handling to the management.

What we can do is use distributed database like etcd to share the state in the
cluster with all the hosts. In this case vdsm can know that a vm moved from
a host to another can be do the right thing.

Comment 27 Martin Tessun 2017-09-22 14:40:57 UTC

(In reply to Michal Skrivanek from comment #25)
> based on a discussion with storage team I'm moving this under virt. 
> Revised implementation proposal:
> add "resume behavior" property to VM with following behavior
> - auto resume - default, current behavior for non-HAwL VMs(other than "HA
> with lease", so both plain HA and no HA).
> - leave paused - new behavior for non-HAwL VMs to bypass the autoresume on
> vdsm side and just leave VMs in the paused state indefinitely
> - kill - the only option for HAwL VMs, addressing also bug 1467893 and the
> possible disk corruption scenario there. Kills the VM when it is unpaused if
> the predefined interval passed. The interval could be set similar to sanlock
> lease expiration interval, 80s, needs to be more than 0 to tolerate short
> time hiccups and must not be longer than time it takes engine to move VM to
> Unknown which would trigger a restart elsewhere(~5 mins). The unpause
> operation happens either automatically due to the built-in autoresume code
> in vdsm or it is checked and kiled in VM recovery in case of vdsm restart.
> "kill" option for non-HAwL VMs is probably not very useful, but we can keep
> it as a possibility.
> 

Looks good to me. I would also keel the "kill" option for no HAwL VMs (maybe even triggering a start in case the killed VM is a HA VM).

> vdsm changes are confined to the autoresume code (and recovery), the
> configuration can be pased via xml metadata, as well as the time when the VM
> was paused can be stored there and checked on unpause.
> 
> This should cover both this RFE and the problematic behavior in bug 1467893
> _and_ would be feasible for 4.2

Sounds good to me.

Comment 28 Tomas Jelinek 2017-10-02 06:18:06 UTC

*** Bug 1467893 has been marked as a duplicate of this bug. ***

Comment 29 Yaniv Lavi 2017-10-29 17:01:34 UTC

Looks good to me as well.

Comment 33 Polina 2018-05-02 13:24:17 UTC

Summary for verification on rhv-release-4.2.3-4-001.noarch:

The RFE is verified on Glusted , FC , ISCSI, NFS storages.

1. On ISCSI and Gluster the I/O Pause was created by dropping rule with iptables command.
2. On FC - by making LUN path faulty (like echo "offline" > /sys/block/sdd/device/state).
3. On NFS I succeeded to get IO Error Pause changing the Retransmissions & Timeout parameters for SD.
here are steps:
   1. Put the SD in maintenance(by Data Center)
   2. Open Storage Domains/Manage Domain/Custom Connection Parameters 
   3. Change the following parameters:
	Retransmissions (#) = 2
	Timeout (deciseconds) = 1 (i.e.10 sec)
   4. Activate the SD. 
   5. Run the VM associated with this SD.

The behavior of NFS VMs has been tested in this setup.

Comment 34 Sandro Bonazzola 2018-05-04 10:45:57 UTC

This bugzilla is included in oVirt 4.2.0 release, published on Dec 20th 2017.

Since the problem described in this bug report should be
resolved in oVirt 4.2.0 release, published on Dec 20th 2017, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Note You need to log in before you can comment on or make changes to this bug.

acanan
amureini
aperotti
apinnick
bugs
danken
dfediuck
ebenahar
fgarciad
jcoscia
jentrena
kwolf
lpeer
lsurette
michal.skrivanek
mkalinin
mtessun
nsoffer
rbalakri
shipatil
srevivo
tjelinek
tnisan
ylavi