Bug 1317450 (oVirt_turn_off_autoresume_of_paused_VMs)

Summary: [RFE] Have a policy for autoresume of VMs paused due to IO errors (stay paused, turn off, restart with defined time out time)
Product: [oVirt] ovirt-engine Reporter: Yaniv Lavi <ylavi>
Component: RFEsAssignee: Milan Zamazal <mzamazal>
Status: CLOSED CURRENTRELEASE QA Contact: Polina <pagranat>
Severity: medium Docs Contact:
Priority: urgent    
Version: 3.6.0CC: acanan, amureini, aperotti, apinnick, bugs, danken, dfediuck, ebenahar, fgarciad, jcoscia, jentrena, kwolf, lpeer, lsurette, michal.skrivanek, mkalinin, mtessun, nsoffer, rbalakri, shipatil, srevivo, tjelinek, tnisan, ylavi
Target Milestone: ovirt-4.2.0Keywords: FutureFeature
Target Release: 4.2.0Flags: rule-engine: ovirt-4.2+
rule-engine: exception+
pagranat: testing_plan_complete+
ylavi: planning_ack+
michal.skrivanek: devel_ack+
mavital: testing_ack+
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Enhancement
Doc Text:
Previously, when a virtual machine was paused due to IO error, the only resume policy was "Auto Resume", which resumed the virtual machine. "Auto Resume" was problematic because it could interfere with custom HA solutions. In the current release, the "Kill" and "Leave Paused" resume policies have been added. "Leave Paused" was introduced for users who prefer this option because they have their own HA implementation. "Kill" allows virtual machines with a lease to automatically restart on another host in the event of irrecoverable outage. The speed of IO Error reporting depends on the underlying storage protocol. On FC storage, IO Errors are generally detected quickly, while on NFS mounts with typical default settings, they may not be detected for several minutes.
Story Points: ---
Clone Of: rhv_turn_off_autoresume_of_paused_VMs Environment:
Last Closed: 2018-05-04 10:45:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Virt RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1142762    
Bug Blocks: 1230788, 1417161    

Description Yaniv Lavi 2016-03-14 09:41:11 UTC
Make automatic resume of VMs paused due to I/O error configurable. Engine should be able to set that VMs paused as a result of an I/O error will not be resumed automatically once the storage domain recovers.

If VMs are resumed automatically (in an uncontrolled way) when the error condition in the storage domain is resolved, this will cause unexpected and/or undesired effects in their application. For example, resumed VMs don't have their clock in sync after they resume, which would cause significant issues for the application.

Admin needs to be able to configure engine not to automatically resume VMs that paused as a result of problems with the storage.

Comment 1 Allon Mureinik 2016-03-16 09:41:44 UTC
The correct way to represent this property (regardless of how it's later displayed to the user, UX-wise) is per vm-disk relationship. Thus, the refactoring described in bug 1142762 should be done first.

Removing the devel-ack+ until that's done, and then we should re-evaluate according to the timeframes.

Comment 5 Allon Mureinik 2016-07-17 15:15:37 UTC
Seeing as we don't have a sensible way to differentiate between EIO and ENOSPC at the moment, implementing something like this would be very risky wrt thing provisioning.

Let's push it out to 4.1 and do properly.

Comment 8 Yaniv Kaul 2017-06-06 19:02:48 UTC
(In reply to Allon Mureinik from comment #5)
> Seeing as we don't have a sensible way to differentiate between EIO and
> ENOSPC at the moment, implementing something like this would be very risky
> wrt thing provisioning.

Is there a bug open on the ability to differentiate between the two ?

Comment 20 Yaniv Lavi 2017-06-21 08:37:23 UTC
The request from users was to be able to control the resume so that if a VM was paused for more than X amount of time, you will be able to set it to:
- Stay paused and let admin handle,
- Turn off the VM.
- Clean restart.

This is true for all three of these options and at its core is moving this decision to the management and not let VDSM just decide to resume it. Even if X amount of time didn't pass, the engine should probably resume the VM if the storage issue was resolved.

Comment 23 Nir Soffer 2017-06-21 19:45:03 UTC
Based on comment 21 and comment 22, I think we can do this:

engine:
- add an option like "Resume VM after I/O errors", defaulting to true,
  keeping the current behavior
- pass the option to vdsm when starting a vm

vdsm:
- when a vm is paused, keep the reason (e.g. ENOSPC, and drive e.g. vda)
  the failure should probably kept in the drive object, so we can handle
  failures of multiple drives.
- when we resume a vm after extending a disk, resume it only if it was paused
  because of ENOSPC on a thin provisioned drive, or if the resume after I/O error
  is enabled.

Comment 24 Yaniv Lavi 2017-06-25 14:36:03 UTC
(In reply to Nir Soffer from comment #23)
> Based on comment 21 and comment 22, I think we can do this:
> 
> engine:
> - add an option like "Resume VM after I/O errors", defaulting to true,
>   keeping the current behavior
> - pass the option to vdsm when starting a vm
> 
> vdsm:
> - when a vm is paused, keep the reason (e.g. ENOSPC, and drive e.g. vda)
>   the failure should probably kept in the drive object, so we can handle
>   failures of multiple drives.
> - when we resume a vm after extending a disk, resume it only if it was paused
>   because of ENOSPC on a thin provisioned drive, or if the resume after I/O
> error
>   is enabled.

Please review comment 20. 
We might not want to allow VM to resume, but take a different handling like restarting or shutdown automatically.

Comment 25 Michal Skrivanek 2017-09-19 13:50:59 UTC
based on a discussion with storage team I'm moving this under virt. 
Revised implementation proposal:
add "resume behavior" property to VM with following behavior
- auto resume - default, current behavior for non-HAwL VMs(other than "HA with lease", so both plain HA and no HA).
- leave paused - new behavior for non-HAwL VMs to bypass the autoresume on vdsm side and just leave VMs in the paused state indefinitely
- kill - the only option for HAwL VMs, addressing also bug 1467893 and the possible disk corruption scenario there. Kills the VM when it is unpaused if the predefined interval passed. The interval could be set similar to sanlock lease expiration interval, 80s, needs to be more than 0 to tolerate short time hiccups and must not be longer than time it takes engine to move VM to Unknown which would trigger a restart elsewhere(~5 mins). The unpause operation happens either automatically due to the built-in autoresume code in vdsm or it is checked and kiled in VM recovery in case of vdsm restart. "kill" option for non-HAwL VMs is probably not very useful, but we can keep it as a possibility.

vdsm changes are confined to the autoresume code (and recovery), the configuration can be pased via xml metadata, as well as the time when the VM was paused can be stored there and checked on unpause.

This should cover both this RFE and the problematic behavior in bug 1467893 _and_ would be feasible for 4.2

Comment 26 Nir Soffer 2017-09-19 14:02:17 UTC
(In reply to Yaniv Lavi (Dary) from comment #20)

> The request from users was to be able to control the resume so that if a VM
> was paused for more than X amount of time, you will be able to set it to:
> - Stay paused and let admin handle,

We can do this for IOError, but not for ENOSPC errors, otherwise thin
provisioning will not be possible.

> - Turn off the VM.

We can do this.

> - Clean restart.

Restarting is not possible for HA vm, the vm may have been started
on another host.

For HA VMs we must kill the vm after if it could not be resumed after some
timeout, since engine may try to move the vm to another host - see comment 25.

> This is true for all three of these options and at its core is moving this
> decision to the management and not let VDSM just decide to resume it. Even
> if X amount of time didn't pass, the engine should probably resume the VM if
> the storage issue was resolved.

Resuming the vm should be done on Vdsm side otherwise when engine is down VMs may
get paused. So we cannot move pause handling to the management.

What we can do is use distributed database like etcd to share the state in the
cluster with all the hosts. In this case vdsm can know that a vm moved from
a host to another can be do the right thing.

Comment 27 Martin Tessun 2017-09-22 14:40:57 UTC
(In reply to Michal Skrivanek from comment #25)
> based on a discussion with storage team I'm moving this under virt. 
> Revised implementation proposal:
> add "resume behavior" property to VM with following behavior
> - auto resume - default, current behavior for non-HAwL VMs(other than "HA
> with lease", so both plain HA and no HA).
> - leave paused - new behavior for non-HAwL VMs to bypass the autoresume on
> vdsm side and just leave VMs in the paused state indefinitely
> - kill - the only option for HAwL VMs, addressing also bug 1467893 and the
> possible disk corruption scenario there. Kills the VM when it is unpaused if
> the predefined interval passed. The interval could be set similar to sanlock
> lease expiration interval, 80s, needs to be more than 0 to tolerate short
> time hiccups and must not be longer than time it takes engine to move VM to
> Unknown which would trigger a restart elsewhere(~5 mins). The unpause
> operation happens either automatically due to the built-in autoresume code
> in vdsm or it is checked and kiled in VM recovery in case of vdsm restart.
> "kill" option for non-HAwL VMs is probably not very useful, but we can keep
> it as a possibility.
> 

Looks good to me. I would also keel the "kill" option for no HAwL VMs (maybe even triggering a start in case the killed VM is a HA VM).

> vdsm changes are confined to the autoresume code (and recovery), the
> configuration can be pased via xml metadata, as well as the time when the VM
> was paused can be stored there and checked on unpause.
> 
> This should cover both this RFE and the problematic behavior in bug 1467893
> _and_ would be feasible for 4.2

Sounds good to me.

Comment 28 Tomas Jelinek 2017-10-02 06:18:06 UTC
*** Bug 1467893 has been marked as a duplicate of this bug. ***

Comment 29 Yaniv Lavi 2017-10-29 17:01:34 UTC
Looks good to me as well.

Comment 33 Polina 2018-05-02 13:24:17 UTC
Summary for verification on rhv-release-4.2.3-4-001.noarch:

The RFE is verified on Glusted , FC , ISCSI, NFS storages.

1. On ISCSI and Gluster the I/O Pause was created by dropping rule with iptables command.
2. On FC - by making LUN path faulty (like echo "offline" > /sys/block/sdd/device/state).
3. On NFS I succeeded to get IO Error Pause changing the Retransmissions & Timeout parameters for SD.
here are steps:
   1. Put the SD in maintenance(by Data Center)
   2. Open Storage Domains/Manage Domain/Custom Connection Parameters 
   3. Change the following parameters:
	Retransmissions (#) = 2
	Timeout (deciseconds) = 1 (i.e.10 sec)
   4. Activate the SD. 
   5. Run the VM associated with this SD.

The behavior of NFS VMs has been tested in this setup.

Comment 34 Sandro Bonazzola 2018-05-04 10:45:57 UTC
This bugzilla is included in oVirt 4.2.0 release, published on Dec 20th 2017.

Since the problem described in this bug report should be
resolved in oVirt 4.2.0 release, published on Dec 20th 2017, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.