Bug 1317450 (oVirt_turn_off_autoresume_of_paused_VMs)
Summary: | [RFE] Have a policy for autoresume of VMs paused due to IO errors (stay paused, turn off, restart with defined time out time) | ||
---|---|---|---|
Product: | [oVirt] ovirt-engine | Reporter: | Yaniv Lavi <ylavi> |
Component: | RFEs | Assignee: | Milan Zamazal <mzamazal> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Polina <pagranat> |
Severity: | medium | Docs Contact: | |
Priority: | urgent | ||
Version: | 3.6.0 | CC: | acanan, amureini, aperotti, apinnick, bugs, danken, dfediuck, ebenahar, fgarciad, jcoscia, jentrena, kwolf, lpeer, lsurette, michal.skrivanek, mkalinin, mtessun, nsoffer, rbalakri, shipatil, srevivo, tjelinek, tnisan, ylavi |
Target Milestone: | ovirt-4.2.0 | Keywords: | FutureFeature |
Target Release: | 4.2.0 | Flags: | rule-engine:
ovirt-4.2+
rule-engine: exception+ pagranat: testing_plan_complete+ ylavi: planning_ack+ michal.skrivanek: devel_ack+ mavital: testing_ack+ |
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Enhancement | |
Doc Text: |
Previously, when a virtual machine was paused due to IO error, the only resume policy was "Auto Resume", which resumed the virtual machine. "Auto Resume" was problematic because it could interfere with custom HA solutions. In the current release, the "Kill" and "Leave Paused" resume policies have been added. "Leave Paused" was introduced for users who prefer this option because they have their own HA implementation. "Kill" allows virtual machines with a lease to automatically restart on another host in the event of irrecoverable outage.
The speed of IO Error reporting depends on the underlying storage protocol. On FC storage, IO Errors are generally detected quickly, while on NFS mounts with typical default settings, they may not be detected for several minutes.
|
Story Points: | --- |
Clone Of: | rhv_turn_off_autoresume_of_paused_VMs | Environment: | |
Last Closed: | 2018-05-04 10:45:57 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | Virt | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1142762 | ||
Bug Blocks: | 1230788, 1417161 |
Description
Yaniv Lavi
2016-03-14 09:41:11 UTC
The correct way to represent this property (regardless of how it's later displayed to the user, UX-wise) is per vm-disk relationship. Thus, the refactoring described in bug 1142762 should be done first. Removing the devel-ack+ until that's done, and then we should re-evaluate according to the timeframes. Seeing as we don't have a sensible way to differentiate between EIO and ENOSPC at the moment, implementing something like this would be very risky wrt thing provisioning. Let's push it out to 4.1 and do properly. (In reply to Allon Mureinik from comment #5) > Seeing as we don't have a sensible way to differentiate between EIO and > ENOSPC at the moment, implementing something like this would be very risky > wrt thing provisioning. Is there a bug open on the ability to differentiate between the two ? The request from users was to be able to control the resume so that if a VM was paused for more than X amount of time, you will be able to set it to: - Stay paused and let admin handle, - Turn off the VM. - Clean restart. This is true for all three of these options and at its core is moving this decision to the management and not let VDSM just decide to resume it. Even if X amount of time didn't pass, the engine should probably resume the VM if the storage issue was resolved. Based on comment 21 and comment 22, I think we can do this: engine: - add an option like "Resume VM after I/O errors", defaulting to true, keeping the current behavior - pass the option to vdsm when starting a vm vdsm: - when a vm is paused, keep the reason (e.g. ENOSPC, and drive e.g. vda) the failure should probably kept in the drive object, so we can handle failures of multiple drives. - when we resume a vm after extending a disk, resume it only if it was paused because of ENOSPC on a thin provisioned drive, or if the resume after I/O error is enabled. (In reply to Nir Soffer from comment #23) > Based on comment 21 and comment 22, I think we can do this: > > engine: > - add an option like "Resume VM after I/O errors", defaulting to true, > keeping the current behavior > - pass the option to vdsm when starting a vm > > vdsm: > - when a vm is paused, keep the reason (e.g. ENOSPC, and drive e.g. vda) > the failure should probably kept in the drive object, so we can handle > failures of multiple drives. > - when we resume a vm after extending a disk, resume it only if it was paused > because of ENOSPC on a thin provisioned drive, or if the resume after I/O > error > is enabled. Please review comment 20. We might not want to allow VM to resume, but take a different handling like restarting or shutdown automatically. based on a discussion with storage team I'm moving this under virt. Revised implementation proposal: add "resume behavior" property to VM with following behavior - auto resume - default, current behavior for non-HAwL VMs(other than "HA with lease", so both plain HA and no HA). - leave paused - new behavior for non-HAwL VMs to bypass the autoresume on vdsm side and just leave VMs in the paused state indefinitely - kill - the only option for HAwL VMs, addressing also bug 1467893 and the possible disk corruption scenario there. Kills the VM when it is unpaused if the predefined interval passed. The interval could be set similar to sanlock lease expiration interval, 80s, needs to be more than 0 to tolerate short time hiccups and must not be longer than time it takes engine to move VM to Unknown which would trigger a restart elsewhere(~5 mins). The unpause operation happens either automatically due to the built-in autoresume code in vdsm or it is checked and kiled in VM recovery in case of vdsm restart. "kill" option for non-HAwL VMs is probably not very useful, but we can keep it as a possibility. vdsm changes are confined to the autoresume code (and recovery), the configuration can be pased via xml metadata, as well as the time when the VM was paused can be stored there and checked on unpause. This should cover both this RFE and the problematic behavior in bug 1467893 _and_ would be feasible for 4.2 (In reply to Yaniv Lavi (Dary) from comment #20) > The request from users was to be able to control the resume so that if a VM > was paused for more than X amount of time, you will be able to set it to: > - Stay paused and let admin handle, We can do this for IOError, but not for ENOSPC errors, otherwise thin provisioning will not be possible. > - Turn off the VM. We can do this. > - Clean restart. Restarting is not possible for HA vm, the vm may have been started on another host. For HA VMs we must kill the vm after if it could not be resumed after some timeout, since engine may try to move the vm to another host - see comment 25. > This is true for all three of these options and at its core is moving this > decision to the management and not let VDSM just decide to resume it. Even > if X amount of time didn't pass, the engine should probably resume the VM if > the storage issue was resolved. Resuming the vm should be done on Vdsm side otherwise when engine is down VMs may get paused. So we cannot move pause handling to the management. What we can do is use distributed database like etcd to share the state in the cluster with all the hosts. In this case vdsm can know that a vm moved from a host to another can be do the right thing. (In reply to Michal Skrivanek from comment #25) > based on a discussion with storage team I'm moving this under virt. > Revised implementation proposal: > add "resume behavior" property to VM with following behavior > - auto resume - default, current behavior for non-HAwL VMs(other than "HA > with lease", so both plain HA and no HA). > - leave paused - new behavior for non-HAwL VMs to bypass the autoresume on > vdsm side and just leave VMs in the paused state indefinitely > - kill - the only option for HAwL VMs, addressing also bug 1467893 and the > possible disk corruption scenario there. Kills the VM when it is unpaused if > the predefined interval passed. The interval could be set similar to sanlock > lease expiration interval, 80s, needs to be more than 0 to tolerate short > time hiccups and must not be longer than time it takes engine to move VM to > Unknown which would trigger a restart elsewhere(~5 mins). The unpause > operation happens either automatically due to the built-in autoresume code > in vdsm or it is checked and kiled in VM recovery in case of vdsm restart. > "kill" option for non-HAwL VMs is probably not very useful, but we can keep > it as a possibility. > Looks good to me. I would also keel the "kill" option for no HAwL VMs (maybe even triggering a start in case the killed VM is a HA VM). > vdsm changes are confined to the autoresume code (and recovery), the > configuration can be pased via xml metadata, as well as the time when the VM > was paused can be stored there and checked on unpause. > > This should cover both this RFE and the problematic behavior in bug 1467893 > _and_ would be feasible for 4.2 Sounds good to me. *** Bug 1467893 has been marked as a duplicate of this bug. *** Looks good to me as well. Summary for verification on rhv-release-4.2.3-4-001.noarch: The RFE is verified on Glusted , FC , ISCSI, NFS storages. 1. On ISCSI and Gluster the I/O Pause was created by dropping rule with iptables command. 2. On FC - by making LUN path faulty (like echo "offline" > /sys/block/sdd/device/state). 3. On NFS I succeeded to get IO Error Pause changing the Retransmissions & Timeout parameters for SD. here are steps: 1. Put the SD in maintenance(by Data Center) 2. Open Storage Domains/Manage Domain/Custom Connection Parameters 3. Change the following parameters: Retransmissions (#) = 2 Timeout (deciseconds) = 1 (i.e.10 sec) 4. Activate the SD. 5. Run the VM associated with this SD. The behavior of NFS VMs has been tested in this setup. This bugzilla is included in oVirt 4.2.0 release, published on Dec 20th 2017. Since the problem described in this bug report should be resolved in oVirt 4.2.0 release, published on Dec 20th 2017, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report. |