Bug 1230788 (rhv_turn_off_autoresume_of_paused_VMs)
Summary: | [RFE] Have a policy for autoresume of VMs paused due to IO errors (stay paused, turn off, restart with defined time out time) | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Julio Entrena Perez <jentrena> | |
Component: | ovirt-engine | Assignee: | Michal Skrivanek <michal.skrivanek> | |
Status: | CLOSED ERRATA | QA Contact: | Polina <pagranat> | |
Severity: | high | Docs Contact: | ||
Priority: | urgent | |||
Version: | 3.4.5 | CC: | acanan, ahoness, aperotti, danken, dcadzow, dfediuck, eheftman, fromani, gveitmic, jentrena, jhardy, lbopf, lpeer, lsurette, mavital, mgoldboi, michal.skrivanek, mkalinin, nsoffer, oscardalmauroig, pagranat, ratamir, rbalakri, Rhev-m-bugs, rmcswain, srevivo, tjelinek, tnisan, ykaul, ylavi | |
Target Milestone: | ovirt-4.2.0 | Keywords: | FutureFeature | |
Target Release: | --- | Flags: | mavital:
needinfo+
mavital: testing_plan_complete+ |
|
Hardware: | All | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Enhancement | ||
Doc Text: |
Feature:
Previously, if the VM has been paused due to IO Error, there was no way how to configure what should happen after the storage gets fixed. The only option was "auto resume", which resumed the VM. This feature adds two more options configurable per VM: "Kill" and "Leave Paused".
Reason:
There are ways how the "auto resume" together with HA VM using VM lease could lead to split brain. Other reason is that it can interfere with custom HA solutions.
Result:
Now the user can configure 3 resume policies for VMs:
- auto resume (the one which used to be the only one)
- leave paused
- kill
|
Story Points: | --- | |
Clone Of: | ||||
: | oVirt_turn_off_autoresume_of_paused_VMs (view as bug list) | Environment: | ||
Last Closed: | 2018-05-15 17:36:24 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | Virt | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 1317450, 1481022 | |||
Bug Blocks: | 1541529, 1386444, 1417161, 1460513, 1545980 |
Description
Julio Entrena Perez
2015-06-11 14:26:21 UTC
We should consider hosted engine for this RFE, as a VM which will need to be resumed regardless of the config, or make the configuration on SD level which means HE SD will not be using it. Is time sync the problem here? If so we can add a guest agent verb to explicitly sync time after resume If there are more/other issues then we can extend the existing error_policy/propagateErrors parameter *** Bug 1206317 has been marked as a duplicate of this bug. *** note the special case of HA VMs discussed in https://bugzilla.redhat.com/show_bug.cgi?id=1467893#c33 see upstream bug 1317450 for more details could you please add feature page? thank you design as per https://bugzilla.redhat.com/show_bug.cgi?id=1317450#c25 bot doesn't seem to work, this is already being tested *** Bug 1386444 has been marked as a duplicate of this bug. *** INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [No relevant external trackers attached] For more info please contact: rhv-devops INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [No relevant external trackers attached] For more info please contact: rhv-devops added depends on 1481022. The RFE could not be verified for all kinds of storages because of precondition problem - no I/O Error VM Pause when blocking NFS/gluster storage. Tested only for iscsi Hi Michal Is this ready to be documented? I can see that only iscsi was tested. Also is there a feature page? Thanks! (In reply to Emma Heftman from comment #42) > Hi Michal > Is this ready to be documented? I can see that only iscsi was tested. > Also is there a feature page? > Thanks! Well, this RFE is complete, but it may make sense to take bug 1540548 into account too, and a comprehensive description of HA VMs currently on review in https://github.com/oVirt/ovirt-site/pull/1530 (In reply to Polina from comment #40) > added depends on 1481022. The RFE could not be verified for all kinds of > storages because of precondition problem - no I/O Error VM Pause when > blocking NFS/gluster storage. Tested only for iscsi Polina, how about FC? The main customer behind this RFE is using FC storage and we would like to make sure the solution works right for them. (In reply to Marina from comment #44) > (In reply to Polina from comment #40) > > added depends on 1481022. The RFE could not be verified for all kinds of > > storages because of precondition problem - no I/O Error VM Pause when > > blocking NFS/gluster storage. Tested only for iscsi > > Polina, how about FC? > The main customer behind this RFE is using FC storage and we would like to > make sure the solution works right for them. Hi Marina, this feature was not tested with FC. I'll try to get today environment with FC storage and test this. Will update you asap Hi Marina, The feature was tested on Fiber Channel Storage Domain successfully. On the latest build : rhv-release-4.2.1-3-001.noarch &rhel 7.5 just to summarize: The feature was successfully tested on two kinds of storages: ISCSI and Fiber Channel. On NFS and Gluster SDs there is a problem with the setup (pre-condition) for the tests: the VM is not Paused due to I/O error while nfs/gluster storage is blocked. The problem is detailed described in BZ https://bugzilla.redhat.com/show_bug.cgi?id=1481022. for rhvm-4.2.3-0.1.el7.noarch, libvirt-3.9.0-14.el7_5.2.x86_64: The feature is verified for Gluster storage. NFS - please see https://bugzilla.redhat.com/show_bug.cgi?id=1481022#c58 Summary for verification on rhv-release-4.2.3-4-001.noarch: The bug is verified on Glusted , FC , ISCSI, NFS storages. 1. On ISCSI and Gluster the I/O Pause was created by dropping rule with iptables command. 2. On FC - by making LUN path faulty (like echo "offline" > /sys/block/sdd/device/state). 3. On NFS the I/O Pause was created by changing /etc/exports file on nfs-server while there is a writing on VM. given the limitations we have with NFS this looks good enough. It woulds till be great if you can reduce the timoeut parameters for NFS mounts so we can check IOError reporting before host gets fenced, but I think that's tracked in other related bug For NFS I succeeded to get IO Error Pause changing the Retransmissions & Timeout parameters for SD. here are steps: 1. Put the SD in maintenance(by Data Center) 2. Open Storage Domains/Manage Domain/Custom Connection Parameters 3. Change the following parameters: Retransmissions (#) = 2 Timeout (deciseconds) = 1 (i.e.10 sec) 4. Activate the SD. 5. Run the VM associated with this SD. The behavior of NFS VMs has been tested in this setup. So, I can verify. please confirm. that's good enough, but needs to be noted in documentation verified on on rhv-release-4.2.3-4-001.noarch (see comments 51-54). Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:1488 BZ<2>Jira Resync |