1230788 – (rhv_turn_off_autoresume_of_paused_VMs) [RFE] Have a policy for autoresume of VMs paused due to IO errors (stay paused, turn off, restart with defined time out time)

Bug 1230788 (rhv_turn_off_autoresume_of_paused_VMs) - [RFE] Have a policy for autoresume of VMs paused due to IO errors (stay paused, turn off, restart with defined time out time)

Summary: [RFE] Have a policy for autoresume of VMs paused due to IO errors (stay pause...

Keywords:
Status:	CLOSED ERRATA
Alias:	rhv_turn_off_autoresume_of_paused_VMs
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	3.4.5
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	high
Target Milestone:	ovirt-4.2.0
Target Release:	---
Assignee:	Michal Skrivanek
QA Contact:	Polina
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	1206317 1386444 (view as bug list)
Depends On:	oVirt_turn_off_autoresume_of_paused_VMs 1481022
Blocks:	1386444 1417161 1460513 1541529 1545980
TreeView+	depends on / blocked

Reported:	2015-06-11 14:26 UTC by Julio Entrena Perez
Modified:	2021-09-09 11:39 UTC (History)
CC List:	30 users (show)
Fixed In Version:
Doc Type:	Enhancement
Doc Text:	Feature: Previously, if the VM has been paused due to IO Error, there was no way how to configure what should happen after the storage gets fixed. The only option was "auto resume", which resumed the VM. This feature adds two more options configurable per VM: "Kill" and "Leave Paused". Reason: There are ways how the "auto resume" together with HA VM using VM lease could lead to split brain. Other reason is that it can interfere with custom HA solutions. Result: Now the user can configure 3 resume policies for VMs: - auto resume (the one which used to be the only one) - leave paused - kill
Clone Of:
Clones:	oVirt_turn_off_autoresume_of_paused_VMs (view as bug list)
Environment:
Last Closed:	2018-05-15 17:36:24 UTC
oVirt Team:	Virt
Target Upstream Version:
Embargoed:
Flags:	mavital: needinfo+ mavital: testing_plan_complete+

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1386444	high	CLOSED	[RFE] Introduce HA timeout for VMs in Paused state due to Unreachable Storage	2021-06-10 11:42:07 UTC
Red Hat Bugzilla	1467893	urgent	CLOSED	VM lease missing after host loses storage connection	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	1540548	high	CLOSED	[RFE] Automatically restart HA VMs paused due to I/O Error	2021-09-09 13:09:55 UTC
Red Hat Knowledge Base (Solution)	2128511	None	None	None	2016-01-18 15:32:50 UTC
Red Hat Knowledge Base (Solution)	2749481	None	None	None	2017-12-04 14:25:50 UTC
Red Hat Product Errata	RHEA-2018:1488	None	None	None	2018-05-15 17:38:04 UTC

Internal Links: 1386444 1467893 1540548

Description Julio Entrena Perez 2015-06-11 14:26:21 UTC

> 1. Proposed title of this feature request  

Make automatic resume of VMs paused due to I/O error configurable.
      
> 3. What is the nature and description of the request?

Customer needs to be able to instruct RHEV that VMs paused as a result of an I/O error should not be resumed automatically once the storage domain recovers.

According to bug 1036358 VMs paused as a result of a problem in the storage domain should be resumed automatically once the problem is resolved.
      
> 4. Why does the customer need this? (List the business requirements here)  

If VMs are resumed automatically (in an uncontrolled way) when the error condition in the storage domain is resolved, this will cause unexpected and/or undesired effects in their application.
For example, resumed VMs don't have their clock in sync after they resume, which would cause significant issues for the customer's application.

Customer needs to be able to configure RHEV not to automatically resume VMs that paused as a result of problems with the storage.

Comment 3 Doron Fediuck 2015-06-14 14:00:02 UTC

We should consider hosted engine for this RFE, as a VM which will need
to be resumed regardless of the config, or make the configuration on
SD level which means HE SD will not be using it.

Comment 4 Michal Skrivanek 2015-06-15 10:02:31 UTC

Is time sync the problem here? If so we can add a guest agent verb to explicitly sync time after resume

If there are more/other issues then we can extend the existing error_policy/propagateErrors parameter

Comment 19 Yaniv Lavi 2017-02-13 22:59:21 UTC

*** Bug 1206317 has been marked as a duplicate of this bug. ***

Comment 22 Michal Skrivanek 2017-08-02 06:13:44 UTC

note the special case of HA VMs discussed in https://bugzilla.redhat.com/show_bug.cgi?id=1467893#c33

Comment 26 Michal Skrivanek 2017-09-19 13:56:38 UTC

see upstream bug 1317450 for more details

Comment 27 Polina 2017-10-03 10:43:50 UTC

could you please add feature page?

thank you

Comment 28 Michal Skrivanek 2017-10-16 06:41:40 UTC

design as per https://bugzilla.redhat.com/show_bug.cgi?id=1317450#c25

Comment 31 Michal Skrivanek 2017-11-24 10:17:47 UTC

bot doesn't seem to work, this is already being tested

Comment 35 Doron Fediuck 2017-12-04 14:25:51 UTC

*** Bug 1386444 has been marked as a duplicate of this bug. ***

Comment 38 RHV bug bot 2017-12-06 16:17:48 UTC

INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[No relevant external trackers attached]

For more info please contact: rhv-devops

Comment 39 RHV bug bot 2017-12-12 21:16:17 UTC

INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[No relevant external trackers attached]

For more info please contact: rhv-devops

Comment 40 Polina 2018-01-16 13:41:44 UTC

added depends on 1481022. The RFE could not be verified for all kinds of storages because of precondition problem - no I/O Error VM Pause when blocking NFS/gluster storage. Tested only for iscsi

Comment 42 Emma Heftman 2018-02-20 12:31:21 UTC

Hi Michal
Is this ready to be documented? I can see that only iscsi was tested.
Also is there a feature page?
Thanks!

Comment 43 Michal Skrivanek 2018-02-21 09:02:36 UTC

(In reply to Emma Heftman from comment #42)
> Hi Michal
> Is this ready to be documented? I can see that only iscsi was tested.
> Also is there a feature page?
> Thanks!

Well, this RFE is complete, but it may make sense to take bug 1540548 into account too, and a comprehensive description of HA VMs currently on review in https://github.com/oVirt/ovirt-site/pull/1530

Comment 44 Marina Kalinin 2018-02-21 14:26:36 UTC

(In reply to Polina from comment #40)
> added depends on 1481022. The RFE could not be verified for all kinds of
> storages because of precondition problem - no I/O Error VM Pause when
> blocking NFS/gluster storage. Tested only for iscsi

Polina, how about FC?
The main customer behind this RFE is using FC storage and we would like to make sure the solution works right for them.

Comment 45 Polina 2018-02-22 07:43:00 UTC

(In reply to Marina from comment #44)
> (In reply to Polina from comment #40)
> > added depends on 1481022. The RFE could not be verified for all kinds of
> > storages because of precondition problem - no I/O Error VM Pause when
> > blocking NFS/gluster storage. Tested only for iscsi
> 
> Polina, how about FC?
> The main customer behind this RFE is using FC storage and we would like to
> make sure the solution works right for them.

Hi Marina, this feature was not tested with FC. I'll try to get today environment with FC storage and test this. Will update you asap

Comment 46 Polina 2018-02-26 14:03:56 UTC

Hi Marina,

The feature was tested on Fiber Channel Storage Domain successfully. On the latest build :
rhv-release-4.2.1-3-001.noarch &rhel 7.5

Comment 47 Polina 2018-02-26 14:50:09 UTC

just to summarize:
The feature was successfully tested on two kinds of storages: ISCSI and Fiber Channel.
On NFS and Gluster SDs there is a problem with the setup (pre-condition) for the tests: 
the VM is not Paused due to I/O error while nfs/gluster storage is blocked. The problem is detailed described in BZ https://bugzilla.redhat.com/show_bug.cgi?id=1481022.

Comment 48 Polina 2018-04-15 16:24:24 UTC

for rhvm-4.2.3-0.1.el7.noarch, libvirt-3.9.0-14.el7_5.2.x86_64:

The feature is verified for Gluster storage.

NFS - please see https://bugzilla.redhat.com/show_bug.cgi?id=1481022#c58

Comment 51 Polina 2018-04-30 07:44:36 UTC

Summary for verification on rhv-release-4.2.3-4-001.noarch:

The bug is verified on Glusted , FC , ISCSI, NFS storages.

1. On ISCSI and Gluster the I/O Pause was created by dropping rule with iptables command.
2. On FC - by making LUN path faulty (like echo "offline" > /sys/block/sdd/device/state).
3. On NFS the I/O Pause was created by changing /etc/exports file on nfs-server while there is a writing on VM.

Comment 52 Michal Skrivanek 2018-04-30 11:53:41 UTC

given the limitations we have with NFS this looks good enough. It woulds till be great if you can reduce the timoeut parameters for NFS mounts so we can check IOError reporting before host gets fenced, but I think that's tracked in other related bug

Comment 53 Polina 2018-05-01 09:17:19 UTC

For NFS I succeeded to get IO Error Pause changing the Retransmissions & Timeout parameters for SD.
here are steps:
   1. Put the SD in maintenance(by Data Center)
   2. Open Storage Domains/Manage Domain/Custom Connection Parameters 
   3. Change the following parameters:
	Retransmissions (#) = 2
	Timeout (deciseconds) = 1 (i.e.10 sec)
   4. Activate the SD. 
   5. Run the VM associated with this SD.

The behavior of NFS VMs has been tested in this setup.
So, I can verify. please confirm.

Comment 54 Michal Skrivanek 2018-05-02 11:18:37 UTC

that's good enough, but needs to be noted in documentation

Comment 55 Polina 2018-05-02 12:05:51 UTC

verified on on rhv-release-4.2.3-4-001.noarch (see comments 51-54).

Comment 58 errata-xmlrpc 2018-05-15 17:36:24 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:1488

Comment 59 Franta Kust 2019-05-16 13:09:08 UTC

BZ<2>Jira Resync

Note You need to log in before you can comment on or make changes to this bug.

acanan
ahoness
aperotti
danken
dcadzow
dfediuck
eheftman
fromani
gveitmic
jentrena
jhardy
lbopf
lpeer
lsurette
mavital
mgoldboi
michal.skrivanek
mkalinin
nsoffer
oscardalmauroig
pagranat
ratamir
rbalakri
Rhev-m-bugs
rmcswain
srevivo
tjelinek
tnisan
ykaul
ylavi