912723 – [RFE] Need to improve the retry mechanism of highly available VMs

Bug 912723 - [RFE] Need to improve the retry mechanism of highly available VMs

Summary: [RFE] Need to improve the retry mechanism of highly available VMs

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	RFEs
Sub Component:
Version:	---
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	ovirt-4.4.1
Target Release:	---
Assignee:	Andrej Krejcir
QA Contact:	Polina
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1653389 (view as bug list)
Depends On:
Blocks:	1670339
TreeView+	depends on / blocked

Reported:	2013-02-19 14:01 UTC by Barak Dagan
Modified:	2023-10-06 17:26 UTC (History)
CC List:	17 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2020-07-08 08:27:17 UTC
oVirt Team:	Virt
Embargoed:
Dependent Products:
Flags:	pm-rhel: ovirt-4.4+ mtessun: planning_ack+ pm-rhel: devel_ack+ pm-rhel: testing_ack+

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHV-44186	0	None	None	None	2021-12-10 14:27:56 UTC
oVirt gerrit	102785	0	'None'	MERGED	core: Don't stop trying to start HA VMs	2021-02-01 22:30:42 UTC

Description Barak Dagan 2013-02-19 14:01:20 UTC

Description of problem:
When an high availability vm is down, due to some environmental issue (like storage maintenance), there is only one attempt to start the vm (which will probably fail). I suggest that the backend will try to start the vm periodically

Version-Release number of selected component (if applicable):


How reproducible:
100%

Steps to Reproduce:
1.
2.
3.
  
Actual results:
	
2013-Feb-19, VM ... was restarted on Host ...
2013-Feb-19, Failed to run VM ... on Host ... .
2013-Feb-19, VM ... is down. Exit message: Bad volume specification {'index': '0', 'domainID': 'd5d0dfe6-c2b9-429e-96c2-c1cd247ac828', 'reqsize': '0', 'format': 'raw', 'boot': 'true', 'volumeID': '2ca7c670-13b0-4a9b-8a9b-e37d0ba7e1ad', 'apparentsize': '32212254720', 'imageID': '6e94cfbd-4510-4403-896e-029bf97816bd', 'readonly': False, 'iface': 'virtio', 'truesize': '32212254720', 'poolID': '132859ec-ef83-4c01-b411-0cea6d3e1ed6', 'device': 'disk', 'shared': False, 'propagateErrors': 'off', 'type': 'disk', 'if': 'virtio'}.
2013-Feb-19, VM ... was started by ... (Host: ...).
	
Expected results:


Additional info:

Comment 1 Doron Fediuck 2013-02-24 14:07:09 UTC

Barak,
we need some more info here;
ie- how do we know when this is a temporary issue and when it isn't?
For example, if you have 30 HM VMs, and storage goes down, you will have a lot of noise and initiating restart operations may end up with marking hosts as problematic (even when the issue is with storage or network). So this needs to be refined to the cases we'd like to handle.

Comment 2 Barak Dagan 2013-02-24 15:07:14 UTC

Doron, 

I see what you mean, but using one try takes out the point from the availability.
If the vms are down for a long period, the administrator will know to tell whether is an host or something else issue, And will need to restart the system.

Today the admin need to restart the VM anyway, my suggestion reduces the times human interference is needed.

This is a general idea, I'm sure you guys can take create the details.

Simon, can you add your thoughts about the subject ?

Comment 3 Simon Grinberg 2013-02-25 16:47:44 UTC

(In reply to comment #2)

> Simon, can you add your thoughts about the subject ?

I agree we should improve the mechanism to something else then just one retry and then stop, on the other hand we know Doron is right and we had too many issues with the host recovery mechanism when it did periodic retries without checking the issue has been resolved.

I'm moving this to future in order to have a discussion place holder for this issue, but we need to think carefully about any mechanism we select.

Comment 9 Red Hat Bugzilla Rules Engine 2015-10-20 15:38:14 UTC

Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 14 Yaniv Kaul 2017-11-16 12:55:14 UTC

Milan - how's the new policy feature affecting this request?

Comment 15 Milan Zamazal 2017-11-16 13:59:55 UTC

The new "kill" autoresume policy for HA VMs with leases should help make VM restarts safer. But being involved just in the Vdsm part, I lack proper insight, Engine developers should know better about the exact motivations and utilization.

Comment 16 Yaniv Kaul 2017-11-16 18:53:45 UTC

(In reply to Milan Zamazal from comment #15)
> The new "kill" autoresume policy for HA VMs with leases should help make VM
> restarts safer. But being involved just in the Vdsm part, I lack proper
> insight, Engine developers should know better about the exact motivations
> and utilization.

What's the status on the engine side?

Comment 17 Tomas Jelinek 2017-11-22 11:16:32 UTC

(In reply to Yaniv Kaul from comment #16)
> (In reply to Milan Zamazal from comment #15)
> > The new "kill" autoresume policy for HA VMs with leases should help make VM
> > restarts safer. But being involved just in the Vdsm part, I lack proper
> > insight, Engine developers should know better about the exact motivations
> > and utilization.
> 
> What's the status on the engine side?

For HA VMs with a lease, the only policy which is allowed is the "KILL" which means the engine will try to re-start it again.
So, it will help in this case.

However, if the VM is a HA VM without lease, you can set also the "Resume" policy (which is the default) and in that case this policies will not help and the original behavior will be preserved.

Comment 18 Polina 2020-02-03 09:43:24 UTC

verified on ovirt-engine-4.4.0-0.17.master.el7.noarch
according to https://polarion.engineering.redhat.com/polarion/redirect/project/RHEVM3/workitem?id=RHEVM-26910

Comment 19 Ryan Barry 2020-03-09 22:07:24 UTC

*** Bug 1653389 has been marked as a duplicate of this bug. ***

Comment 20 Sandro Bonazzola 2020-07-08 08:27:17 UTC

This bugzilla is included in oVirt 4.4.1 release, published on July 8th 2020.

Since the problem described in this bug report should be resolved in oVirt 4.4.1 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Note You need to log in before you can comment on or make changes to this bug.