1653389 – [RFE] more efficient HA logic

Bug 1653389 - [RFE] more efficient HA logic

Summary: [RFE] more efficient HA logic

Keywords:
Status:	CLOSED DUPLICATE of bug 912723
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	4.1.9
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	ovirt-4.4.0
Target Release:	---
Assignee:	Nobody
QA Contact:	meital avital
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-11-26 17:18 UTC by Olimp Bockowski
Modified:	2023-10-06 18:00 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Enhancement
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-03-09 22:07:24 UTC
oVirt Team:	Virt
Target Upstream Version:
Embargoed:
Flags:	lsvaty: testing_plan_complete-

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHV-44331	0	None	None	None	2021-12-10 18:22:15 UTC

Description Olimp Bockowski 2018-11-26 17:18:13 UTC

The HA logic should be adjusted to the latest RHV features and keep into consideration storage access. 

Currently, the logic is quite simple, it tries to start VM by default 10 times with 30 seconds intervals (configured by MaxNumOfTriesToRunFailedAutoStartVm and 
RetryToRunAutoStartVmIntervalInSeconds). This leads to the situation when in some cases HA VMs are not restarted when RHV environment is up and healthy after a longer outage.

Instead, HA VMs could be restarted in a more sophisticated way, e.g. the attempt could be triggered when there is sense to do it, e.g. are enough resources on a cluster level and storage dependencies are available.

What is really odd is the fact that we spend the restart attempt even if the valdidation fails:

WARN  [org.ovirt.engine.core.bll.RunVmCommand] (DefaultQuartzScheduler2) [782bf594] Validation of action 'RunVm' failed for user SYSTEM. Reasons: VAR__ACTION__RUN,VAR__TYPE__VM,SCHEDUL
ING_NO_HOSTS
2018-11-18 08:19:08,723+01 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler2) [782bf594] EVENT_ID: HA_VM_RESTART_FAILED(9,603), Correlation ID: null, Call Stack: null, Custom ID: null, Custom Event ID: -1, Message: Restart of the Highly Available VM x01001100 failed.
and eventually
ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler6) [25701382] EVENT_ID: EXCEEDED_MAXIMUM_NUM_OF_RESTART_HA_VM_ATTEMPTS(9,605), Corre
lation ID: null, Call Stack: null, Custom ID: null, Custom Event ID: -1, Message: Highly Available VM x01001100 could not be restarted automatically, exceeded the maximum number of attempts.

ADDITIONAL INFO:
Some very poor workaround is to make the restart without any limits but it is not possible, only some enormous int value for MaxNumOfTriesToRunFailedAutoStartVm 

/bll/src/main/java/org/ovirt/engine/core/bll/AutoStartVmsRunner.java

from the ConfigValues the value MaxNumOfTriesToRunFailedAutoStartVm is taken and assigned as an integer to MAXIMUM_NUM_OF_TRIES_TO_AUTO_START_VM

...
private static final int MAXIMUM_NUM_OF_TRIES_TO_AUTO_START_VM =
                Config.<Integer> getValue(ConfigValues.MaxNumOfTriesToRunFailedAutoStartVm);
...

later, MAXIMUM_NUM_OF_TRIES_TO_AUTO_START_VM is taken into consideration while scheduling the next try to restart HA VM:


        boolean scheduleNextTimeToRun(Date timeToRunTheVm) {
            this.timeToRunTheVm = timeToRunTheVm;
            return ++numOfRuns < MAXIMUM_NUM_OF_TRIES_TO_AUTO_START_VM;
        }

there is an ordinary comparison without any conditionals
The loop is going through elements of CopyOnWriteArraySet
I don't see any place to make it ad infinitum.

Comment 1 Ryan Barry 2018-11-27 13:47:04 UTC

Hey Olimp -

There are a number of related bugs. The closest is probably https://bugzilla.redhat.com/show_bug.cgi?id=844083

The best case here would be to have a sliding window for restarts, since our options otherwise are essentially intinite restarts (engine can't be aware of every possible failure case from libvirt/qemu -- only that it failed).

In theory, we could also watch network/storage and restart, but it's difficult to make guarantees without unlimited restarts.

Unlimited restarts may be acceptable as an interim solution, but risks flooding the logs by failing over and over again.

Comment 2 Klaas Demter 2018-11-27 14:52:07 UTC

https://bugzilla.redhat.com/show_bug.cgi?id=844083 is not public :)

Comment 3 Ryan Barry 2018-11-27 15:05:25 UTC

The gist of that RFE is "if there's an environment failure, engine should re-start VMs which were powered on when the environment comes back up"

Comment 4 Germano Veit Michel 2019-02-28 23:35:54 UTC

(In reply to Olimp Bockowski from comment #0)
> This leads to the situation when in
> some cases HA VMs are not restarted when RHV environment is up and healthy
> after a longer outage.
> 
> Instead, HA VMs could be restarted in a more sophisticated way, e.g. the
> attempt could be triggered when there is sense to do it, e.g. are enough
> resources on a cluster level and storage dependencies are available.

Attaching another ticket.

On this one the SD with the VM leases had a long outage. The VMs failed and then HA tried to restart them but gave up after a few attempts.

Then later the SD with the leases became valid again, but the VMs were not restarted. The user had to start them manually.

Comment 5 Ryan Barry 2020-03-09 22:07:24 UTC


*** This bug has been marked as a duplicate of bug 912723 ***

Note You need to log in before you can comment on or make changes to this bug.