Bug 1653389

Summary: [RFE] more efficient HA logic
Product: Red Hat Enterprise Virtualization Manager Reporter: Olimp Bockowski <obockows>
Component: ovirt-engineAssignee: Nobody <nobody>
Status: CLOSED DUPLICATE QA Contact: meital avital <mavital>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.1.9CC: gveitmic, klaas, rbarry, Rhev-m-bugs
Target Milestone: ovirt-4.4.0Keywords: FutureFeature
Target Release: ---Flags: lsvaty: testing_plan_complete-
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-03-09 22:07:24 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Virt RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Olimp Bockowski 2018-11-26 17:18:13 UTC
The HA logic should be adjusted to the latest RHV features and keep into consideration storage access. 

Currently, the logic is quite simple, it tries to start VM by default 10 times with 30 seconds intervals (configured by MaxNumOfTriesToRunFailedAutoStartVm and 
RetryToRunAutoStartVmIntervalInSeconds). This leads to the situation when in some cases HA VMs are not restarted when RHV environment is up and healthy after a longer outage.

Instead, HA VMs could be restarted in a more sophisticated way, e.g. the attempt could be triggered when there is sense to do it, e.g. are enough resources on a cluster level and storage dependencies are available.

What is really odd is the fact that we spend the restart attempt even if the valdidation fails:

WARN  [org.ovirt.engine.core.bll.RunVmCommand] (DefaultQuartzScheduler2) [782bf594] Validation of action 'RunVm' failed for user SYSTEM. Reasons: VAR__ACTION__RUN,VAR__TYPE__VM,SCHEDUL
ING_NO_HOSTS
2018-11-18 08:19:08,723+01 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler2) [782bf594] EVENT_ID: HA_VM_RESTART_FAILED(9,603), Correlation ID: null, Call Stack: null, Custom ID: null, Custom Event ID: -1, Message: Restart of the Highly Available VM x01001100 failed.
and eventually
ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler6) [25701382] EVENT_ID: EXCEEDED_MAXIMUM_NUM_OF_RESTART_HA_VM_ATTEMPTS(9,605), Corre
lation ID: null, Call Stack: null, Custom ID: null, Custom Event ID: -1, Message: Highly Available VM x01001100 could not be restarted automatically, exceeded the maximum number of attempts.

ADDITIONAL INFO:
Some very poor workaround is to make the restart without any limits but it is not possible, only some enormous int value for MaxNumOfTriesToRunFailedAutoStartVm 

/bll/src/main/java/org/ovirt/engine/core/bll/AutoStartVmsRunner.java

from the ConfigValues the value MaxNumOfTriesToRunFailedAutoStartVm is taken and assigned as an integer to MAXIMUM_NUM_OF_TRIES_TO_AUTO_START_VM

...
private static final int MAXIMUM_NUM_OF_TRIES_TO_AUTO_START_VM =
                Config.<Integer> getValue(ConfigValues.MaxNumOfTriesToRunFailedAutoStartVm);
...

later, MAXIMUM_NUM_OF_TRIES_TO_AUTO_START_VM is taken into consideration while scheduling the next try to restart HA VM:


        boolean scheduleNextTimeToRun(Date timeToRunTheVm) {
            this.timeToRunTheVm = timeToRunTheVm;
            return ++numOfRuns < MAXIMUM_NUM_OF_TRIES_TO_AUTO_START_VM;
        }

there is an ordinary comparison without any conditionals
The loop is going through elements of CopyOnWriteArraySet
I don't see any place to make it ad infinitum.

Comment 1 Ryan Barry 2018-11-27 13:47:04 UTC
Hey Olimp -

There are a number of related bugs. The closest is probably https://bugzilla.redhat.com/show_bug.cgi?id=844083

The best case here would be to have a sliding window for restarts, since our options otherwise are essentially intinite restarts (engine can't be aware of every possible failure case from libvirt/qemu -- only that it failed).

In theory, we could also watch network/storage and restart, but it's difficult to make guarantees without unlimited restarts.

Unlimited restarts may be acceptable as an interim solution, but risks flooding the logs by failing over and over again.

Comment 2 Klaas Demter 2018-11-27 14:52:07 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=844083 is not public :)

Comment 3 Ryan Barry 2018-11-27 15:05:25 UTC
The gist of that RFE is "if there's an environment failure, engine should re-start VMs which were powered on when the environment comes back up"

Comment 4 Germano Veit Michel 2019-02-28 23:35:54 UTC
(In reply to Olimp Bockowski from comment #0)
> This leads to the situation when in
> some cases HA VMs are not restarted when RHV environment is up and healthy
> after a longer outage.
> 
> Instead, HA VMs could be restarted in a more sophisticated way, e.g. the
> attempt could be triggered when there is sense to do it, e.g. are enough
> resources on a cluster level and storage dependencies are available.

Attaching another ticket.

On this one the SD with the VM leases had a long outage. The VMs failed and then HA tried to restart them but gave up after a few attempts.

Then later the SD with the leases became valid again, but the VMs were not restarted. The user had to start them manually.

Comment 5 Ryan Barry 2020-03-09 22:07:24 UTC

*** This bug has been marked as a duplicate of bug 912723 ***