The HA logic should be adjusted to the latest RHV features and keep into consideration storage access. Currently, the logic is quite simple, it tries to start VM by default 10 times with 30 seconds intervals (configured by MaxNumOfTriesToRunFailedAutoStartVm and RetryToRunAutoStartVmIntervalInSeconds). This leads to the situation when in some cases HA VMs are not restarted when RHV environment is up and healthy after a longer outage. Instead, HA VMs could be restarted in a more sophisticated way, e.g. the attempt could be triggered when there is sense to do it, e.g. are enough resources on a cluster level and storage dependencies are available. What is really odd is the fact that we spend the restart attempt even if the valdidation fails: WARN [org.ovirt.engine.core.bll.RunVmCommand] (DefaultQuartzScheduler2) [782bf594] Validation of action 'RunVm' failed for user SYSTEM. Reasons: VAR__ACTION__RUN,VAR__TYPE__VM,SCHEDUL ING_NO_HOSTS 2018-11-18 08:19:08,723+01 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler2) [782bf594] EVENT_ID: HA_VM_RESTART_FAILED(9,603), Correlation ID: null, Call Stack: null, Custom ID: null, Custom Event ID: -1, Message: Restart of the Highly Available VM x01001100 failed. and eventually ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler6) [25701382] EVENT_ID: EXCEEDED_MAXIMUM_NUM_OF_RESTART_HA_VM_ATTEMPTS(9,605), Corre lation ID: null, Call Stack: null, Custom ID: null, Custom Event ID: -1, Message: Highly Available VM x01001100 could not be restarted automatically, exceeded the maximum number of attempts. ADDITIONAL INFO: Some very poor workaround is to make the restart without any limits but it is not possible, only some enormous int value for MaxNumOfTriesToRunFailedAutoStartVm /bll/src/main/java/org/ovirt/engine/core/bll/AutoStartVmsRunner.java from the ConfigValues the value MaxNumOfTriesToRunFailedAutoStartVm is taken and assigned as an integer to MAXIMUM_NUM_OF_TRIES_TO_AUTO_START_VM ... private static final int MAXIMUM_NUM_OF_TRIES_TO_AUTO_START_VM = Config.<Integer> getValue(ConfigValues.MaxNumOfTriesToRunFailedAutoStartVm); ... later, MAXIMUM_NUM_OF_TRIES_TO_AUTO_START_VM is taken into consideration while scheduling the next try to restart HA VM: boolean scheduleNextTimeToRun(Date timeToRunTheVm) { this.timeToRunTheVm = timeToRunTheVm; return ++numOfRuns < MAXIMUM_NUM_OF_TRIES_TO_AUTO_START_VM; } there is an ordinary comparison without any conditionals The loop is going through elements of CopyOnWriteArraySet I don't see any place to make it ad infinitum.
Hey Olimp - There are a number of related bugs. The closest is probably https://bugzilla.redhat.com/show_bug.cgi?id=844083 The best case here would be to have a sliding window for restarts, since our options otherwise are essentially intinite restarts (engine can't be aware of every possible failure case from libvirt/qemu -- only that it failed). In theory, we could also watch network/storage and restart, but it's difficult to make guarantees without unlimited restarts. Unlimited restarts may be acceptable as an interim solution, but risks flooding the logs by failing over and over again.
https://bugzilla.redhat.com/show_bug.cgi?id=844083 is not public :)
The gist of that RFE is "if there's an environment failure, engine should re-start VMs which were powered on when the environment comes back up"
(In reply to Olimp Bockowski from comment #0) > This leads to the situation when in > some cases HA VMs are not restarted when RHV environment is up and healthy > after a longer outage. > > Instead, HA VMs could be restarted in a more sophisticated way, e.g. the > attempt could be triggered when there is sense to do it, e.g. are enough > resources on a cluster level and storage dependencies are available. Attaching another ticket. On this one the SD with the VM leases had a long outage. The VMs failed and then HA tried to restart them but gave up after a few attempts. Then later the SD with the leases became valid again, but the VMs were not restarted. The user had to start them manually.
*** This bug has been marked as a duplicate of bug 912723 ***