Bug 1801439 - Prioritization of restarted HA VMs is messed up - low is restarted before medium
Summary: Prioritization of restarted HA VMs is messed up - low is restarted before medium
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Virt
Version: 4.4.0
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ovirt-4.4.1
: ---
Assignee: Andrej Krejcir
QA Contact: Polina
URL:
Whiteboard:
Depends On:
Blocks: autostart-w-engine
TreeView+ depends on / blocked
 
Reported: 2020-02-10 21:27 UTC by Polina
Modified: 2020-02-17 12:24 UTC (History)
3 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2020-02-17 12:24:45 UTC
oVirt Team: SLA
Embargoed:
pm-rhel: ovirt-4.4+


Attachments (Terms of Use)
engine.log attached (4.19 MB, text/plain)
2020-02-10 21:27 UTC, Polina
no flags Details
engine logs for two scenarios (246.07 KB, application/gzip)
2020-02-12 18:46 UTC, Polina
no flags Details

Description Polina 2020-02-10 21:27:23 UTC
Created attachment 1662275 [details]
engine.log attached

Description of problem:
Sometimes prioritization of restarted HA VMs is messed up -low is restarted before medium

Version-Release number of selected component (if applicable):
Sometimes prioritization of restarted HA VMs is messed up -low is restarted before the medium

How reproducible:always

Steps to Reproduce:

Scenario 1.
1. Run the following HA VMs with lease on the same host 
four HA VMs with high priority named high_1, high_2, high_3, high_4;
four HA VMs with medium priority named medium_1, medium_2, medium_3, medium_4;
four HA VMs with low priority named low_priority_1, low_priority_2, low_priority_3, low_priority_4.

2. Send poweroff with the powermgmnt to the host (VMs are down which is ok), wait for a while then start the host again.

Scenario 2.
1. the same as in scenario 1
2. block the storage on the host with iptables command 

Expected results: VMs are restarted in the order: high priority, medium, low.

Actual results:

The VMs are started, but sometimes the medium goes before the low (it is never messed with the high - all the high priority VMs are started first).

[root@compute-ge-6 ovirt-engine]# tail -f engine.log |grep "Trying to restart VM"
2020-02-09 15:27:21,080+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-89) [2a9cacc7] EVENT_ID: VDS_INITIATED_RUN_VM(506), Trying to restart VM high_2 on Host host_mixed_1
2020-02-09 15:27:21,480+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-89) [79c46e1e] EVENT_ID: VDS_INITIATED_RUN_VM(506), Trying to restart VM high_3 on Host host_mixed_1
2020-02-09 15:27:22,060+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-89) [2ff57f0a] EVENT_ID: VDS_INITIATED_RUN_VM(506), Trying to restart VM high_4 on Host host_mixed_1
2020-02-09 15:28:31,578+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-47) [] EVENT_ID: VDS_INITIATED_RUN_VM(506), Trying to restart VM high_4 on Host host_mixed_1
2020-02-09 15:28:31,597+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-47) [] EVENT_ID: VDS_INITIATED_RUN_VM(506), Trying to restart VM high_2 on Host host_mixed_1
2020-02-09 15:28:31,615+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-47) [] EVENT_ID: VDS_INITIATED_RUN_VM(506), Trying to restart VM high_3 on Host host_mixed_1
2020-02-09 15:28:32,517+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-83) [63b1aa62] EVENT_ID: VDS_INITIATED_RUN_VM(506), Trying to restart VM high_1 on Host host_mixed_1
2020-02-09 15:29:46,857+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-37) [] EVENT_ID: VDS_INITIATED_RUN_VM(506), Trying to restart VM high_1 on Host host_mixed_1
2020-02-09 15:29:47,824+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-27) [55b821aa] EVENT_ID: VDS_INITIATED_RUN_VM(506), Trying to restart VM medium_2 on Host host_mixed_1
2020-02-09 15:31:02,135+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-30) [] EVENT_ID: VDS_INITIATED_RUN_VM(506), Trying to restart VM medium_2 on Host host_mixed_1
2020-02-09 15:31:03,050+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-52) [3b3fa31b] EVENT_ID: VDS_INITIATED_RUN_VM(506), Trying to restart VM medium_4 on Host host_mixed_1
2020-02-09 15:32:17,504+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-55) [] EVENT_ID: VDS_INITIATED_RUN_VM(506), Trying to restart VM medium_4 on Host host_mixed_1
2020-02-09 15:32:18,331+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-70) [30747637] EVENT_ID: VDS_INITIATED_RUN_VM(506), Trying to restart VM low_priority_1 on Host host_mixed_1
2020-02-09 15:33:32,776+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-8) [] EVENT_ID: VDS_INITIATED_RUN_VM(506), Trying to restart VM low_priority_1 on Host host_mixed_1
2020-02-09 15:33:33,703+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-36) [3c6aa9b2] EVENT_ID: VDS_INITIATED_RUN_VM(506), Trying to restart VM medium_1 on Host host_mixed_1
2020-02-09 15:33:33,965+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-36) [3a4b4b22] EVENT_ID: VDS_INITIATED_RUN_VM(506), Trying to restart VM low_priority_2 on Host host_mixed_1
2020-02-09 15:33:34,299+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-36) [4ff0c879] EVENT_ID: VDS_INITIATED_RUN_VM(506), Trying to restart VM medium_3 on Host host_mixed_1
2020-02-09 15:33:34,559+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-36) [a1bfcac] EVENT_ID: VDS_INITIATED_RUN_VM(506), Trying to restart VM low_priority_3 on Host host_mixed_1
2020-02-09 15:33:34,828+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-36) [4f5ef5cf] EVENT_ID: VDS_INITIATED_RUN_VM(506), Trying to restart VM low_priority_4 on Host host_mixed_1
2020-02-09 15:34:33,908+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-7) [] EVENT_ID: VDS_INITIATED_RUN_VM(506), Trying to restart VM medium_1 on Host host_mixed_1
2020-02-09 15:34:49,009+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-91) [] EVENT_ID: VDS_INITIATED_RUN_VM(506), Trying to restart VM low_priority_4 on Host host_mixed_1
2020-02-09 15:34:49,026+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-91) [] EVENT_ID: VDS_INITIATED_RUN_VM(506), Trying to restart VM low_priority_3 on Host host_mixed_1
2020-02-09 15:34:49,039+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-91) [] EVENT_ID: VDS_INITIATED_RUN_VM(506), Trying to restart VM low_priority_2 on Host host_mixed_1
2020-02-09 15:34:49,062+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-91) [] EVENT_ID: VDS_INITIATED_RUN_VM(506), Trying to restart VM medium_3 on Host host_mixed_1


Additional info:

Comment 2 Andrej Krejcir 2020-02-11 14:25:05 UTC
The lower priority HA VMs wait for the higher priority VMs to be Up, or until a configured timeout passes (config value 'MaxTimeAutoStartBlockedOnPriority'). By default the timeout is 5 minutes.

Looking at the logs, it seems that the timeout is not long enough. For example VM 'low_priority_1' was added to the auto-start algorithm here:

2020-02-09 15:24:16,262+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-31) [] EVENT_ID: HA_VM_FAILED(9,602), Highly Available VM low_priority_1 failed. It will be restarted automatically.

And it was started 8 minutes after, here:

2020-02-09 15:32:18,205+02 INFO  [org.ovirt.engine.core.vdsbroker.CreateVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-70) [30747637] START, CreateVDSCommand( CreateVDSCommandParameters:{hostId='ccc22896-571f-45fc-94e1-b752cf0baf5e', vmId='4d601cb7-377f-4aed-baca-dc2f07b28543', vm='VM [low_priority_1]'}), log id: 7d3317a5

Can you test this again, but set the timeout to high enough value?
Default timeout of 5 minutes may not be long enough, maybe we should increase it.

Comment 3 Polina 2020-02-12 11:50:29 UTC
Hi Andrej,

it looks like now we have 10 min, not 5 -

engine-config --get MaxTimeAutoStartBlockedOnPriority
MaxTimeAutoStartBlockedOnPriority: 600 version: general

do we want to try longer timeout ?

Comment 4 Andrej Krejcir 2020-02-12 14:50:59 UTC
Right, I was mistaken. But could you test it again with a longer timeout anyway?
I cannot reproduce it in my environment, and then we will know if the issue is caused by the timeout or not. Try setting it to a high value, like 1 day.

Comment 5 Polina 2020-02-12 18:46:12 UTC
Created attachment 1662756 [details]
engine logs for two scenarios

Tried two values - an hour and a day:
engine-config --set MaxTimeAutoStartBlockedOnPriority=3600 (engine restart)
engine-config --set MaxTimeAutoStartBlockedOnPriority=86400 (engine restart)

still, low priority goes before medium.

attached engine_log_two_scenarios_for_two_timeouts.tar.gz

[root@compute-ge-6 ovirt-engine]# grep "Trying to restart VM" engine.log
2020-02-12 19:51:18,797+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-28) [318fc02d] EVENT_ID: VDS_INITIATED_RUN_VM(506), Trying to restart VM high_2 on Host host_mixed_1
2020-02-12 19:51:19,325+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-28) [7d54fcdf] EVENT_ID: VDS_INITIATED_RUN_VM(506), Trying to restart VM high_3 on Host host_mixed_1
2020-02-12 19:51:19,705+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-28) [5ed24e99] EVENT_ID: VDS_INITIATED_RUN_VM(506), Trying to restart VM high_4 on Host host_mixed_1
2020-02-12 19:52:30,771+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-39) [] EVENT_ID: VDS_INITIATED_RUN_VM(506), Trying to restart VM high_4 on Host host_mixed_1
2020-02-12 19:52:30,789+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-39) [] EVENT_ID: VDS_INITIATED_RUN_VM(506), Trying to restart VM high_2 on Host host_mixed_1
2020-02-12 19:52:30,805+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-39) [] EVENT_ID: VDS_INITIATED_RUN_VM(506), Trying to restart VM high_3 on Host host_mixed_1
2020-02-12 19:52:32,163+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-100) [c7d132] EVENT_ID: VDS_INITIATED_RUN_VM(506), Trying to restart VM high_1 on Host host_mixed_1
2020-02-12 19:53:46,740+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-5) [] EVENT_ID: VDS_INITIATED_RUN_VM(506), Trying to restart VM high_1 on Host host_mixed_1
2020-02-12 19:53:47,247+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-50) [34de16e3] EVENT_ID: VDS_INITIATED_RUN_VM(506), Trying to restart VM medium_2 on Host host_mixed_1
2020-02-12 19:54:47,874+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-45) [] EVENT_ID: VDS_INITIATED_RUN_VM(506), Trying to restart VM medium_2 on Host host_mixed_1
2020-02-12 19:54:48,725+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-59) [437c267c] EVENT_ID: VDS_INITIATED_RUN_VM(506), Trying to restart VM medium_4 on Host host_mixed_1
2020-02-12 19:56:03,634+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-18) [] EVENT_ID: VDS_INITIATED_RUN_VM(506), Trying to restart VM medium_4 on Host host_mixed_1
2020-02-12 19:56:04,971+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-32) [2f589156] EVENT_ID: VDS_INITIATED_RUN_VM(506), Trying to restart VM low_priority_1 on Host host_mixed_1
2020-02-12 19:57:19,543+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-60) [] EVENT_ID: VDS_INITIATED_RUN_VM(506), Trying to restart VM low_priority_1 on Host host_mixed_1
2020-02-12 19:57:20,025+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-24) [248a938a] EVENT_ID: VDS_INITIATED_RUN_VM(506), Trying to restart VM medium_1 on Host host_mixed_1
2020-02-12 19:57:20,392+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-24) [3abacf4b] EVENT_ID: VDS_INITIATED_RUN_VM(506), Trying to restart VM low_priority_2 on Host host_mixed_1
2020-02-12 19:57:20,729+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-24) [7341c6e8] EVENT_ID: VDS_INITIATED_RUN_VM(506), Trying to restart VM medium_3 on Host host_mixed_1
2020-02-12 19:57:21,141+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-24) [6a526936] EVENT_ID: VDS_INITIATED_RUN_VM(506), Trying to restart VM low_priority_3 on Host host_mixed_1
2020-02-12 19:57:21,506+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-24) [b90fd40] EVENT_ID: VDS_INITIATED_RUN_VM(506), Trying to restart VM low_priority_4 on Host host_mixed_1

Comment 6 Polina 2020-02-17 12:24:45 UTC
Closed the bz as NOTABUG since there was an error in the test setup - two VMs named as a medium were configured with low priority which messed up the results. 
Re-tested for both scenarios (host poweroff and blocking storage). The results are correct


Note You need to log in before you can comment on or make changes to this bug.