Description of the problem: The engine processes VMs in no particular (intended) order so the VMs are processed as returned by the DB subsystem (the same order) when the engine tries to resolve the affinity conflicts. As the engine tries to migrate only one VM per the evaluation cycle, there is a high chance that one VM that cannot be migrated will block the rest of the VMs in the same affinity group from being migrated. Version-Release number of selected component (if applicable): rhvm-4.2.8.2-0.1 How reproducible: 100% Steps to Reproduce: 1. Create a set of VMs some with a high amount of RAM and some of a low amount of RAM 2. Have hypervisor with just less memory than the memory needed by the VMs. This just o simulate environment with more hypervisor in the same affinity group. 3. You will see that the engine will not migrate all the VMs even if it could as it will be blocked by a VM with a high amount of memory that is repeatedly tried to be migrated on this hypervisor. Actual results: There are VMs that could be migrated on the hypervisor, but the engine will not try as it is endlessly trying to migrate VM that cannot fit there. Expected results: If there is a VM that cannot be migrated the engine should skip it and try another VM in the list. Obviously, also the entire algorithm should be improved so there is a higher chance that all the VMs will fir there if more hypervisors are available. Additional info: The VMs are actually sorted by the number of conflicts, but in real life, there are only one or two conflicts and the majority of the VMs have the same number of conflicts (mostly just one). The issue increases with the number of VMs in the affinity group and the diversity of the VM (small and big VMs).
What is the exact configuration of VMs and hosts? How much memory the VMs need and the hosts have? The engine checks if a VM can be migrated and if it cannot, it will try to migrate a different VM. This bug may be an edge case, where the VM can be migrated, but the best host for it is the one where it is currently running, so it is not moved.
This issue will be solved by some of the patches that solve Bug 1651747.
The other bug is in MODIFIED.
(In reply to Andrej Krejcir from comment #6) > This issue will be solved by some of the patches that solve Bug 1651747. Shouldn't this be retargeted to TM 4.3.5 as 1651747 is?
WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [Found non-acked flags: '{'rhevm-4.3.z': '?'}', ] For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [Found non-acked flags: '{'rhevm-4.3.z': '?'}', ] For more info please contact: rhv-devops
Possible steps to verify: 1. Have 3 VMs (VM1, VM2, VM3) and 2 hosts (Host1, Host2). VM1 and VM2 are running on Host1, VM3 on Host2. 2. Create these VM to host affinity groups: - positive hard (VM3, Host2) - positive soft (VM1, Host1) - positive soft (VM2, Host1) 3. Create positive hard VM affinity group with all VMs. 4. Check the engine.log. Expected results: The affinity rules enforcer runs every minute by default. It should try to migrate both VM1 and VM2 every time it runs, but they will not migrate. The log should contain "Running command: BalanceVmCommand" for both VMs no more than a minute apart. The reason for this setup, is that VM1 and VM2 can be migrated, but because of their VM to host soft affinity, the scheduler chooses Host1 as the best host for them. As a result, when one of them is not migrated the affinity rules enforcer tries to migrate the other one.
Verified according to steps described in https://bugzilla.redhat.com/show_bug.cgi?id=1674386#c14 29d8f9d8-23a0-4028-935c-96ac9bba8c27 - VM1 (golden_env_mixed_virtio_0) 932c013e-5f5d-4ea0-8382-b132dea73c33 - VM2 (golden_env_mixed_virtio_1) a4871027-040a-49fa-b869-673a31270f5f - VM3 (golden_env_mixed_virtio_2) once a minute there is BalanceVmCommand report in engine.log for VM1 and VM2. 2019-07-08 18:46:07,259+03 INFO [org.ovirt.engine.core.bll.BalanceVmCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-93) [33e8f107] Running command: BalanceVmCommand internal: true. Entities affected : ID: 29d8f9d8-23a0-4028-935c-96ac9bba8c27 2019-07-08 18:46:07,444+03 INFO [org.ovirt.engine.core.bll.BalanceVmCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-93) [4051b966] Running command: BalanceVmCommand internal: true. Entities affected : ID: 932c013e-5f5d-4ea0-8382-b132dea73c33 2019-07-08 18:46:07,292+03 WARN [org.ovirt.engine.core.bll.scheduling.policyunits.VmAffinityPolicyUnit] (EE-ManagedThreadFactory-engineScheduled-Thread-93) [33e8f107] Invalid affinity situation was detected while scheduling VMs: 'golden_env_mixed_virtio_0' (29d8f9d8-23a0-4028-935c-96ac9bba8c27). VMs belonging to the same positive enforcing affinity groups are running on more than one host. 2019-07-08 18:46:07,450+03 WARN [org.ovirt.engine.core.bll.scheduling.policyunits.VmAffinityPolicyUnit] (EE-ManagedThreadFactory-engineScheduled-Thread-93) [4051b966] Invalid affinity situation was detected while scheduling VMs: 'golden_env_mixed_virtio_1' (932c013e-5f5d-4ea0-8382-b132dea73c33). VMs belonging to the same positive enforcing affinity groups are running on more than one host. 2019-07-08 18:46:07,292+03 WARN [org.ovirt.engine.core.bll.scheduling.policyunits.VmAffinityPolicyUnit] (EE-ManagedThreadFactory-engineScheduled-Thread-93) [33e8f107] Invalid affinity situation was detected while scheduling VMs: 'golden_env_mixed_virtio_0' (29d8f9d8-23a0-4028-935c-96ac9bba8c27). VMs belonging to the same positive enforcing affinity groups are running on more than one host. 2019-07-08 18:46:07,450+03 WARN [org.ovirt.engine.core.bll.scheduling.policyunits.VmAffinityPolicyUnit] (EE-ManagedThreadFactory-engineScheduled-Thread-93) [4051b966] Invalid affinity situation was detected while scheduling VMs: 'golden_env_mixed_virtio_1' (932c013e-5f5d-4ea0-8382-b132dea73c33). VMs belonging to the same positive enforcing affinity groups are running on more than one host.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2019:2431