Bug 1583009

Summary:	[RFE] Balancing does not produce ideal migrations
Product:	Red Hat Enterprise Virtualization Manager	Reporter:	Germano Veit Michel <gveitmic>
Component:	ovirt-engine	Assignee:	Andrej Krejcir <akrejcir>
Status:	CLOSED ERRATA	QA Contact:	Liran Rotenberg <lrotenbe>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.2.5	CC:	akrejcir, dfediuck, gveitmic, lsurette, mavital, michal.skrivanek, msivak, mtessun, rbarry, Rhev-m-bugs, sborella, srevivo
Target Milestone:	ovirt-4.3.2	Keywords:	FutureFeature
Target Release:	4.3.0	Flags:	lrotenbe: testing_plan_complete+
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	ovirt-engine-4.3.0_alpha	Doc Type:	Enhancement
Doc Text:	Previously, during high CPU usage, the balancing process would migrate a single virtual machine that evaluated to a good migration candidate. Now, this enhancement updates the balancing process to migrate multiple virtual machines one-by-one until one of the virtual machine migrations succeeds.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-05-08 12:37:41 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	SLA	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1653752
Bug Blocks:

Description Germano Veit Michel 2018-05-28 04:31:42 UTC

Description of problem:

Consider the following scenario:
- 3 Hosts (ovirt-h1, ovirt-h2, ovirt-h3)
- 5 VMs (VM-1, VM-2, ... VM-5)

The following Affinity Settings:
- VM-1, VM-2, VM-3 -> Soft Affinity Group, Positive
- VM-4, VM-1       -> Hard Affinity Group, Negative
- VM-4, VM-2       -> Hard Affinity Group, Negative
- VM-4, VM-3       -> Hard Affinity Group, Negative
- VM-5             -> No Setting

The VMs are running on these hosts:
ovirt-h1: VM-4
ovirt-h2: none (SPM)
ovirt-h3: VM-1, VM-2, VM-3, VM-5

At this point, run stress on VM-5 to max it its CPU. VM-5 has high CPU usage while the other VMs have low cpu usage (but not idle).

What happens? 

VM-2 (with the affinity group setting!) is migrated to ovirt-h2. Sometimes I also got VM-1 or VM-3 migrated (until we drop below HighUtilization for CPU, then migrations stop). So most of the times we are left with:

ovirt-h1: VM-4
ovirt-h2: VM-2
ovirt-h3: VM-1, VM-3, VM-5

If ovirt-h3 continues over HighUtilization threshold, then VM-1 or VM-3 are also migrated to ovirt-h2 and ovirt-h3 drops under the treshold. The result is also far from ideal.

ovirt-h1: VM-4
ovirt-h2: VM-2, VM-3
ovirt-h3: VM-1, VM-5

Problems:
a) Shouldn't the default weight for VmAffinityGroups (1) be higher? The default policy is prioritizing CPU usage over Vm Affinity. But OK this can be tuned by the user for this specific scenario, not actually a problem.
b) VM-5 was never picked by FindVmAndDestinations for migration. The code seems to suggest we sort by CPU usage, but it seems the list is in ascending order [1], so only the idle VMs are picked instead of VM-5, which has the highest CPU usage and is the ideal candidate for migration. So unless VM-5 produces enough HighUtilization to evacuate all VMs from the host, the cluster is left in a not ideal state.

In summary:
-> Instead of doing a single migration - moving VM-5 - to ovirt-h2, the code seems to try to move all low CPU usage VMs (VM-1, VM-2 and VM-3) to ovirt-h2. But it stops halfway once HighUtilization drops below the threshold due to the migrations.
-> And worse, if VmAffinityGroup weight is high, then nothing happens, as FindVmAndDestinations never finds VM-5 and the affinity weight prevents the migration of VM-1|2|3.

Load Balancing Setting:
evenly_distributed

Version-Release number of selected component (if applicable):
ovirt-engine-4.2.3.5-1.el7.centos.noarch

How reproducible:
100%

Steps to Reproduce:
Build Scenario Above

Actual results:
* VM-2 migrated to ovirt-h2 (in case VmAffinityGroup weight not raised to 5+)
* Nothing (in case VmAffinityGroup raised to 5+)

Expected results:
* VM-5 migrated to ovirt-h2
OR
* VM-1, VM-2 and VM-3 migrated to ovirt-h2

Ideally 1 migration of VM-5 to ovirt-h2, settling on this:
ovirt-h1: VM-4
ovirt-h2: VM-5
ovirt-h3: VM-1, VM-2, VM-3

Additional information:
[1] This is ascending:
> Collections.sort(migratableVmsOnHost, VmCpuUsageComparator.INSTANCE);
See, in 26 runs, only VM-1, VM-2 and VM-3 were picked. These are the lower CPU usage ones:
# grep "selected for" /var/log/ovirt-engine/engine.log | cut -d ' ' -f3,4,8 | sort | uniq -c
     12 DEBUG [org.ovirt.engine.core.bll.scheduling.utils.FindVmAndDestinations] 'VM-1'
      7 DEBUG [org.ovirt.engine.core.bll.scheduling.utils.FindVmAndDestinations] 'VM-2'
      7 DEBUG [org.ovirt.engine.core.bll.scheduling.utils.FindVmAndDestinations] 'VM-3'

Comment 1 Doron Fediuck 2018-05-29 10:49:53 UTC

Hi,
I'll start with VM selection. We indeed sort the VMs according to cpu usage, with a preference to choosing the idle VMs first. The reason for this is that we prefer not to disturb a "busy" VM, while migrating an idle VM is more likely to succeed with no service disruption. So this is working by design and should stay this way.

As for the affinity rules; All hard affinity rules have been preserved, as they should. Soft affinity, by design, is best effort and may be disrupted when needed. So an an overall the system is functioning without breaking any rules. In this case one might claim that leaving the busy VM alone to avoid disruption is the right way to go. The alternative, of migrating a busy VM with a potential for migration failure and service disruption is the worst case scenario for some of the users who need the stability. As you can see there's not a single solution which fits everyone here. For this reason you can generate your own cluster policy and change the weights as you see fit.

Please let me know if you need anything else.

Comment 2 Germano Veit Michel 2018-05-29 23:41:06 UTC

Hi Doron,

I understand the point behind migrating the VM with lowest CPU usage, it does make sense. Thanks for that.

However, it doesn't change much the erratic behavior. Let me explain it differently.

Scenario:
Host is overloaded
1 VM will lot of CPU usage
1 VM will medium CPU usage
3 VMs with no CPU usage, all part of a affinity group, pinned to host etc. Anything that ties them to the running host either by filtering or best score.

The current algorithm:

if host is overcommited:
   vms = migratable_vms(host)
   sort(vms, cpu_usage)
   for vm in vms:
       dest_hosts = getValidHosts(vm)
       if dest_hosts:
           return vm,dest_hosts

1) VM is always the VM with the lowest CPU usage. 
2) getValidHosts is a simple check. Passing it does not mean the VM can migrate.
3) For several reasons (hard pinning (filter), soft pinning (weight)..) the selected VM will be found to no be able to migrate (or the running host already has the best score).

Then the algorithm stops and no VMs are migrated. Then it runs again 1 minute later and hits the same problem, no VMs are migrated ever. The host continues overloaded and the VMs are providing poor service.

This loop here:

for (VM vmToMigrate : migratableVmsOnHost){                                                                                               
    // Check if vm not over utilize memory or CPU of destination hosts                                                                    
    List<VDS> validDestinationHosts = getValidHosts(                                                                                      
        destinationHosts, cluster, vmToMigrate, highCpuUtilization, requiredMemory);                                                  
                                                                                                                                                      
    if (!validDestinationHosts.isEmpty()){                                                                                                
        log.debug("Vm '{}' selected for migration", vmToMigrate.getName());                                                               
        return Optional.of(new Result(vmToMigrate, validDestinationHosts));                                                               
    }                                                                                                                                     
}

It will always return a VM that cannot migrate on every cycle. The scheduler is stuck, doing nothing.

IMO, if the VM cannot migrate, then the loop must continue and select the next VM with lowest CPU usage on the same cycle. Eventually it will find a VM that can migrate, even if it is the last VM. So it does not get stuck on these scenarios where the VM with lowest CPU usage will not migrate due to filtering or weights.

What do you think?

Comment 3 Martin Sivák 2018-06-04 16:21:26 UTC

Hi Germano,

you are right, the current balancing algorithm is pretty simple and does not know how to handle this kind of situation. Count in the AREM that runs as a separate "balancing" thread and has similar limitations.

The issue here is that the balancing would have to check for every candidate VM if it has a viable target that actually improves anything [1] and that is a pretty big change. We would have to merge AREM and balancing and run a complete scheduling run from balancing without executing the results just to select the candidates. And then trigger the scheduler again to actually do something.

I was experimenting with a deterministic solution to this myself and I have a solution for this. Unfortunately, it is CPU intensive - balancing run for 10000 VMs took about 2 seconds to compute the result.

We rejected the other option (some kind of randomization) for support purposes. We would not be able to reproduce bug reports.

This can all be implemented and improved, but what we has been "good enough" so far and we could not justify the time and resources.


[1] The scheduler can decide that the best host for a VM is the current one.

Comment 4 Germano Veit Michel 2018-06-04 23:05:18 UTC

Hi Martin,

> I was experimenting with a deterministic solution to this myself and I have
> a solution for this. Unfortunately, it is CPU intensive - balancing run for
> 10000 VMs took about 2 seconds to compute the result.

Interesting. Well, with 10k VMs I believe there will be other problems as well ;)
And with 10k VMs, its expected that RHV-M will have plenty of resources. 2s for 10k VMs doesn't sound that bad to me.

> This can all be implemented and improved, but what we has been "good enough"
> so far and we could not justify the time and resources.

There is a ticket attached to this BZ, so perhaps this could be considered a RFE?

Comment 5 Martin Sivák 2018-06-06 09:13:51 UTC

Yep, we can consider this to be an RFE.

Comment 8 Michal Skrivanek 2018-09-17 08:57:31 UTC

how about just adding a "last migration failure" time as primary sort-by field?

Comment 9 Germano Veit Michel 2018-09-19 05:21:15 UTC

(In reply to Michal Skrivanek from comment #8)
> how about just adding a "last migration failure" time as primary sort-by
> field?

Or keep the sorted list the way it is. And then retry a migration with the next VM in the list if the migration validation fails.

Comment 10 Germano Veit Michel 2018-10-22 02:50:11 UTC

Is this too intrusive for 4.2.z or we have plans to Z-Stream it once the patches are merged? Note there are tickets attached.

Comment 11 Ryan Barry 2018-11-18 16:31:50 UTC

Probably too intrusive for a Z-stream unless there's an urgent need

Comment 12 RHV bug bot 2018-12-10 15:13:26 UTC

WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{'rhevm-4.3-ga': '?'}', ]

For more info please contact: rhv-devops: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{'rhevm-4.3-ga': '?'}', ]

For more info please contact: rhv-devops

Comment 13 RHV bug bot 2019-01-15 23:35:53 UTC

WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{'rhevm-4.3-ga': '?'}', ]

For more info please contact: rhv-devops: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{'rhevm-4.3-ga': '?'}', ]

For more info please contact: rhv-devops

Comment 14 RHV bug bot 2019-03-05 21:22:46 UTC

WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 15 Liran Rotenberg 2019-03-13 13:36:04 UTC

Verified on:
ovirt-engine-4.3.2.1-0.0.master.20190305140204.git3649df7.el7.noarch

Steps:
Tested the scenario from comment #2.

Environment is with 3 hosts, 32cpus, SPM and HE VM were on host1.

1. Run 3 VMs with 1vcpu under hard affinity to host1.
2. Run VM-VM1 with mid load (24vcpus), causing a ~30% of the host cpu load on host1.
3. Run VM-VM2 with high load (28vcpus), causing a ~60% of the host cpu load on host1.
- total cpu load on host1 > 90%
4. Set even distribute scheduling policy on the cluster, with high utilization: 80%.

Results:
HE VM(without any load) migrated away to host2.
Afterwards, VM1 migrated away.

Another test was a negative test:
On the same environment as above. HE VM now on host2.
1. Run 3 VMs with 1vcpu under hard affinity to host1, load 1VM to 100% (total of ~3% of the host).
2. Run VM-VM1 with mid load (24vcpus), causing a ~30% of the host cpu load on host2.
3. Run VM-VM2 with high load (28vcpus), causing a ~60% of the host cpu load on host1.
- total cpu load on host1 > 63%
4. Set even distribute scheduling policy on the cluster, with high utilization: 55%.

Results:
As expected, host1 has > 63% load so it needs balance. host2 already have ~30%, moving VM2 to it is impossible, as the 3VMs(because affinity).
host3 is free but moving VM2 to it will cause a need to balance again.
The engine keeps trying to balance the cluster, each time he is trying different VM.
The engine does not invoke migration.

Comment 17 errata-xmlrpc 2019-05-08 12:37:41 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:1085

Comment 18 Red Hat Bugzilla 2023-09-18 00:13:45 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days