1207255 – soft negative affinity

Bug 1207255 - soft negative affinity

Summary: soft negative affinity

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	General
Sub Component:
Version:	---
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	ovirt-4.1.0-beta
Target Release:	4.1.0
Assignee:	Martin Sivák
QA Contact:	Artyom
Docs Contact:
URL:
Whiteboard:
Depends On:	1306263
Blocks:
TreeView+	depends on / blocked

Reported:	2015-03-30 14:09 UTC by Kapetanakis Giannis
Modified:	2017-02-01 14:42 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Clone Of:
Environment:	Centos 6.6 ovirt-engine-3.5.1.1-1.el6.noarch
Last Closed:	2017-02-01 14:42:11 UTC
oVirt Team:	SLA
Embargoed:
Dependent Products:
Flags:	rule-engine: ovirt-4.1+ rule-engine: planning_ack+ rule-engine: devel_ack+ mavital: testing_ack+

Attachments	(Terms of Use)
engine log (30.49 KB, text/plain) 2015-06-30 13:19 UTC, Kapetanakis Giannis	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
oVirt gerrit	47772	0	None	None	None	2016-01-25 10:02:45 UTC

Description Kapetanakis Giannis 2015-03-30 14:09:32 UTC

I've defined a soft negative affinity group for two VMs.

To my understanding if the there are at least 2 nodes available in the cluster
then the VMs SHOULD start on different nodes.

This does not happen. They start on the same node.
If I make it hard then it works.

However I don't want to make it hard since if there is only one node available in the cluster then one vm will stay down.

Comment 1 Kapetanakis Giannis 2015-04-22 15:54:32 UTC

any news on this ?

thanks

Comment 2 Martin Sivák 2015-04-30 13:52:38 UTC

Hi,

soft affinity only modifies the one number in the weight table of all relevant hosts so if there was something that was considered to be more serious (cpu load, memory..) then it was possible that the VMs started on the same host.

We would need debug logs from the actual run to be certain though.

What you can try is to open the current cluster policy and increase the factor multiplier for affinity weight module to some bigger number to make it more important. Or you can disable weight modules for other resources (memory, cpu..).

Comment 3 Kapetanakis Giannis 2015-04-30 14:13:29 UTC

Thanks for the reply.

You've touched another part of ovirt for which i'm not very satisfied and that's cluster policy. Ovirt seems to have a preference to overload one node before starting VMs to other nodes...

I've made a copy and applied evenly_distributed policy with the values bellow:
OptimalForEvenDistribution: 1
HA: 1
OptimalForHaReservation: 1
VMAffinityGroups: 10

with properties
CpuOverCommitDuration: 2
HighUtilization: 50

They still start on the same node (and not the one with the least load)...

In engine.log I don't have something interesting. How can I enable DEBUG?

Comment 4 Martin Sivák 2015-05-28 10:25:11 UTC

What is the CPU load of your hosts? Evenly balanced uses only CPU load to balance VMs. The balancing is behaving like you describe when the VMs are doing mostly nothing and the host sees 0% CPU load.

We are introducing additional memory load factors to balancing in 3.6.

Comment 5 Kapetanakis Giannis 2015-06-02 07:44:29 UTC

CPU load is ~ 45-55% for all nodes

Comment 6 Martin Sivák 2015-06-02 09:27:06 UTC

Ah.. but that seems to be a quite correct distribution.

A new VM that is not doing anything might not contribute to the load and so a second VM can end up on the same host.

Can you attach an engine.log from the oVirt engine machine? And how does the overloading of a single host look like?

Are you mass starting the VMs (multiple VMs at once)?

We need at least some details about the situation and actions to be able to reproduce this as a something different than the known limitation in the CPU based scheduling.

Try giving the affinity groups even higher factor (100 for example).

Comment 7 Kapetanakis Giannis 2015-06-30 13:19:10 UTC

Created attachment 1044683 [details]
engine log

Sorry for the delay.

I've just upgraded to 3.5.3 (vdsm also latest) and had the same result.
VMAffinityGroups was set to 100 (on policy)

Both vms started on same node.
CPU on all all nodes was around 50~ on all of them.

Comment 8 Red Hat Bugzilla Rules Engine 2015-10-19 11:02:00 UTC

Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 9 Sandro Bonazzola 2015-10-26 12:37:53 UTC

this is an automated message. oVirt 3.6.0 RC3 has been released and GA is targeted to next week, Nov 4th 2015.
Please review this bug and if not a blocker, please postpone to a later release.
All bugs not postponed on GA release will be automatically re-targeted to

- 3.6.1 if severity >= high
- 4.0 if severity < high

Comment 10 Red Hat Bugzilla Rules Engine 2015-11-16 14:07:42 UTC

This bug is flagged for 3.6, yet the milestone is for 4.0 version, therefore the milestone has been reset.
Please set the correct milestone or add the flag.

Comment 11 Roy Golan 2016-01-24 09:48:37 UTC

(In reply to Kapetanakis Giannis from comment #7)
> Created attachment 1044683 [details]
> engine log
> 
> Sorry for the delay.
> 
> I've just upgraded to 3.5.3 (vdsm also latest) and had the same result.
> VMAffinityGroups was set to 100 (on policy)
> 
> Both vms started on same node.
> CPU on all all nodes was around 50~ on all of them.


Martin if the factor is 100 then the soft best effort isn't working right.

Comment 12 Martin Sivák 2016-01-25 10:02:46 UTC

The issue is that we do not use normalized numbers for weight policy unit. And the memory policy unit uses big numbers (megabytes) that almost always overweight everything else (affinity uses pretty low numbers..).

The solution would be to either normalize weighting or use rank based weighting similar to what I did in the oVirt patchset I just attached.

Comment 13 Martin Sivák 2016-10-17 13:05:19 UTC

Assigning to a placeholder email to stop polluting our lists. We will assign it to a proper person once the bug is prioritized again.

Comment 14 Martin Sivák 2016-12-12 09:17:37 UTC

I am moving this to ON_QA with TestOnly keyword since https://gerrit.ovirt.org/#/c/67707/ is now merged and will make weighting factors behave much more predictably (there is no need for insanely high values now).

The normalization feature is documented here: http://www.ovirt.org/develop/release-management/features/sla/scheduling-weight-normalization/

The change should make it into the next 4.1 build (this week according to the current plan).

Comment 15 Artyom 2017-01-03 15:41:47 UTC

Verified on rhevm-4.1.0-0.3.beta2.el7.noarch

I have 3 hosts under my environment, one of the hosts has more memory and CPU's.

1) Create soft negative affinity group and add vm_1, vm_2 and vm_3
2) Start vm_1 and vm_2
3) Start vm_3 when factor of the VmAffinityGroups equal to 1 - vm_3 starts on the host with more CPU and memory

Host with more CPU and memory - 353417ed-25a8-4bbd-8940-df481f3b16e3

Ranking selector:
*;factor;246d5c0e-7ad0-4522-95ff-4c7f5069ac8d;;353417ed-25a8-4bbd-8940-df481f3b16e3;;bfa1dbe1-e405-4070-9840-eb4390a42e0a;
98e92667-6161-41fb-b3fa-34f820ccbc4b;1; 2;1;     2;1;     2;1
84e6ddee-ab0d-42dd-82f0-c297779db567;1; 1;1000;  1;1000;  2;1
427aed70-dae3-48ba-8fe9-a902a9d563c8;1; 2;1;     2;1;     2;1
7db4ab05-81ab-42e8-868a-aee2df483edb;1; 1;2;     2;1;     1;2
7f262d70-6cac-11e3-981f-0800200c9a66;1; 2;0;     2;0;     2;0
591cdb81-ba67-45b4-9642-e28f61a97d57;1; 2;10000; 2;10000; 2;10000
4134247a-9c58-4b9a-8593-530bb9e37c59;1; 1;359;   2;1;     0;543

Ranks of the hosts:
246d5c0e-7ad0-4522-95ff-4c7f5069ac8d - 11
353417ed-25a8-4bbd-8940-df481f3b16e3 - 13
bfa1dbe1-e405-4070-9840-eb4390a42e0a - 11

4) Stop vm_3
5) Change the VmAffinityGroups factor to 3
6) Start vm_3 - vm_3 starts on the host bfa1dbe1-e405-4070-9840-eb4390a42e0a because of the affinity

a56-410e-b972-982a87ea4289] Ranking selector:
*;factor;246d5c0e-7ad0-4522-95ff-4c7f5069ac8d;;353417ed-25a8-4bbd-8940-df481f3b16e3;;bfa1dbe1-e405-4070-9840-eb4390a42e0a;
98e92667-6161-41fb-b3fa-34f820ccbc4b;1; 2;1;     2;1;     2;1
84e6ddee-ab0d-42dd-82f0-c297779db567;3; 1;1000;  1;1000;  2;1
427aed70-dae3-48ba-8fe9-a902a9d563c8;1; 2;1;     2;1;     2;1
7db4ab05-81ab-42e8-868a-aee2df483edb;1; 1;2;     2;1;     1;2
7f262d70-6cac-11e3-981f-0800200c9a66;1; 2;0;     2;0;     2;0
591cdb81-ba67-45b4-9642-e28f61a97d57;1; 2;10000; 2;10000; 2;10000
4134247a-9c58-4b9a-8593-530bb9e37c59;1; 1;359;   2;1;     0;495

Ranks of the hosts:
246d5c0e-7ad0-4522-95ff-4c7f5069ac8d - 11
353417ed-25a8-4bbd-8940-df481f3b16e3 - 13
bfa1dbe1-e405-4070-9840-eb4390a42e0a - 15

Note You need to log in before you can comment on or make changes to this bug.