1535175 – positive and negative affinity-groups for splitting hosts into two groups could force a migration loop of assigned VMs

Bug 1535175 - positive and negative affinity-groups for splitting hosts into two groups could force a migration loop of assigned VMs

Summary: positive and negative affinity-groups for splitting hosts into two groups cou...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	4.1.6
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	ovirt-4.2.4
Target Release:	---
Assignee:	Andrej Krejcir
QA Contact:	Polina
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-01-16 18:52 UTC by Steffen Froemer
Modified:	2021-03-11 19:45 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:	undefined
Clone Of:
Environment:
Last Closed:	2018-05-15 17:47:24 UTC
oVirt Team:	SLA
Target Upstream Version:
Embargoed:
Flags:	lsvaty: testing_plan_complete-

Attachments	(Terms of Use)
engine log (7.56 MB, text/plain) 2018-01-30 16:04 UTC, Artyom	no flags	Details
logs (1.23 MB, application/x-gzip) 2018-04-26 11:28 UTC, Polina	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2018:1488	None	None	None	2018-05-15 17:48:17 UTC
oVirt gerrit	87411	master	POST	core: When balancing, subtract VM memory and CPU load from current host	2018-02-20 14:54:50 UTC
oVirt gerrit	90215	master	MERGED	core: Change scoring in VmToHostAffinityWeightPolicyUnit	2018-04-13 09:38:40 UTC
oVirt gerrit	90216	ovirt-engine-4.2	MERGED	core: When balancing, subtract VM memory and CPU load from current host	2018-04-15 12:28:11 UTC
oVirt gerrit	90250	ovirt-engine-4.2	MERGED	core: Change scoring in VmToHostAffinityWeightPolicyUnit	2018-04-15 12:28:16 UTC

Description Steffen Froemer 2018-01-16 18:52:41 UTC

Description of problem:
Think about following scenario

HOST-A1
HOST-A2
HOST-B1
HOST-B2

VM-A1
VM-A2
VM-B1
VM-B2

On cluster 'cluster-AB' there is following affinity-group defined.

VM-A1, VM-A2 should run on HOST-A1 or HOST-A2
VM-B1, VM-B2 should not run on HOST-A1 or HOST-A2

The affinity-group is defined as soft-rule, to make it possible VM-A* could run on HOST-B* temporary.

Let's assume VM-A1 is running on HOST-B1. By rule-set, it's required to move it to one of the hosts HOST-A1 or HOST-A2.
Now it's trying to migrate to these hosts. If these hosts, does not have sufficient resources to host the VM-A1, it will be migrated to HOST-B2. This would not be expected.

Some time later, the same happen again. By rule-set the VM-A1 should run on HOST-A1 or HOST-A2, but due to, for example memory pressure, the VM can't be scheduled there. Now it's migrated to HOST-B1 again.

This is an endless loop and can only be stopped by successful migration to a Host defined in affinity-group.

Such scenario could happen, if a HOST needs to switch to maintenance.


Version-Release number of selected component (if applicable):
ovirt-engine-4.1.6.2-0.1.el7.noarch

How reproducible:
100%

Steps to Reproduce:
Setup an environment to fulfil the scenario described in description

Actual results:
The VM is migrated in a loop.

Expected results:
If the affinity-rule can't be applied, the VM should not be migrated and some kind of warning should be visible.

Additional info:

Comment 2 Martin Sivák 2018-01-17 13:40:45 UTC

Yes, this is theoretically possible.

But soft affinity has a very high priority (99x higher than most of the rules) and it should make a second non-complying host a very unattractive destination.

We will check the affinity enforcement logic there to make sure.

Comment 3 Steffen Froemer 2018-01-18 14:46:30 UTC

Based on what I understand, if a migration is started, based on a affinity-rule, the only possible migration-targets should be these, based on information of the affinity-group ruleset.
If these target-hosts are not suiteable to whatever reason, the migration/balancing action should be aborted. 
There is no exception in terms of soft- or hard-affinity groups.

Comment 4 Artyom 2018-01-30 16:03:23 UTC

Reproducible on rhvm-4.2.1.4-0.1.el7.noarch

Environment with 3 hosts(host_1, host_2, host_3)

1) Create new host to VM soft positive affinity group
2) Add vm_1 and host_1 to the affinity group
3) Start the VM
4) Create CPU load on the VM
5) Put host_1 to maintenance

Affinity Rule enforcement manager starts to migrate the VM from host_2 to host_3 and back.

You can start to look in the log from the line
2018-01-30 17:53:39,278+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-16) [4ea1366a] EVENT_ID: VM_MIGRATION_START_SYSTEM_INITIATED(67), Migration initiated by system (VM: golden_env_mixed_virtio_0, Source: host_mixed_1, Destination: host_mixed_3, Reason: Host preparing for maintenance).

Comment 5 Artyom 2018-01-30 16:04:11 UTC

Created attachment 1388513 [details]
engine log

Comment 8 Martin Sivák 2018-02-06 16:41:38 UTC

We should probably fix this by ignoring the cpu load of the migrated VM when computing the source load and introducing a new unit that will add a small penalty for needed migration. That should create a hysteresis window and prefer a solution where migration is not necessary.

Comment 10 Polina 2018-04-26 11:27:41 UTC

the bug tested on rhv-release-4.2.3-2-001.noarch and still happens.

attached logs (engine, vdsm  - host1,2,3) and image of Events and  VM after the host 1 is put to maintenance.

steps for verification:
environment with three hosts - [host_mixed_1, host_mixed_2, host_mixed_3]
1. create on cluster affinity group (add VM and host_mixed_1):
        <name>group1</name>
        <hosts_rule>
            <enabled>true</enabled>
            <enforcing>false</enforcing>
            <positive>true</positive>
        </hosts_rule>
        <positive>true</positive>
        <vms_rule>
            <enabled>true</enabled>
            <enforcing>false</enforcing>
            <positive>true</positive>
        </vms_rule>

2. Run VM on host_mixed_1. 
3. Create CPU load on VM with dd command (dd if=/dev/zero of=/dev/null).
4. Put host_mixed_1 to maintenance.

Result: the VM is moved to the host_mixed_2 , then starts circulating between host_mixed_2 and host_mixed_3.

Comment 11 Polina 2018-04-26 11:28:13 UTC

Created attachment 1427152 [details]
logs

Comment 12 Polina 2018-04-30 07:00:59 UTC

The bug is solved in rhv-release-4.2.3-4-001.noarch.
The verification steps in 1535175#c10

Comment 16 errata-xmlrpc 2018-05-15 17:47:24 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:1488

Comment 17 Franta Kust 2019-05-16 13:05:14 UTC

BZ<2>Jira Resync

Comment 18 Daniel Gur 2019-08-28 13:12:43 UTC

sync2jira

Comment 19 Daniel Gur 2019-08-28 13:16:56 UTC

sync2jira

Note You need to log in before you can comment on or make changes to this bug.