Description of problem: Think about following scenario HOST-A1 HOST-A2 HOST-B1 HOST-B2 VM-A1 VM-A2 VM-B1 VM-B2 On cluster 'cluster-AB' there is following affinity-group defined. VM-A1, VM-A2 should run on HOST-A1 or HOST-A2 VM-B1, VM-B2 should not run on HOST-A1 or HOST-A2 The affinity-group is defined as soft-rule, to make it possible VM-A* could run on HOST-B* temporary. Let's assume VM-A1 is running on HOST-B1. By rule-set, it's required to move it to one of the hosts HOST-A1 or HOST-A2. Now it's trying to migrate to these hosts. If these hosts, does not have sufficient resources to host the VM-A1, it will be migrated to HOST-B2. This would not be expected. Some time later, the same happen again. By rule-set the VM-A1 should run on HOST-A1 or HOST-A2, but due to, for example memory pressure, the VM can't be scheduled there. Now it's migrated to HOST-B1 again. This is an endless loop and can only be stopped by successful migration to a Host defined in affinity-group. Such scenario could happen, if a HOST needs to switch to maintenance. Version-Release number of selected component (if applicable): ovirt-engine-4.1.6.2-0.1.el7.noarch How reproducible: 100% Steps to Reproduce: Setup an environment to fulfil the scenario described in description Actual results: The VM is migrated in a loop. Expected results: If the affinity-rule can't be applied, the VM should not be migrated and some kind of warning should be visible. Additional info:
Yes, this is theoretically possible. But soft affinity has a very high priority (99x higher than most of the rules) and it should make a second non-complying host a very unattractive destination. We will check the affinity enforcement logic there to make sure.
Based on what I understand, if a migration is started, based on a affinity-rule, the only possible migration-targets should be these, based on information of the affinity-group ruleset. If these target-hosts are not suiteable to whatever reason, the migration/balancing action should be aborted. There is no exception in terms of soft- or hard-affinity groups.
Reproducible on rhvm-4.2.1.4-0.1.el7.noarch Environment with 3 hosts(host_1, host_2, host_3) 1) Create new host to VM soft positive affinity group 2) Add vm_1 and host_1 to the affinity group 3) Start the VM 4) Create CPU load on the VM 5) Put host_1 to maintenance Affinity Rule enforcement manager starts to migrate the VM from host_2 to host_3 and back. You can start to look in the log from the line 2018-01-30 17:53:39,278+02 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-16) [4ea1366a] EVENT_ID: VM_MIGRATION_START_SYSTEM_INITIATED(67), Migration initiated by system (VM: golden_env_mixed_virtio_0, Source: host_mixed_1, Destination: host_mixed_3, Reason: Host preparing for maintenance).
Created attachment 1388513 [details] engine log
We should probably fix this by ignoring the cpu load of the migrated VM when computing the source load and introducing a new unit that will add a small penalty for needed migration. That should create a hysteresis window and prefer a solution where migration is not necessary.
the bug tested on rhv-release-4.2.3-2-001.noarch and still happens. attached logs (engine, vdsm - host1,2,3) and image of Events and VM after the host 1 is put to maintenance. steps for verification: environment with three hosts - [host_mixed_1, host_mixed_2, host_mixed_3] 1. create on cluster affinity group (add VM and host_mixed_1): <name>group1</name> <hosts_rule> <enabled>true</enabled> <enforcing>false</enforcing> <positive>true</positive> </hosts_rule> <positive>true</positive> <vms_rule> <enabled>true</enabled> <enforcing>false</enforcing> <positive>true</positive> </vms_rule> 2. Run VM on host_mixed_1. 3. Create CPU load on VM with dd command (dd if=/dev/zero of=/dev/null). 4. Put host_mixed_1 to maintenance. Result: the VM is moved to the host_mixed_2 , then starts circulating between host_mixed_2 and host_mixed_3.
Created attachment 1427152 [details] logs
The bug is solved in rhv-release-4.2.3-4-001.noarch. The verification steps in 1535175#c10
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:1488
BZ<2>Jira Resync
sync2jira