Created attachment 1249923 [details] engine log Description of problem: I have HE VM that runs on the host_mixed_2 and the non-HE VM that runs on the host_mixed_1 after I added this two VM's to the hard positive affinity group, the starts to show me failed tasks with the message "Balancing VM HostedEngine", so I assume that the affinity enforcement tries to balance HE VM and migrate it to the host_mixed_1 Version-Release number of selected component (if applicable): rhevm-4.1.1-0.1.el7.noarch How reproducible: Always Steps to Reproduce: 1. HE VM runs on the host_2 2. Start the non-HE VM on the host_1(has more free memory than host_2) 3. Add both VM's to the hard positive affinity group Actual results: The engine starts to show me the failed tasks with the message "Balancing VM HostedEngine" Expected results: I believe we need to exclude the "HE VM from possible VM's for migration" and instead of always balance the cluster with non-HE VM's. Additional info:
Created attachment 1249924 [details] screenshot with failed tasks
If we allow affinity for the HE VM we should try to handle this. Having that said we're not going to move the HE VM around. This means that we will try to align by making the non-HE VMs join the HE VM. It's up to the admin to ensure there's sufficient capacity for such a deployment or he will get such errors since affinity is broken and cannot be fixed.
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.
> If we allow affinity for the HE VM we should try to handle this. We should just skip hosted engine VM when AREM (or balancing) is figuring out what VM needs to be migrated. > Having that said we're not going to move the HE VM around. This means > that we will try to align by making the non-HE VMs join the HE VM. Exactly.
(In reply to Martin Sivák from comment #4) > > If we allow affinity for the HE VM we should try to handle this. > > We should just skip hosted engine VM when AREM (or balancing) is figuring > out what VM needs to be migrated. Skip the VM or dismiss the affinity group it belongs to ?
Do everything as usual and only the step that selects the VM to be migrated needs to skip the hosted engine VM.
Moving back to MODIFIED ( maybe even POST?). This patch is merged on ovirt-4.1 but not ovirt-4.1.1.z which is the right branch for ovirt-4.1.1 milestone. if this bug is still targeted to 4.1.1, please backport patch and update bug status, if not, please move bug milestone to 4.1.2
Checked on rhevm-4.1.2-0.1.el7.noarch I do not sure that we have now the desired behavior, in my case: 1) My HE VM(VM name HostedEngine) runs on the host_1 2) Start additional VM(VM name test_vm) on the host_2 3) Create new VM to VM affinity group(hard, positive) and add HostedEngine VM and test_vm to the affinity group 4) Wait some time Result: Nothing happens, the engine does not try to migrate the test_vm to the host_1(what I expect), I believe it happens because that everytime the AREM chooses the HostedEngine VM for migration. private boolean isVmMigrationValid(Cluster cluster, VM candidateVm) { if (candidateVm.isHostedEngine()) { log.debug("VM {} is NOT a viable candidate for solving the affinity group violation situation" + " since its a hosted engine VM.", candidateVm.getId()); return false; } I think we just need to skip HE VM as canditate VM for migration before we send candidate VM to the method. Any thoughts?
might be right. what happens when you use vm to host affinity ? HE_VM and test_vm with (hard,positive) to host_1 ?
I tried it with the Host to VM affinity but the result is the same: 1) Have two VM's(HostedEngine and test_vm), both runs on the host_mixed_1 2) Create new affinity group: <affinity_group href="/ovirt-engine/api/clusters/00000002-0002-0002-0002-00000000017a/affinitygroups/d331fa8c-0047-436f-b925-e486aca1bd73" id="d331fa8c-0047-436f-b925-e486aca1bd73"> <name>test_affinity</name> <link href="/ovirt-engine/api/clusters/00000002-0002-0002-0002-00000000017a/affinitygroups/d331fa8c-0047-436f-b925-e486aca1bd73/vms" rel="vms" /> <enforcing>true</enforcing> <hosts_rule> <enabled>true</enabled> <enforcing>true</enforcing> <positive>true</positive> </hosts_rule> <vms_rule> <enabled>false</enabled> <enforcing>true</enforcing> <positive>false</positive> </vms_rule> <cluster href="/ovirt-engine/api/clusters/00000002-0002-0002-0002-00000000017a" id="00000002-0002-0002-0002-00000000017a" /> <hosts> <host id="fca7300c-760a-45aa-aa81-a8968fb8abef" /> - host_mixed_2 </hosts> <vms> <vm id="065e3895-555d-43fb-a553-4aa537963c4b" /> - HostedEngine <vm id="17779151-67c0-4218-999e-70338ca39dcf" /> - test_vm </vms> </affinity_group> Both VM's stay on the host host_mixed_1 And under the log I can see only: 2017-04-27 08:28:13,519-04 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-21) [519fb0a5-c741-490c-8883-45b6940b0e83] EVENT_ID: USER_UPDATED_AFFINITY_GROUP(10,352), Correlation ID: 519fb0a5-c741-490c-8883-45b6940b0e83, Call Stack: null, Custom Event ID: -1, Message: Affinity Group test_affinity was updated. (User: admin@internal-authz) 2017-04-27 08:28:32,580-04 INFO [org.ovirt.engine.core.bll.scheduling.SchedulingManager] (DefaultQuartzScheduler5) [] Candidate host 'host_mixed_1' ('19d88d3d-7d9f-4d19-8372-b8f1086f309e') was filtered out by 'VAR__FILTERTYPE__INTERNAL' filter 'VmToHostsAffinityGroups' (correlation id: null) 2017-04-27 08:28:32,580-04 INFO [org.ovirt.engine.core.bll.scheduling.SchedulingManager] (DefaultQuartzScheduler5) [] Candidate host 'host_mixed_3' ('2c7be430-a43f-4a87-a393-601f999fd3e6') was filtered out by 'VAR__FILTERTYPE__INTERNAL' filter 'VmToHostsAffinityGroups' (correlation id: null) 2017-04-27 08:29:32,880-04 INFO [org.ovirt.engine.core.bll.scheduling.SchedulingManager] (DefaultQuartzScheduler5) [] Candidate host 'host_mixed_1' ('19d88d3d-7d9f-4d19-8372-b8f1086f309e') was filtered out by 'VAR__FILTERTYPE__INTERNAL' filter 'VmToHostsAffinityGroups' (correlation id: null) 2017-04-27 08:29:32,880-04 INFO [org.ovirt.engine.core.bll.scheduling.SchedulingManager] (DefaultQuartzScheduler5) [] Candidate host 'host_mixed_3' ('2c7be430-a43f-4a87-a393-601f999fd3e6') was filtered out by 'VAR__FILTERTYPE__INTERNAL' filter 'VmToHostsAffinityGroups' (correlation id: null)
I checked a little deeper in the code, and found to problematic functions: 1) protected Guid chooseCandidateHostForMigration(.... return the best host with VM's for the balancing, but in the case when the host includes only the HE VM, nothing to balance. 2) private List<Guid> findVmViolatingNegativeAg(... ... if (firstAssignment.containsKey(host)) { violatingVms.add(vm); violatingVms.add(firstAssignment.get(host)); } else { firstAssignment.put(host, vm); } ... if the first VM on the host is regular VM, the code will not add it to the violatingVms and if the second VM the HE VM, it will be added to the violatingVms.
In the new patch, when the hosted engine is in a positive affinity group, the AREM will ignore all VMs in the group that are running on the same host as the hosted engine. It will pick a VM from another host to be migrated. However, the scheduler may still decide to not migrate the VM, because it is running on the best host.
Verified on rhevm-4.1.3.2-0.1.el7.noarch