1080515 – Violating hard constraint positive Affinity rule can prevent fixing the violated rule forever

Bug 1080515 - Violating hard constraint positive Affinity rule can prevent fixing the violated rule forever

Summary: Violating hard constraint positive Affinity rule can prevent fixing the viola...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	oVirt
Classification:	Retired
Component:	ovirt-engine-core
Sub Component:
Version:	3.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	3.5.0
Assignee:	Martin Sivák
QA Contact:	Pavel Stehlik
Docs Contact:
URL:
Whiteboard:	sla
Depends On:
Blocks:	1080521
TreeView+	depends on / blocked

Reported:	2014-03-25 15:20 UTC by Martin Sivák
Modified:	2016-02-10 19:41 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Clone Of:
Clones:	1080521 (view as bug list)
Environment:
Last Closed:	2014-10-17 12:31:46 UTC
oVirt Team:	SLA
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
oVirt gerrit	26619	0	None	None	None	Never

Description Martin Sivák 2014-03-25 15:20:16 UTC

Description of problem:

I found out that it is possible to get to a state when it is not possible to fix affinity group violation.

Take a look at the following code snippets:

// Group all hosts for VMs with positive affinity
for (Guid id : allVmIdsPositive) {
    VM runVm = runningVMsMap.get(id);
    if (runVm != null && runVm.getRunOnVds() != null) {
         acceptableHosts.add(runVm.getRunOnVds());
    }
}

In the above snippet the allVmIdsPositive holds a list of VMs that are supposed to run on the same host (Positive affinity).

The acceptableHosts set then ends up with all hosts that are used to run the VMs from the allVmIdsPositive list. The assumption here is that it should be either single host or empty set if no other VM from the Affinity group is running.

The following snippet checks that:

if (acceptableHosts.isEmpty()) {
    acceptableHosts.addAll(hostMap.keySet());
} else if (acceptableHosts.size() == 1 &&  
           hostMap.containsKey(acceptableHosts.iterator().next())) {
    hasPositiveConstraint = true;
    // Only one host is allowed for positive affinity, i.e. if the VM
    // contained in a positive affinity group he must run on the host
    // that all the other members are running, if the VMs spread across
    // hosts, the affinity rule isn't applied.
} else {
    ...
    return null;
}

Now focus on the last else clause. If for any reason there are VMs that

1) belong to the same positive affinity group
2) run on different hosts

then the filter returns null meaning no host can be used to run the currently scheduled VM.

The same scheduling algorithm is used when the user starts a new VM, when the user tries to migrate a VM manually and when the load balancing job tries to rebalance the cluster. In all of those cases any VM belonging to the affinity group is prevented to run or migrate.

Now, how can this happen?

The user is free to change cluster policies and the Affinity Filter module can be disabled at first.

Version-Release number of selected component (if applicable):

ovirt-engine master as of 25th of Mar 2014, 16:13 CET

Steps to Reproduce:
1. Disable affinity modules from cluster policy
2. Create at least 2 VMs
3. Add all VMs from step 2 to hard constraint positive affinity group
4. Start the VMs on different hosts
5. Enable the affinity modules in cluster policy
6. Try to fix the issue or watch the cluster as it tries to rebalance

Actual results:

The VMs are stuck on their hosts and no VM from the affinity group can be started.

Expected results:

The VMs are automatically rebalanced to run on a single host.

Additional info:

I believe the logic for hard constraint positive affinity should be changed to:

1) use any host if there is no VM from that group running (already there)
2) leave only hosts that already have VMs from that group running (instead of filtering out all hosts)

Comment 1 Sandro Bonazzola 2014-03-31 12:13:42 UTC

Moving to 3.4.1 since 3.4.0 has been released

Comment 2 Sandro Bonazzola 2014-05-08 13:56:27 UTC

This is an automated message.

oVirt 3.4.1 has been released.
This issue has been retargeted to 3.5.0 since it has not been marked as high priority or severity issue, please retarget if needed.

Comment 3 Sandro Bonazzola 2014-10-17 12:31:46 UTC

oVirt 3.5 has been released and should include the fix for this issue.

Note You need to log in before you can comment on or make changes to this bug.