Bug 1080515

Summary: Violating hard constraint positive Affinity rule can prevent fixing the violated rule forever
Product: [Retired] oVirt Reporter: Martin Sivák <msivak>
Component: ovirt-engine-coreAssignee: Martin Sivák <msivak>
Status: CLOSED CURRENTRELEASE QA Contact: Pavel Stehlik <pstehlik>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 3.4CC: bugs, gchaplik, gklein, iheim, rbalakri, sbonazzo, yeylon
Target Milestone: ---   
Target Release: 3.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: sla
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1080521 (view as bug list) Environment:
Last Closed: 2014-10-17 12:31:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: SLA RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1080521    

Description Martin Sivák 2014-03-25 15:20:16 UTC
Description of problem:

I found out that it is possible to get to a state when it is not possible to fix affinity group violation.

Take a look at the following code snippets:

// Group all hosts for VMs with positive affinity
for (Guid id : allVmIdsPositive) {
    VM runVm = runningVMsMap.get(id);
    if (runVm != null && runVm.getRunOnVds() != null) {
         acceptableHosts.add(runVm.getRunOnVds());
    }
}

In the above snippet the allVmIdsPositive holds a list of VMs that are supposed to run on the same host (Positive affinity).

The acceptableHosts set then ends up with all hosts that are used to run the VMs from the allVmIdsPositive list. The assumption here is that it should be either single host or empty set if no other VM from the Affinity group is running.

The following snippet checks that:

if (acceptableHosts.isEmpty()) {
    acceptableHosts.addAll(hostMap.keySet());
} else if (acceptableHosts.size() == 1 &&  
           hostMap.containsKey(acceptableHosts.iterator().next())) {
    hasPositiveConstraint = true;
    // Only one host is allowed for positive affinity, i.e. if the VM
    // contained in a positive affinity group he must run on the host
    // that all the other members are running, if the VMs spread across
    // hosts, the affinity rule isn't applied.
} else {
    ...
    return null;
}

Now focus on the last else clause. If for any reason there are VMs that

1) belong to the same positive affinity group
2) run on different hosts

then the filter returns null meaning no host can be used to run the currently scheduled VM.

The same scheduling algorithm is used when the user starts a new VM, when the user tries to migrate a VM manually and when the load balancing job tries to rebalance the cluster. In all of those cases any VM belonging to the affinity group is prevented to run or migrate.

Now, how can this happen?

The user is free to change cluster policies and the Affinity Filter module can be disabled at first.

Version-Release number of selected component (if applicable):

ovirt-engine master as of 25th of Mar 2014, 16:13 CET

Steps to Reproduce:
1. Disable affinity modules from cluster policy
2. Create at least 2 VMs
3. Add all VMs from step 2 to hard constraint positive affinity group
4. Start the VMs on different hosts
5. Enable the affinity modules in cluster policy
6. Try to fix the issue or watch the cluster as it tries to rebalance

Actual results:

The VMs are stuck on their hosts and no VM from the affinity group can be started.

Expected results:

The VMs are automatically rebalanced to run on a single host.

Additional info:

I believe the logic for hard constraint positive affinity should be changed to:

1) use any host if there is no VM from that group running (already there)
2) leave only hosts that already have VMs from that group running (instead of filtering out all hosts)

Comment 1 Sandro Bonazzola 2014-03-31 12:13:42 UTC
Moving to 3.4.1 since 3.4.0 has been released

Comment 2 Sandro Bonazzola 2014-05-08 13:56:27 UTC
This is an automated message.

oVirt 3.4.1 has been released.
This issue has been retargeted to 3.5.0 since it has not been marked as high priority or severity issue, please retarget if needed.

Comment 3 Sandro Bonazzola 2014-10-17 12:31:46 UTC
oVirt 3.5 has been released and should include the fix for this issue.