Bug 637281 - Slots don't partition fully when preemption is turned off
Summary: Slots don't partition fully when preemption is turned off
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor
Version: 1.3
Hardware: All
OS: Linux
high
high
Target Milestone: 1.3.2
: ---
Assignee: Erik Erlandson
QA Contact: Lubos Trilety
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-09-24 19:20 UTC by Jon Thomas
Modified: 2011-02-15 12:12 UTC (History)
3 users (show)

Fixed In Version: condor-7.4.5-0.2
Doc Type: Bug Fix
Doc Text:
Previously, partitionable slots could not be fully utilized when the preemption was disabled and GROUP_DYNAMIC_MACH_CONSTRAINT was set. With this update, trimming of startd ads for preemption si now carried out after the constraint checking for GROUP_DYNAMIC_MACH_CONSTRAINT, so the negotiator correctly counts claimed slots. Now, the negotiator sends the proper slot counts including claimed slots to the inner negotiation loops when preemption is disabled and GROUP_DYNAMIC_MACH_CONSTRAINT is enabled. This allows the negotiation to include already-claimed dynamic slots and so partitionable slots can be fully utilized.
Clone Of:
Environment:
Last Closed: 2011-02-15 12:12:08 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
move trimStartdAds to after constraint eval (2.24 KB, patch)
2010-10-07 18:57 UTC, Jon Thomas
no flags Details | Diff
quotaconstraintv2 patch (4.13 KB, patch)
2010-10-08 18:07 UTC, Jon Thomas
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2011:0217 0 normal SHIPPED_LIVE Red Hat Enterprise MRG Messaging and Grid bug fix and enhancement update 2011-02-15 12:10:15 UTC

Description Jon Thomas 2010-09-24 19:20:15 UTC
With preemption off, a partitionable slot doesn't partition fully.

repro/data:

Works:
------

NUM_CPUS = 10
SLOT_TYPE_1 = cpus=2
SLOT_TYPE_1_PARTITIONABLE = TRUE
NUM_SLOTS_TYPE_1 = 5
START =  SlotId > 2
GROUP_DYNAMIC_MACH_CONSTRAINT = State =!= "Owner" && Cpus > 0
NEGOTIATOR_USE_SLOT_WEIGHTS = FALSE
#NEGOTIATOR_CONSIDER_PREEMPTION = False
#PREEMPT = False
#PREEMPTION_REQUIREMENTS = False


09/24/10 15:02:39 GROUP_DYNAMIC_MACH_CONSTRAINT constraint reduces machine count from 5 to 3

... submit jobs to use slots

09/24/10 15:03:19 GROUP_DYNAMIC_MACH_CONSTRAINT constraint reduces machine count from 11 to 6

condor_status

Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime

slot1 LINUX      X86_64 Owner     Idle     0.000   390  0+07:25:17
slot2 LINUX      X86_64 Owner     Idle     0.000   390  0+07:25:18
slot3 LINUX      X86_64 Unclaimed Idle     0.000   388  0+00:00:06
slot3_1@localhost. LINUX      X86_64 Claimed   Busy     0.000     1  0+00:00:06
slot3_2@localhost. LINUX      X86_64 Claimed   Busy     0.000     1  0+00:00:06
slot4 LINUX      X86_64 Unclaimed Idle     0.000   388  0+00:00:07
slot4_1@localhost. LINUX      X86_64 Claimed   Busy     0.000     1  0+00:00:07
slot4_2@localhost. LINUX      X86_64 Claimed   Busy     0.000     1  0+00:00:07
slot5 LINUX      X86_64 Unclaimed Idle     0.000   388  0+00:00:08
slot5_1@localhost. LINUX      X86_64 Claimed   Busy     0.000     1  0+00:00:08
slot5_2@localhost. LINUX      X86_64 Claimed   Busy     0.000     1  0+00:00:08
                     Machines Owner Claimed Unclaimed Matched Preempting

        X86_64/LINUX       11     2       6         3       0          0

               Total       11     2       6         3       0          0


Doesn't work:
-------------

NUM_CPUS = 10
SLOT_TYPE_1 = cpus=2
SLOT_TYPE_1_PARTITIONABLE = TRUE
NUM_SLOTS_TYPE_1 = 5
START =  SlotId > 2
GROUP_DYNAMIC_MACH_CONSTRAINT = State =!= "Owner" && Cpus > 0
NEGOTIATOR_USE_SLOT_WEIGHTS = FALSE
NEGOTIATOR_CONSIDER_PREEMPTION = False
PREEMPT = False
PREEMPTION_REQUIREMENTS = False

09/24/10 15:09:33 GROUP_DYNAMIC_MACH_CONSTRAINT constraint reduces machine count from 5 to 3

... submit jobs to use slots

09/24/10 15:10:53 GROUP_DYNAMIC_MACH_CONSTRAINT constraint reduces machine count from 8 to 3

condor_status

Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime

slot1 LINUX      X86_64 Owner     Idle     0.380   390  0+00:00:09
slot2 LINUX      X86_64 Owner     Idle     0.000   390  0+00:00:10
slot3 LINUX      X86_64 Unclaimed Idle     0.000   389  0+00:00:56
slot3_1@localhost. LINUX      X86_64 Claimed   Busy     0.000     1  0+00:00:06
slot4 LINUX      X86_64 Unclaimed Idle     0.000   389  0+00:00:57
slot4_1@localhost. LINUX      X86_64 Claimed   Busy     0.000     1  0+00:00:07
slot5 LINUX      X86_64 Unclaimed Idle     0.000   389  0+00:00:58
slot5_1@localhost. LINUX      X86_64 Claimed   Busy     0.000     1  0+00:00:08
                     Machines Owner Claimed Unclaimed Matched Preempting

        X86_64/LINUX        8     2       3         3       0          0

               Total        8     2       3         3       0          0

Comment 1 Jon Thomas 2010-09-24 19:41:31 UTC
additional note: This was on a personal Condor. A customer with more than one node, indicated he thought he saw the same behavior. 

BTW, this simpler config shows same problem. 

START = true
NUM_CPUS = 10
SLOT_TYPE_1 = cpus=10
SLOT_TYPE_1_PARTITIONABLE = TRUE
NUM_SLOTS_TYPE_1 = 1

Comment 2 Jon Thomas 2010-10-07 18:57:40 UTC
Created attachment 452182 [details]
move trimStartdAds to after constraint eval

Comment 3 Jon Thomas 2010-10-08 18:07:40 UTC
Created attachment 452400 [details]
quotaconstraintv2 patch

It looks like the bug has been around as long as GROUP_DYNAMIC_MACH_CONSTRAINT has been used with trimStartdAds. So, looks like bug exists upstream too. 

The previous patch moved the trimStartdAds to after GROUP_DYNAMIC_MACH_CONSTRAINT, but that means trimStartdAds would not be called if GROUP_NAMES was not defined. New patch ensures trimStartdAds gets called for both cases.

Comment 4 Erik Erlandson 2010-10-25 23:36:17 UTC
I don't understand why this needs fixing -- trimStartdAds removes claimed (or preempting) ads from consideration.   In the event that consider-preemption is off, it seems correct for numDynGroupSlots to not include those slots either, since they won't be up for consideration.

partitionable slots are never in claimed state, so disabling preemption should never affect the number of partitionable slots seen by HFS.

Comment 5 Jon Thomas 2010-10-26 13:54:16 UTC
Hi,

Did you try the repro?

The reason the fix is required is that NegotiateWithGroup expects the group quota to include claimed slots. Therefore, we have to count slots before trimming out the claimed ones.

btw, the actual value of GROUP_DYNAMIC_MACH_CONSTRAINT is mostly irrelevant here. The constraint evaluation returns Length() or some number less than Length(), but this needs to include the claimed slots.

Comment 6 Erik Erlandson 2010-11-19 21:21:57 UTC
Incorporated Jon's fix in devel branch:
V7_4-BZ619557-HFS-tree-structure

Comment 7 Lubos Trilety 2010-11-25 16:22:58 UTC
Where can I find following lines, in what file are they?

09/24/10 15:09:33 GROUP_DYNAMIC_MACH_CONSTRAINT constraint reduces machine
count from 5 to 3

09/24/10 15:10:53 GROUP_DYNAMIC_MACH_CONSTRAINT constraint reduces machine
count from 8 to 3

Comment 8 Erik Erlandson 2010-11-25 17:10:10 UTC
> Where can I find following lines, in what file are they?
> 
> 09/24/10 15:09:33 GROUP_DYNAMIC_MACH_CONSTRAINT constraint reduces machine
> count from 5 to 3

That check is in matchmaker.cpp, line 1000

Comment 9 Lubos Trilety 2010-11-26 13:18:44 UTC
Successfully reproduced on:
$CondorVersion: 7.4.4 Sep 27 2010 BuildID: RH-7.4.4-0.16.el5 PRE-RELEASE $
$CondorPlatform: X86_64-LINUX_RHEL5 $

Comment 10 Erik Erlandson 2010-12-21 16:18:38 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause:
Bug manifests when preemption is disabled and GROUP_DYNAMIC_MACH_CONSTRAINT is set (accounting groups are in effect).

Consequence:
Partitionable slots will not be fully utilized.

Fix:
Trimming of startd ads for preemption was moved to after the constraint checking for GROUP_DYNAMIC_MACH_CONSTRAINT, so the negotiator properly counts claimed slots.

Result:
When preemption is disabled, and GROUP_DYNAMIC_MACH_CONSTRAINT is enabled, the negotiator now sends the proper slot counts including claimed slots to the inner negotiation loops.  This allows the negotiation to include already-claimed dynamic slots and so partitionable slots can be fully utilized.

Comment 12 Lubos Trilety 2011-01-10 15:11:35 UTC
Tested with (version):
condor-7.4.5-0.6

Tested on:
RHEL5 x86_64,i386  - passed
RHEL4 x86_64,i386  - passed

/var/log/condor/NegotiatorLog:01/10/11 10:06:05 GROUP_DYNAMIC_MACH_CONSTRAINT constraint reduces machine count from 11 to 6

# condor_status
Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime
slot1@host   LINUX      X86_64 Owner     Idle     0.130  12888  0+00:00:11
slot2@host   LINUX      X86_64 Owner     Idle     0.000  12888  0+00:00:12
slot3@host   LINUX      X86_64 Unclaimed Idle     0.000  12886  0+00:00:05
slot3_1@host LINUX      X86_64 Claimed   Busy     0.000     1  0+00:00:05
slot3_2@host LINUX      X86_64 Claimed   Busy     0.000     1  0+00:00:06
slot4@host   LINUX      X86_64 Unclaimed Idle     0.000  12886  0+00:00:07
slot4_1@host LINUX      X86_64 Claimed   Busy     0.000     1  0+00:00:07
slot4_2@host LINUX      X86_64 Claimed   Busy     0.000     1  0+00:00:07
slot5@host   LINUX      X86_64 Unclaimed Idle     0.000  12886  0+00:00:28
slot5_1@host LINUX      X86_64 Claimed   Busy     0.000     1  0+00:00:08
slot5_2@host LINUX      X86_64 Claimed   Busy     0.000     1  0+00:00:08
                     Machines Owner Claimed Unclaimed Matched Preempting
        X86_64/LINUX       11     2       6         3       0          0
               Total       11     2       6         3       0          0


>>> VERIFIED

Comment 13 Florian Nadge 2011-02-09 17:59:52 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1,11 +1,2 @@
-Cause:
+Previously, partitionable slots could not be fully utilized when the preemption was disabled and GROUP_DYNAMIC_MACH_CONSTRAINT was set.
-Bug manifests when preemption is disabled and GROUP_DYNAMIC_MACH_CONSTRAINT is set (accounting groups are in effect).
+With this update, trimming of startd ads for preemption si now carried out after the constraint checking for GROUP_DYNAMIC_MACH_CONSTRAINT, so the negotiator correctly counts claimed slots. Now, the negotiator sends the proper slot counts including claimed slots to the inner negotiation loops when preemption is disabled and GROUP_DYNAMIC_MACH_CONSTRAINT is enabled.  This allows the negotiation to include already-claimed dynamic slots and so partitionable slots can be fully utilized.-
-Consequence:
-Partitionable slots will not be fully utilized.
-
-Fix:
-Trimming of startd ads for preemption was moved to after the constraint checking for GROUP_DYNAMIC_MACH_CONSTRAINT, so the negotiator properly counts claimed slots.
-
-Result:
-When preemption is disabled, and GROUP_DYNAMIC_MACH_CONSTRAINT is enabled, the negotiator now sends the proper slot counts including claimed slots to the inner negotiation loops.  This allows the negotiation to include already-claimed dynamic slots and so partitionable slots can be fully utilized.

Comment 14 errata-xmlrpc 2011-02-15 12:12:08 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0217.html


Note You need to log in before you can comment on or make changes to this bug.