794818 – RFE: Support multiple claims from p-slots in negotiation loop

Bug 794818 - RFE: Support multiple claims from p-slots in negotiation loop

Summary: RFE: Support multiple claims from p-slots in negotiation loop

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	condor
Sub Component:
Version:	2.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	low
Target Milestone:	2.4
Target Release:	---
Assignee:	Erik Erlandson
QA Contact:	Lubos Trilety
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	995360
TreeView+	depends on / blocked

Reported:	2012-02-17 17:45 UTC by Erik Erlandson
Modified:	2013-10-01 16:50 UTC (History)
CC List:	6 users (show)
Fixed In Version:	condor-7.8.9-0.5
Doc Type:	Enhancement
Doc Text:	This release introduces support for configurable consumption policies on partitionable slots, such that the quantity of each resource asset (e.g. cpus, memory, disk) that is consumed by a match is determined by evaluating a configurable expression on the slot. Usually some function of the amount requested by the matching job. These consumption policies allow a partitionable slot to emulate different resource allocation behaviors depending on the use cases of the customer. They also enable the negotiator to make multiple matches against each partitionable slot per negotiation cycle, providing improved performance and the better use of resources. This results in possible behaviors such as the emulation of static slots, the support for sub-core job loads, or memory-centric allocation policies rather than legacy cpu-centric ones. Each execute node on an HTCondor pool can be configured with one or more consumption policies, allowing heterogeneous resource allocation strategies on a single pool.
Clone Of:
Clones:	995360 (view as bug list)
Environment:
Last Closed:	2013-10-01 16:50:54 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Condor	3435	0	None	None	None	Never
Red Hat Product Errata	RHSA-2013:1294	0	normal	SHIPPED_LIVE	Moderate: Red Hat Enterprise MRG Grid 2.4 security update	2013-10-01 20:35:50 UTC

Description Erik Erlandson 2012-02-17 17:45:38 UTC

Description of problem:
The negotiator only matches one job to a slot per cycle, even if that slot is partitionable, and could service multiple jobs.   This slows down loading of pools using p-slots as it requires multiple cycles.

Expected results:
Moving the "multi-claim" logic from:
https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2790

from the schedd to the negotiator would allow this functionality to properly work with resource limits like group quotas and concurrency limits.  
https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2826
https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2818

It would also set the stage for supporting policies for both "maximum-spread" and "minimum-spread" of jobs across machines (currently only maximum-spread is possible):
https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2681,31

Comment 1 Erik Erlandson 2012-02-17 17:55:11 UTC

Another quirk of negotiator treatment of p-slots is that a submitter is charged for an entire weight of a p-slot, even though it is most likely to actually only use a fraction of the resources on that slot.   

I think it would be a good idea to add some return-information to the claim protocol: the startd can include its "charge" for the claim against a p-slot, which represents its assessment of what fraction of resources were used. 

In the case of a multi-resource p-slot, I propose that the charge be the maximum of "d-slot-resource(X)/p-slot-total-resource(X)" over all resources X listed on the p-slot definition.   So for example if a job requests 1/8 of the cpus, but 1/2 the memory, the match will cost 1/2 of the p-slot's weight since that's the larger fraction.

Comment 4 Martin Kudlej 2013-07-18 13:13:49 UTC

What changes mentioned in comment #1 and comment #2 are implemented for this BZ. Could you please provide any example how to test:
https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2790
and
https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2826

It is possible to test https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2681 in Vanilla universe? If yes, could you please provide example of that test?

Comment 5 Erik Erlandson 2013-08-13 22:46:18 UTC

A "basic" test of consumption policies

configuration:
--------------------------------------------------------------
# spoof some cores
NUM_CPUS = 20

# declare an extensible resource for a claim-based consumption policy
MACHINE_RESOURCE_tokens = 3

# startd-wide consumption policy config
# defaults for cpus/memory/disk consumption
CONSUMPTION_POLICY = True

# startd defaults, can be overridden on a per-slot-type basis
CONSUMPTION_CPUS = ifthenelse(target.Cpus isnt undefined, quantize(target.Cpus, {1}), 1)
CONSUMPTION_MEMORY = ifthenelse(target.Memory isnt undefined, quantize(target.Memory, {1}), 1)
CONSUMPTION_DISK = ifthenelse(target.Disk isnt undefined, quantize(target.Disk, {100}), 100)
CONSUMPTION_TOKENS = ifthenelse(target.Tokens isnt undefined, target.Tokens, 0)

# defaults, can be overridden on a per-slot-type basis
SLOT_WEIGHT = Cpus
NUM_CLAIMS = 5

# slot type 1: a traditional cpu-centric policy
SLOT_TYPE_1 = cpus=5,memory=100,disk=25%,tokens=0
SLOT_TYPE_1_PARTITIONABLE = True
SLOT_TYPE_1_NUM_CLAIMS = 10
NUM_SLOTS_TYPE_1 = 1

# slot type 2: will demo/test a memory-centric policy
SLOT_TYPE_2 = cpus=5,memory=100,disk=25%,tokens=0
SLOT_TYPE_2_PARTITIONABLE = True
NUM_SLOTS_TYPE_2 = 1
SLOT_TYPE_2_CONSUMPTION_MEMORY = quantize(target.RequestMemory, {25})
SLOT_TYPE_2_SLOT_WEIGHT = floor(Memory / 25)

# slot type 3: a claim-based policy
# (not tied to resource such as cpu, mem, etc)
SLOT_TYPE_3 = cpus=5,memory=100,disk=25%,tokens=3
SLOT_TYPE_3_PARTITIONABLE = True
NUM_SLOTS_TYPE_3 = 1
# always consume 1 token, and none of anything else
SLOT_TYPE_3_CONSUMPTION_TOKENS = 1
SLOT_TYPE_3_CONSUMPTION_CPUS = 0
SLOT_TYPE_3_CONSUMPTION_MEMORY = 0
SLOT_TYPE_3_CONSUMPTION_DISK = 0
# define cost in terms of available tokens for serving jobs
SLOT_TYPE_3_SLOT_WEIGHT = Tokens

# slot type 4: a static-slot policy
# (always consume all resources)
SLOT_TYPE_4 = cpus=5,memory=100,disk=25%,tokens=0
SLOT_TYPE_4_PARTITIONABLE = True
NUM_SLOTS_TYPE_4 = 1
# consume all resources - emulate static slot
SLOT_TYPE_4_CONSUMPTION_CPUS = Cpus
SLOT_TYPE_4_CONSUMPTION_MEMORY = Memory
SLOT_TYPE_4_CONSUMPTION_DISK = Disk
SLOT_TYPE_4_CONSUMPTION_TOKENS = Tokens

# turn this off to demonstrate that consumption policy will handle this kind of logic
MUST_MODIFY_REQUEST_EXPRS = False

# turn off schedd-side resource splitting since we are demonstrating neg-side alternative
CLAIM_PARTITIONABLE_LEFTOVERS = False

# keep slot weights enabled for match costing
NEGOTIATOR_USE_SLOT_WEIGHTS = True

# for simplicity, turn off preemption, caching, worklife
CLAIM_WORKLIFE=0
MAXJOBRETIREMENTTIME = 3600
PREEMPT = False
RANK = 0
PREEMPTION_REQUIREMENTS = False
NEGOTIATOR_CONSIDER_PREEMPTION = False
NEGOTIATOR_MATCHLIST_CACHING = False

# verbose logging
ALL_DEBUG = D_FULLDEBUG |  D_MACHINE

# reduce daemon update latencies
NEGOTIATOR_INTERVAL = 30
SCHEDD_INTERVAL = 15

# This should induce SLOT_TYPE_1 and SLOT_TYPE_4 to go into owner state when
# their cpu assets are exhausted, which tests claim logic fix from #3792
START = (Cpus > 0) || (SlotType is "Dynamic")
--------------------------------------------------------------


After the four p-slots spin up you should see this:
--------------------------------------------------------------
$ condor_status -format "%d" SlotTypeID -format " %d" SlotWeight -format " %d" Cpus -format " %d" Memory -format " %d\n" Tokens
1 5 5 100 0
2 4 5 100 0
3 3 5 100 3
4 5 5 100 0
--------------------------------------------------------------


Submit some jobs, but BEFORE you submit, set up a watch on negotiation:
$ tail -f NegotiatorLog | grep -e 'Started Negotiation' -e 'Finished Negotiation' -e 'Successfully matched with'

The submit file for the 13 jobs looks like this:
--------------------------------------------------------------
universe = vanilla
executable = /bin/sleep
arguments = 300
request_cpus = 1
request_memory = 1
request_disk = 1
notification = never
queue 13
--------------------------------------------------------------

Firstly, you should see that the matchmaking matches the slots like so.  Each slot (slot1, slot2, ....) should negotiate *consecutively*.  That is, slot1 negotiates until it is full, then slot2, etc:

---------------------------------------------------------------------------
$ tail -f NegotiatorLog | grep -e 'Started Negotiation' -e 'Finished Negotiation' -e 'Successfully matched with'
08/13/13 15:30:17 ---------- Finished Negotiation Cycle ----------
08/13/13 15:30:37 ---------- Started Negotiation Cycle ----------
08/13/13 15:30:37       Successfully matched with slot1@localhost
08/13/13 15:30:38       Successfully matched with slot1@localhost
08/13/13 15:30:38       Successfully matched with slot1@localhost
08/13/13 15:30:38       Successfully matched with slot1@localhost
08/13/13 15:30:38       Successfully matched with slot1@localhost
08/13/13 15:30:38       Successfully matched with slot2@localhost
08/13/13 15:30:38       Successfully matched with slot2@localhost
08/13/13 15:30:38       Successfully matched with slot2@localhost
08/13/13 15:30:38       Successfully matched with slot2@localhost
08/13/13 15:30:38       Successfully matched with slot3@localhost
08/13/13 15:30:39       Successfully matched with slot3@localhost
08/13/13 15:30:39       Successfully matched with slot3@localhost
08/13/13 15:30:39       Successfully matched with slot4@localhost
08/13/13 15:30:39 ---------- Finished Negotiation Cycle ----------
---------------------------------------------------------------------------


Now re-examine the space of slots.   First, the p-slots should look like as follows.  Note the remaining SlotWeight values are all zero.  Slot Type 1 has no Cpus left.  Slot Type 2 has no Memory left.   Slot Type 3 has no Tokens left.  Slot Type 4 has nothing at all left of anything:
---------------------------------------------------------------
$ condor_status -constraint "SlotTypeID > 0" -format "%d" SlotTypeID -format " %d" SlotWeight -format " %d" Cpus -format " %d" Memory -format " %d\n" Tokens
1 0 0 95 0
2 0 1 0 0
3 0 5 100 0
4 0 0 0 0
---------------------------------------------------------------

Now examine the corresponding d-slots.  There should be five d-slots of type (-1), each with one Cpu.  4 d-slots of type (-2), each with Memory=25.  3 d-slots type (-3), each with Tokens=1.  Finally, 1 d-slot type (-4), with 5 Cpus, 100 Memory, and zero Tokens.  Note for type (-4), the weight is 5:
---------------------------------------------------------------
$ condor_status -constraint "SlotTypeID < 0" -format "%d" SlotTypeID -format " %d" SlotWeight -format " %d" Cpus -format " %d" Memory -format " %d\n" Tokens | sort | uniq -c
      5 -1 1 1 1 0
      4 -2 1 1 25 0
      3 -3 1 0 0 1
      1 -4 5 5 100 0
---------------------------------------------------------------

Comment 6 Erik Erlandson 2013-08-13 22:55:16 UTC

Regarding https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2790

We regard consumption policies as an alternative to CLAIM_PARTITIONABLE_LEFTOVERS (#2790), however Consumption policies and #2790 are expected to be compatible in a mixed-pool environment.  That is, you should be able to mix startds with consumption policies and startds without them, and also set CLAIM_PARTITIONABLE_LEFTOVERS=True.

The following test scenario should hold:

-----------------------------------------------------------------------------
# spoof some cores
NUM_CPUS = 5

STARTD.ST1.STARTD_LOG = $(LOG)/Startd_1_Log
STARTD.ST1.STARTD_NAME = st1
STARTD.ST1.ADDRESS_FILE = $(LOG)/.startd_1_address
STARTD_ST1_ARGS = -f -local-name ST1
STARTD_ST1 = $(STARTD)

STARTD.ST2.STARTD_LOG = $(LOG)/Startd_2_Log
STARTD.ST2.STARTD_NAME = st2
STARTD.ST2.ADDRESS_FILE = $(LOG)/.startd_2_address
STARTD_ST2_ARGS = -f -local-name ST2
STARTD_ST2 = $(STARTD)

# master-only procd should work
USE_PROCD = FALSE
DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR, SCHEDD, STARTD_ST1, STARTD_ST2

# configure an aggregate resource (p-slot) to consume
SLOT_TYPE_1 = 100%
SLOT_TYPE_1_PARTITIONABLE = True
# declare multiple claims for negotiator to use
# may also use global: NUM_CLAIMS
SLOT_TYPE_1_NUM_CLAIMS = 20
NUM_SLOTS_TYPE_1 = 1

# turn on schedd-side claim splitting to test with a consumption policy
CLAIM_PARTITIONABLE_LEFTOVERS = True

# turn this off to demonstrate that consumption policy will handle this kind of logic
MUST_MODIFY_REQUEST_EXPRS = False

# configure a consumption policy.   This policy is modeled on
# current 'modify-request-exprs' defaults:
# "my" is resource ad, "target" is job ad
# startd-wide consumption policy config
# defaults for cpus/memory/disk consumption
STARTD.ST2.CONSUMPTION_POLICY = True

# a consumption policy where match consumes whole slot each time
STARTD.ST2.CONSUMPTION_CPUS = 2
STARTD.ST2.CONSUMPTION_MEMORY = 32
STARTD.ST2.CONSUMPTION_DISK = 128

# keep slot weights enabled for match costing
NEGOTIATOR_USE_SLOT_WEIGHTS = True

# weight used to derive match cost: W(before-consumption) - W(after-consumption)
SlotWeight = Cpus

# for simplicity, turn off preemption, caching, worklife
CLAIM_WORKLIFE=0
MAXJOBRETIREMENTTIME = 3600
PREEMPT = False
RANK = 0
PREEMPTION_REQUIREMENTS = False
NEGOTIATOR_CONSIDER_PREEMPTION = False
NEGOTIATOR_MATCHLIST_CACHING = False

# verbose logging
ALL_DEBUG = D_FULLDEBUG

NEGOTIATOR_INTERVAL = 300
SCHEDD_INTERVAL	= 15
-----------------------------------------------------------------------------


spin up the pool, and verify that there are two p-slots:

-----------------------------------------------------------------------------
$ condor_status
Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime

slot1@st1@localhos LINUX      X86_64 Unclaimed Idle      0.500 1897  0+00:00:04
slot1@st2@localhos LINUX      X86_64 Unclaimed Idle      0.460 1897  0+00:00:04
                     Total Owner Claimed Unclaimed Matched Preempting Backfill

        X86_64/LINUX     2     0       0         2       0          0        0

               Total     2     0       0         2       0          0        0
-----------------------------------------------------------------------------


Now submit 10 jobs:

-----------------------------------------------------------------------------
universe = vanilla
cmd = /bin/sleep
args = 300
should_transfer_files = if_needed
when_to_transfer_output = on_exit
queue 10
-----------------------------------------------------------------------------


You should see three match-events in the negotiator:

-----------------------------------------------------------------------------
$ grep Matched NegotiatorLog
04/11/13 16:57:36       Matched 1.0 none.user0000@localdomain <10.0.1.3:52463> preempting none <10.0.1.3:50396> slot1@st1@localhost
04/11/13 16:57:36       Matched 1.1 none.user0000@localdomain <10.0.1.3:52463> preempting none <10.0.1.3:54095> slot1@st2@localhost
04/11/13 16:57:36       Matched 1.2 none.user0000@localdomain <10.0.1.3:52463> preempting none <10.0.1.3:54095> slot1@st2@localhost
-----------------------------------------------------------------------------

Notice that only one match can occur against the "traditional" p-slot slot1@st1@localhost, but two matches are allowed against slot1@st2@localhost, since its consumption policy consumes 2 cpus per match and it has 5 cpus.

However, the schedd can use leftover-splitting on slot1@st2@localhost, so it will result in 7 total jobs running. Five on slot1@st1, and two against slot1@st2:

-----------------------------------------------------------------------------
$ cchist condor_q RemoteHost
      5 slot1@st1@localhost
      2 slot1@st2@localhost
      3 undefined
     10 total
-----------------------------------------------------------------------------

Comment 7 Erik Erlandson 2013-08-13 23:01:25 UTC

Regarding https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2826

Consumption policies (and negotiator multi-matching) are designed to properly respect both concurrency limits and accounting group quotas.   In general, p-slots should work *better* with accounting groups, because each match is charged only for what it uses at the time of the match.  So, for example, this bug is also fixed:
https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=3013

Comment 8 Erik Erlandson 2013-08-13 23:05:55 UTC

Regarding https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2681

This bug is particular to the dedicated scheduler, so a vanilla universe isn't applicable here.   It is related to the known issues with CLAIM_PARTITIONABLE_LEFTOVERS (#2790) and concurrency limits: allowing multiple matches in the scheduler causes various accounting difficulties because you have to make that accounting visible to the scheduler, when before it did not have to be aware of it.

Consumption policies avoid these issues, because it leaves resource accounting logic where it already resided.

Comment 9 Erik Erlandson 2013-08-25 21:55:03 UTC

UPSTREAM-8.1.2-BZ794818-consumption-policies

Comment 10 Erik Erlandson 2013-08-25 23:03:08 UTC

(In reply to Erik Erlandson from comment #5)

> # startd defaults, can be overridden on a per-slot-type basis
> CONSUMPTION_CPUS = ifthenelse(target.Cpus isnt undefined,
> quantize(target.Cpus, {1}), 1)
> CONSUMPTION_MEMORY = ifthenelse(target.Memory isnt undefined,
> quantize(target.Memory, {1}), 1)
> CONSUMPTION_DISK = ifthenelse(target.Disk isnt undefined,
> quantize(target.Disk, {100}), 100)
> CONSUMPTION_TOKENS = ifthenelse(target.Tokens isnt undefined, target.Tokens,
> 0)
> 

The above aren't very well written.  they may work, but for the "wrong" reason.

Proper definitions are:
CONSUMPTION_CPUS = quantize(target.RequestCpus, {1})
CONSUMPTION_MEMORY = quantize(target.RequestMemory, {1})
CONSUMPTION_DISK = quantize(target.RequestDisk, {100})
CONSUMPTION_TOKENS = ifthenelse(target.RequestTokens isnt undefined, target.RequestTokens, 0)

Comment 12 Lubos Trilety 2013-09-09 13:58:18 UTC

Negotiator doesn't print any error even when the consumption policy for some resource is set to minus number. It leads to very strange behaviour.

e.g.
SLOT_TYPE_2 = tokens=0,cpus=5,memory=100
SLOT_TYPE_2_PARTITIONABLE = True
SLOT_TYPE_2_CONSUMPTION_memory = -47
NUM_SLOTS_TYPE_2 = 1

submit some jobs

see condor_status

# condor_status -format "%d" SlotTypeID -format " %d" SlotWeight -format " %d" Cpus -format " %d" Memory -format " %d\n" Tokens
...
2 -487 4 -22930 0
-2 490 1 23030 0
...

For sure memory numbers are bad one.

Comment 13 Lubos Trilety 2013-09-17 12:33:02 UTC

Negotiator doesn't match all possible dynamic slots, if the requirements are bigger than consumption policy.

e.g.
CONSUMPTION_cpus=1
CONSUMPTION_memory=1

NUM_SLOTS_TYPE_1=1
SLOT_TYPE_1_SLOT_WEIGHT=cpus
SLOT_TYPE_1_PARTITIONABLE=True
SLOT_TYPE_1=tokens=0,cpus=10,memory=100

submit following jobs:

should_transfer_files=IF_NEEDED
executable=/bin/sleep
requirements=(FileSystemDomain =!= UNDEFINED && Arch =!= UNDEFINED)
transfer_executable=False
universe=vanilla
request_cpus=3
arguments=6000
when_to_transfer_output=ON_EXIT
notification=never
queue 10

$ condor_status -format "%d" SlotTypeID -format " %d" SlotWeight -format " %d" Cpus -format " %d\n" Memory | sort | uniq -c
      8 -1 1 1 1
      1 1 2 2 92


There are still 2 cpus available consumption policy is set to 1, so it should match, but it doesn't match because job requires 3 cpus.

Comment 14 Erik Erlandson 2013-09-17 19:47:20 UTC

(In reply to Lubos Trilety from comment #13)
> Negotiator doesn't match all possible dynamic slots, if the requirements are
> bigger than consumption policy.
> 

> There are still 2 cpus available consumption policy is set to 1, so it
> should match, but it doesn't match because job requires 3 cpus.


This is a bug.   It's happening because the job ads are being optimized once, on read-in, prior to the actual matchmaking logic.  RequestCpus is being replaced with the constant '3' inside the job.Requirements expression, and so the call to cp_override_requested() has no effect on the job side.   

It takes a RequestXXX set to some constant value > the consumption value to repro.

Fix is to disable the job.Requirements optimization in an environment where consumption policies are in effect.

Comment 15 Lubos Trilety 2013-09-20 12:02:15 UTC

Corrected in condor-7.8.9-0.5

# cat NegotiatorLog
...
09/20/13 11:59:36 WARNING: consumption policy for Consumptionmemory on resource slot2.lab.eng.brq.redhat.com failed to evaluate to a non-negative numeric value
09/20/13 11:59:36 WARNING: Consumption for asset memory on resource slot2.lab.eng.brq.redhat.com was negative: -47
...

(In reply to Lubos Trilety from comment #12)
> Negotiator doesn't print any error even when the consumption policy for some
> resource is set to minus number. It leads to very strange behaviour.
> 
> e.g.
> SLOT_TYPE_2 = tokens=0,cpus=5,memory=100
> SLOT_TYPE_2_PARTITIONABLE = True
> SLOT_TYPE_2_CONSUMPTION_memory = -47
> NUM_SLOTS_TYPE_2 = 1
> 
> submit some jobs
> 
> see condor_status
> 
> # condor_status -format "%d" SlotTypeID -format " %d" SlotWeight -format "
> %d" Cpus -format " %d" Memory -format " %d\n" Tokens
> ...
> 2 -487 4 -22930 0
> -2 490 1 23030 0
> ...
> 
> For sure memory numbers are bad one.

Comment 16 Lubos Trilety 2013-09-20 12:50:23 UTC

It seems it doesn't work properly with:
CLAIM_PARTITIONABLE_LEFTOVERS = True

For example:
Configure two startds:
STARTD.ST1.STARTD_LOG = $(LOG)/Startd_1_Log
STARTD.ST1.STARTD_NAME = st1
STARTD.ST1.ADDRESS_FILE = $(LOG)/.startd_1_address
STARTD_ST1_ARGS = -f -local-name ST1
STARTD_ST1 = $(STARTD)

STARTD.ST2.STARTD_LOG = $(LOG)/Startd_2_Log
STARTD.ST2.STARTD_NAME = st2
STARTD.ST2.ADDRESS_FILE = $(LOG)/.startd_2_address
STARTD_ST2_ARGS = -f -local-name ST2
STARTD_ST2 = $(STARTD)

# master-only procd should work
USE_PROCD = FALSE
DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR, SCHEDD, STARTD_ST1, STARTD_ST2

# set number of cpus
NUM_CPUS=40

# Configure policy
STARTD.ST1.CONSUMPTION_POLICY=True
CLAIM_PARTITIONABLE_LEFTOVERS=True

# configure slots
SLOT_TYPE_4_SLOT_WEIGHT=floor(cpus/10)
SLOT_TYPE_4_CONSUMPTION_cpus=TotalSlotcpus
SLOT_TYPE_4_CONSUMPTION_tokens=0
NUM_SLOTS_TYPE_4=1
SLOT_TYPE_4_CONSUMPTION_memory=TotalSlotmemory
SLOT_TYPE_4=tokens=0,cpus=10,memory=100
SLOT_TYPE_4_PARTITIONABLE=True

NUM_SLOTS_TYPE_3=1
SLOT_TYPE_3_CONSUMPTION_tokens=ifthenelse(target.Requesttokens isnt undefined, quantize(target.Requesttokens, {1}), 1)
SLOT_TYPE_3_CONSUMPTION_memory=0
SLOT_TYPE_3_SLOT_WEIGHT=tokens
SLOT_TYPE_3_PARTITIONABLE=True
SLOT_TYPE_3=tokens=3,cpus=10,memory=100
SLOT_TYPE_3_CONSUMPTION_cpus=0

NUM_SLOTS_TYPE_2=1
SLOT_TYPE_2=tokens=0,cpus=10,memory=100
SLOT_TYPE_2_PARTITIONABLE=True
SLOT_TYPE_2_SLOT_WEIGHT=floor(memory/33)
SLOT_TYPE_2_CONSUMPTION_memory=quantize(target.Requestmemory, {33})

NUM_SLOTS_TYPE_1=1
SLOT_TYPE_1_SLOT_WEIGHT=floor(cpus/2)
SLOT_TYPE_1_PARTITIONABLE=True
SLOT_TYPE_1=tokens=0,cpus=10,memory=100

# Default policy
CONSUMPTION_tokens=ifthenelse(target.Requesttokens isnt undefined, target.Requesttokens, 0)
CONSUMPTION_memory=quantize(target.Requestmemory, {15})
CONSUMPTION_cpus=2

# set resources
MACHINE_RESOURCE_tokens=6

# other
SCHEDD_INTERVAL=15
NEGOTIATOR_CONSIDER_PREEMPTION=False
NEGOTIATOR_USE_SLOT_WEIGHTS=True
SLOT_WEIGHT=Cpus
NEGOTIATOR_INTERVAL=30
NEGOTIATOR_MATCHLIST_CACHING=False
MUST_MODIFY_REQUEST_EXPRS=False
PREEMPT=False
RANK=0
MAXJOBRETIREMENTTIME=3600
CLAIM_WORKLIFE=0
PREEMPTION_REQUIREMENTS=False


Submit following jobs:
universe = vanilla
cmd = /bin/sleep
args = 3000
should_transfer_files = if_needed
when_to_transfer_output = on_exit
request_cpus = 3
request_memory = 12
queue 100

Run condor_status:
# condor_status -format "%d" SlotTypeID -format " %d" Cpus -format " %d" Memory -format " %d\n" Tokens | sort | uniq -c
      1 1 1 64 0
      4 -1 2 15 0
      1 1 2 40 0
      3 -1 3 12 0
      1 2 1 64 0
      3 -2 2 33 0
      3 -2 3 12 0
      1 2 4 1 0
      3 -3 0 0 1
      1 3 10 100 0
      1 3 1 64 3
      3 -3 3 12 0
      1 4 0 0 0
      1 -4 10 100 0
      1 4 1 64 0
      3 -4 3 12 0

There is one -1 dynamic slot missing, there should be:
      5 -1 2 15 0
      1 1 0 25 0

instead of:
      4 -1 2 15 0
      1 1 2 40 0

See logs:
# cat NegotiatorLog
...
09/20/13 12:14:25     Request 00001.00012:
09/20/13 12:14:25       Matched 1.12 test.lab.eng.brq.redhat.com <10.34.33.139:59430> preempting none <10.34.33.139:45008> slot1@st1.lab.eng.brq.redhat.com
09/20/13 12:14:25       Successfully matched with slot1@st1.lab.eng.brq.redhat.com
...

# cat SchedLog
...
09/20/13 12:14:26 (pid:13469) Starting add_shadow_birthdate(1.12)
09/20/13 12:14:26 (pid:13469) Started shadow for job 1.12 on slot1@st2.lab.eng.brq.redhat.com <10.34.33.139:56178> for test, (shadow pid = 13585)
...

Seems like scheduler take that job and run it on st2 instead of st1.

It doesn't happen always though, but in most cases it happens. Sometimes the issue is on other slot than slot1.

Comment 17 Lubos Trilety 2013-09-20 12:51:28 UTC

Fixed on condor-7.8.9-0.5

(In reply to Erik Erlandson from comment #14)
> (In reply to Lubos Trilety from comment #13)
> > Negotiator doesn't match all possible dynamic slots, if the requirements are
> > bigger than consumption policy.
> > 
> 
> > There are still 2 cpus available consumption policy is set to 1, so it
> > should match, but it doesn't match because job requires 3 cpus.
> 
> 
> This is a bug.   It's happening because the job ads are being optimized
> once, on read-in, prior to the actual matchmaking logic.  RequestCpus is
> being replaced with the constant '3' inside the job.Requirements expression,
> and so the call to cp_override_requested() has no effect on the job side.   
> 
> It takes a RequestXXX set to some constant value > the consumption value to
> repro.
> 
> Fix is to disable the job.Requirements optimization in an environment where
> consumption policies are in effect.

Comment 21 Lubos Trilety 2013-09-24 13:09:44 UTC

With consumption policy active, the p-slot state was changed to matched and it remains matched even after all jobs were removed.

Settings:
# set number of cpus
NUM_CPUS=10

# Configure policy
#CONSUMPTION_POLICY=True

# configure slots
NUM_SLOTS_TYPE_1=1
SLOT_TYPE_1_SLOT_WEIGHT=floor(cpus/2)
SLOT_TYPE_1_PARTITIONABLE=True
SLOT_TYPE_1=cpus=10,memory=100

# Default policy
CONSUMPTION_memory=quantize(target.Requestmemory, {15})
CONSUMPTION_cpus=2

# other
SCHEDD_INTERVAL=15
NEGOTIATOR_CONSIDER_PREEMPTION=False
NEGOTIATOR_USE_SLOT_WEIGHTS=True
SLOT_WEIGHT=Cpus
NEGOTIATOR_INTERVAL=30
NEGOTIATOR_MATCHLIST_CACHING=False
MUST_MODIFY_REQUEST_EXPRS=False
PREEMPT=False
RANK=0
MAXJOBRETIREMENTTIME=3600
CLAIM_WORKLIFE=0
PREEMPTION_REQUIREMENTS=False

Submit following job:
universe = vanilla
cmd = /bin/sleep
args = 3000
requirements=(FileSystemDomain =!= UNDEFINED && Arch =!= UNDEFINED)
transfer_executable=False
should_transfer_files = if_needed
when_to_transfer_output = on_exit
queue 10

See condor_status:
# condor_status
Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime
slot1@host LINUX      X86_64 Matched   Idle     0.240    25  0+00:00:04
slot1_1@host LINUX      X86_64 Claimed   Busy     0.000    15  0+00:00:04
slot1_2@host LINUX      X86_64 Claimed   Busy     0.000    15  0+00:00:04
slot1_3@host LINUX      X86_64 Claimed   Busy     0.000    15  0+00:00:04
slot1_4@host LINUX      X86_64 Claimed   Busy     0.000    15  0+00:00:04
slot1_5@host LINUX      X86_64 Claimed   Busy     0.000    15  0+00:00:04
                     Machines Owner Claimed Unclaimed Matched Preempting
        X86_64/LINUX        6     0       5         0       1          0
               Total        6     0       5         0       1          0

Remove all jobs:
# condor_rm -all
All jobs marked for removal.

See condor_status:
# condor_status
Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime
slot1@host LINUX      X86_64 Matched   Idle     0.050   100  0+00:01:34
                     Machines Owner Claimed Unclaimed Matched Preempting
        X86_64/LINUX        1     0       0         0       1          0
               Total        1     0       0         0       1          0

No other jobs can be run. Submit previous job again, see results:
# condor_q


-- Submitter: host : <IP:48016> : host
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   2.0   test            9/24 13:05   0+00:00:00 I  0   0.0  sleep 3000        
...
10 jobs; 0 completed, 0 removed, 10 idle, 0 running, 0 held, 0 suspended

# cat MatchLog
...
09/24/13 13:05:24 ---------- Started Negotiation Cycle ----------
09/24/13 13:05:24 Phase 1:  Obtaining ads from collector ...
09/24/13 13:05:24   Getting Scheduler, Submitter and Machine ads ...
09/24/13 13:05:24   Sorting 3 ads ...
09/24/13 13:05:24   Getting startd private ads ...
09/24/13 13:05:24 Got ads: 3 public and 1 private
09/24/13 13:05:24 Public ads include 1 submitter, 1 startd
09/24/13 13:05:24 Phase 2:  Performing accounting ...
09/24/13 13:05:24 Phase 3:  Sorting submitter ads by priority ...
09/24/13 13:05:24 Phase 4.1:  Negotiating with schedds ...
09/24/13 13:05:24   Negotiating with test@host at <IP:48016>
09/24/13 13:05:24 0 seconds so far
09/24/13 13:05:24     Request 00002.00000:
09/24/13 13:05:24       Rejected 2.0 test@host <IP:48016>: no match found
09/24/13 13:05:24     Got NO_MORE_JOBS;  done negotiating
09/24/13 13:05:24  negotiateWithGroup resources used scheddAds length 0
09/24/13 13:05:24 ---------- Finished Negotiation Cycle ----------
...

Without consumption policy the state is unclaimed not matched and after removing jobs and submitting a new ones, they are running correctly.

Comment 26 Lubos Trilety 2013-09-25 08:29:12 UTC

Tested with:
condor-7.8.9-0.5

Tested on:
RHEL5 i386, x86_64
RHEL6 i386, x86_64

Comment 28 errata-xmlrpc 2013-10-01 16:50:54 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-1294.html

Note You need to log in before you can comment on or make changes to this bug.