Bug 639244 - distribution of slots is not correct
Summary: distribution of slots is not correct
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor
Version: 1.3
Hardware: All
OS: Linux
medium
high
Target Milestone: 1.3.2
: ---
Assignee: Erik Erlandson
QA Contact: Lubos Trilety
URL:
Whiteboard:
Depends On:
Blocks: 528800
TreeView+ depends on / blocked
 
Reported: 2010-10-01 08:31 UTC by Lubos Trilety
Modified: 2011-02-15 13:01 UTC (History)
1 user (show)

Fixed In Version: condor-7.4.5-0.2
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-02-15 13:01:24 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Negotiator log (143.43 KB, text/plain)
2010-10-01 08:31 UTC, Lubos Trilety
no flags Details

Description Lubos Trilety 2010-10-01 08:31:48 UTC
Created attachment 450962 [details]
Negotiator log

Description of problem:
If there is a group with some subgroups defined in condor and some jobs already run with a user from such group (it works even when the user is not in any group), it is possible that after submitting of jobs with users from subgroups get wrong number of slots.

Version-Release number of selected component (if applicable):
condor-7.4.4-0.16

How reproducible:
100%

Steps to Reproduce:
1. set condor configuration to:
NEGOTIATOR_DEBUG = D_FULLDEBUG | D_MATCH
SCHEDD_INTERVAL	= 15
NEGOTIATOR_USE_SLOT_WEIGHTS = FALSE
NUM_CPUS = 20
GROUP_NAMES = a, b, b.q, b.r
GROUP_QUOTA_DYNAMIC_a = 0.51
GROUP_QUOTA_DYNAMIC_b = 0.4
GROUP_QUOTA_DYNAMIC_b.q = 0.5
GROUP_QUOTA_DYNAMIC_b.r = 0.25
GROUP_AUTOREGROUP_b = TRUE
GROUP_AUTOREGROUP_b.q = TRUE
GROUP_AUTOREGROUP_b.r = TRUE

2. prepare two simple submit files
$ cat prepare.file
cmd = /bin/sleep
args = 2
+AccountingGroup = "b.user"
queue
+AccountingGroup = "none.user"
queue
$ cat test.file
universe = vanilla
cmd = /bin/sleep
args = 5m
+AccountingGroup = "a.user"
queue 10
concurrency_limits = none
+AccountingGroup = "b.q.user"
queue 15
+AccountingGroup = "b.r.user"
queue 15

3. submit submit.file
$ condor_submit prepare.file 
Submitting job(s)..
2 job(s) submitted to cluster 1.

4. wait until all jobs finish, submit second file
$ condor_submit test.file 
Submitting job(s)........................................
40 job(s) submitted to cluster 2.

5. see used resources
$ condor_q -run -l | grep AccountingGroup | sort | uniq -c
      2 AccountingGroup = "a.user"
     14 AccountingGroup = "b.q.user"
      4 AccountingGroup = "b.r.user"


Actual results:
"a.user": 2, "b.q.user": 14, "b.r.user": 4

Expected results:
The result should be "a.user": 2, "b.q.user": 12, "b.r.user": 6

Additional info:
In negotiator log, it can be found that in first iteration "a.user" gets 2
slots, because of concurrency limit, "b.q.user" gets 6 slots and "b.r.user"
obtains 4 slots.
So only 8 slots are not used but in logs there is: "Failed to match 12.800001
slots on iteration 1."

Comment 1 Lubos Trilety 2010-10-01 08:32:44 UTC
Comment from Jon Thomas moved from 629614:

It looks like what happened was that the jobs for nongroup and jobs for group b
were in the process of ending at the start of the negotiation cycle. There were
classAds for these jobs, but the values for

ad->LookupInteger(ATTR_RUNNING_JOBS, numrunning);
ad->LookupInteger(ATTR_IDLE_JOBS, numidle);

were zero. This meant numsubmits was zero. Since there was a classAd, the ad
was inserted into the list for the group.

groupArray[0].submitterAds.Insert(ad);


Later, this caused a problem with

if ( groupArray[i].submitterAds.MyLength() == 0 ) {


So I think, 

if ( groupArray[i].submitterAds.MyLength() == 0 ) {

should be changed to

if ( groupArray[i].numsubmits == 0 ) {

We use numsubmits in a number of places, so this should be better. Negotiation
is skipped in negotiatewithgroup based upon:

num_idle_jobs = 0;
schedd->LookupInteger(ATTR_IDLE_JOBS,num_idle_jobs);
if ( num_idle_jobs < 0 ) {
  num_idle_jobs = 0;
 }

This is where we get the  ..." skipped because no idle jobs" message. So I
think the later stages of negotiation rely on these same attributes. If we make
the change, negotiation would be skipped higher in the call chain and it would
not add to the number of unused slots.

Comment 2 Erik Erlandson 2010-11-19 21:22:43 UTC
Latest devel branch:
V7_4-BZ619557-HFS-tree-structure

Comment 3 Lubos Trilety 2011-01-20 14:37:43 UTC
In test scenario it needs to be add following lines to condor configuration:
CONCURRENCY_LIMIT_DEFAULT = 2
none_LIMIT = 20

Comment 4 Lubos Trilety 2011-01-20 15:17:42 UTC
Moreover test.file has to be changed properly

$ cat test.file
universe = vanilla
cmd = /bin/sleep
args = 5m
concurrency_limits = a
+AccountingGroup = "a.user"
queue 10
concurrency_limits = none
+AccountingGroup = "b.q.user"
queue 15
+AccountingGroup = "b.r.user"
queue 15

In new version of condor also HFS_MAX_ALLOCATION_ROUNDS to number bigger than 1 has to be set, otherwise no slots will be used. (e.g. HFS_MAX_ALLOCATION_ROUNDS = 3)

Comment 5 Lubos Trilety 2011-01-20 15:21:41 UTC
Tested with (version):
condor-7.4.5-0.6

Tested on:
RHEL4 i386,x86_64  - passed
RHEL5 i386,x86_64  - passed

>>> VERIFIED


Note You need to log in before you can comment on or make changes to this bug.