639244 – distribution of slots is not correct

Bug 639244 - distribution of slots is not correct

Summary: distribution of slots is not correct

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	condor
Sub Component:
Version:	1.3
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	1.3.2
Target Release:	---
Assignee:	Erik Erlandson
QA Contact:	Lubos Trilety
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	528800
TreeView+	depends on / blocked

Reported:	2010-10-01 08:31 UTC by Lubos Trilety
Modified:	2011-02-15 13:01 UTC (History)
CC List:	1 user (show)
Fixed In Version:	condor-7.4.5-0.2
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2011-02-15 13:01:24 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Negotiator log (143.43 KB, text/plain) 2010-10-01 08:31 UTC, Lubos Trilety	no flags	Details
View All

Description Lubos Trilety 2010-10-01 08:31:48 UTC

Created attachment 450962 [details]
Negotiator log

Description of problem:
If there is a group with some subgroups defined in condor and some jobs already run with a user from such group (it works even when the user is not in any group), it is possible that after submitting of jobs with users from subgroups get wrong number of slots.

Version-Release number of selected component (if applicable):
condor-7.4.4-0.16

How reproducible:
100%

Steps to Reproduce:
1. set condor configuration to:
NEGOTIATOR_DEBUG = D_FULLDEBUG | D_MATCH
SCHEDD_INTERVAL	= 15
NEGOTIATOR_USE_SLOT_WEIGHTS = FALSE
NUM_CPUS = 20
GROUP_NAMES = a, b, b.q, b.r
GROUP_QUOTA_DYNAMIC_a = 0.51
GROUP_QUOTA_DYNAMIC_b = 0.4
GROUP_QUOTA_DYNAMIC_b.q = 0.5
GROUP_QUOTA_DYNAMIC_b.r = 0.25
GROUP_AUTOREGROUP_b = TRUE
GROUP_AUTOREGROUP_b.q = TRUE
GROUP_AUTOREGROUP_b.r = TRUE

2. prepare two simple submit files
$ cat prepare.file
cmd = /bin/sleep
args = 2
+AccountingGroup = "b.user"
queue
+AccountingGroup = "none.user"
queue
$ cat test.file
universe = vanilla
cmd = /bin/sleep
args = 5m
+AccountingGroup = "a.user"
queue 10
concurrency_limits = none
+AccountingGroup = "b.q.user"
queue 15
+AccountingGroup = "b.r.user"
queue 15

3. submit submit.file
$ condor_submit prepare.file 
Submitting job(s)..
2 job(s) submitted to cluster 1.

4. wait until all jobs finish, submit second file
$ condor_submit test.file 
Submitting job(s)........................................
40 job(s) submitted to cluster 2.

5. see used resources
$ condor_q -run -l | grep AccountingGroup | sort | uniq -c
      2 AccountingGroup = "a.user"
     14 AccountingGroup = "b.q.user"
      4 AccountingGroup = "b.r.user"


Actual results:
"a.user": 2, "b.q.user": 14, "b.r.user": 4

Expected results:
The result should be "a.user": 2, "b.q.user": 12, "b.r.user": 6

Additional info:
In negotiator log, it can be found that in first iteration "a.user" gets 2
slots, because of concurrency limit, "b.q.user" gets 6 slots and "b.r.user"
obtains 4 slots.
So only 8 slots are not used but in logs there is: "Failed to match 12.800001
slots on iteration 1."

Comment 1 Lubos Trilety 2010-10-01 08:32:44 UTC

Comment from Jon Thomas moved from 629614:

It looks like what happened was that the jobs for nongroup and jobs for group b
were in the process of ending at the start of the negotiation cycle. There were
classAds for these jobs, but the values for

ad->LookupInteger(ATTR_RUNNING_JOBS, numrunning);
ad->LookupInteger(ATTR_IDLE_JOBS, numidle);

were zero. This meant numsubmits was zero. Since there was a classAd, the ad
was inserted into the list for the group.

groupArray[0].submitterAds.Insert(ad);


Later, this caused a problem with

if ( groupArray[i].submitterAds.MyLength() == 0 ) {


So I think, 

if ( groupArray[i].submitterAds.MyLength() == 0 ) {

should be changed to

if ( groupArray[i].numsubmits == 0 ) {

We use numsubmits in a number of places, so this should be better. Negotiation
is skipped in negotiatewithgroup based upon:

num_idle_jobs = 0;
schedd->LookupInteger(ATTR_IDLE_JOBS,num_idle_jobs);
if ( num_idle_jobs < 0 ) {
  num_idle_jobs = 0;
 }

This is where we get the  ..." skipped because no idle jobs" message. So I
think the later stages of negotiation rely on these same attributes. If we make
the change, negotiation would be skipped higher in the call chain and it would
not add to the number of unused slots.

Comment 2 Erik Erlandson 2010-11-19 21:22:43 UTC

Latest devel branch:
V7_4-BZ619557-HFS-tree-structure

Comment 3 Lubos Trilety 2011-01-20 14:37:43 UTC

In test scenario it needs to be add following lines to condor configuration:
CONCURRENCY_LIMIT_DEFAULT = 2
none_LIMIT = 20

Comment 4 Lubos Trilety 2011-01-20 15:17:42 UTC

Moreover test.file has to be changed properly

$ cat test.file
universe = vanilla
cmd = /bin/sleep
args = 5m
concurrency_limits = a
+AccountingGroup = "a.user"
queue 10
concurrency_limits = none
+AccountingGroup = "b.q.user"
queue 15
+AccountingGroup = "b.r.user"
queue 15

In new version of condor also HFS_MAX_ALLOCATION_ROUNDS to number bigger than 1 has to be set, otherwise no slots will be used. (e.g. HFS_MAX_ALLOCATION_ROUNDS = 3)

Comment 5 Lubos Trilety 2011-01-20 15:21:41 UTC

Tested with (version):
condor-7.4.5-0.6

Tested on:
RHEL4 i386,x86_64  - passed
RHEL5 i386,x86_64  - passed

>>> VERIFIED

Note You need to log in before you can comment on or make changes to this bug.