Bug 639244

Summary:

distribution of slots is not correct

Product:

Red Hat Enterprise MRG

Reporter:

Lubos Trilety <ltrilety>

Component:

condor

Assignee:

Erik Erlandson <eerlands>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Lubos Trilety <ltrilety>

Severity:

high

Docs Contact:

Priority:

medium

Version:

1.3

CC:

matt

Target Milestone:

1.3.2

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

condor-7.4.5-0.2

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2011-02-15 13:01:24 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

528800

Attachments:

Description	Flags
Negotiator log	none

Description Lubos Trilety 2010-10-01 08:31:48 UTC

Created attachment 450962 [details]
Negotiator log

Description of problem:
If there is a group with some subgroups defined in condor and some jobs already run with a user from such group (it works even when the user is not in any group), it is possible that after submitting of jobs with users from subgroups get wrong number of slots.

Version-Release number of selected component (if applicable):
condor-7.4.4-0.16

How reproducible:
100%

Steps to Reproduce:
1. set condor configuration to:
NEGOTIATOR_DEBUG = D_FULLDEBUG | D_MATCH
SCHEDD_INTERVAL	= 15
NEGOTIATOR_USE_SLOT_WEIGHTS = FALSE
NUM_CPUS = 20
GROUP_NAMES = a, b, b.q, b.r
GROUP_QUOTA_DYNAMIC_a = 0.51
GROUP_QUOTA_DYNAMIC_b = 0.4
GROUP_QUOTA_DYNAMIC_b.q = 0.5
GROUP_QUOTA_DYNAMIC_b.r = 0.25
GROUP_AUTOREGROUP_b = TRUE
GROUP_AUTOREGROUP_b.q = TRUE
GROUP_AUTOREGROUP_b.r = TRUE

2. prepare two simple submit files
$ cat prepare.file
cmd = /bin/sleep
args = 2
+AccountingGroup = "b.user"
queue
+AccountingGroup = "none.user"
queue
$ cat test.file
universe = vanilla
cmd = /bin/sleep
args = 5m
+AccountingGroup = "a.user"
queue 10
concurrency_limits = none
+AccountingGroup = "b.q.user"
queue 15
+AccountingGroup = "b.r.user"
queue 15

3. submit submit.file
$ condor_submit prepare.file 
Submitting job(s)..
2 job(s) submitted to cluster 1.

4. wait until all jobs finish, submit second file
$ condor_submit test.file 
Submitting job(s)........................................
40 job(s) submitted to cluster 2.

5. see used resources
$ condor_q -run -l | grep AccountingGroup | sort | uniq -c
      2 AccountingGroup = "a.user"
     14 AccountingGroup = "b.q.user"
      4 AccountingGroup = "b.r.user"


Actual results:
"a.user": 2, "b.q.user": 14, "b.r.user": 4

Expected results:
The result should be "a.user": 2, "b.q.user": 12, "b.r.user": 6

Additional info:
In negotiator log, it can be found that in first iteration "a.user" gets 2
slots, because of concurrency limit, "b.q.user" gets 6 slots and "b.r.user"
obtains 4 slots.
So only 8 slots are not used but in logs there is: "Failed to match 12.800001
slots on iteration 1."

Comment 1 Lubos Trilety 2010-10-01 08:32:44 UTC

Comment from Jon Thomas moved from 629614:

It looks like what happened was that the jobs for nongroup and jobs for group b
were in the process of ending at the start of the negotiation cycle. There were
classAds for these jobs, but the values for

ad->LookupInteger(ATTR_RUNNING_JOBS, numrunning);
ad->LookupInteger(ATTR_IDLE_JOBS, numidle);

were zero. This meant numsubmits was zero. Since there was a classAd, the ad
was inserted into the list for the group.

groupArray[0].submitterAds.Insert(ad);


Later, this caused a problem with

if ( groupArray[i].submitterAds.MyLength() == 0 ) {


So I think, 

if ( groupArray[i].submitterAds.MyLength() == 0 ) {

should be changed to

if ( groupArray[i].numsubmits == 0 ) {

We use numsubmits in a number of places, so this should be better. Negotiation
is skipped in negotiatewithgroup based upon:

num_idle_jobs = 0;
schedd->LookupInteger(ATTR_IDLE_JOBS,num_idle_jobs);
if ( num_idle_jobs < 0 ) {
  num_idle_jobs = 0;
 }

This is where we get the  ..." skipped because no idle jobs" message. So I
think the later stages of negotiation rely on these same attributes. If we make
the change, negotiation would be skipped higher in the call chain and it would
not add to the number of unused slots.

Comment 2 Erik Erlandson 2010-11-19 21:22:43 UTC

Latest devel branch:
V7_4-BZ619557-HFS-tree-structure

Comment 3 Lubos Trilety 2011-01-20 14:37:43 UTC

In test scenario it needs to be add following lines to condor configuration:
CONCURRENCY_LIMIT_DEFAULT = 2
none_LIMIT = 20

Comment 4 Lubos Trilety 2011-01-20 15:17:42 UTC

Moreover test.file has to be changed properly

$ cat test.file
universe = vanilla
cmd = /bin/sleep
args = 5m
concurrency_limits = a
+AccountingGroup = "a.user"
queue 10
concurrency_limits = none
+AccountingGroup = "b.q.user"
queue 15
+AccountingGroup = "b.r.user"
queue 15

In new version of condor also HFS_MAX_ALLOCATION_ROUNDS to number bigger than 1 has to be set, otherwise no slots will be used. (e.g. HFS_MAX_ALLOCATION_ROUNDS = 3)

Comment 5 Lubos Trilety 2011-01-20 15:21:41 UTC

Tested with (version):
condor-7.4.5-0.6

Tested on:
RHEL4 i386,x86_64  - passed
RHEL5 i386,x86_64  - passed

>>> VERIFIED