| Summary: | preemption does not work when group quotas are in effect | ||
|---|---|---|---|
| Product: | Red Hat Enterprise MRG | Reporter: | Erik Erlandson <eerlands> |
| Component: | condor | Assignee: | Erik Erlandson <eerlands> |
| Status: | CLOSED ERRATA | QA Contact: | Lubos Trilety <ltrilety> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 2.0 | CC: | claudiol, jthomas, ltrilety, matt, mkudlej, tstclair, whenry |
| Target Milestone: | 2.3 | ||
| Target Release: | --- | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | condor-7.8.2-0.1 | Doc Type: | Bug Fix |
| Doc Text: |
Cause:
Group quota limit bookkeeping did properly not take the possibility of preemption into account.
Consequence:
Job preemption was prevented when group quotas were in effect.
Fix:
Bookkeeping logic for submitter limits was extended to include possibility of preemption. Job classads in the negotiator were enhanced to allow PREEMPTION_REQUIREMENTS to include accounting group names in the expression.
Result:
Preemption is now correctly allowed when group quotas are in effect. Preemption policies can be configured to allow preemption to respect accounting group boundaries (or ignore them if desired).
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2013-03-06 18:39:17 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
|
Description
Erik Erlandson
2011-10-21 20:00:00 UTC
TESTING:
To test fix, use the following configuration:
MAXJOBRETIREMENTTIME = 0
PREEMPT = False
RANK = 0
CLAIM_WORKLIFE = 0
PREEMPTION_REQUIREMENTS = True && (SubmitterGroup =?= RemoteGroup)
NEGOTIATOR_CONSIDER_PREEMPTION = True
NEGOTIATOR_INTERVAL = 30
NEGOTIATOR_DEBUG = D_FULLDEBUG
NUM_CPUS = 10
GROUP_NAMES = a, b
GROUP_QUOTA_a = 5
GROUP_QUOTA_b = 5
GROUP_ACCEPT_SURPLUS = FALSE
Spin up the pool. Then make sure the following priority factors are set:
$ condor_userprio -setfactor a.u1@localdomain 10
The priority factor of a.u1@localdomain was set to 10.000000
$ condor_userprio -setfactor b.u1@localdomain 10
The priority factor of b.u1@localdomain was set to 10.000000
Now submit the following jobs against "a.u1" and "b.u1":
universe = vanilla
cmd = /bin/sleep
args = 600
should_transfer_files = if_needed
when_to_transfer_output = on_exit
+AccountingGroup="a.u1"
queue 4
+AccountingGroup="b.u1"
queue 4
Let the jobs above negotiate (all 8 jobs should run). After they negotiate, submit the following jobs for "a.u2" and "b.u2":
universe = vanilla
cmd = /bin/sleep
args = 600
should_transfer_files = if_needed
when_to_transfer_output = on_exit
+AccountingGroup="a.u2"
queue 2
+AccountingGroup="b.u2"
queue 2
Let these jobs negotiate. "a.u2" and "b.u2" have lower prio-factors than "a.u1" and "b.u1". "a.u2" should take one unclaimed slot, and then preempt another job from "a.u1". Similarly for "b.u2" The result should look like this:
$ condor_q -format "%s" AccountingGroup -format " | %s\n" JobStatus | sort | uniq -c
1 a.u1 | 1
3 a.u1 | 2
2 a.u2 | 2
1 b.u1 | 1
3 b.u1 | 2
2 b.u2 | 2
------------------------------------------
TESTING with preemption requirements False:
Test is same sequence as above, but with preemption requirements set to false:
MAXJOBRETIREMENTTIME = 0
PREEMPT = False
RANK = 0
CLAIM_WORKLIFE = 0
PREEMPTION_REQUIREMENTS = False
NEGOTIATOR_CONSIDER_PREEMPTION = True
NEGOTIATOR_INTERVAL = 30
NEGOTIATOR_DEBUG = D_FULLDEBUG
NUM_CPUS = 10
GROUP_NAMES = a, b
GROUP_QUOTA_a = 5
GROUP_QUOTA_b = 5
GROUP_ACCEPT_SURPLUS = FALSE
Running the same testing sequence as before with the modified configuration above, and you should see that now, "a.u2" and "b.u2" cannot preempt any jobs, and so each of them will only acquire one unclaimed slot, leaving all jobs from "a.u1" and "b.u1" running:
$ condor_q -format "%s" AccountingGroup -format " | %s\n" JobStatus | sort | uniq -c
4 a.u1 | 2
1 a.u2 | 1
1 a.u2 | 2
4 b.u1 | 2
1 b.u2 | 1
1 b.u2 | 2
---------------------------------------
Testing preemption without groups:
Using original test expression for preemption reqs, but this time disable groups:
MAXJOBRETIREMENTTIME = 0
PREEMPT = False
RANK = 0
CLAIM_WORKLIFE = 0
PREEMPTION_REQUIREMENTS = (SubmitterGroup =?= RemoteGroup)
NEGOTIATOR_CONSIDER_PREEMPTION = True
NEGOTIATOR_INTERVAL = 30
NEGOTIATOR_DEBUG = D_FULLDEBUG
NUM_CPUS = 10
#disable acct groups:
#GROUP_NAMES = a, b
GROUP_QUOTA_a = 5
GROUP_QUOTA_b = 5
GROUP_ACCEPT_SURPLUS = FALSE
Running the same basic test sequence with the above config should result in all submitter competing against each other, in traditional non-HGQ fashion. Two of the jobs from "a.u1" and/or "b.u1" should be preempted by jobs from "a.u2" and/or "b.u2" (and all four jobs for "a.u2" and "b.u2" should run):
$ condor_q -format "%s" AccountingGroup -format " | %s\n" JobStatus | sort | uniq -c
4 a.u1 | 2
2 a.u2 | 2
2 b.u1 | 1
2 b.u1 | 2
2 b.u2 | 2
Technical note added. If any revisions are required, please edit the "Technical Notes" field
accordingly. All revisions will be proofread by the Engineering Content Services team.
New Contents:
Cause:
Group quota limit bookkeeping did properly not take the possibility of preemption into account.
Consequence:
Job preemption was prevented when group quotas were in effect.
Fix:
Bookkeeping logic for submitter limits was extended to include possibility of preemption. Job classads in the negotiator were enhanced to allow PREEMPTION_REQUIREMENTS to include accounting group names in the expression.
Result:
Preemption is now correctly allowed when group quotas are in effect. Preemption policies can be configured to allow preemption to respect accounting group boundaries (or ignore them if desired).
Successfully reproduced on condor-7.6.5-0.22
results from first part of scenario:
# condor_q -format "%s" AccountingGroup -format " | %s\n" JobStatus | sort | uniq -c
4 a.u1 | 2
1 a.u2 | 1
1 a.u2 | 2
4 b.u1 | 2
1 b.u2 | 1
1 b.u2 | 2
Tested with:
condor-7.8.7-0.4
Tested on:
RHEL5 i386,x86_64
RHEL6 i386,x86_64
The scenario run successfully.
>>> verified
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2013-0564.html |