Bug 748053 - preemption does not work when group quotas are in effect
Summary: preemption does not work when group quotas are in effect
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor
Version: 2.0
Hardware: All
OS: Linux
high
high
Target Milestone: 2.3
: ---
Assignee: Erik Erlandson
QA Contact: Lubos Trilety
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-10-21 20:00 UTC by Erik Erlandson
Modified: 2013-03-06 18:39 UTC (History)
7 users (show)

Fixed In Version: condor-7.8.2-0.1
Doc Type: Bug Fix
Doc Text:
Cause: Group quota limit bookkeeping did properly not take the possibility of preemption into account. Consequence: Job preemption was prevented when group quotas were in effect. Fix: Bookkeeping logic for submitter limits was extended to include possibility of preemption. Job classads in the negotiator were enhanced to allow PREEMPTION_REQUIREMENTS to include accounting group names in the expression. Result: Preemption is now correctly allowed when group quotas are in effect. Preemption policies can be configured to allow preemption to respect accounting group boundaries (or ignore them if desired).
Clone Of:
Environment:
Last Closed: 2013-03-06 18:39:17 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Condor 2570 0 None None None Never
Red Hat Product Errata RHSA-2013:0564 0 normal SHIPPED_LIVE Low: Red Hat Enterprise MRG Grid 2.3 security update 2013-03-06 23:37:09 UTC

Description Erik Erlandson 2011-10-21 20:00:00 UTC
Description of problem:
I previously put logic in place to limit the computation of pie-left and submitter-limit when group quotas are in effect, because otherwise negotiate() does not assign slots in a way that guarantees respect of other group quotas.

However, a side effect of this logic is that it doesn't allow one submitter to preempt another as desired, when preemption is enabled.

A complete solution will involve addressing these conditions:

<eje> What we need is to allow preemption inside of accounting groups, but the inner negotiation loop does not respect accounting group boundaries -- it will grab slots in such a way that it steals slots that would need to be used by other groups to properly respect quotas

<eje> my previous solution imposed limits on submitter-limit, which properly respected quotas, but does not allow a submitter to steal slots from already-running jobs inside that group

<eje> we need a way to say "yes, you can preempt, but only preempt jobs that are running in the same acct group as you are"

<eje> So, an added condition for preemption.   in addition to having smaller priority value, and preemption-requirements evals to true, when HGQ is in effect we also have to have a match on accountinggroup(submitter) and accountinggroup(submitter-who-currently-has-slot-claim)

<eje> we also need logic to express constraint: "if your group G quota is Q, and number of jobs running against G is N, then you can use up to (Q-N) new slots and after that you have to attempt to preempt jobs running against G"

Comment 1 Erik Erlandson 2011-10-21 20:03:20 UTC
upstream: https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2570

Comment 2 Erik Erlandson 2012-03-03 00:28:17 UTC
TESTING:

To test fix, use the following configuration:

MAXJOBRETIREMENTTIME = 0
PREEMPT = False
RANK = 0
CLAIM_WORKLIFE = 0

PREEMPTION_REQUIREMENTS = True && (SubmitterGroup =?= RemoteGroup)
NEGOTIATOR_CONSIDER_PREEMPTION = True

NEGOTIATOR_INTERVAL = 30
NEGOTIATOR_DEBUG = D_FULLDEBUG

NUM_CPUS = 10

GROUP_NAMES = a, b
GROUP_QUOTA_a = 5
GROUP_QUOTA_b = 5
GROUP_ACCEPT_SURPLUS = FALSE


Spin up the pool. Then make sure the following priority factors are set:

$ condor_userprio -setfactor a.u1@localdomain 10
The priority factor of a.u1@localdomain was set to 10.000000
$ condor_userprio -setfactor b.u1@localdomain 10
The priority factor of b.u1@localdomain was set to 10.000000


Now submit the following jobs against "a.u1" and "b.u1":

universe = vanilla
cmd = /bin/sleep
args = 600
should_transfer_files = if_needed
when_to_transfer_output = on_exit
+AccountingGroup="a.u1"
queue 4
+AccountingGroup="b.u1"
queue 4


Let the jobs above negotiate (all 8 jobs should run). After they negotiate, submit the following jobs for "a.u2" and "b.u2":

universe = vanilla
cmd = /bin/sleep
args = 600
should_transfer_files = if_needed
when_to_transfer_output = on_exit
+AccountingGroup="a.u2"
queue 2
+AccountingGroup="b.u2"
queue 2


Let these jobs negotiate. "a.u2" and "b.u2" have lower prio-factors than "a.u1" and "b.u1". "a.u2" should take one unclaimed slot, and then preempt another job from "a.u1". Similarly for "b.u2" The result should look like this:

$ condor_q -format "%s" AccountingGroup -format " | %s\n" JobStatus | sort | uniq -c
      1 a.u1 | 1
      3 a.u1 | 2
      2 a.u2 | 2
      1 b.u1 | 1
      3 b.u1 | 2
      2 b.u2 | 2



------------------------------------------
TESTING with preemption requirements False:

Test is same sequence as above, but with preemption requirements set to false:

MAXJOBRETIREMENTTIME = 0
PREEMPT = False
RANK = 0
CLAIM_WORKLIFE = 0

PREEMPTION_REQUIREMENTS = False
NEGOTIATOR_CONSIDER_PREEMPTION = True

NEGOTIATOR_INTERVAL = 30
NEGOTIATOR_DEBUG = D_FULLDEBUG

NUM_CPUS = 10

GROUP_NAMES = a, b
GROUP_QUOTA_a = 5
GROUP_QUOTA_b = 5
GROUP_ACCEPT_SURPLUS = FALSE

Running the same testing sequence as before with the modified configuration above, and you should see that now, "a.u2" and "b.u2" cannot preempt any jobs, and so each of them will only acquire one unclaimed slot, leaving all jobs from "a.u1" and "b.u1" running:

$ condor_q -format "%s" AccountingGroup -format " | %s\n" JobStatus | sort | uniq -c
      4 a.u1 | 2
      1 a.u2 | 1
      1 a.u2 | 2
      4 b.u1 | 2
      1 b.u2 | 1
      1 b.u2 | 2



---------------------------------------
Testing preemption without groups:

Using original test expression for preemption reqs, but this time disable groups:

MAXJOBRETIREMENTTIME = 0
PREEMPT = False
RANK = 0
CLAIM_WORKLIFE = 0

PREEMPTION_REQUIREMENTS = (SubmitterGroup =?= RemoteGroup)
NEGOTIATOR_CONSIDER_PREEMPTION = True

NEGOTIATOR_INTERVAL = 30
NEGOTIATOR_DEBUG = D_FULLDEBUG

NUM_CPUS = 10

#disable acct groups:
#GROUP_NAMES = a, b
GROUP_QUOTA_a = 5
GROUP_QUOTA_b = 5
GROUP_ACCEPT_SURPLUS = FALSE

Running the same basic test sequence with the above config should result in all submitter competing against each other, in traditional non-HGQ fashion. Two of the jobs from "a.u1" and/or "b.u1" should be preempted by jobs from "a.u2" and/or "b.u2" (and all four jobs for "a.u2" and "b.u2" should run):

$ condor_q -format "%s" AccountingGroup -format " | %s\n" JobStatus | sort | uniq -c
      4 a.u1 | 2
      2 a.u2 | 2
      2 b.u1 | 1
      2 b.u1 | 2
      2 b.u2 | 2

Comment 3 Erik Erlandson 2012-03-03 00:35:17 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause: 
Group quota limit bookkeeping did properly not take the possibility of preemption into account.

Consequence:
Job preemption was prevented when group quotas were in effect.

Fix:
Bookkeeping logic for submitter limits was extended to include possibility of preemption.  Job classads in the negotiator were enhanced to allow PREEMPTION_REQUIREMENTS to include accounting group names in the expression.

Result:
Preemption is now correctly allowed when group quotas are in effect.  Preemption policies can be configured to allow preemption to respect accounting group boundaries (or ignore them if desired).

Comment 6 Lubos Trilety 2012-11-13 15:28:15 UTC
Successfully reproduced on condor-7.6.5-0.22

results from first part of scenario:
# condor_q -format "%s" AccountingGroup -format " | %s\n" JobStatus | sort | uniq -c
      4 a.u1 | 2
      1 a.u2 | 1
      1 a.u2 | 2
      4 b.u1 | 2
      1 b.u2 | 1
      1 b.u2 | 2

Comment 7 Lubos Trilety 2012-11-13 15:31:19 UTC
Tested with:
condor-7.8.7-0.4

Tested on:
RHEL5 i386,x86_64
RHEL6 i386,x86_64

The scenario run successfully.

>>> verified

Comment 9 errata-xmlrpc 2013-03-06 18:39:17 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-0564.html


Note You need to log in before you can comment on or make changes to this bug.