Bug 712973 - negotiator overwhelmed with rejForSubmitterLimit rejections
Summary: negotiator overwhelmed with rejForSubmitterLimit rejections
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor
Version: 2.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: 2.0.1
: ---
Assignee: Erik Erlandson
QA Contact: Lubos Trilety
URL:
Whiteboard:
Depends On:
Blocks: 723887
TreeView+ depends on / blocked
 
Reported: 2011-06-13 19:08 UTC by Jon Thomas
Modified: 2012-11-16 10:09 UTC (History)
5 users (show)

Fixed In Version: condor-7.6.2-0.2
Doc Type: Bug Fix
Doc Text:
Cause: Negotiator attempting to fill fractional submitter limit remainder where schedd will doggedly try every possible job cluster. Consequence: Many unproductive negotiation attempts with rejection for submitter limits exceeded. Fix: A new check was added to halt negotiations for a submitter once it is rejected for submitter limit, provided it is safe: (preemption not considered, and no slot weights) Consequence: Negotiation avoids unproductive repeated rejections for submitter limits, and halts on first such rejection.
Clone Of:
Environment:
Last Closed: 2011-09-07 16:41:40 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
patch to bail out of negotiation when preemption off and slotweights off (2.07 KB, patch)
2011-06-13 19:08 UTC, Jon Thomas
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2011:1249 0 normal SHIPPED_LIVE Moderate: Red Hat Enterprise MRG Grid 2.0 security, bug fix and enhancement update 2011-09-07 16:40:45 UTC

Description Jon Thomas 2011-06-13 19:08:42 UTC
Created attachment 504523 [details]
patch to bail out of negotiation when preemption off and slotweights off

Massive numbers of submissions rejected with rejForSubmitterLimit overwhelm the negotiator. In some cases, negotiation can be terminated after one rejection. Currently the code cycles through every submitter widening the negotiation cycle, in some cases triggering a timeout.

Comment 1 Jon Thomas 2011-06-13 19:09:26 UTC
upstream: https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2231

Comment 2 Matthew Farrellee 2011-06-13 20:41:38 UTC
In a configuration with many submitters in a group, this is to skip all submitters once one hits the group's limit?

Comment 3 Jon Thomas 2011-06-13 21:05:39 UTC
No, it skips remaining submitters for a particular scheddAd. A group with two different users will have two different scheddAds. 

So if groupa.user1 hits rejection, then it will skip the rest of user1 and proceed to groupa.user2.

Comment 4 Erik Erlandson 2011-06-17 00:24:15 UTC
Repro/test

Repro involves fractional slots on submitter limits. Also interacts with:

NEGOTIATE_ALL_JOBS_IN_CLUSTER = TRUE

Start with this config:

CLAIM_WORKLIFE = 0
NEGOTIATOR_CONSIDER_PREEMPTION = FALSE
NEGOTIATOR_DEBUG = D_FULLDEBUG

NEGOTIATE_ALL_JOBS_IN_CLUSTER = TRUE
NEGOTIATOR_USE_SLOT_WEIGHTS = FALSE

GROUP_QUOTA_MAX_ALLOCATION_ROUNDS = 1

NEGOTIATOR_INTERVAL = 30
SCHEDD_INTERVAL = 15

NUM_CPUS = 20

GROUP_NAMES = a
GROUP_QUOTA_a = 5

GROUP_ACCEPT_SURPLUS = FALSE

Using this submit file, which creates jobs for two submitters, forcing each to be given submitter limit of 2.5:

universe = vanilla
cmd = /bin/sleep
args = 60
should_transfer_files = if_needed
when_to_transfer_output = on_exit
+AccountingGroup="a.u2.user"
queue 10
+AccountingGroup="a.u1.user"
queue 10

Before fix, we see lots of unproductive rejections for submitter limit. Group "a" is allocated its limit of 5:

$ tail -f NegotiatorLog | grep -e 'done negotiating' -e 'exceeded' -e 'Group a .*allocated=.*usage='
06/16/11 16:55:19 group quotas: Group a  allocated= 0  usage= 0
06/16/11 16:55:40       Rejected 1.12 a.u1.user@localdomain <192.168.1.2:42270>: group quota exceeded
06/16/11 16:55:40       Rejected 1.13 a.u1.user@localdomain <192.168.1.2:42270>: group quota exceeded
06/16/11 16:55:40       Rejected 1.14 a.u1.user@localdomain <192.168.1.2:42270>: group quota exceeded
06/16/11 16:55:40       Rejected 1.15 a.u1.user@localdomain <192.168.1.2:42270>: group quota exceeded
06/16/11 16:55:40       Rejected 1.16 a.u1.user@localdomain <192.168.1.2:42270>: group quota exceeded
06/16/11 16:55:40       Rejected 1.17 a.u1.user@localdomain <192.168.1.2:42270>: group quota exceeded
06/16/11 16:55:40       Rejected 1.18 a.u1.user@localdomain <192.168.1.2:42270>: group quota exceeded
06/16/11 16:55:40       Rejected 1.19 a.u1.user@localdomain <192.168.1.2:42270>: group quota exceeded
06/16/11 16:55:40     Got NO_MORE_JOBS;  done negotiating
06/16/11 16:55:40       Rejected 1.2 a.u2.user@localdomain <192.168.1.2:42270>: group quota exceeded
06/16/11 16:55:40       Rejected 1.3 a.u2.user@localdomain <192.168.1.2:42270>: group quota exceeded
06/16/11 16:55:40       Rejected 1.4 a.u2.user@localdomain <192.168.1.2:42270>: group quota exceeded
06/16/11 16:55:40       Rejected 1.5 a.u2.user@localdomain <192.168.1.2:42270>: group quota exceeded
06/16/11 16:55:40       Rejected 1.6 a.u2.user@localdomain <192.168.1.2:42270>: group quota exceeded
06/16/11 16:55:40       Rejected 1.7 a.u2.user@localdomain <192.168.1.2:42270>: group quota exceeded
06/16/11 16:55:40       Rejected 1.8 a.u2.user@localdomain <192.168.1.2:42270>: group quota exceeded
06/16/11 16:55:40       Rejected 1.9 a.u2.user@localdomain <192.168.1.2:42270>: group quota exceeded
06/16/11 16:55:40     Got NO_MORE_JOBS;  done negotiating
06/16/11 16:55:40 group quotas: Group a  allocated= 5  usage= 5

After fix, we see that group "a" still gets its quota of 5, but with fewer unproductive submitter limit rejections:

$ tail -f NegotiatorLog | grep -e 'done negotiating' -e 'exceeded' -e 'Group a .*allocated=.*usage='
06/16/11 16:52:39 group quotas: Group a  allocated= 0  usage= 0
06/16/11 16:52:59       Rejected 1.12 a.u1.user@localdomain <192.168.1.2:56962>: group quota exceeded
06/16/11 16:52:59     Hit submitter limit: done negotiating
06/16/11 16:53:00       Rejected 1.2 a.u2.user@localdomain <192.168.1.2:56962>: group quota exceeded
06/16/11 16:53:00     Hit submitter limit: done negotiating
06/16/11 16:53:00 group quotas: Group a  allocated= 5  usage= 5

Comment 5 Erik Erlandson 2011-06-17 00:24:15 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause:
Interaction of negotiator attempting to fill fractional submitter limit remainder with "NEGOTIATE_ALL_JOBS_IN_CLUSTER = TRUE", in which schedd will doggedly try every possible job, no matter what.

Consequence:
Many unproductive negotiation attempts with rejection for submitter limits exceeded.

Fix:
A new check was added to halt negotiations for a submitter once it is rejected for submitter limit, provided it is safe: (preemption not considered, and no slot weights)

Consequence:
Negotiation avoids unproductive repeated rejections for submitter limits, and halts on first such rejection.

Comment 6 Erik Erlandson 2011-06-17 00:25:06 UTC
Fix pending upstream targeted for 7.6.2

Comment 7 Erik Erlandson 2011-06-17 19:13:14 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1,5 +1,5 @@
 Cause:
-Interaction of negotiator attempting to fill fractional submitter limit remainder with "NEGOTIATE_ALL_JOBS_IN_CLUSTER = TRUE", in which schedd will doggedly try every possible job, no matter what.
+Negotiator attempting to fill fractional submitter limit remainder where schedd will doggedly try every possible job cluster.
 
 Consequence:
 Many unproductive negotiation attempts with rejection for submitter limits exceeded.

Comment 8 Lubos Trilety 2011-07-20 13:27:01 UTC
Successfully reproduced on:
$CondorVersion: 7.6.1 Jun 02 2011 BuildID: RH-7.6.1-0.10.el5 $
$CondorPlatform: X86_64-RedHat_5.6 $

Tested on:
$CondorVersion: 7.6.3 Jul 13 2011 BuildID: RH-7.6.3-0.2.el5 $
$CondorPlatform: I686-RedHat_5.7 $

$CondorVersion: 7.6.3 Jul 13 2011 BuildID: RH-7.6.3-0.2.el5 $
$CondorPlatform: X86_64-RedHat_5.7 $

$CondorVersion: 7.6.3 Jul 13 2011 BuildID: RH-7.6.3-0.2.el6 $
$CondorPlatform: I686-RedHat_6.1 $

$CondorVersion: 7.6.3 Jul 13 2011 BuildID: RH-7.6.3-0.2.el6 $
$CondorPlatform: X86_64-RedHat_6.1 $

Only two rejections present in negotiator log after submit:
07/20/11 16:23:53       Rejected 1.12 a.u1.user.lab.eng.brq.redhat.com <10.34.33.174:37086>: group quota exceeded
07/20/11 16:23:53     Hit submitter limit: done negotiating
07/20/11 16:23:54       Rejected 1.2 a.u2.user.lab.eng.brq.redhat.com <10.34.33.174:37086>: group quota exceeded
07/20/11 16:23:54     Hit submitter limit: done negotiating

>>> VERIFIED

Comment 9 errata-xmlrpc 2011-09-07 16:41:40 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-1249.html


Note You need to log in before you can comment on or make changes to this bug.