Bug 712972

Summary: negotiating groups with no submitters
Product: Red Hat Enterprise MRG Reporter: Jon Thomas <jthomas>
Component: condorAssignee: Erik Erlandson <eerlands>
Status: CLOSED ERRATA QA Contact: Tomas Rusnak <trusnak>
Severity: medium Docs Contact:
Priority: medium    
Version: 2.0CC: jneedle, ltrilety, matt, mkudlej, trusnak, tstclair
Target Milestone: 2.0.1   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: condor-7.6.2-0.2 Doc Type: Bug Fix
Doc Text:
Cause: Logic that removes submitters with no idle jobs in the negotiator, interacting with multiple round robin iterations and collector update latencies Consequence: On multiple round robin iterations, groups can be sent to negotiation with no submitters, which wastes effort. Fix: Check was added to skip groups having no submitters. Result: Groups with no submitter no longer waste time in negotiation group.
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-09-07 16:43:07 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 723887    
Attachments:
Description Flags
patch to skip when no submitterAds none

Description Jon Thomas 2011-06-13 18:57:54 UTC
Artifact of 678590 and 706512 when using round robin that leads to entering negotiateWithGroup with no submitters attached to the group.


06/07/11 21:20:38 Group <removed> - BEGIN NEGOTIATION
06/07/11 21:20:38 Phase 3:  Sorting submitter ads by priority ...
06/07/11 21:20:38 Phase 4.1:  Negotiating with schedds ...
06/07/11 21:20:38     numSlots = 365
06/07/11 21:20:38     slotWeightTotal = 365.000000
06/07/11 21:20:38     pieLeft = 0.000
06/07/11 21:20:38     NormalFactor = 0.000000
06/07/11 21:20:38     MaxPrioValue = 0.000000
06/07/11 21:20:38     NumSubmitterAds = 0
06/07/11 21:20:38  resources used by  are 0.000000
06/07/11 21:20:38  resources used scheddused 0.000000 groupUsed 0.000000
06/07/11 21:20:38  negotiateWithGroup resources used scheddAds length 0 

upstream: https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2230

Comment 1 Jon Thomas 2011-06-13 21:24:17 UTC
Created attachment 504551 [details]
patch to skip when no submitterAds

Comment 2 Erik Erlandson 2011-06-15 23:34:06 UTC
Test/Repro:

Using configuration:

CLAIM_WORKLIFE = 0
NEGOTIATOR_CONSIDER_PREEMPTION = FALSE
NEGOTIATOR_DEBUG = D_FULLDEBUG

NEGOTIATOR_INTERVAL = 30
SCHEDD_INTERVAL = 15

NUM_CPUS = 10

GROUP_NAMES = a, b
GROUP_QUOTA_a = 5
GROUP_QUOTA_b = 5

GROUP_QUOTA_ROUND_ROBIN_RATE = 1

Spool up the test pool, and then follow along on negotiator log:

$ tail -f NegotiatorLog | grep -e 'RR iter' -e 'skipping, no sub' -e 'Filtering.*no idle jobs'

Submit this job:

universe = vanilla
cmd = /bin/sleep
args = 20
should_transfer_files = if_needed
when_to_transfer_output = on_exit
+AccountingGroup="a.u1"
queue 2

After the fix, you should first see it go through 2 RR iterations. Then, after the job completes, you see "a.u1" removed on first RR iteration, followed by it skipping the subsequent iteration. Note, it may occasionally get correct up-to-date values from the collector, in which case the fix will not execute, so if you do not see the behavior after job completes, try again.

[root@rorschach log]$ tail -f NegotiatorLog | grep -e 'RR iter' -e 'skipping, no sub' -e 'Filtering.*no idle jobs'
# First negotiation (two RR iterations)
06/15/11 15:24:58 group quotas: entering RR iteration n= 1
06/15/11 15:24:59 group quotas: entering RR iteration n= 2
# ...
# After job completes, you see this:
06/15/11 15:25:29 group quotas: entering RR iteration n= 1
06/15/11 15:25:29 Filtering submitter a.u1@localdomain with no idle jobs
06/15/11 15:25:29 group quotas: entering RR iteration n= 2
06/15/11 15:25:29 Group a - skipping, no submitters (usage=0)

Before the fix, you should see two RR iterations on job completion (no skipping message) (unless the collector got lucky and had up-to-date values)

Comment 3 Erik Erlandson 2011-06-15 23:34:06 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause:
Logic that removes submitters with no idle jobs in the negotiator, interacting with multiple round robin iterations and collector update latencies

Consequence:
On multiple round robin iterations, groups can be sent to negotiation with no submitters, which wastes effort.

Fix:
Check was added to skip groups having no submitters.

Result:
Groups with no submitter no longer waste time in negotiation group.

Comment 4 Erik Erlandson 2011-06-17 00:26:40 UTC
Fix pending upstream, targeted for 7.6.2

Comment 6 Tomas Rusnak 2011-07-26 16:20:02 UTC
Reproduced on:

$CondorVersion: 7.6.1 Jun 02 2011 BuildID: RH-7.6.1-0.10.el5 $
$CondorPlatform: X86_64-RedHat_5.6 $

NegotiatorLog:

07/26/11 18:58:25 group quotas: group= a  quota= 5  requested= 2  allocated= 2  unallocated= 0
07/26/11 18:58:25 group quotas: group= b  quota= 5  requested= 0  allocated= 0  unallocated= 0
07/26/11 18:58:25 group quotas: groups= 3  requesting= 1  served= 1  unserved= 0  slots= 10  requested= 2  allocated= 2  surplus= 8
07/26/11 18:58:25 group quotas: entering RR iteration n= 1
07/26/11 18:58:25 Group <none> - skipping, zero slots allocated
07/26/11 18:58:25 Group a - BEGIN NEGOTIATION
07/26/11 18:58:25 Phase 3:  Sorting submitter ads by priority ...
...
07/26/11 18:58:25  negotiateWithGroup resources used scheddAds length 0
07/26/11 18:58:25 Group b - skipping, zero slots allocate
07/26/11 18:58:25 group quotas: entering RR iteration n= 2
07/26/11 18:58:25 group quotas: entering RR iteration n= 2
07/26/11 18:58:25 Group <none> - skipping, zero slots allocated
07/26/11 18:58:25 Group a - BEGIN NEGOTIATION
07/26/11 18:58:25 Phase 3:  Sorting submitter ads by priority ... 
...
07/26/11 18:58:25 group quotas: group= b  quota= 5  requested= 0  allocated= 0  unallocated= 0
07/26/11 18:58:25 group quotas: groups= 3  requesting= 0  served= 0  unserved= 0  slots= 10  requested= 0  allocated= 0  surplus= 10
07/26/11 18:58:25 group quotas: entering RR iteration n= 0
07/26/11 18:58:25 Group <none> - skipping, zero slots allocated
07/26/11 18:58:25 Group a - skipping, zero slots allocated

Comment 7 Tomas Rusnak 2011-07-27 08:36:29 UTC
Retested over all supported platforms x86,x86_64/RHEL5,RHEL6 with:

condor-7.6.3-0.2

# sudo -u test condor_submit groupwosubmitter.job && tail -f /var/log/condor/NegotiatorLog | grep -e 'RR iter' -e 'skipping, no sub' -e 'Filtering.*no idle jobs'
Submitting job(s)..
2 job(s) submitted to cluster 366.
07/27/11 09:27:06 group quotas: entering RR iteration n= 1
07/27/11 09:27:06 group quotas: entering RR iteration n= 2
07/27/11 09:27:37 group quotas: entering RR iteration n= 1
07/27/11 09:27:37 group quotas: entering RR iteration n= 2
07/27/11 09:27:37 Group a - skipping, no submitters (usage=0)
07/27/11 09:27:37 group quotas: entering RR iteration n= 0

Group negotiating without submitter was skipped.

>>> VERIFIED

Comment 8 errata-xmlrpc 2011-09-07 16:43:07 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-1249.html