Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 703905 - wrong reject "group quota exceeded"
wrong reject "group quota exceeded"
Status: CLOSED ERRATA
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor (Show other bugs)
2.0
Unspecified Unspecified
medium Severity medium
: 2.0.1
: ---
Assigned To: Erik Erlandson
Lubos Trilety
:
Depends On:
Blocks: 723887
  Show dependency treegraph
 
Reported: 2011-05-11 11:56 EDT by Lubos Trilety
Modified: 2012-11-16 05:17 EST (History)
6 users (show)

See Also:
Fixed In Version: condor-7.6.2-0.2
Doc Type: Bug Fix
Doc Text:
Cause: An unnecessary check related to precision jitter caused negotiator logic to prematurely conclude submitters were at their limit. Consequence: Extra "pie-spin" iterations in the negotiator loop to fill allowed slots. Fix: Removed the unneeded check. Result: Submitters now complete negotiation with fewer pie spins.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2011-09-07 12:43:01 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
NegotiatorLog (142.03 KB, text/plain)
2011-05-11 11:56 EDT, Lubos Trilety
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2011:1249 normal SHIPPED_LIVE Moderate: Red Hat Enterprise MRG Grid 2.0 security, bug fix and enhancement update 2011-09-07 12:40:45 EDT

  None (edit)
Description Lubos Trilety 2011-05-11 11:56:06 EDT
Created attachment 498326 [details]
NegotiatorLog

Description of problem:
Condor rejects job in phase 4.1 of negotiation cycle when there is only one slot left to submitter limit.

05/11/11 17:16:11 matchmakingAlgorithm: limit 3.000000 used 2.000000 pieLeft 1.000000
05/11/11 17:16:11 Attempting to use cached MatchList: Failed (MatchList length: 0, Autocluster: 1, Schedd Name: a.u1@hostname, Schedd Address: <IP:58040>)
05/11/11 17:16:11       Rejected 1.2 a.u1@hostname <IP:58040>: group quota exceeded


Version-Release number of selected component (if applicable):
condor-7.6.1-0.4

How reproducible:
100%

Steps to Reproduce:
1. configure condor
NUM_CPUS = 10
GROUP_NAMES = a, b
GROUP_QUOTA_DYNAMIC_a = 0.5
GROUP_QUOTA_DYNAMIC_b = 0.5

2. submit three job as user a.u1
# echo -e
"universe=vanilla\ncmd=/bin/sleep\nargs=1d\n+AccountingGroup=\"a.u1\"\nqueue 3"
| runuser condor -s /bin/bash -c "condor_submit"
Submitting job(s)...
3 job(s) submitted to cluster 1.

3. observe negotiator log file
  
Actual results:
there is reject in negotiator log file, the last job was run in phase 4.2

Expected results:
no reject, all a.u1 jobs will be run in phase 4.1

Additional info:
see attachment
Comment 1 Jon Thomas 2011-05-20 16:25:00 EDT
I think this is due to SubmitterLimitPermits which has some historical significance. I think this part of bigger problem of having round off issues throughout the code.
Comment 2 Jon Thomas 2011-05-23 11:42:35 EDT
The first conditional seems to be why the pieleft==1 issue occurs.  For practical purposes, used and SlotWeight are integers, so I'm not sure why the the 0.99 factor is used. From logs, it appears that the rejections occur in phase 4.1, but then the unused pie gets used in phase 4.2 by the same user that initially got the rejection. I considered using floor instead of 0.99*, but there didn't seem to be a case that would require it. 

I'm going to run some tests with:

diff -rNup condor-7.4.4.orig/src/condor_negotiator.V6/matchmaker.cpp condor-7.4.4/src/condor_negotiator.V6/matchmaker.cpp
--- condor-7.4.4.orig/src/condor_negotiator.V6/matchmaker.cpp	2011-02-09 14:47:51.000000000 -0500
+++ condor-7.4.4/src/condor_negotiator.V6/matchmaker.cpp	2011-05-23 11:27:30.000000000 -0400
@@ -3016,10 +3016,10 @@ SubmitterLimitPermits(ClassAd *candidate
 		// the use of a fudge-factor 0.99 in the following is to be
 		// generous in case of very small round-off differences
 		// that I have observed in tests
-	if((used + SlotWeight) <= 0.99*allowed) {
+	if((used + (double) SlotWeight) <= allowed) {
 		return true;
 	}
-	if( used == 0 && allowed > 0 && pieLeft >= 0.99*SlotWeight ) {
+	if( used == 0 && allowed > 0 && pieLeft >= 0.99* (double) SlotWeight ) {
 
 		// Allow user to round up once per pie spin in order to avoid
 		// "crumbs" being left behind that couldn't be taken by anyone
Comment 3 Erik Erlandson 2011-05-23 11:55:19 EDT
(In reply to comment #2)
> The first conditional seems to be why the pieleft==1 issue occurs.  For
> practical purposes, used and SlotWeight are integers, so I'm not sure why the
> the 0.99 factor is used. 

Condor officially allows slot weight to be any positive floating point number, so the matchmaking algorithms can't assume integer values.  (that said, there will be some way to properly guarantee convergence, regardless)
Comment 4 Jon Thomas 2011-05-23 13:04:02 EDT
Patch fixes issue in description.
Comment 5 Jon Thomas 2011-05-23 14:53:49 EDT
In the logs I've looked at there's evidence of this hitting more than just when pieLeft==1. This case hits too:

matchmakingAlgorithm: limit 171.000000 used 169.000000 pieLeft 2.000000

if((used + SlotWeight) <= 0.99*allowed) {
169+1<=0.99*171
170<=169.29
Comment 6 Erik Erlandson 2011-05-26 17:29:13 EDT
upstream: https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2195
Comment 7 Erik Erlandson 2011-05-27 19:38:56 EDT
Pushed fix to fh branch: V7_6-BZ703905-extra-spin
Comment 8 Erik Erlandson 2011-05-27 19:40:11 EDT
repro/test:

Using following config:

NEGOTIATOR_CONSIDER_PREEMPTION = FALSE
NUM_CPUS = 10


And the following submit file:

universe = vanilla
cmd = /bin/sleep
args = 300
should_transfer_files = if_needed
when_to_transfer_output = on_exit
+AccountingGroup="u1"
queue 5
+AccountingGroup="u2"
queue 5


Before fix, submit job and see extra spins (phase 4.2):

$ tail -f NegotiatorLog | grep -e 'Phase 4..:' -e 'Negotiating with.* at' -e 'submitterLimit *='
05/27/11 16:03:20 Phase 4.1:  Negotiating with schedds ...
05/27/11 16:03:20   Negotiating with u1@localdomain at <192.168.1.2:52350>
05/27/11 16:03:20     submitterLimit    = 5.000000
05/27/11 16:03:20   Negotiating with u2@localdomain at <192.168.1.2:52350>
05/27/11 16:03:20     submitterLimit    = 5.000000
05/27/11 16:03:20 Phase 4.2:  Negotiating with schedds ...
05/27/11 16:03:20   Negotiating with u1@localdomain at <192.168.1.2:52350>
05/27/11 16:03:20     submitterLimit    = 1.000000
05/27/11 16:03:21   Negotiating with u2@localdomain at <192.168.1.2:52350>
05/27/11 16:03:21     submitterLimit    = 1.000000


After fix, jobs negotiate in one spin:

$ tail -f NegotiatorLog | grep -e 'Phase 4..:' -e 'Negotiating with.* at' -e 'submitterLimit *='
05/27/11 16:05:54 Phase 4.1:  Negotiating with schedds ...
05/27/11 16:05:54   Negotiating with u1@localdomain at <192.168.1.2:33637>
05/27/11 16:05:54     submitterLimit    = 5.000000
05/27/11 16:05:54   Negotiating with u2@localdomain at <192.168.1.2:33637>
05/27/11 16:05:54     submitterLimit    = 5.000000
Comment 9 Erik Erlandson 2011-05-27 19:43:44 EDT
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause:
An unnecessary check related to precision jitter caused negotiator logic to prematurely conclude submitters were at their limit.

Consequence:
Extra "pie-spin" iterations in the negotiator loop to fill allowed slots.

Fix:
Removed the unneeded check.

Result:
Submitters now complete negotiation with fewer pie spins.
Comment 11 Lubos Trilety 2011-07-27 10:32:05 EDT
Test in Comment 8 successfully reproduced on:
$CondorVersion: 7.6.0 Mar 24 2011 BuildID: RH-7.6.0-0.3.el5 PRE-RELEASE-GRID $
$CondorPlatform: X86_64-Redhat_5.6 $
Comment 12 Lubos Trilety 2011-07-27 10:35:24 EDT
Tested on:
$CondorVersion: 7.6.3 Jul 13 2011 BuildID: RH-7.6.3-0.2.el5 $
$CondorPlatform: I686-RedHat_5.7 $

$CondorVersion: 7.6.3 Jul 13 2011 BuildID: RH-7.6.3-0.2.el5 $
$CondorPlatform: X86_64-RedHat_5.7 $

$CondorVersion: 7.6.3 Jul 13 2011 BuildID: RH-7.6.3-0.2.el6 $
$CondorPlatform: I686-RedHat_6.1 $

$CondorVersion: 7.6.3 Jul 13 2011 BuildID: RH-7.6.3-0.2.el6 $
$CondorPlatform: X86_64-RedHat_6.1 $

In both tests submitters correctly get all possible resources in first phase of negotiator cycle.

>>> VERIFIED
Comment 13 errata-xmlrpc 2011-09-07 12:43:01 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-1249.html

Note You need to log in before you can comment on or make changes to this bug.