Created attachment 498326 [details] NegotiatorLog Description of problem: Condor rejects job in phase 4.1 of negotiation cycle when there is only one slot left to submitter limit. 05/11/11 17:16:11 matchmakingAlgorithm: limit 3.000000 used 2.000000 pieLeft 1.000000 05/11/11 17:16:11 Attempting to use cached MatchList: Failed (MatchList length: 0, Autocluster: 1, Schedd Name: a.u1@hostname, Schedd Address: <IP:58040>) 05/11/11 17:16:11 Rejected 1.2 a.u1@hostname <IP:58040>: group quota exceeded Version-Release number of selected component (if applicable): condor-7.6.1-0.4 How reproducible: 100% Steps to Reproduce: 1. configure condor NUM_CPUS = 10 GROUP_NAMES = a, b GROUP_QUOTA_DYNAMIC_a = 0.5 GROUP_QUOTA_DYNAMIC_b = 0.5 2. submit three job as user a.u1 # echo -e "universe=vanilla\ncmd=/bin/sleep\nargs=1d\n+AccountingGroup=\"a.u1\"\nqueue 3" | runuser condor -s /bin/bash -c "condor_submit" Submitting job(s)... 3 job(s) submitted to cluster 1. 3. observe negotiator log file Actual results: there is reject in negotiator log file, the last job was run in phase 4.2 Expected results: no reject, all a.u1 jobs will be run in phase 4.1 Additional info: see attachment
I think this is due to SubmitterLimitPermits which has some historical significance. I think this part of bigger problem of having round off issues throughout the code.
The first conditional seems to be why the pieleft==1 issue occurs. For practical purposes, used and SlotWeight are integers, so I'm not sure why the the 0.99 factor is used. From logs, it appears that the rejections occur in phase 4.1, but then the unused pie gets used in phase 4.2 by the same user that initially got the rejection. I considered using floor instead of 0.99*, but there didn't seem to be a case that would require it. I'm going to run some tests with: diff -rNup condor-7.4.4.orig/src/condor_negotiator.V6/matchmaker.cpp condor-7.4.4/src/condor_negotiator.V6/matchmaker.cpp --- condor-7.4.4.orig/src/condor_negotiator.V6/matchmaker.cpp 2011-02-09 14:47:51.000000000 -0500 +++ condor-7.4.4/src/condor_negotiator.V6/matchmaker.cpp 2011-05-23 11:27:30.000000000 -0400 @@ -3016,10 +3016,10 @@ SubmitterLimitPermits(ClassAd *candidate // the use of a fudge-factor 0.99 in the following is to be // generous in case of very small round-off differences // that I have observed in tests - if((used + SlotWeight) <= 0.99*allowed) { + if((used + (double) SlotWeight) <= allowed) { return true; } - if( used == 0 && allowed > 0 && pieLeft >= 0.99*SlotWeight ) { + if( used == 0 && allowed > 0 && pieLeft >= 0.99* (double) SlotWeight ) { // Allow user to round up once per pie spin in order to avoid // "crumbs" being left behind that couldn't be taken by anyone
(In reply to comment #2) > The first conditional seems to be why the pieleft==1 issue occurs. For > practical purposes, used and SlotWeight are integers, so I'm not sure why the > the 0.99 factor is used. Condor officially allows slot weight to be any positive floating point number, so the matchmaking algorithms can't assume integer values. (that said, there will be some way to properly guarantee convergence, regardless)
Patch fixes issue in description.
In the logs I've looked at there's evidence of this hitting more than just when pieLeft==1. This case hits too: matchmakingAlgorithm: limit 171.000000 used 169.000000 pieLeft 2.000000 if((used + SlotWeight) <= 0.99*allowed) { 169+1<=0.99*171 170<=169.29
upstream: https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2195
Pushed fix to fh branch: V7_6-BZ703905-extra-spin
repro/test: Using following config: NEGOTIATOR_CONSIDER_PREEMPTION = FALSE NUM_CPUS = 10 And the following submit file: universe = vanilla cmd = /bin/sleep args = 300 should_transfer_files = if_needed when_to_transfer_output = on_exit +AccountingGroup="u1" queue 5 +AccountingGroup="u2" queue 5 Before fix, submit job and see extra spins (phase 4.2): $ tail -f NegotiatorLog | grep -e 'Phase 4..:' -e 'Negotiating with.* at' -e 'submitterLimit *=' 05/27/11 16:03:20 Phase 4.1: Negotiating with schedds ... 05/27/11 16:03:20 Negotiating with u1@localdomain at <192.168.1.2:52350> 05/27/11 16:03:20 submitterLimit = 5.000000 05/27/11 16:03:20 Negotiating with u2@localdomain at <192.168.1.2:52350> 05/27/11 16:03:20 submitterLimit = 5.000000 05/27/11 16:03:20 Phase 4.2: Negotiating with schedds ... 05/27/11 16:03:20 Negotiating with u1@localdomain at <192.168.1.2:52350> 05/27/11 16:03:20 submitterLimit = 1.000000 05/27/11 16:03:21 Negotiating with u2@localdomain at <192.168.1.2:52350> 05/27/11 16:03:21 submitterLimit = 1.000000 After fix, jobs negotiate in one spin: $ tail -f NegotiatorLog | grep -e 'Phase 4..:' -e 'Negotiating with.* at' -e 'submitterLimit *=' 05/27/11 16:05:54 Phase 4.1: Negotiating with schedds ... 05/27/11 16:05:54 Negotiating with u1@localdomain at <192.168.1.2:33637> 05/27/11 16:05:54 submitterLimit = 5.000000 05/27/11 16:05:54 Negotiating with u2@localdomain at <192.168.1.2:33637> 05/27/11 16:05:54 submitterLimit = 5.000000
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Cause: An unnecessary check related to precision jitter caused negotiator logic to prematurely conclude submitters were at their limit. Consequence: Extra "pie-spin" iterations in the negotiator loop to fill allowed slots. Fix: Removed the unneeded check. Result: Submitters now complete negotiation with fewer pie spins.
Test in Comment 8 successfully reproduced on: $CondorVersion: 7.6.0 Mar 24 2011 BuildID: RH-7.6.0-0.3.el5 PRE-RELEASE-GRID $ $CondorPlatform: X86_64-Redhat_5.6 $
Tested on: $CondorVersion: 7.6.3 Jul 13 2011 BuildID: RH-7.6.3-0.2.el5 $ $CondorPlatform: I686-RedHat_5.7 $ $CondorVersion: 7.6.3 Jul 13 2011 BuildID: RH-7.6.3-0.2.el5 $ $CondorPlatform: X86_64-RedHat_5.7 $ $CondorVersion: 7.6.3 Jul 13 2011 BuildID: RH-7.6.3-0.2.el6 $ $CondorPlatform: I686-RedHat_6.1 $ $CondorVersion: 7.6.3 Jul 13 2011 BuildID: RH-7.6.3-0.2.el6 $ $CondorPlatform: X86_64-RedHat_6.1 $ In both tests submitters correctly get all possible resources in first phase of negotiator cycle. >>> VERIFIED
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-1249.html