Created attachment 504523 [details] patch to bail out of negotiation when preemption off and slotweights off Massive numbers of submissions rejected with rejForSubmitterLimit overwhelm the negotiator. In some cases, negotiation can be terminated after one rejection. Currently the code cycles through every submitter widening the negotiation cycle, in some cases triggering a timeout.
upstream: https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2231
In a configuration with many submitters in a group, this is to skip all submitters once one hits the group's limit?
No, it skips remaining submitters for a particular scheddAd. A group with two different users will have two different scheddAds. So if groupa.user1 hits rejection, then it will skip the rest of user1 and proceed to groupa.user2.
Repro/test Repro involves fractional slots on submitter limits. Also interacts with: NEGOTIATE_ALL_JOBS_IN_CLUSTER = TRUE Start with this config: CLAIM_WORKLIFE = 0 NEGOTIATOR_CONSIDER_PREEMPTION = FALSE NEGOTIATOR_DEBUG = D_FULLDEBUG NEGOTIATE_ALL_JOBS_IN_CLUSTER = TRUE NEGOTIATOR_USE_SLOT_WEIGHTS = FALSE GROUP_QUOTA_MAX_ALLOCATION_ROUNDS = 1 NEGOTIATOR_INTERVAL = 30 SCHEDD_INTERVAL = 15 NUM_CPUS = 20 GROUP_NAMES = a GROUP_QUOTA_a = 5 GROUP_ACCEPT_SURPLUS = FALSE Using this submit file, which creates jobs for two submitters, forcing each to be given submitter limit of 2.5: universe = vanilla cmd = /bin/sleep args = 60 should_transfer_files = if_needed when_to_transfer_output = on_exit +AccountingGroup="a.u2.user" queue 10 +AccountingGroup="a.u1.user" queue 10 Before fix, we see lots of unproductive rejections for submitter limit. Group "a" is allocated its limit of 5: $ tail -f NegotiatorLog | grep -e 'done negotiating' -e 'exceeded' -e 'Group a .*allocated=.*usage=' 06/16/11 16:55:19 group quotas: Group a allocated= 0 usage= 0 06/16/11 16:55:40 Rejected 1.12 a.u1.user@localdomain <192.168.1.2:42270>: group quota exceeded 06/16/11 16:55:40 Rejected 1.13 a.u1.user@localdomain <192.168.1.2:42270>: group quota exceeded 06/16/11 16:55:40 Rejected 1.14 a.u1.user@localdomain <192.168.1.2:42270>: group quota exceeded 06/16/11 16:55:40 Rejected 1.15 a.u1.user@localdomain <192.168.1.2:42270>: group quota exceeded 06/16/11 16:55:40 Rejected 1.16 a.u1.user@localdomain <192.168.1.2:42270>: group quota exceeded 06/16/11 16:55:40 Rejected 1.17 a.u1.user@localdomain <192.168.1.2:42270>: group quota exceeded 06/16/11 16:55:40 Rejected 1.18 a.u1.user@localdomain <192.168.1.2:42270>: group quota exceeded 06/16/11 16:55:40 Rejected 1.19 a.u1.user@localdomain <192.168.1.2:42270>: group quota exceeded 06/16/11 16:55:40 Got NO_MORE_JOBS; done negotiating 06/16/11 16:55:40 Rejected 1.2 a.u2.user@localdomain <192.168.1.2:42270>: group quota exceeded 06/16/11 16:55:40 Rejected 1.3 a.u2.user@localdomain <192.168.1.2:42270>: group quota exceeded 06/16/11 16:55:40 Rejected 1.4 a.u2.user@localdomain <192.168.1.2:42270>: group quota exceeded 06/16/11 16:55:40 Rejected 1.5 a.u2.user@localdomain <192.168.1.2:42270>: group quota exceeded 06/16/11 16:55:40 Rejected 1.6 a.u2.user@localdomain <192.168.1.2:42270>: group quota exceeded 06/16/11 16:55:40 Rejected 1.7 a.u2.user@localdomain <192.168.1.2:42270>: group quota exceeded 06/16/11 16:55:40 Rejected 1.8 a.u2.user@localdomain <192.168.1.2:42270>: group quota exceeded 06/16/11 16:55:40 Rejected 1.9 a.u2.user@localdomain <192.168.1.2:42270>: group quota exceeded 06/16/11 16:55:40 Got NO_MORE_JOBS; done negotiating 06/16/11 16:55:40 group quotas: Group a allocated= 5 usage= 5 After fix, we see that group "a" still gets its quota of 5, but with fewer unproductive submitter limit rejections: $ tail -f NegotiatorLog | grep -e 'done negotiating' -e 'exceeded' -e 'Group a .*allocated=.*usage=' 06/16/11 16:52:39 group quotas: Group a allocated= 0 usage= 0 06/16/11 16:52:59 Rejected 1.12 a.u1.user@localdomain <192.168.1.2:56962>: group quota exceeded 06/16/11 16:52:59 Hit submitter limit: done negotiating 06/16/11 16:53:00 Rejected 1.2 a.u2.user@localdomain <192.168.1.2:56962>: group quota exceeded 06/16/11 16:53:00 Hit submitter limit: done negotiating 06/16/11 16:53:00 group quotas: Group a allocated= 5 usage= 5
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Cause: Interaction of negotiator attempting to fill fractional submitter limit remainder with "NEGOTIATE_ALL_JOBS_IN_CLUSTER = TRUE", in which schedd will doggedly try every possible job, no matter what. Consequence: Many unproductive negotiation attempts with rejection for submitter limits exceeded. Fix: A new check was added to halt negotiations for a submitter once it is rejected for submitter limit, provided it is safe: (preemption not considered, and no slot weights) Consequence: Negotiation avoids unproductive repeated rejections for submitter limits, and halts on first such rejection.
Fix pending upstream targeted for 7.6.2
Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1,5 +1,5 @@ Cause: -Interaction of negotiator attempting to fill fractional submitter limit remainder with "NEGOTIATE_ALL_JOBS_IN_CLUSTER = TRUE", in which schedd will doggedly try every possible job, no matter what. +Negotiator attempting to fill fractional submitter limit remainder where schedd will doggedly try every possible job cluster. Consequence: Many unproductive negotiation attempts with rejection for submitter limits exceeded.
Successfully reproduced on: $CondorVersion: 7.6.1 Jun 02 2011 BuildID: RH-7.6.1-0.10.el5 $ $CondorPlatform: X86_64-RedHat_5.6 $ Tested on: $CondorVersion: 7.6.3 Jul 13 2011 BuildID: RH-7.6.3-0.2.el5 $ $CondorPlatform: I686-RedHat_5.7 $ $CondorVersion: 7.6.3 Jul 13 2011 BuildID: RH-7.6.3-0.2.el5 $ $CondorPlatform: X86_64-RedHat_5.7 $ $CondorVersion: 7.6.3 Jul 13 2011 BuildID: RH-7.6.3-0.2.el6 $ $CondorPlatform: I686-RedHat_6.1 $ $CondorVersion: 7.6.3 Jul 13 2011 BuildID: RH-7.6.3-0.2.el6 $ $CondorPlatform: X86_64-RedHat_6.1 $ Only two rejections present in negotiator log after submit: 07/20/11 16:23:53 Rejected 1.12 a.u1.user.lab.eng.brq.redhat.com <10.34.33.174:37086>: group quota exceeded 07/20/11 16:23:53 Hit submitter limit: done negotiating 07/20/11 16:23:54 Rejected 1.2 a.u2.user.lab.eng.brq.redhat.com <10.34.33.174:37086>: group quota exceeded 07/20/11 16:23:54 Hit submitter limit: done negotiating >>> VERIFIED
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-1249.html