Description of problem: It is possible construct submission scenario that completely violates group quota limit configured for HFS Steps to Reproduce: #configure 20 slots, 10 to group a and 10 to group b [condor@rorschach config.d]$ more 15.slot.config NUM_CPUS = 20 SLOT_TYPE_1 = cpus=2 SLOT_TYPE_1_PARTITIONABLE = TRUE NUM_SLOTS_TYPE_1 = 0 SLOT_TYPE_2 = cpus=1 SLOT_TYPE_2_PARTITIONABLE = FALSE NUM_SLOTS_TYPE_2 = 20 GROUP_NAMES = a, b GROUP_QUOTA_DYNAMIC_a = 0.5 GROUP_QUOTA_DYNAMIC_b = 0.5 GROUP_AUTOREGROUP_a = FALSE GROUP_AUTOREGROUP_b = FALSE # check our slots: [eje@rorschach ~]$ svhist Machine _SlotType_ Cpus State Activity 20 rorschach.localdomain | X | 1 | Unclaimed | Idle 20 total # submit 9 jobs to group "a" (should be one left) [eje@rorschach ~]$ echo -e "universe=vanilla\ncmd=/bin/sleep\nargs=600\n+AccountingGroup=\"a.u1\"\nqueue 9\n" | condor_submit Submitting job(s)......... 9 job(s) submitted to cluster 126. [eje@rorschach ~]$ svhist Machine _SlotType_ Cpus State Activity 9 rorschach.localdomain | X | 1 | Claimed | Busy 11 rorschach.localdomain | X | 1 | Unclaimed | Idle 20 total # now submit 8 more jobs: 2 jobs each from four new users, but all under group "a" [eje@rorschach ~]$ echo -e "universe=vanilla\ncmd=/bin/sleep\nargs=600\n+AccountingGroup=\"a.u2\"\nqueue 2\n+AccountingGroup=\"a.u3\"\nqueue 2\n+AccountingGroup=\"a.u4\"\nqueue 2\n+AccountingGroup=\"a.u5\"\nqueue 2\n" | condor_submit Submitting job(s)........ 8 job(s) submitted to cluster 127. # expecting one more slot to fill (was intending to study reject logic for submitters) # but -- oh no -- all 8 jobs got scheduled [eje@rorschach ~]$ !svhist svhist Machine _SlotType_ Cpus State Activity 17 rorschach.localdomain | X | 1 | Claimed | Busy 3 rorschach.localdomain | X | 1 | Unclaimed | Idle 20 total [eje@rorschach ~]$ condor_q -- Submitter: rorschach.localdomain : <192.168.1.2:49106> : rorschach.localdomain ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 126.0 eje 10/7 16:12 0+00:09:59 R 0 0.0 sleep 600 126.1 eje 10/7 16:12 0+00:09:58 R 0 0.0 sleep 600 126.2 eje 10/7 16:12 0+00:09:58 R 0 0.0 sleep 600 126.3 eje 10/7 16:12 0+00:09:58 R 0 0.0 sleep 600 126.4 eje 10/7 16:12 0+00:09:58 R 0 0.0 sleep 600 126.5 eje 10/7 16:12 0+00:09:58 R 0 0.0 sleep 600 126.6 eje 10/7 16:12 0+00:09:58 R 0 0.0 sleep 600 126.7 eje 10/7 16:12 0+00:09:58 R 0 0.0 sleep 600 126.8 eje 10/7 16:12 0+00:09:58 R 0 0.0 sleep 600 127.0 eje 10/7 16:16 0+00:06:34 R 0 0.0 sleep 600 127.1 eje 10/7 16:16 0+00:06:32 R 0 0.0 sleep 600 127.2 eje 10/7 16:16 0+00:06:34 R 0 0.0 sleep 600 127.3 eje 10/7 16:16 0+00:06:33 R 0 0.0 sleep 600 127.4 eje 10/7 16:16 0+00:06:34 R 0 0.0 sleep 600 127.5 eje 10/7 16:16 0+00:06:33 R 0 0.0 sleep 600 127.6 eje 10/7 16:16 0+00:06:33 R 0 0.0 sleep 600 127.7 eje 10/7 16:16 0+00:06:33 R 0 0.0 sleep 600 17 jobs; 0 idle, 17 running, 0 held [eje@rorschach ~]$ Expected results: System should envorce limit of 10 slots for group a, and schedule maximum of 10 jobs running at any one time
Created attachment 454216 [details] patch Strange. This was tested so either the previous test didn't catch the issue and it exists in the flat upstream group code too or we had a regression. I think perhaps the former. One thing to note is initially I couldn't repro it, but later I was able to consistently. In any case, here is a patch that appears to fix the problem.
Incorporated Jon's fix here: V7_4-BZ619557-HFS-tree-structure
Successfully reproduced on: $CondorVersion: 7.4.4 Sep 27 2010 BuildID: RH-7.4.4-0.16.el5 PRE-RELEASE $ $CondorPlatform: X86_64-LINUX_RHEL5 $
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Cause: Submitting jobs with multiple users under the same accounting group (e.g. "a.user1", "a.user2", "a.user3" ... ) Consequence: The negotiator fails to properly limit the number of jobs submitted under the accounting group in question. Fix: A bug in the logic for accumulating the total number of jobs matched in the negotiation loop was corrected. Result: Accounting group slot limits are obeyed.
Created attachment 473340 [details] NegotiatorLog After # now submit 8 more jobs: 2 jobs each from four new users, but all under group "a" # echo -e "universe=vanilla\ncmd=/bin/sleep\nargs=600\n+AccountingGroup=\"a.u2\"\nqueue 2\n+AccountingGroup=\"a.u3\"\nqueue 2\n+AccountingGroup=\"a.u4\"\nqueue 2\n+AccountingGroup=\"a.u5\"\nqueue 2\n" | runuser condor -s /bin/bash -c condor_submit Submitting job(s)........ 8 job(s) submitted to cluster 2. the condor run one job for all of a.<user> a.u5 - 1 a.u4 - 1 a.u3 - 1 a.u2 - 1 together with a.u1 - 9 there is 13 running jobs for group a # condor_q -- Submitter: hostname : <IP:44168> : hostname ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0 condor 1/13 09:22 0+00:08:16 R 0 0.0 sleep 600 1.1 condor 1/13 09:22 0+00:08:16 R 0 0.0 sleep 600 1.2 condor 1/13 09:22 0+00:08:16 R 0 0.0 sleep 600 1.3 condor 1/13 09:22 0+00:08:16 R 0 0.0 sleep 600 1.4 condor 1/13 09:22 0+00:08:16 R 0 0.0 sleep 600 1.5 condor 1/13 09:22 0+00:08:16 R 0 0.0 sleep 600 1.6 condor 1/13 09:22 0+00:08:16 R 0 0.0 sleep 600 1.7 condor 1/13 09:22 0+00:08:16 R 0 0.0 sleep 600 1.8 condor 1/13 09:22 0+00:08:16 R 0 0.0 sleep 600 2.0 condor 1/13 09:23 0+00:06:56 R 0 0.0 sleep 600 2.1 condor 1/13 09:23 0+00:00:00 I 0 0.0 sleep 600 2.2 condor 1/13 09:23 0+00:06:56 R 0 0.0 sleep 600 2.3 condor 1/13 09:23 0+00:00:00 I 0 0.0 sleep 600 2.4 condor 1/13 09:23 0+00:06:56 R 0 0.0 sleep 600 2.5 condor 1/13 09:23 0+00:00:00 I 0 0.0 sleep 600 2.6 condor 1/13 09:23 0+00:06:56 R 0 0.0 sleep 600 2.7 condor 1/13 09:23 0+00:00:00 I 0 0.0 sleep 600 17 jobs; 4 idle, 13 running, 0 held But group a had limit 10 resources.
The previous Comment 6 was tested with condor version: condor-7.4.5-0.6
Pending fix pushed to V7_4-BZ641418-group-submitter-limits
Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -5,7 +5,7 @@ The negotiator fails to properly limit the number of jobs submitted under the accounting group in question. Fix: -A bug in the logic for accumulating the total number of jobs matched in the negotiation loop was corrected. +A bug in the logic for accumulating the total number of jobs matched in the negotiation loop was corrected. Additionally, a bug in the computation of "pieLeft" was fixed so that its value correctly respects group quota limits. Result: Accounting group slot limits are obeyed.
Tested with (version): condor-7.4.5-0.7 Tested on: RHEL4 i386,x86_64 - passed RHEL5 i386,x86_64 - passed >>> VERIFIED
Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1,11 +1 @@ -Cause: +Previously, the negotiator fails to properly limit the number of jobs submitted under the accounting group in question when jobs with multiple users under the same accounting group (e.g. "a.user1", "a.user2", "a.user3" ... ) were submitted. This update corrects the logic for accumulating the total number of jobs matched in the negotiation loop and the computation of "pieLeft" so that its value correctly respects group quota limits. Now, accounting group slot limits are obeyed as expected.-Submitting jobs with multiple users under the same accounting group (e.g. "a.user1", "a.user2", "a.user3" ... ) - -Consequence: -The negotiator fails to properly limit the number of jobs submitted under the accounting group in question. - -Fix: -A bug in the logic for accumulating the total number of jobs matched in the negotiation loop was corrected. Additionally, a bug in the computation of "pieLeft" was fixed so that its value correctly respects group quota limits. - -Result: -Accounting group slot limits are obeyed.
Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1 +1 @@ -Previously, the negotiator fails to properly limit the number of jobs submitted under the accounting group in question when jobs with multiple users under the same accounting group (e.g. "a.user1", "a.user2", "a.user3" ... ) were submitted. This update corrects the logic for accumulating the total number of jobs matched in the negotiation loop and the computation of "pieLeft" so that its value correctly respects group quota limits. Now, accounting group slot limits are obeyed as expected.+Previously, the negotiator failed tocorrectlu limit the number of jobs submitted under the accounting group in question when jobs with multiple users under the same accounting group (e.g. "a.user1", "a.user2", "a.user3" ... ) were submitted. This update corrects the logic for accumulation of the total number of jobs matched in the negotiation loop and the computation of "pieLeft" so that its value correctly respects group quota limits. Now, accounting group slot limits are obeyed as expected.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-0217.html