Bug 641418
| Summary: | negotiator does not enforce HFS group limits in scenarios with multiple submitters against same group | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise MRG | Reporter: | Erik Erlandson <eerlands> | ||||||
| Component: | condor | Assignee: | Erik Erlandson <eerlands> | ||||||
| Status: | CLOSED ERRATA | QA Contact: | Lubos Trilety <ltrilety> | ||||||
| Severity: | high | Docs Contact: | |||||||
| Priority: | low | ||||||||
| Version: | 1.3 | CC: | claudiol, fnadge, ltrilety, matt | ||||||
| Target Milestone: | 1.3.2 | ||||||||
| Target Release: | --- | ||||||||
| Hardware: | All | ||||||||
| OS: | All | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | condor-7.4.5-0.7 | Doc Type: | Bug Fix | ||||||
| Doc Text: |
Previously, the negotiator failed tocorrectlu limit the number of jobs submitted under the accounting group in question when jobs with multiple users under the same accounting group (e.g. "a.user1", "a.user2", "a.user3" ... ) were submitted. This update corrects the logic for accumulation of the total number of jobs matched in the negotiation loop and the computation of "pieLeft" so that its value correctly respects group quota limits. Now, accounting group slot limits are obeyed as expected.
|
Story Points: | --- | ||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2011-02-15 12:16:06 UTC | Type: | --- | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Bug Depends On: | |||||||||
| Bug Blocks: | 641431 | ||||||||
| Attachments: |
|
||||||||
Created attachment 454216 [details]
patch
Strange. This was tested so either the previous test didn't catch the issue and it exists in the flat upstream group code too or we had a regression. I think perhaps the former. One thing to note is initially I couldn't repro it, but later I was able to consistently.
In any case, here is a patch that appears to fix the problem.
Incorporated Jon's fix here: V7_4-BZ619557-HFS-tree-structure Successfully reproduced on: $CondorVersion: 7.4.4 Sep 27 2010 BuildID: RH-7.4.4-0.16.el5 PRE-RELEASE $ $CondorPlatform: X86_64-LINUX_RHEL5 $
Technical note added. If any revisions are required, please edit the "Technical Notes" field
accordingly. All revisions will be proofread by the Engineering Content Services team.
New Contents:
Cause:
Submitting jobs with multiple users under the same accounting group (e.g. "a.user1", "a.user2", "a.user3" ... )
Consequence:
The negotiator fails to properly limit the number of jobs submitted under the accounting group in question.
Fix:
A bug in the logic for accumulating the total number of jobs matched in the negotiation loop was corrected.
Result:
Accounting group slot limits are obeyed.
Created attachment 473340 [details]
NegotiatorLog
After
# now submit 8 more jobs: 2 jobs each from four new users, but all under group
"a"
# echo -e
"universe=vanilla\ncmd=/bin/sleep\nargs=600\n+AccountingGroup=\"a.u2\"\nqueue
2\n+AccountingGroup=\"a.u3\"\nqueue 2\n+AccountingGroup=\"a.u4\"\nqueue
2\n+AccountingGroup=\"a.u5\"\nqueue 2\n" | runuser condor -s /bin/bash -c
condor_submit
Submitting job(s)........
8 job(s) submitted to cluster 2.
the condor run one job for all of a.<user>
a.u5 - 1
a.u4 - 1
a.u3 - 1
a.u2 - 1
together with a.u1 - 9 there is 13 running jobs for group a
# condor_q
-- Submitter: hostname : <IP:44168> : hostname
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
1.0 condor 1/13 09:22 0+00:08:16 R 0 0.0 sleep 600
1.1 condor 1/13 09:22 0+00:08:16 R 0 0.0 sleep 600
1.2 condor 1/13 09:22 0+00:08:16 R 0 0.0 sleep 600
1.3 condor 1/13 09:22 0+00:08:16 R 0 0.0 sleep 600
1.4 condor 1/13 09:22 0+00:08:16 R 0 0.0 sleep 600
1.5 condor 1/13 09:22 0+00:08:16 R 0 0.0 sleep 600
1.6 condor 1/13 09:22 0+00:08:16 R 0 0.0 sleep 600
1.7 condor 1/13 09:22 0+00:08:16 R 0 0.0 sleep 600
1.8 condor 1/13 09:22 0+00:08:16 R 0 0.0 sleep 600
2.0 condor 1/13 09:23 0+00:06:56 R 0 0.0 sleep 600
2.1 condor 1/13 09:23 0+00:00:00 I 0 0.0 sleep 600
2.2 condor 1/13 09:23 0+00:06:56 R 0 0.0 sleep 600
2.3 condor 1/13 09:23 0+00:00:00 I 0 0.0 sleep 600
2.4 condor 1/13 09:23 0+00:06:56 R 0 0.0 sleep 600
2.5 condor 1/13 09:23 0+00:00:00 I 0 0.0 sleep 600
2.6 condor 1/13 09:23 0+00:06:56 R 0 0.0 sleep 600
2.7 condor 1/13 09:23 0+00:00:00 I 0 0.0 sleep 600
17 jobs; 4 idle, 13 running, 0 held
But group a had limit 10 resources.
The previous Comment 6 was tested with condor version: condor-7.4.5-0.6 Pending fix pushed to V7_4-BZ641418-group-submitter-limits
Technical note updated. If any revisions are required, please edit the "Technical Notes" field
accordingly. All revisions will be proofread by the Engineering Content Services team.
Diffed Contents:
@@ -5,7 +5,7 @@
The negotiator fails to properly limit the number of jobs submitted under the accounting group in question.
Fix:
-A bug in the logic for accumulating the total number of jobs matched in the negotiation loop was corrected.
+A bug in the logic for accumulating the total number of jobs matched in the negotiation loop was corrected. Additionally, a bug in the computation of "pieLeft" was fixed so that its value correctly respects group quota limits.
Result:
Accounting group slot limits are obeyed.
Tested with (version):
condor-7.4.5-0.7
Tested on:
RHEL4 i386,x86_64 - passed
RHEL5 i386,x86_64 - passed
>>> VERIFIED
Technical note updated. If any revisions are required, please edit the "Technical Notes" field
accordingly. All revisions will be proofread by the Engineering Content Services team.
Diffed Contents:
@@ -1,11 +1 @@
-Cause:
+Previously, the negotiator fails to properly limit the number of jobs submitted under the accounting group in question when jobs with multiple users under the same accounting group (e.g. "a.user1", "a.user2", "a.user3" ... ) were submitted. This update corrects the logic for accumulating the total number of jobs matched in the negotiation loop and the computation of "pieLeft" so that its value correctly respects group quota limits. Now, accounting group slot limits are obeyed as expected.-Submitting jobs with multiple users under the same accounting group (e.g. "a.user1", "a.user2", "a.user3" ... )
-
-Consequence:
-The negotiator fails to properly limit the number of jobs submitted under the accounting group in question.
-
-Fix:
-A bug in the logic for accumulating the total number of jobs matched in the negotiation loop was corrected. Additionally, a bug in the computation of "pieLeft" was fixed so that its value correctly respects group quota limits.
-
-Result:
-Accounting group slot limits are obeyed.
Technical note updated. If any revisions are required, please edit the "Technical Notes" field
accordingly. All revisions will be proofread by the Engineering Content Services team.
Diffed Contents:
@@ -1 +1 @@
-Previously, the negotiator fails to properly limit the number of jobs submitted under the accounting group in question when jobs with multiple users under the same accounting group (e.g. "a.user1", "a.user2", "a.user3" ... ) were submitted. This update corrects the logic for accumulating the total number of jobs matched in the negotiation loop and the computation of "pieLeft" so that its value correctly respects group quota limits. Now, accounting group slot limits are obeyed as expected.+Previously, the negotiator failed tocorrectlu limit the number of jobs submitted under the accounting group in question when jobs with multiple users under the same accounting group (e.g. "a.user1", "a.user2", "a.user3" ... ) were submitted. This update corrects the logic for accumulation of the total number of jobs matched in the negotiation loop and the computation of "pieLeft" so that its value correctly respects group quota limits. Now, accounting group slot limits are obeyed as expected.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-0217.html |
Description of problem: It is possible construct submission scenario that completely violates group quota limit configured for HFS Steps to Reproduce: #configure 20 slots, 10 to group a and 10 to group b [condor@rorschach config.d]$ more 15.slot.config NUM_CPUS = 20 SLOT_TYPE_1 = cpus=2 SLOT_TYPE_1_PARTITIONABLE = TRUE NUM_SLOTS_TYPE_1 = 0 SLOT_TYPE_2 = cpus=1 SLOT_TYPE_2_PARTITIONABLE = FALSE NUM_SLOTS_TYPE_2 = 20 GROUP_NAMES = a, b GROUP_QUOTA_DYNAMIC_a = 0.5 GROUP_QUOTA_DYNAMIC_b = 0.5 GROUP_AUTOREGROUP_a = FALSE GROUP_AUTOREGROUP_b = FALSE # check our slots: [eje@rorschach ~]$ svhist Machine _SlotType_ Cpus State Activity 20 rorschach.localdomain | X | 1 | Unclaimed | Idle 20 total # submit 9 jobs to group "a" (should be one left) [eje@rorschach ~]$ echo -e "universe=vanilla\ncmd=/bin/sleep\nargs=600\n+AccountingGroup=\"a.u1\"\nqueue 9\n" | condor_submit Submitting job(s)......... 9 job(s) submitted to cluster 126. [eje@rorschach ~]$ svhist Machine _SlotType_ Cpus State Activity 9 rorschach.localdomain | X | 1 | Claimed | Busy 11 rorschach.localdomain | X | 1 | Unclaimed | Idle 20 total # now submit 8 more jobs: 2 jobs each from four new users, but all under group "a" [eje@rorschach ~]$ echo -e "universe=vanilla\ncmd=/bin/sleep\nargs=600\n+AccountingGroup=\"a.u2\"\nqueue 2\n+AccountingGroup=\"a.u3\"\nqueue 2\n+AccountingGroup=\"a.u4\"\nqueue 2\n+AccountingGroup=\"a.u5\"\nqueue 2\n" | condor_submit Submitting job(s)........ 8 job(s) submitted to cluster 127. # expecting one more slot to fill (was intending to study reject logic for submitters) # but -- oh no -- all 8 jobs got scheduled [eje@rorschach ~]$ !svhist svhist Machine _SlotType_ Cpus State Activity 17 rorschach.localdomain | X | 1 | Claimed | Busy 3 rorschach.localdomain | X | 1 | Unclaimed | Idle 20 total [eje@rorschach ~]$ condor_q -- Submitter: rorschach.localdomain : <192.168.1.2:49106> : rorschach.localdomain ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 126.0 eje 10/7 16:12 0+00:09:59 R 0 0.0 sleep 600 126.1 eje 10/7 16:12 0+00:09:58 R 0 0.0 sleep 600 126.2 eje 10/7 16:12 0+00:09:58 R 0 0.0 sleep 600 126.3 eje 10/7 16:12 0+00:09:58 R 0 0.0 sleep 600 126.4 eje 10/7 16:12 0+00:09:58 R 0 0.0 sleep 600 126.5 eje 10/7 16:12 0+00:09:58 R 0 0.0 sleep 600 126.6 eje 10/7 16:12 0+00:09:58 R 0 0.0 sleep 600 126.7 eje 10/7 16:12 0+00:09:58 R 0 0.0 sleep 600 126.8 eje 10/7 16:12 0+00:09:58 R 0 0.0 sleep 600 127.0 eje 10/7 16:16 0+00:06:34 R 0 0.0 sleep 600 127.1 eje 10/7 16:16 0+00:06:32 R 0 0.0 sleep 600 127.2 eje 10/7 16:16 0+00:06:34 R 0 0.0 sleep 600 127.3 eje 10/7 16:16 0+00:06:33 R 0 0.0 sleep 600 127.4 eje 10/7 16:16 0+00:06:34 R 0 0.0 sleep 600 127.5 eje 10/7 16:16 0+00:06:33 R 0 0.0 sleep 600 127.6 eje 10/7 16:16 0+00:06:33 R 0 0.0 sleep 600 127.7 eje 10/7 16:16 0+00:06:33 R 0 0.0 sleep 600 17 jobs; 0 idle, 17 running, 0 held [eje@rorschach ~]$ Expected results: System should envorce limit of 10 slots for group a, and schedule maximum of 10 jobs running at any one time