Bug 641418 - negotiator does not enforce HFS group limits in scenarios with multiple submitters against same group
Summary: negotiator does not enforce HFS group limits in scenarios with multiple submi...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor
Version: 1.3
Hardware: All
OS: All
low
high
Target Milestone: 1.3.2
: ---
Assignee: Erik Erlandson
QA Contact: Lubos Trilety
URL:
Whiteboard:
Depends On:
Blocks: 641431
TreeView+ depends on / blocked
 
Reported: 2010-10-08 16:34 UTC by Erik Erlandson
Modified: 2018-11-14 16:37 UTC (History)
4 users (show)

Fixed In Version: condor-7.4.5-0.7
Doc Type: Bug Fix
Doc Text:
Previously, the negotiator failed tocorrectlu limit the number of jobs submitted under the accounting group in question when jobs with multiple users under the same accounting group (e.g. "a.user1", "a.user2", "a.user3" ... ) were submitted. This update corrects the logic for accumulation of the total number of jobs matched in the negotiation loop and the computation of "pieLeft" so that its value correctly respects group quota limits. Now, accounting group slot limits are obeyed as expected.
Clone Of:
Environment:
Last Closed: 2011-02-15 12:16:06 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
patch (1.50 KB, patch)
2010-10-18 21:01 UTC, Jon Thomas
no flags Details | Diff
NegotiatorLog (860.44 KB, text/plain)
2011-01-13 14:50 UTC, Lubos Trilety
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2011:0217 0 normal SHIPPED_LIVE Red Hat Enterprise MRG Messaging and Grid bug fix and enhancement update 2011-02-15 12:10:15 UTC

Description Erik Erlandson 2010-10-08 16:34:22 UTC
Description of problem:
It is possible construct submission scenario that completely violates group quota limit configured for HFS


Steps to Reproduce:
#configure 20 slots, 10 to group a and 10 to group b
[condor@rorschach config.d]$ more 15.slot.config 
NUM_CPUS = 20

SLOT_TYPE_1 = cpus=2
SLOT_TYPE_1_PARTITIONABLE = TRUE
NUM_SLOTS_TYPE_1 = 0

SLOT_TYPE_2 = cpus=1
SLOT_TYPE_2_PARTITIONABLE = FALSE
NUM_SLOTS_TYPE_2 = 20

GROUP_NAMES = a, b
GROUP_QUOTA_DYNAMIC_a = 0.5
GROUP_QUOTA_DYNAMIC_b = 0.5
GROUP_AUTOREGROUP_a = FALSE
GROUP_AUTOREGROUP_b = FALSE


# check our slots:
[eje@rorschach ~]$ svhist Machine _SlotType_ Cpus State Activity
     20 rorschach.localdomain | X | 1 | Unclaimed | Idle
     20 total

# submit 9 jobs to group "a"  (should be one left)
[eje@rorschach ~]$ echo -e "universe=vanilla\ncmd=/bin/sleep\nargs=600\n+AccountingGroup=\"a.u1\"\nqueue 9\n" | condor_submit
Submitting job(s).........
9 job(s) submitted to cluster 126.

[eje@rorschach ~]$ svhist Machine _SlotType_ Cpus State Activity
      9 rorschach.localdomain | X | 1 | Claimed | Busy
     11 rorschach.localdomain | X | 1 | Unclaimed | Idle
     20 total


# now submit 8 more jobs:  2 jobs each from four new users, but all under group "a"
[eje@rorschach ~]$ echo -e "universe=vanilla\ncmd=/bin/sleep\nargs=600\n+AccountingGroup=\"a.u2\"\nqueue 2\n+AccountingGroup=\"a.u3\"\nqueue 2\n+AccountingGroup=\"a.u4\"\nqueue 2\n+AccountingGroup=\"a.u5\"\nqueue 2\n" | condor_submit
Submitting job(s)........
8 job(s) submitted to cluster 127.

# expecting one more slot to fill (was intending to study reject logic for submitters)
# but -- oh no -- all 8 jobs got scheduled
[eje@rorschach ~]$ !svhist
svhist Machine _SlotType_ Cpus State Activity
     17 rorschach.localdomain | X | 1 | Claimed | Busy
      3 rorschach.localdomain | X | 1 | Unclaimed | Idle
     20 total

[eje@rorschach ~]$ condor_q
-- Submitter: rorschach.localdomain : <192.168.1.2:49106> : rorschach.localdomain
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
 126.0   eje            10/7  16:12   0+00:09:59 R  0   0.0  sleep 600         
 126.1   eje            10/7  16:12   0+00:09:58 R  0   0.0  sleep 600         
 126.2   eje            10/7  16:12   0+00:09:58 R  0   0.0  sleep 600         
 126.3   eje            10/7  16:12   0+00:09:58 R  0   0.0  sleep 600         
 126.4   eje            10/7  16:12   0+00:09:58 R  0   0.0  sleep 600         
 126.5   eje            10/7  16:12   0+00:09:58 R  0   0.0  sleep 600         
 126.6   eje            10/7  16:12   0+00:09:58 R  0   0.0  sleep 600         
 126.7   eje            10/7  16:12   0+00:09:58 R  0   0.0  sleep 600         
 126.8   eje            10/7  16:12   0+00:09:58 R  0   0.0  sleep 600         
 127.0   eje            10/7  16:16   0+00:06:34 R  0   0.0  sleep 600         
 127.1   eje            10/7  16:16   0+00:06:32 R  0   0.0  sleep 600         
 127.2   eje            10/7  16:16   0+00:06:34 R  0   0.0  sleep 600         
 127.3   eje            10/7  16:16   0+00:06:33 R  0   0.0  sleep 600         
 127.4   eje            10/7  16:16   0+00:06:34 R  0   0.0  sleep 600         
 127.5   eje            10/7  16:16   0+00:06:33 R  0   0.0  sleep 600         
 127.6   eje            10/7  16:16   0+00:06:33 R  0   0.0  sleep 600         
 127.7   eje            10/7  16:16   0+00:06:33 R  0   0.0  sleep 600         

17 jobs; 0 idle, 17 running, 0 held
[eje@rorschach ~]$

Expected results:
System should envorce limit of 10 slots for group a, and schedule maximum of 10 jobs running at any one time

Comment 1 Jon Thomas 2010-10-18 21:01:24 UTC
Created attachment 454216 [details]
patch

Strange. This was tested so either the previous test didn't catch the issue and it exists in the flat upstream group code too or we had a regression. I think perhaps the former. One thing to note is initially I couldn't repro it, but later I was able to consistently. 

In any case, here is a patch that appears to fix the problem.

Comment 2 Erik Erlandson 2010-11-19 21:26:06 UTC
Incorporated Jon's fix here:
V7_4-BZ619557-HFS-tree-structure

Comment 3 Lubos Trilety 2010-11-30 09:27:22 UTC
Successfully reproduced on:
$CondorVersion: 7.4.4 Sep 27 2010 BuildID: RH-7.4.4-0.16.el5 PRE-RELEASE $
$CondorPlatform: X86_64-LINUX_RHEL5 $

Comment 4 Erik Erlandson 2010-12-21 19:09:01 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause:
Submitting jobs with multiple users under the same accounting group (e.g. "a.user1", "a.user2", "a.user3" ... )

Consequence:
The negotiator fails to properly limit the number of jobs submitted under the accounting group in question.

Fix:
A bug in the logic for accumulating the total number of jobs matched in the negotiation loop was corrected.

Result:
Accounting group slot limits are obeyed.

Comment 6 Lubos Trilety 2011-01-13 14:50:36 UTC
Created attachment 473340 [details]
NegotiatorLog

After
# now submit 8 more jobs:  2 jobs each from four new users, but all under group
"a"
# echo -e
"universe=vanilla\ncmd=/bin/sleep\nargs=600\n+AccountingGroup=\"a.u2\"\nqueue
2\n+AccountingGroup=\"a.u3\"\nqueue 2\n+AccountingGroup=\"a.u4\"\nqueue
2\n+AccountingGroup=\"a.u5\"\nqueue 2\n" | runuser condor -s /bin/bash -c
condor_submit
Submitting job(s)........
8 job(s) submitted to cluster 2.

the condor run one job for all of a.<user>
a.u5 - 1
a.u4 - 1
a.u3 - 1
a.u2 - 1

together with a.u1 - 9 there is 13 running jobs for group a

# condor_q
-- Submitter: hostname : <IP:44168> : hostname
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   1.0   condor          1/13 09:22   0+00:08:16 R  0   0.0  sleep 600         
   1.1   condor          1/13 09:22   0+00:08:16 R  0   0.0  sleep 600         
   1.2   condor          1/13 09:22   0+00:08:16 R  0   0.0  sleep 600         
   1.3   condor          1/13 09:22   0+00:08:16 R  0   0.0  sleep 600         
   1.4   condor          1/13 09:22   0+00:08:16 R  0   0.0  sleep 600         
   1.5   condor          1/13 09:22   0+00:08:16 R  0   0.0  sleep 600         
   1.6   condor          1/13 09:22   0+00:08:16 R  0   0.0  sleep 600         
   1.7   condor          1/13 09:22   0+00:08:16 R  0   0.0  sleep 600         
   1.8   condor          1/13 09:22   0+00:08:16 R  0   0.0  sleep 600         
   2.0   condor          1/13 09:23   0+00:06:56 R  0   0.0  sleep 600         
   2.1   condor          1/13 09:23   0+00:00:00 I  0   0.0  sleep 600         
   2.2   condor          1/13 09:23   0+00:06:56 R  0   0.0  sleep 600         
   2.3   condor          1/13 09:23   0+00:00:00 I  0   0.0  sleep 600         
   2.4   condor          1/13 09:23   0+00:06:56 R  0   0.0  sleep 600         
   2.5   condor          1/13 09:23   0+00:00:00 I  0   0.0  sleep 600         
   2.6   condor          1/13 09:23   0+00:06:56 R  0   0.0  sleep 600         
   2.7   condor          1/13 09:23   0+00:00:00 I  0   0.0  sleep 600         
17 jobs; 4 idle, 13 running, 0 held

But group a had limit 10 resources.

Comment 7 Lubos Trilety 2011-01-13 14:52:29 UTC
The previous Comment 6 was tested with condor version:
condor-7.4.5-0.6

Comment 8 Erik Erlandson 2011-01-14 20:30:00 UTC
Pending fix pushed to V7_4-BZ641418-group-submitter-limits

Comment 9 Erik Erlandson 2011-01-14 20:30:00 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -5,7 +5,7 @@
 The negotiator fails to properly limit the number of jobs submitted under the accounting group in question.
 
 Fix:
-A bug in the logic for accumulating the total number of jobs matched in the negotiation loop was corrected.
+A bug in the logic for accumulating the total number of jobs matched in the negotiation loop was corrected.  Additionally, a bug in the computation of "pieLeft" was fixed so that its value correctly respects group quota limits.
 
 Result:
 Accounting group slot limits are obeyed.

Comment 11 Lubos Trilety 2011-01-24 10:50:41 UTC
Tested with (version):
condor-7.4.5-0.7

Tested on:
RHEL4 i386,x86_64  - passed
RHEL5 i386,x86_64  - passed

>>> VERIFIED

Comment 12 Florian Nadge 2011-02-09 14:06:39 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1,11 +1 @@
-Cause:
+Previously, the negotiator fails to properly limit the number of jobs submitted under the accounting group in question when jobs with multiple users under the same accounting group (e.g. "a.user1", "a.user2", "a.user3" ... ) were submitted. This update corrects the logic for accumulating the total number of jobs matched in the negotiation loop and the computation of "pieLeft" so that its value correctly respects group quota limits. Now, accounting group slot limits are obeyed as expected.-Submitting jobs with multiple users under the same accounting group (e.g. "a.user1", "a.user2", "a.user3" ... )
-
-Consequence:
-The negotiator fails to properly limit the number of jobs submitted under the accounting group in question.
-
-Fix:
-A bug in the logic for accumulating the total number of jobs matched in the negotiation loop was corrected.  Additionally, a bug in the computation of "pieLeft" was fixed so that its value correctly respects group quota limits.
-
-Result:
-Accounting group slot limits are obeyed.

Comment 13 Florian Nadge 2011-02-09 17:46:49 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1 +1 @@
-Previously, the negotiator fails to properly limit the number of jobs submitted under the accounting group in question when jobs with multiple users under the same accounting group (e.g. "a.user1", "a.user2", "a.user3" ... ) were submitted. This update corrects the logic for accumulating the total number of jobs matched in the negotiation loop and the computation of "pieLeft" so that its value correctly respects group quota limits. Now, accounting group slot limits are obeyed as expected.+Previously, the negotiator failed tocorrectlu limit the number of jobs submitted under the accounting group in question when jobs with multiple users under the same accounting group (e.g. "a.user1", "a.user2", "a.user3" ... ) were submitted. This update corrects the logic for accumulation of the total number of jobs matched in the negotiation loop and the computation of "pieLeft" so that its value correctly respects group quota limits. Now, accounting group slot limits are obeyed as expected.

Comment 14 errata-xmlrpc 2011-02-15 12:16:06 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0217.html


Note You need to log in before you can comment on or make changes to this bug.