Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 641418

Summary: negotiator does not enforce HFS group limits in scenarios with multiple submitters against same group
Product: Red Hat Enterprise MRG Reporter: Erik Erlandson <eerlands>
Component: condorAssignee: Erik Erlandson <eerlands>
Status: CLOSED ERRATA QA Contact: Lubos Trilety <ltrilety>
Severity: high Docs Contact:
Priority: low    
Version: 1.3CC: claudiol, fnadge, ltrilety, matt
Target Milestone: 1.3.2   
Target Release: ---   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: condor-7.4.5-0.7 Doc Type: Bug Fix
Doc Text:
Previously, the negotiator failed tocorrectlu limit the number of jobs submitted under the accounting group in question when jobs with multiple users under the same accounting group (e.g. "a.user1", "a.user2", "a.user3" ... ) were submitted. This update corrects the logic for accumulation of the total number of jobs matched in the negotiation loop and the computation of "pieLeft" so that its value correctly respects group quota limits. Now, accounting group slot limits are obeyed as expected.
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-02-15 12:16:06 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 641431    
Attachments:
Description Flags
patch
none
NegotiatorLog none

Description Erik Erlandson 2010-10-08 16:34:22 UTC
Description of problem:
It is possible construct submission scenario that completely violates group quota limit configured for HFS


Steps to Reproduce:
#configure 20 slots, 10 to group a and 10 to group b
[condor@rorschach config.d]$ more 15.slot.config 
NUM_CPUS = 20

SLOT_TYPE_1 = cpus=2
SLOT_TYPE_1_PARTITIONABLE = TRUE
NUM_SLOTS_TYPE_1 = 0

SLOT_TYPE_2 = cpus=1
SLOT_TYPE_2_PARTITIONABLE = FALSE
NUM_SLOTS_TYPE_2 = 20

GROUP_NAMES = a, b
GROUP_QUOTA_DYNAMIC_a = 0.5
GROUP_QUOTA_DYNAMIC_b = 0.5
GROUP_AUTOREGROUP_a = FALSE
GROUP_AUTOREGROUP_b = FALSE


# check our slots:
[eje@rorschach ~]$ svhist Machine _SlotType_ Cpus State Activity
     20 rorschach.localdomain | X | 1 | Unclaimed | Idle
     20 total

# submit 9 jobs to group "a"  (should be one left)
[eje@rorschach ~]$ echo -e "universe=vanilla\ncmd=/bin/sleep\nargs=600\n+AccountingGroup=\"a.u1\"\nqueue 9\n" | condor_submit
Submitting job(s).........
9 job(s) submitted to cluster 126.

[eje@rorschach ~]$ svhist Machine _SlotType_ Cpus State Activity
      9 rorschach.localdomain | X | 1 | Claimed | Busy
     11 rorschach.localdomain | X | 1 | Unclaimed | Idle
     20 total


# now submit 8 more jobs:  2 jobs each from four new users, but all under group "a"
[eje@rorschach ~]$ echo -e "universe=vanilla\ncmd=/bin/sleep\nargs=600\n+AccountingGroup=\"a.u2\"\nqueue 2\n+AccountingGroup=\"a.u3\"\nqueue 2\n+AccountingGroup=\"a.u4\"\nqueue 2\n+AccountingGroup=\"a.u5\"\nqueue 2\n" | condor_submit
Submitting job(s)........
8 job(s) submitted to cluster 127.

# expecting one more slot to fill (was intending to study reject logic for submitters)
# but -- oh no -- all 8 jobs got scheduled
[eje@rorschach ~]$ !svhist
svhist Machine _SlotType_ Cpus State Activity
     17 rorschach.localdomain | X | 1 | Claimed | Busy
      3 rorschach.localdomain | X | 1 | Unclaimed | Idle
     20 total

[eje@rorschach ~]$ condor_q
-- Submitter: rorschach.localdomain : <192.168.1.2:49106> : rorschach.localdomain
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
 126.0   eje            10/7  16:12   0+00:09:59 R  0   0.0  sleep 600         
 126.1   eje            10/7  16:12   0+00:09:58 R  0   0.0  sleep 600         
 126.2   eje            10/7  16:12   0+00:09:58 R  0   0.0  sleep 600         
 126.3   eje            10/7  16:12   0+00:09:58 R  0   0.0  sleep 600         
 126.4   eje            10/7  16:12   0+00:09:58 R  0   0.0  sleep 600         
 126.5   eje            10/7  16:12   0+00:09:58 R  0   0.0  sleep 600         
 126.6   eje            10/7  16:12   0+00:09:58 R  0   0.0  sleep 600         
 126.7   eje            10/7  16:12   0+00:09:58 R  0   0.0  sleep 600         
 126.8   eje            10/7  16:12   0+00:09:58 R  0   0.0  sleep 600         
 127.0   eje            10/7  16:16   0+00:06:34 R  0   0.0  sleep 600         
 127.1   eje            10/7  16:16   0+00:06:32 R  0   0.0  sleep 600         
 127.2   eje            10/7  16:16   0+00:06:34 R  0   0.0  sleep 600         
 127.3   eje            10/7  16:16   0+00:06:33 R  0   0.0  sleep 600         
 127.4   eje            10/7  16:16   0+00:06:34 R  0   0.0  sleep 600         
 127.5   eje            10/7  16:16   0+00:06:33 R  0   0.0  sleep 600         
 127.6   eje            10/7  16:16   0+00:06:33 R  0   0.0  sleep 600         
 127.7   eje            10/7  16:16   0+00:06:33 R  0   0.0  sleep 600         

17 jobs; 0 idle, 17 running, 0 held
[eje@rorschach ~]$

Expected results:
System should envorce limit of 10 slots for group a, and schedule maximum of 10 jobs running at any one time

Comment 1 Jon Thomas 2010-10-18 21:01:24 UTC
Created attachment 454216 [details]
patch

Strange. This was tested so either the previous test didn't catch the issue and it exists in the flat upstream group code too or we had a regression. I think perhaps the former. One thing to note is initially I couldn't repro it, but later I was able to consistently. 

In any case, here is a patch that appears to fix the problem.

Comment 2 Erik Erlandson 2010-11-19 21:26:06 UTC
Incorporated Jon's fix here:
V7_4-BZ619557-HFS-tree-structure

Comment 3 Lubos Trilety 2010-11-30 09:27:22 UTC
Successfully reproduced on:
$CondorVersion: 7.4.4 Sep 27 2010 BuildID: RH-7.4.4-0.16.el5 PRE-RELEASE $
$CondorPlatform: X86_64-LINUX_RHEL5 $

Comment 4 Erik Erlandson 2010-12-21 19:09:01 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause:
Submitting jobs with multiple users under the same accounting group (e.g. "a.user1", "a.user2", "a.user3" ... )

Consequence:
The negotiator fails to properly limit the number of jobs submitted under the accounting group in question.

Fix:
A bug in the logic for accumulating the total number of jobs matched in the negotiation loop was corrected.

Result:
Accounting group slot limits are obeyed.

Comment 6 Lubos Trilety 2011-01-13 14:50:36 UTC
Created attachment 473340 [details]
NegotiatorLog

After
# now submit 8 more jobs:  2 jobs each from four new users, but all under group
"a"
# echo -e
"universe=vanilla\ncmd=/bin/sleep\nargs=600\n+AccountingGroup=\"a.u2\"\nqueue
2\n+AccountingGroup=\"a.u3\"\nqueue 2\n+AccountingGroup=\"a.u4\"\nqueue
2\n+AccountingGroup=\"a.u5\"\nqueue 2\n" | runuser condor -s /bin/bash -c
condor_submit
Submitting job(s)........
8 job(s) submitted to cluster 2.

the condor run one job for all of a.<user>
a.u5 - 1
a.u4 - 1
a.u3 - 1
a.u2 - 1

together with a.u1 - 9 there is 13 running jobs for group a

# condor_q
-- Submitter: hostname : <IP:44168> : hostname
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   1.0   condor          1/13 09:22   0+00:08:16 R  0   0.0  sleep 600         
   1.1   condor          1/13 09:22   0+00:08:16 R  0   0.0  sleep 600         
   1.2   condor          1/13 09:22   0+00:08:16 R  0   0.0  sleep 600         
   1.3   condor          1/13 09:22   0+00:08:16 R  0   0.0  sleep 600         
   1.4   condor          1/13 09:22   0+00:08:16 R  0   0.0  sleep 600         
   1.5   condor          1/13 09:22   0+00:08:16 R  0   0.0  sleep 600         
   1.6   condor          1/13 09:22   0+00:08:16 R  0   0.0  sleep 600         
   1.7   condor          1/13 09:22   0+00:08:16 R  0   0.0  sleep 600         
   1.8   condor          1/13 09:22   0+00:08:16 R  0   0.0  sleep 600         
   2.0   condor          1/13 09:23   0+00:06:56 R  0   0.0  sleep 600         
   2.1   condor          1/13 09:23   0+00:00:00 I  0   0.0  sleep 600         
   2.2   condor          1/13 09:23   0+00:06:56 R  0   0.0  sleep 600         
   2.3   condor          1/13 09:23   0+00:00:00 I  0   0.0  sleep 600         
   2.4   condor          1/13 09:23   0+00:06:56 R  0   0.0  sleep 600         
   2.5   condor          1/13 09:23   0+00:00:00 I  0   0.0  sleep 600         
   2.6   condor          1/13 09:23   0+00:06:56 R  0   0.0  sleep 600         
   2.7   condor          1/13 09:23   0+00:00:00 I  0   0.0  sleep 600         
17 jobs; 4 idle, 13 running, 0 held

But group a had limit 10 resources.

Comment 7 Lubos Trilety 2011-01-13 14:52:29 UTC
The previous Comment 6 was tested with condor version:
condor-7.4.5-0.6

Comment 8 Erik Erlandson 2011-01-14 20:30:00 UTC
Pending fix pushed to V7_4-BZ641418-group-submitter-limits

Comment 9 Erik Erlandson 2011-01-14 20:30:00 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -5,7 +5,7 @@
 The negotiator fails to properly limit the number of jobs submitted under the accounting group in question.
 
 Fix:
-A bug in the logic for accumulating the total number of jobs matched in the negotiation loop was corrected.
+A bug in the logic for accumulating the total number of jobs matched in the negotiation loop was corrected.  Additionally, a bug in the computation of "pieLeft" was fixed so that its value correctly respects group quota limits.
 
 Result:
 Accounting group slot limits are obeyed.

Comment 11 Lubos Trilety 2011-01-24 10:50:41 UTC
Tested with (version):
condor-7.4.5-0.7

Tested on:
RHEL4 i386,x86_64  - passed
RHEL5 i386,x86_64  - passed

>>> VERIFIED

Comment 12 Florian Nadge 2011-02-09 14:06:39 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1,11 +1 @@
-Cause:
+Previously, the negotiator fails to properly limit the number of jobs submitted under the accounting group in question when jobs with multiple users under the same accounting group (e.g. "a.user1", "a.user2", "a.user3" ... ) were submitted. This update corrects the logic for accumulating the total number of jobs matched in the negotiation loop and the computation of "pieLeft" so that its value correctly respects group quota limits. Now, accounting group slot limits are obeyed as expected.-Submitting jobs with multiple users under the same accounting group (e.g. "a.user1", "a.user2", "a.user3" ... )
-
-Consequence:
-The negotiator fails to properly limit the number of jobs submitted under the accounting group in question.
-
-Fix:
-A bug in the logic for accumulating the total number of jobs matched in the negotiation loop was corrected.  Additionally, a bug in the computation of "pieLeft" was fixed so that its value correctly respects group quota limits.
-
-Result:
-Accounting group slot limits are obeyed.

Comment 13 Florian Nadge 2011-02-09 17:46:49 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1 +1 @@
-Previously, the negotiator fails to properly limit the number of jobs submitted under the accounting group in question when jobs with multiple users under the same accounting group (e.g. "a.user1", "a.user2", "a.user3" ... ) were submitted. This update corrects the logic for accumulating the total number of jobs matched in the negotiation loop and the computation of "pieLeft" so that its value correctly respects group quota limits. Now, accounting group slot limits are obeyed as expected.+Previously, the negotiator failed tocorrectlu limit the number of jobs submitted under the accounting group in question when jobs with multiple users under the same accounting group (e.g. "a.user1", "a.user2", "a.user3" ... ) were submitted. This update corrects the logic for accumulation of the total number of jobs matched in the negotiation loop and the computation of "pieLeft" so that its value correctly respects group quota limits. Now, accounting group slot limits are obeyed as expected.

Comment 14 errata-xmlrpc 2011-02-15 12:16:06 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0217.html