Bug 674669 - RFE: include submitters hitting group limits in Negotiator classad stat LastNegotiationCycleSubmittersShareLimitX
Summary: RFE: include submitters hitting group limits in Negotiator classad stat LastN...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor
Version: 1.3
Hardware: All
OS: All
medium
low
Target Milestone: 2.0
: ---
Assignee: Erik Erlandson
QA Contact: Lubos Trilety
URL:
Whiteboard:
Depends On: 641431
Blocks: 693778
TreeView+ depends on / blocked
 
Reported: 2011-02-02 21:13 UTC by Erik Erlandson
Modified: 2012-03-28 09:42 UTC (History)
9 users (show)

Fixed In Version: condor-7.5.6-0.1
Doc Type: Enhancement
Doc Text:
Cause The LastNegotiationCycleSubmittersShareLimit<N> negotiator classad stat attribute did not include submitters hitting share limits in a group-quota scenario. Consequence The statistic did not count all submitters hitting limits when accounting groups were in use. Change Logic was added to include submitter names in the attribute when a submitter hit its limit in the context of a group quota limit. Result The LastNegotiationCycleSubmittersShareLimit<N> attribute now includes submitters hitting submitter limits in all cases, including group quota limits. Release Notes Entry: Previously, the LastNegotiationCycleSubmittersShareLimit<N> negotiator classad stat attribute did not account for a submitter reaching the share limits in a group-quote scenario. The negotiator now includes submitter names in the attribute when any submitter reaches the submitter limit, including group quota limits.
Clone Of:
Environment:
Last Closed: 2011-06-23 15:38:01 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
NegotiatorLog (142.03 KB, text/plain)
2011-05-11 15:20 UTC, Lubos Trilety
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2011:0889 0 normal SHIPPED_LIVE Red Hat Enterprise MRG Grid 2.0 Release 2011-06-23 15:35:53 UTC

Description Erik Erlandson 2011-02-02 21:13:55 UTC
Description of problem:

LastNegotiationCycleSubmittersShareLimitX currently only counts submitters which specifically hit the submitter limit.  This does not necessarily include submitters which run up against the accounting group limit.  The statistic would be more intuitive if it counted the union of these cases.

Comment 1 Erik Erlandson 2011-02-02 21:15:38 UTC
Pushed a branch V7_4-BZ641431-unify-share-group-limits, which updates the
semantic of LastNegotiationCycleSubmittersShareLimitX to include limits imposed
by acct groups as well as share limits, which makes the statistic more
intuitive.

Also, I included this update upstream on the latest patch for gt#1393

Comment 3 Martin Kudlej 2011-03-07 13:21:22 UTC
What are the step to verify this issue, please?

Comment 4 Erik Erlandson 2011-03-07 15:21:15 UTC
(In reply to comment #3)
> What are the step to verify this issue, please?

Should be able to verify by setting up a group with quota (N) and submitting (N+1) jobs against it -- the submitter should show up in the LastNegotiationCycleSubmittersShareLimitX list.

Comment 5 Erik Erlandson 2011-04-27 20:51:12 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause
The LastNegotiationCycleSubmittersShareLimit<N> negotiator classad stat attribute did not include submitters hitting share limits in a group-quota scenario.

Consequence
The statistic did not count all submitters hitting limits when accounting groups were in use.

Change
Logic was added to include submitter names in the attribute when a submitter hit its limit in the context of a group quota limit.

Result
The LastNegotiationCycleSubmittersShareLimit<N> attribute now includes submitters hitting submitter limits in all cases, including group quota limits.

Comment 7 Lubos Trilety 2011-05-05 15:17:00 UTC
Successfully reproduced on:
$CondorVersion: 7.4.5 Feb  4 2011 BuildID: RH-7.4.5-0.8.el5 PRE-RELEASE $
$CondorPlatform: X86_64-LINUX_RHEL5 $

Scenario:
config file:
NUM_CPUS = 2
GROUP_NAMES = a, b
GROUP_QUOTA_DYNAMIC_a = 0.5
GROUP_QUOTA_DYNAMIC_b = 0.5

# echo -e
"universe=vanilla\ncmd=/bin/sleep\nargs=1d\n+AccountingGroup=\"a.u3\"\nqueue\n+AccountingGroup=\"a.u1\"\nqueue\n+AccountingGroup=\"a.u2\"\nqueue\n" |
runuser condor -s /bin/bash -c "condor_submit"
Submitting job(s)......
6 job(s) submitted to cluster 1.

# condor_q
-- Submitter: hostname : <IP:39251> : hostname
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   1.0   condor          5/5  17:14   0+00:00:00 I  0   0.0  sleep 1d          
   1.1   condor          5/5  17:14   0+00:00:03 R  0   0.0  sleep 1d          
   1.2   condor          5/5  17:14   0+00:00:00 I  0   0.0  sleep 1d          

3 jobs; 2 idle, 1 running, 0 held

# condor_status -subsystem negotiator -l | grep SubmittersShareLimit
LastNegotiationCycleSubmittersShareLimit0 = ""
LastNegotiationCycleSubmittersShareLimit1 = ""
LastNegotiationCycleSubmittersShareLimit2 = ""

Comment 8 Lubos Trilety 2011-05-05 15:31:16 UTC
Tested on:
$CondorVersion: 7.6.1 Apr 27 2011 BuildID: RH-7.6.1-0.4.el5 $
$CondorPlatform: I686-RedHat_5.6 $

$CondorVersion: 7.6.1 Apr 27 2011 BuildID: RH-7.6.1-0.4.el5 $
$CondorPlatform: X86_64-RedHat_5.6 $

$CondorVersion: 7.6.1 Apr 27 2011 BuildID: RH-7.6.1-0.4.el6 $
$CondorPlatform: I686-RedHat_6.0 $

$CondorVersion: 7.6.1 Apr 27 2011 BuildID: RH-7.6.1-0.4.el6 $
$CondorPlatform: X86_64-RedHat_6.0 $

Scenario:
config file:
NUM_CPUS = 10
GROUP_NAMES = a, b
GROUP_QUOTA_DYNAMIC_a = 0.5
GROUP_QUOTA_DYNAMIC_b = 0.5

# echo -e "universe=vanilla\ncmd=/bin/sleep\nargs=1d\n+AccountingGroup=\"a.u1\"\nqueue 3" | runuser condor -s /bin/bash -c "condor_submit"
Submitting job(s)...
3 job(s) submitted to cluster 1.

# cat /var/log/condor/NegotiatorLog | grep -e 'using its quota' -e 'exceeded'
05/05/11 17:23:14       Rejected 1.2 a.u1@hostname <IP:50966>: group quota exceeded
05/05/11 17:23:14 Group a is using its quota 3 - halting negotiation

# condor_status -subsystem negotiator -l | grep SubmittersShareLimitLastNegotiationCycleSubmittersShareLimit0 = "a.u1@hostname"
LastNegotiationCycleSubmittersShareLimit1 = ""


The quota was not reached and even than this statistic item was filled.

>>> ASSIGNED

Comment 9 Erik Erlandson 2011-05-10 22:20:02 UTC
(In reply to comment #8)

> | runuser condor -s /bin/bash -c "condor_submit"
> Submitting job(s)...
> 3 job(s) submitted to cluster 1.
> 
> # cat /var/log/condor/NegotiatorLog | grep -e 'using its quota' -e 'exceeded'
> 05/05/11 17:23:14       Rejected 1.2 a.u1@hostname <IP:50966>: group quota
> exceeded
> 05/05/11 17:23:14 Group a is using its quota 3 - halting negotiation
> 
> # condor_status -subsystem negotiator -l | grep
> SubmittersShareLimitLastNegotiationCycleSubmittersShareLimit0 = "a.u1@hostname"
> LastNegotiationCycleSubmittersShareLimit1 = ""
> 
> The quota was not reached and even than this statistic item was filled.


The behavior you're reporting above is correct:  what's happening is that the quota-assignment logic will only assign a group the quota it requests.  So group "a" was only requesting 3 slots (via "a.u1").  Therefore the negotiator assigned it a group quota of 3.  

In the negotiator loop, "a.u1" was assigned a submitter-limit of 3, which it hit when it got the 3 slots it asked for.  So, it emitted the log message 'group quota exceeded'  (a misleading log message that really means "I met my submitter limit and group-quotas are enabled") and then also tripped the *real* group quota check ("Group a is using its quota 3 - halting negotiation").

So, "a.u1" did hit its submitter limit, and it is correct that it showed up on 
NegotiationCycleSubmittersShareLimit0.

Comment 10 Lubos Trilety 2011-05-11 12:35:55 UTC
(In reply to comment #9)
> The behavior you're reporting above is correct:  what's happening is that the
> quota-assignment logic will only assign a group the quota it requests.  So
> group "a" was only requesting 3 slots (via "a.u1").  Therefore the negotiator
> assigned it a group quota of 3.  
> 
> In the negotiator loop, "a.u1" was assigned a submitter-limit of 3, which it
> hit when it got the 3 slots it asked for.  So, it emitted the log message
> 'group quota exceeded'  (a misleading log message that really means "I met my
> submitter limit and group-quotas are enabled") and then also tripped the *real*
> group quota check ("Group a is using its quota 3 - halting negotiation").
> 
> So, "a.u1" did hit its submitter limit, and it is correct that it showed up on 
> NegotiationCycleSubmittersShareLimit0.

Well, I understand that submitter/group limit is set to less value than possible maximum, but I don't think that there should be any reject in negotiator log. From my point of view the job should not be rejected if there is still available slot and there was one.
Consequent the statistic should be filled only if reject happened, at least I thought that, after I read Comment 4.

Comment 11 Erik Erlandson 2011-05-11 15:04:03 UTC
(In reply to comment #10)

> From my point of view the job should not be rejected if there
> is still available slot and there was one.
> Consequent the statistic should be filled only if reject happened, at least I
> thought that, after I read Comment 4.

The semantics for how the negotiator loop sets the internal "rejForSubmitterLimit" flag are not very intuitive -- the NegotiationCycleSubmittersShareLimit<N> stats are keyed off of this flag.  

I consider the negotiator statistic to be correct, however we should consider an RFE for making the semantics of what the negotiator calls a 'rejection' more intuitive (and closely related, improve the misleading 'exceeded group quota' message)

Comment 13 Lubos Trilety 2011-05-11 15:59:14 UTC
Tested on:
$CondorVersion: 7.6.1 Apr 27 2011 BuildID: RH-7.6.1-0.4.el5 $
$CondorPlatform: I686-RedHat_5.6 $

$CondorVersion: 7.6.1 Apr 27 2011 BuildID: RH-7.6.1-0.4.el5 $
$CondorPlatform: X86_64-RedHat_5.6 $

$CondorVersion: 7.6.1 Apr 27 2011 BuildID: RH-7.6.1-0.4.el6 $
$CondorPlatform: I686-RedHat_6.0 $

$CondorVersion: 7.6.1 Apr 27 2011 BuildID: RH-7.6.1-0.4.el6 $
$CondorPlatform: X86_64-RedHat_6.0 $

The statistic is working correctly, for problem with reject new bug 703905 was
created.

>>> VERIFIED

Comment 14 Misha H. Ali 2011-06-01 00:55:41 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -8,4 +8,8 @@
 Logic was added to include submitter names in the attribute when a submitter hit its limit in the context of a group quota limit.
 
 Result
-The LastNegotiationCycleSubmittersShareLimit<N> attribute now includes submitters hitting submitter limits in all cases, including group quota limits.+The LastNegotiationCycleSubmittersShareLimit<N> attribute now includes submitters hitting submitter limits in all cases, including group quota limits.
+
+Release Notes Entry:
+
+Previously, the LastNegotiationCycleSubmittersShareLimit<N> negotiator classad stat attribute did not account for a submitter reaching the share limits in a group-quote scenario. The negotiator now includes submitter names in the attribute when any submitter reaches the submitter limit, including group quota limits.

Comment 15 Misha H. Ali 2011-06-06 03:30:30 UTC
Technical note can be viewed in the release notes for 2.0 at the documentation stage here:

http://documentation-stage.bne.redhat.com/docs/en-US/Red_Hat_Enterprise_MRG/2.0/html-single/MRG_Release_Notes/index.html#tabl-MRG_Release_Notes-GRID_Update_Notes-RHM_Known_Issues

Comment 16 errata-xmlrpc 2011-06-23 15:38:01 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2011-0889.html


Note You need to log in before you can comment on or make changes to this bug.