Bug 519183 - Matchmaker code doesn't implement fair share correctly
Summary: Matchmaker code doesn't implement fair share correctly
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: grid
Version: 1.0
Hardware: All
OS: Linux
medium
medium
Target Milestone: 1.3
: ---
Assignee: Jon Thomas
QA Contact: Lubos Trilety
URL:
Whiteboard:
: 523495 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2009-08-25 14:50 UTC by Jon Thomas
Modified: 2018-10-20 04:22 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Previously, when using group qoutas and the 'autoregroup' function, the group quotas did not grow relative to their initial proportions. This was caused by the fact that all users (including group users) were negotiated at the same time using prio normalization based on all user prios. With this update, group quotas scale into unused slots appropriately.
Clone Of:
Environment:
Last Closed: 2010-10-14 16:12:14 UTC


Attachments (Terms of Use)
group hfs patch (13.57 KB, patch)
2009-09-15 18:08 UTC, Jon Thomas
no flags Details | Diff
new patch (14.19 KB, patch)
2009-09-18 21:33 UTC, Jon Thomas
no flags Details | Diff
fixed a small problem created when variable names were changed (14.21 KB, patch)
2009-10-05 15:46 UTC, Jon Thomas
no flags Details | Diff


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2010:0773 normal SHIPPED_LIVE Moderate: Red Hat Enterprise MRG Messaging and Grid Version 1.3 2010-10-14 15:56:44 UTC

Description Jon Thomas 2009-08-25 14:50:49 UTC
Description of problem:

Groups get appropriate quota and users get the correct number of slots within those quotas. However, if there are unused slots and autoregroup is specified, the group quotas (and hence usage) do not grow relative to their initial proportions. This is because in the second stage of negotiation all users, including those group users, are negotiated at the same time using prio normalization based upon all user prios. 

  
Actual results:

Observed usage ranges from being close to that specified by group quotas to being flat prio-based user level. 

Expected results:

Group quotas should scale into unused slots.

Additional info:

I have a patch for this which I will attach soon.

Comment 1 Jon Thomas 2009-09-03 19:01:29 UTC
posted upstream too 
http://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=712

Comment 2 Jon Thomas 2009-09-15 18:08:47 UTC
Created attachment 361117 [details]
group hfs patch

This patch fixes the issue where the quota proportions are not carried over when using AUTOREGROUP. 

Previously, AUTOREGROUP just meant that groups get scheduled against their quota and then would have an additional chance to to use an unused slot relative to the userprios of all users. This meant that a group with less quota than another group might see higher usage than the other group.

Comment 3 Jon Thomas 2009-09-17 16:43:06 UTC
notes on negotiation behavior:

1) If no groups, negotiation occurs based upon userprios

2) Jobs from users with no accounting group or an unknown accounting group will only execute if the config provides unclaimed quota. This means total group quota is less than total number of slots or that dynamic group quota sums to less than 1.0. 

example where jobs do not run:

GROUP_QUOTA_DYNAMIC_group_a = 0.50
GROUP_QUOTA_DYNAMIC_group_b = 0.30
GROUP_QUOTA_DYNAMIC_group_c = 0.20

example where jobs do run:

GROUP_QUOTA_DYNAMIC_group_a = 0.50
GROUP_QUOTA_DYNAMIC_group_b = 0.30
GROUP_QUOTA_DYNAMIC_group_c = 0.10

in this example, 10% of slots are not designated as belonging to a group and is thus unclaimed.

3) Groups that have claimed quota, but not enough submitters to fill their quota return their unusable quota to the unclaimed quota pool. For example a user with a 50 slot quota and only 25 job submissions will return 25 slots to unclaimed quota.

4) Group users with autoregroup set to TRUE and with enough submitters to utilize more quota, claim unclaimed quota based upon their designated quota as a percent of total slots. Hence a group with

GROUP_QUOTA_DYNAMIC_group_a = 0.50

will consume 50% of unclaimed quota.

5) In the case where there are non-group users and unclaimed quota, non-group users claim unclaimed quota based upon unclaimed quota as a percent of total slots. Hence, in this scenario:

GROUP_QUOTA_DYNAMIC_group_a = 0.50
GROUP_QUOTA_DYNAMIC_group_b = 0.30
GROUP_QUOTA_DYNAMIC_group_c = 0.10

non-group users will claim 10% of unclaimed slots.


5) The claiming of unclaimed quota is iterative, so all unclaimed quota will be claimed if:

a) there are non-group users with enough job submissions to use more quota
b) groups with job submissions greater than their claimed quota have autoregroup set to true

6) If a group has no quota, users of that group become part of the pool of non-group users.

Comment 4 Jon Thomas 2009-09-18 21:33:36 UTC
Created attachment 361719 [details]
new patch

new patch that preserves old behavior for static configs

Comment 5 Jon Thomas 2009-10-05 15:46:25 UTC
Created attachment 363702 [details]
fixed a small problem created when variable names were changed

Previous patch had a small issue introduced when I changed variable names and streamlined the number of jobs calculation. Behavior would have been slightly different when (non-group user job count > 0) && (non-group user job count < non-group user quota) in that the unused quota would not be added to the unclaimed pool.

Comment 6 Matthew Farrellee 2009-10-05 18:32:40 UTC
*** Bug 523495 has been marked as a duplicate of this bug. ***

Comment 8 Matthew Farrellee 2009-10-26 00:48:54 UTC
Built since 7.4.0-0.5.

Comment 10 Lubos Trilety 2010-09-14 09:05:54 UTC
Configuration:
NUM_CPUS = 100
GROUP_NAMES = group0, group1
GROUP_QUOTA_DYNAMIC_group0 = 0.9
GROUP_QUOTA_DYNAMIC_group1 = 0.09
GROUP_AUTOREGROUP_group0 = FALSE
GROUP_AUTOREGROUP_group1 = TRUE

Reproduction scenario:
1. stop negotiator
# condor_off -subsystem negotiator
Sent "Kill-Daemon" command for "negotiator" to local master

2. submit two jobs
$ cat group1.submit
cmd=/bin/sleep
args=10
+AccountingGroup="group1"
queue 100
$ condor_submit group1.submit
Submitting job(s)...
100 job(s) submitted to cluster 1.
$ cat nogroup.submit
cmd=/bin/sleep
args=10
queue 100
$ condor_submit nogroup.submit
Submitting job(s)...
100 job(s) submitted to cluster 2.

3. wait a minute and start negotiator
# condor_on -subsystem negotiator
Sent "Spawn-Daemon" command for "negotiator" to local master

4. see used resources
# condor_userprio -l
LastUpdate = 1284454125
Name1 = "condor_user@hostname"
Priority1 = 0.500000
ResourcesUsed1 = 50
WeightedResourcesUsed1 = 50.000000
AccumulatedUsage1 = 0.000000
WeightedAccumulatedUsage1 = 0.000000
BeginUsageTime1 = 0
LastUsageTime1 = 0
PriorityFactor1 = 1.000000
Name2 = "group1@hostname"
Priority2 = 0.500000
ResourcesUsed2 = 50
WeightedResourcesUsed2 = 50.000000
AccumulatedUsage2 = 0.000000
WeightedAccumulatedUsage2 = 0.000000
BeginUsageTime2 = 0
LastUsageTime2 = 0
PriorityFactor2 = 1.000000
NumSubmittors = 2


As I understood from previous comments group1 should obtain 9% of unused slots in first round, whilst no-group user should get only 1%, incrementally group1 should have 90% of unused slots and no-group user only 10%. But in this scenario both of them obtain equal number of slots. The same behaviour can be observed with more than two groups, in all cases the number of slots is distributed equally to all groups.
Is this expected? If not please please move this bug to assigned.

Comment 11 Jon Thomas 2010-09-14 11:43:56 UTC
Hi

The accountinggroup for group1 is not specified correctly. Correct format is group[.subgroup].username

"group1" mean you are submitting to the nongroup pool because there is no ".". In this case,50% went to one nongroup user and 50% went to the other. That is expected with accountinggroup specified as it is.

Comment 12 Lubos Trilety 2010-09-17 10:48:49 UTC
Tested with (version):
condor-7.4.4-0.9

Tested on:
RHEL4 i386,x86_64  - passed
RHEL5 i386,x86_64  - passed

>>> VERIFIED

Comment 13 Martin Prpič 2010-10-06 13:29:03 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Previously, when using group qoutas and the 'autoregroup' function, the group quotas did not grow relative to their initial proportions. This was caused by the fact that all users (including group users) were negotiated at the same time using prio normalization based on all user prios. With this update, group quotas scale into unused slots appropriately.

Comment 15 errata-xmlrpc 2010-10-14 16:12:14 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0773.html


Note You need to log in before you can comment on or make changes to this bug.