Bug 615534

Summary: HFS does not work properly with weighted slots
Product: Red Hat Enterprise MRG Reporter: Erik Erlandson <eerlands>
Component: condorAssignee: Erik Erlandson <eerlands>
Status: CLOSED DUPLICATE QA Contact: Lubos Trilety <ltrilety>
Severity: high Docs Contact:
Priority: high    
Version: 1.3CC: eerlands, iboverma, jneedle, ltoscano, ltrilety, matt, mkudlej
Target Milestone: 2.3Keywords: Reopened
Target Release: ---   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: condor-7.5.6-0.1 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-09-04 21:59:44 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 850563, 528800    
Attachments:
Description Flags
Small patch that redefines available resources (max-allowed) at root node to be sum of slot weights.
none
NegotiatorLog none

Description Erik Erlandson 2010-07-16 23:57:53 UTC
Description of problem:
HFS does not play well with weighted slots.   For example, when pool has a bunch of partitionable slots with cpus=8, it will cause HFS to match about 1/8 of the possible slots.   this is because it currently considers avail resources in unweighted space, but each slot costs the full weight (for example, 8.0).

Initial tests of simply re-defining available resources to be sum of slot weights are promising, but need more testing.

I'm going to attach a patch (one-line modification) with available resources redefined in that way, as a starting point for additional testing.


Steps to Reproduce:
1.  Define a pool with partitionable slots, where cpus = N.
2.  Submit a bunch of jobs
3.  Observe number of jobs matched per cycle:  it will be 1/N of the avail slots.
  
Expected results:
Would like HFS to match all available slots (or as close as possible), regardless of weightings.

Comment 1 Erik Erlandson 2010-07-17 00:03:04 UTC
Created attachment 432522 [details]
Small patch that redefines available resources (max-allowed) at root node to be sum of slot weights.

This patch seems to produce desired behavior in the "homogeneous" case (e.g. all slots are partitionable, and with cpus = N).

My initial experiments with non-homogeneous slots are showing some starvation, but when I look at negotiator logs, it does not seem to be a problem with insufficient quotas.   It's giving me "insufficient priority", which I do not yet understand.

May be desirable to run similar tests against empire.  My tests were on a single laptop with NUM_CPUS artificially set to 100, and it may have been wreaking havoc on my install.

Comment 2 Erik Erlandson 2010-07-17 00:06:00 UTC
Here is my slot config and HFS config for a heterogeneous case (note artificial setting for NUM_CPUS)
#####################
# HFS testing related parameters
#####################
NUM_CPUS=100

SLOT_TYPE_1 = cpus=5
SLOT_TYPE_1_PARTITIONABLE = TRUE
NUM_SLOTS_TYPE_1 = 10

SLOT_TYPE_2 = cpus=1
SLOT_TYPE_2_PARTITIONABLE = FALSE
NUM_SLOTS_TYPE_2 = 50

GROUP_NAMES = grp_10a, grp_10b, grp_20a, grp_20b, grp_40a

GROUP_QUOTA_DYNAMIC_grp_10a = 0.1
GROUP_QUOTA_DYNAMIC_grp_10b = 0.1
GROUP_QUOTA_DYNAMIC_grp_20a = 0.2
GROUP_QUOTA_DYNAMIC_grp_20b = 0.2
GROUP_QUOTA_DYNAMIC_grp_40a = 0.4

Comment 3 Erik Erlandson 2010-07-17 00:07:37 UTC
Here is the submit file I was using -- it over-submits (relative to 100 cpus), so I can see how it fills available slots.

universe = vanilla
cmd = /bin/sleep
args = 5m
+AccountingGroup = "grp_10a.eje"
queue 100
+AccountingGroup = "grp_10b.eje"
queue 100
+AccountingGroup = "grp_20a.eje"
queue 100
+AccountingGroup = "grp_20b.eje"
queue 100
+AccountingGroup = "grp_40a.eje"
queue 100

Comment 4 Erik Erlandson 2010-07-17 00:10:41 UTC
I've been using this one-liner to monitor histogram of accounting group usage:

% watch -n 10 'condor_q -run -l | grep AccountingGroup | sort | uniq -c'

Comment 5 Jon Thomas 2010-07-19 14:16:58 UTC
> My initial experiments with non-homogeneous slots are showing some starvation,
> but when I look at negotiator logs, it does not seem to be a problem with
> insufficient quotas.   It's giving me "insufficient priority", which I do not
> yet understand.
> 

I think the "insufficient priority" is related to preemption. My previous tests (1 yr ago or so) showed starvation too. As I recall, this was also evident in simple userprio vs userprio negotiation. So it occurred at the group level and also at the user level.  

At the time, my thought was that the issue was only solvable with a negotiator redesign that allowed the HFS algorithm to see exactly what transpired during negotiation and readjust accordingly.  Other than that, maybe sort slots by weight and group sort by quota could be leveraged.

The problem with expected vs actual HFS quotas is that even small differences compound over time. Being starved just a handful of slots each cycle ends up being significant.

I think this needs further testing. A more complicated group scenario and also user level testing. If the problem still exists at the user level, that may change how we view the problem at the group level.

Comment 6 Jon Thomas 2010-07-19 19:58:01 UTC
I've found that when SLOT_TYPE_1_PARTITIONABLE = TRUE,  GetWeightedResourcesUsed returns the same value as GetResourcesUsed. This is the unweighted slot count.  I'm not sure of the exact impact of this yet. For the first negotiation cycle, things are fine. In the second cycle, usage doesn't represent weighted usage. The problem here is that the result of the first cycle could be a group with quota=40 that was matched against five machines with 8 cpus or 40 machines with one cpu. The difference in the second cycle is negotiation for 35 more slots or non at all.

This is probably solvable by changing GetWeightedResourcesUsed, but I have to look around to see what else uses this.

There is definetly starvation at the group level and it's dependent on the pool of nodes and how they are consumed. It's somewhat of a bin packing problem. It works well when you have a large number of small items. Not so well if you have large items and small bins or a situation where a whole number of items don't fit in the bins. A scenario that occurred with one of my simple tests, is multiple group quotas starved with a number of 8 cpu machines unused.

Comment 7 Jon Thomas 2010-07-19 20:02:14 UTC
> negotiation and readjust accordingly.  Other than that, maybe sort slots by
> weight and group sort by quota could be leveraged.
> 

Sort is not going to be of any use. Matchmaker already fills bins with largest object that will fit.

Comment 8 Jon Thomas 2010-07-22 14:54:20 UTC
summary of results:

Disable slot weights either with/without patch with/without partitionable slots = works

Enable slot weights with no patch = doesn't work
   --slow ramp up and starvation in groups/users who negotiate late in cycle

Enable slot weights with patch = doesn't work
   --better ramp up but still has starvation in groups/users who negotiate late in cycle

Enable slot weights with patch and with partitionable slots = doesn't work
   -- partitionable flag causes the accountant to return unweighted usage
   -- unweighted usage breaks HFS in subsequent cycles
   -- this has the same starvation issue
  


There seems to be two issues here. 
 a) Weighted slots and impact on groups and users who negotiate late in the cycle.
 b) Use of partitionable slots with weighted slots. This behavior seems completely broken and unrelated to HFS. The flag causes unweighted usage to be used in negotiation. At the same time, the negotiator is matching against weighted slots and limiting the number of slots matched to the sum of the unweighted usage plus all the slot weights being matched. It appears here that someone made the decision that the partitionable slot flag unsets the slot weight flag in the accountant, but didn't tweak the code in the negotiator to do the same.

Comment 9 Jon Thomas 2010-07-22 18:22:25 UTC
additional info:

With partitionable slots and weighted slots, in each cycle numDynGroupSlots grows, yet the untrimmedSlotWeightTotal remains the same. For example in a test that has an initial cycle with:

numDynGroupSlots 39  untrimmedSlotWeightTotal 200.000000 

One will eventually see things like:

numDynGroupSlots 124  untrimmedSlotWeightTotal 200.000000 


This means if we use untrimmedSlotWeightTotal we will never acquire additional slots. Except that we do because the usage is reported incorrectly. The end result is that we use some of the additional slots, but not as many as we should. 

The other issue here is why is a slot with 8 cpus being partitioned when we gave away all 8 cpus in the first iteration?  

In general, I find partitionable slots are incompatible with weighted slots. It seems reasonable to have a partitionable slot within the same environ as a weighted slot, but it doesn't make much sense to partition a weighted slot.


I'm also finding some other behavior in partitionable slots;

- if you set 

SLOT_TYPE_3 = cpus=1
SLOT_TYPE_3_PARTITIONABLE = TRUE
NUM_SLOTS_TYPE_3 = 16

You get numDynGroupSlots and untrimmedSlotWeightTotal that are wrong in negotiatior. The single cpu slots show up as partitioned in the counts, but then no jobs run on them because they are only single cpu. Setting to FALSE fixes this problem.

- I'm finding this config only partitions 3 times. In the cycle, I see the slots partition the 4th time and that numDynGroupSlots and untrimmedSlotWeightTotal show this. However, at negotiation these slots are not available. 

SLOT_TYPE_1 = cpus=8
SLOT_TYPE_1_PARTITIONABLE = TRUE
NUM_SLOTS_TYPE_1 = 23

Comment 10 Jon Thomas 2010-07-23 19:02:31 UTC
Unrelated to this issue, there is a bug in the accountant for hfs where string matching causes incorrect stats in the accountant. 

Additionally, there seems to be an issue in AddMatch and removeMatch.

What seems to be happening is with partitionable slots, the accountant removes the original match against the slot and creates a new match against a renamed slot. For example, original match is slot1@. The accountant deletes this and subtracts off slotweight, time, etc from the statistics. It then creates a match for slot1_1@ and adds back all the statistics.  When it does the removal  apparently the slotweight for slot1@ is 0.  And when it adds back statistics for slot1_1@, slotweight is 1. All this despite the slot's weight is actually 8.

This is why the usage value from GetWeightedResourcesUsed in negotiation is wrong.

thoughts.. 

-Having a slotweight=1 for slot1_1@ on AddMatch makes sense if the original match was for one cpu. But the original match was for 8. 

-Having slotweight =0 on RemoveMatch makes sense if this number represents the slotweight that is remaining. But it's the wrong number to use if we are removing a match for the 8 slots that were matched in the previous cycle.

-a couple things need to be fixed
 1) don't partition a slot whose entire weight was claimed in a previous cycle
 2) fix the usage statistics in the accountant.

Comment 11 Jon Thomas 2010-07-23 21:18:21 UTC
correction: slotweight in RemoveMatch is 8 not 0 (there was an I/O issue)

It looks like the potential partitionable slot/weight slot generic issue fix is to not decrement pieleft and increment usage by slotweight when the job is matched to a partionable slot.  Because of slotweights, the matchmaker thinks it has satisfied the quota when it's really only matched a much smaller number of slots. This fix would allow negotiation to continue to until the quota is reached.

Additionally for hfs, we would likely need to compute up front in HFS a total slot count that ignores slotweights for partitionable slots. This would be a number that ranges between the numdyngroupslots and untrimmedslotweight, but would more accurately reflect the number of slots that are available.

Currently, the code doesn't allow a partitionable slot to be used in the same way as a weighted slot. Setting the partitionable flag turns a weighted slot into a partitionable slot.  The above fixes wouldn't change this, but would fix up the matchmaking, negotiation, and accounting problems.

re: hfs/user starvation with most slots having large slot weights. The fix is to set these slots to be partitionable. 

A more complicated fix would be to allow a slot to be partitioned when the matchmaker cannot find a match. But this implies slots are partitionable by default.

Comment 12 Jon Thomas 2010-08-02 18:23:13 UTC
The following patch fixes up usage stats by flagging on whether the slot is partitionable. Slotweight should be 1 for a partitionable slot. 

The patch doesn't fix the starvation problem. 



diff -rNup condor-7.4.2.orig/src/condor_negotiator.V6/Accountant.cpp condor-7.4.2/src/condor_negotiator.V6/Accountant.cpp
--- condor-7.4.2.orig/src/condor_negotiator.V6/Accountant.cpp	2010-07-30 08:52:34.000000000 -0400
+++ condor-7.4.2/src/condor_negotiator.V6/Accountant.cpp	2010-08-02 13:24:38.000000000 -0400
@@ -1505,13 +1505,19 @@ float Accountant::GetSlotWeight(ClassAd 
 	if(!UseSlotWeights) {
 		return SlotWeight;
 	}
-
-	if(candidate->EvalFloat(SlotWeightAttr, NULL, SlotWeight) == 0 || SlotWeight<0) {
-		MyString candidateName;
-		candidate->LookupString(ATTR_NAME, candidateName);
-		dprintf(D_FULLDEBUG, "Can't get SlotWeight for '%s'; using 1.0\n", 
+	bool is_partitionable=false;
+	
+	candidate->LookupBool(ATTR_SLOT_PARTITIONABLE, is_partitionable);
+	dprintf(D_FULLDEBUG, "is_partitionable %d \n", 
+			 (int) is_partitionable);
+	if ( !is_partitionable){									;
+		if(candidate->EvalFloat(SlotWeightAttr, NULL, SlotWeight) == 0 || SlotWeight<0) {
+			MyString candidateName;
+			candidate->LookupString(ATTR_NAME, candidateName);
+			dprintf(D_FULLDEBUG, "Can't get SlotWeight for '%s'; using 1.0\n", 
 				candidateName.Value());
-		SlotWeight = 1.0;
+			SlotWeight = 1.0;
+		}
 	}
 	return SlotWeight;
 }

Comment 13 Jon Thomas 2010-08-02 18:37:23 UTC
errr, I should say it doesn't fix the "other" starvation problem. One starvation problem is due to usage being off. This patch corrects this.

The other starvation problem is due to not being able to break up a weighted slot into smaller weighted slots. 

Neither starvation problem is specific to hfs. Regular userprio RUP based negotiation uses the same usage functions. Similar to group quotas, no user with RUP derived quota<8 will get a slot with slotweight>=8. The user's jobs won't run in this environment.

The matrix is:

-no weighted slots, no partitionable slots: works
-no weighted slots, with partitionable slots: works
-weighted slots: starvation dependent upon quota calculation and pool of resources

Comment 14 Matthew Farrellee 2010-08-03 01:21:50 UTC
Building NEGOTIATOR_USE_SLOT_WEIGHTS = False into 7.4.4-0.8

Comment 15 Jon Thomas 2010-12-02 21:25:20 UTC
This should probably be reassigned to somebody more familiar with the code going into 2.0.

Comment 16 Erik Erlandson 2011-03-03 19:05:12 UTC
Weighted slots were enabled with HFS (HGQ) as part of the upstream inclusion process.

For grid 2.0, what I expect is that it will pass basic smoke-tests: 
(a) that HFS will not crash when weighted slots are enabled, and 
(b) it should be able to use weighted slots, and that it will continue to obey its accounting group limits.  So, for example, if a group's quota is 10, it might allocate up to 5 slots with weight 2.  Or 4 slots of weight 1 + 3 slots of weight 2, etc.

(It has passed these basic smoke tests on my personal sandbox)

At this time we do not guarantee any attempt at "optimal packing" of slots: using weighted slots may result in sub-optimal or poor slot usage.

Comment 17 Lubos Trilety 2011-04-18 13:49:42 UTC
Created attachment 492897 [details]
NegotiatorLog

When I use following configuration, it gets 5 slots of weight 5 to group grp_20a (group grp_20a has quota 20)

>>> ASSIGNED

configuration:
NUM_CPUS=100

SLOT_TYPE_1 = cpus=5
SLOT_TYPE_1_PARTITIONABLE = FALSE
NUM_SLOTS_TYPE_1 = 10

SLOT_TYPE_2 = cpus=3
SLOT_TYPE_2_PARTITIONABLE = FALSE
NUM_SLOTS_TYPE_2 = 15

GROUP_NAMES = grp_10a, grp_10b, grp_20a, grp_20b, grp_40a

GROUP_QUOTA_DYNAMIC_grp_10a = 0.1
GROUP_QUOTA_DYNAMIC_grp_10b = 0.1
GROUP_QUOTA_DYNAMIC_grp_20a = 0.2
GROUP_QUOTA_DYNAMIC_grp_20b = 0.2
GROUP_QUOTA_DYNAMIC_grp_40a = 0.4


submit file:
universe = vanilla
cmd = /bin/sleep
args = 5m
+AccountingGroup = "grp_10a.eje"
queue 100
+AccountingGroup = "grp_10b.eje"
queue 100
+AccountingGroup = "grp_20a.eje"
queue 100
+AccountingGroup = "grp_20b.eje"
queue 100
+AccountingGroup = "grp_40a.eje"
queue 100

Scenario:
1) submit the file

2) see use cpus per group
# condor_q -run -l -constraint 'AccountingGroup == "grp_10a.eje"' | grep MachineAttrCpus
MachineAttrCpus0 = 5
MachineAttrCpus0 = 5
# condor_q -run -l -constraint 'AccountingGroup == "grp_10b.eje"' | grep MachineAttrCpus
MachineAttrCpus0 = 5
MachineAttrCpus0 = 5
# condor_q -run -l -constraint 'AccountingGroup == "grp_20a.eje"' | grep MachineAttrCpus
MachineAttrCpus0 = 5
MachineAttrCpus0 = 5
MachineAttrCpus0 = 5
MachineAttrCpus0 = 5
MachineAttrCpus0 = 5
# condor_q -run -l -constraint 'AccountingGroup == "grp_20b.eje"' | grep MachineAttrCpus
MachineAttrCpus0 = 5
MachineAttrCpus0 = 3
MachineAttrCpus0 = 3
MachineAttrCpus0 = 3
MachineAttrCpus0 = 3
# condor_q -run -l -constraint 'AccountingGroup == "grp_40a.eje"' | grep MachineAttrCpus
MachineAttrCpus0 = 3
MachineAttrCpus0 = 3
MachineAttrCpus0 = 3
MachineAttrCpus0 = 3
MachineAttrCpus0 = 3
MachineAttrCpus0 = 3
MachineAttrCpus0 = 3
MachineAttrCpus0 = 3
MachineAttrCpus0 = 3
MachineAttrCpus0 = 3

group grp_20a has too many cpus used

Additional information:
see negotiator log in attachment

# condor_q -run -l -constraint 'AccountingGroup == "grp_20a.eje"' | grep slot
RemoteHost = "slot5@hostname"
RemoteHost = "slot6@hostname"
RemoteHost = "slot7@hostname"
RemoteHost = "slot8@hostname"
RemoteHost = "slot9@hostname"

# condor_status -l -startd -constraint 'Cpus == 5' | grep slot
Name = "slot10@hostname"
Name = "slot1@hostname"
Name = "slot2@hostname"
Name = "slot3@hostname"
Name = "slot4@hostname"
Name = "slot5@hostname"
Name = "slot6@hostname"
Name = "slot7@hostname"
Name = "slot8@hostname"
Name = "slot9@hostname"
# condor_status -l -startd -constraint 'Cpus == 3' | grep slot
Name = "slot11@hostname"
Name = "slot12@hostname"
Name = "slot13@hostname"
Name = "slot14@hostname"
Name = "slot15@hostname"
Name = "slot16@hostname"
Name = "slot17@hostname"
Name = "slot18@hostname"
Name = "slot19@hostname"
Name = "slot20@hostname"
Name = "slot21@hostname"
Name = "slot22@hostname"
Name = "slot23@hostname"
Name = "slot24@hostname"
Name = "slot25@hostname"

Comment 18 Erik Erlandson 2011-04-18 16:31:03 UTC
> When I use following configuration, it gets 5 slots of weight 5 to group
> grp_20a (group grp_20a has quota 20)

The negotiator log indicates that slot weights are disabled:

04/18/11 09:47:44 group quotas: assigning group quotas from 25 available slots

If slot weights are enabled, it would say "... xxx available weighted slots."  I didn't see it in your config, but it looks like somewhere you have NEGOTIATOR_USE_SLOT_WEIGHTS = FALSE

Comment 19 Lubos Trilety 2011-04-19 07:22:18 UTC
(In reply to comment #18)
> > When I use following configuration, it gets 5 slots of weight 5 to group
> > grp_20a (group grp_20a has quota 20)
> 
> The negotiator log indicates that slot weights are disabled:
> 
> 04/18/11 09:47:44 group quotas: assigning group quotas from 25 available slots
> 
> If slot weights are enabled, it would say "... xxx available weighted slots." 
> I didn't see it in your config, but it looks like somewhere you have
> NEGOTIATOR_USE_SLOT_WEIGHTS = FALSE

I didn't set that, it seems to me that it's default setting. I found that in /etc/condor/condor_config, which was not changed.

One other notice I found this sentence in Grid User Guide:
When accounting groups are in effect, slot weights must be disabled. It is a requirement to set
NEGOTIATOR_USE_SLOT_WEIGHTS = FALSE.
Probably it should be removed from the manual otherwise this bug doesn't make much sense.

Comment 20 Erik Erlandson 2011-04-19 14:55:45 UTC
> I didn't set that, it seems to me that it's default setting. I found that in
> /etc/condor/condor_config, which was not changed.

Strange, the default in the code is 'true' -- If it isn't set to FALSE somewhere I don't see how the behavior is disabled.

> 
> One other notice I found this sentence in Grid User Guide:
> When accounting groups are in effect, slot weights must be disabled. It is a
> requirement to set
> NEGOTIATOR_USE_SLOT_WEIGHTS = FALSE.
> Probably it should be removed from the manual otherwise this bug doesn't make
> much sense.

This discrepancy arose from a difference of opinion with upstream policy:  upstream wanted to allow users to use weighted slots.  We felt that the behavior wasn't sufficiently well-understood.   We may want to leave our "official" documentation as-is, however I will bring it up with Matt.

Comment 21 Erik Erlandson 2011-04-25 17:53:47 UTC
Kicking this to 2.1 -- weighted slots are supported, however we need to spend a bit of time sorting out what our Grid policy on weighted slots is going to be, without rushing it.   

For 2.0, we can leave our current policy of requiring 'NEGOTIATOR_USE_SLOT_WEIGHTS = FALSE' in place.