619557 – Convert HFS to tree based structure

Bug 619557 - Convert HFS to tree based structure

Summary: Convert HFS to tree based structure

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	condor
Sub Component:
Version:	1.2
Hardware:	All
OS:	Linux
Priority:	high
Severity:	medium
Target Milestone:	1.3.2
Target Release:	---
Assignee:	Erik Erlandson
QA Contact:	MRG Quality Engineering
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2010-07-29 20:01 UTC by Jon Thomas
Modified:	2011-02-15 13:03 UTC (History)
CC List:	2 users (show)
Fixed In Version:	condor-7.4.5-0.2
Doc Type:	Enhancement
Doc Text:
Clone Of:
Environment:
Last Closed:	2011-02-15 13:03:29 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
patch (56.35 KB, application/octet-stream) 2010-07-29 20:01 UTC, Jon Thomas	no flags	Details
patch against condor-7_4_4-0_4_el5 (64.56 KB, patch) 2010-08-02 13:30 UTC, Jon Thomas	no flags	Details \| Diff
tree based hfs patch against condor-7_4_4-0_16_el5 (66.66 KB, patch) 2010-10-27 19:02 UTC, Jon Thomas	no flags	Details \| Diff
Functional tests used for reorg (5.22 KB, application/x-gzip) 2010-11-19 21:07 UTC, Erik Erlandson	no flags	Details
Updated functional tests for reorg (6.41 KB, application/x-gzip) 2010-11-24 23:04 UTC, Erik Erlandson	no flags	Details
Show Obsolete (4) View All

Description Jon Thomas 2010-07-29 20:01:58 UTC

Created attachment 435385 [details]
patch

Convert HFS to tree based structure for better readability and scaling performance for systems with large number of groups.

Comment 2 Jon Thomas 2010-08-02 13:30:23 UTC

Created attachment 436022 [details]
patch against condor-7_4_4-0_4_el5

ported patch over to condor-7_4_4-0_4_el5

Comment 5 Jon Thomas 2010-10-27 19:02:40 UTC

Created attachment 456050 [details]
tree based hfs patch against condor-7_4_4-0_16_el5

Patch for tree based hfs built for 0.16

This version also changes the structure of the tree. All submitters are now attached to leaves. This removes a bunch of duplicate code and makes the recursive functions quite a bit cleaner.

Comment 6 Erik Erlandson 2010-11-19 21:07:01 UTC

Created attachment 461651 [details]
Functional tests used for reorg

I pushed a new branch with tree-based reorg to fedorahosted:
V7_4-BZ619557-HFS-tree-structure

This attachment contains some new functional tests I wrote for the HFS reorg.

This re-org incorporates several RFEs and bug fixes from Jon, including BZ637281, BZ641418, BZ644436, BZ644904 (these are also annotated in the commit).

Comment 7 Jon Thomas 2010-11-19 23:29:56 UTC

fyi: the code in the new branch does not pass the hfs tests that I use. Off hand, I would guess perhaps roundoff errors are being compounded. There is some starvation a low level groups perhaps due to roundoff, which likely causes proportions to be off when jobs for those groups are killed. 

Actually what it looks like is what I experienced when I tested the whole slot approach.

Comment 8 Erik Erlandson 2010-11-20 14:52:41 UTC

> fyi: the code in the new branch does not pass the hfs tests that I use. Off
> hand, I would guess perhaps roundoff errors are being compounded. There is some
> starvation a low level groups perhaps due to roundoff, which likely causes
> proportions to be off when jobs for those groups are killed. 

Can you post that test?  I'll understand the issue more clearly if I can run it.

Comment 9 Jon Thomas 2010-11-20 23:43:45 UTC

On my phone at the moment, but it was posted as nongroupuser test in theoriginal hfs bug. I think part of the problem is due to use of floor(). The other part, the side effect of handing out whole slots in complex trees. In complex trees, hfs may hand out slots to a group multiple times. In the whole slot approach that means the leftovers from use of floor are given to the same subgroup multiple times.... Unless they are sorted which takes away any perceived performance gains. 

I lean toward use of floating point in hfs due to these types of problems. Translating from fp to ints causes a loss in precision. Using fp for hfs and translating to ints immediately prior to negotiation ensures a group obtains a more precise fairshare.

Comment 10 Erik Erlandson 2010-11-21 01:13:37 UTC

> Using fp for hfs and
> translating to ints immediately prior to negotiation ensures a group obtains a
> more precise fairshare.

One property of this HFS incarnation is that it does preserve fractions of quota -- when it does the floor(), any unused fraction is passed up the stack.  In fact, if a subtree can use all surplus at a level, the *only* thing that will be passed up the stack is any remaining fraction, which is a quantity < 1.   (one consequence is that at the root level, any remaining surplus should therefore be an integer)

The "rr_surplus_test" tests a fairly extreme example of that, where a single slot has to get passed around across multiple cycles, since *every* group and subgroup has quota < 1.

Although it's not throwing any fractional quota away, fractional quotas are allocated thru a somewhat different path, and I'll do some more verification to make sure it's not misbehaving.

Comment 11 Jon Thomas 2010-11-21 14:22:41 UTC

One thing I noticed was a group with 4.8 slots via quota calc only got 4 slots.  The fractional part was given to an autoregroup group. I think the problem is that over time the group will be starved by almost 20% of it's quota.

Comment 12 Erik Erlandson 2010-11-21 16:46:56 UTC

> One thing I noticed was a group with 4.8 slots via quota calc only got 4 slots.
>  The fractional part was given to an autoregroup group. I think the problem is
> that over time the group will be starved by almost 20% of it's quota.

It will definitely behave that way -- The "qw" sorting on the round robin was designed to pick up the remainder over time, across cycles.  That's a pretty straightforward test I'll have to run on the "nongroupuser" config -- if it doesn't then I need to fix it.

Comment 13 Erik Erlandson 2010-11-22 14:45:23 UTC

(In reply to comment #11)
> One thing I noticed was a group with 4.8 slots via quota calc only got 4 slots.
>  The fractional part was given to an autoregroup group. I think the problem is
> that over time the group will be starved by almost 20% of it's quota.

On 2nd thought, I agree it's just not a good semantic for "quota" to share partial slots as surplus prematurely.  I'm going to tweak the algorithm to operate analogously to your implementation and defer fractional slot recovery.

Comment 14 Erik Erlandson 2010-11-24 23:04:31 UTC

Created attachment 462771 [details]
Updated functional tests for reorg

Updated and added a few more functional tests.

Comment 15 Erik Erlandson 2010-11-24 23:08:00 UTC

Updated the surplus allocation to manipulate fractional quotas.  Remainders are harvested and reallocated using a round robin prioritized by last RR allocation.

branch is still: V7_4-BZ619557-HFS-tree-structure

Comment 16 Jon Thomas 2010-12-03 17:49:43 UTC

As far as I can tell the only difference in code is how remainders are handled. The language is different, but that's a matter of personal preference.

surplus is place of unused
requested in place of maxAllowed
hfs_construct_tree performs insert
hfs_assign_quotas performs maxallowedcalc
hfs_fairshare performs fairshare
hfs_allocate_surplus does the same thing as redistribute
hfs_recover_remainders is analogous to roundoffstage1
hfs_round_robin is analogous to roundoffstage2

The only significant difference in code is the last stage of roundoff. In both cases, it's a submitter based FP calculation until roundoffstage1/hfs_recover_remainders. The newer code stores values in the group tree that previously were computed on the fly. The logic isn't different until the last roundoff step. I never used a sort in roundoff due to the potential of sorting multiple times and in doing so, not really solving the problem the sort is supposed to fix. So, there doesn't seem to be any new problem solved here.

In my view, the biggest problems are:
1) a=500, b=1. For every 500 slots "a" gets, "b" should get one regardless of negotiation cycles. My thought here was to make the tree persistent and track "usage" in some way to allow this.
2) Stale data. We can see this in simple personal condor setups which doesn't say much for larger more complex systems. The negotiator can either have stale submitter classAds that cause starvation or there can be more slots available than the machine classAds indicate.
3) Overlapping pool problem. My solution was use a weighted RR and look to see what can be done to remove loops from the negotiatewithgroup call chain. The weighted RR feeds nicely into #1.

Comment 17 Erik Erlandson 2010-12-04 05:21:34 UTC

I adapted the core HFS algorithms from yours, with relatively minor logic changes, so it is definitely true that there is a correspondence between various names.

The main semantic tweak to (Simple)GroupEntry member names is that 'quota' and 'allocated' are the same for both leaf and internal groups, which factors out the duplication relating to maxAllowed/nodeMaxAllowed and quota/nodeQuota.

Algorithmic tweaks were:
1) loop convergence/halting conditions on the core surplus allocation loop are cleaned up a bit -- the need for the 'giveaway' mode is avoided
2) surplus allocations are accumulated into working vector, and recursive calls are made once, after all accumulations are finished
3) fifo-style sorting on the groups for fractional-remainder allocations -- my motivation was to provide an easy-to-document scheme for insuring that the same groups don't get remainders each time

Other refactoring was consolidating logic for mapping submitter to acct group into a single function (residing in the accountant).  Giving the root "none" group status with the accountant also provided some refactoring.

Regarding the overlapping-pool problem, I think your weighted-RR works at least as well as other plausible options.  An unweighted-RR might be desirable to customers, depending on their priorities, but we can easily provide that sort of alternative if we get actual customer feedback to that effect.

I've also seen stale data throwing a monkeywrench into HFS (garbage in, garbage out), although I don't see that as an HFS problem per se, so much as a classad-propagation latency problem.

Comment 18 Jon Thomas 2010-12-06 21:17:49 UTC

I wouldn't call stale data to be HFS problem either, except that the customer will think it is and label it as one. It does impact HFS, but the biggest impact is throttling overall throughput. I see two distinct areas of stale data:

1) Submitter ads that require no slots hang around in the system for the 5 minute timeout. In current HFS this means the submitter is given slots based upon configured quota. All these slots will go unused for the 5 minute timeout period.

2) Negotiator gets "claimed" slot ads, but the slots are actually idle.  In current HFS, this means all these slots are not used in the cycle.

In a complex system, both of these have got to reduce overall throughput by some percent. 

I've been experimenting with different HFS strategies. I came across #2, when trying to avoid #1.  One thing I looked at and that may be relatively easy to change in the current setup is move the calls to negotiatewithgroup into the main HFS algorithm. This will give us better data with regard to surplus. If we get a rejection, the unused slots can be distributed and the group node pruned. This should help avoid the overlapping pool problem.

---

I'm not sure we want to use <none>. It shows up in condor_userprio. Perhaps "ungrouped users"? or maybe change condor_userprio to filter out  <none>.

Comment 19 Erik Erlandson 2010-12-09 23:19:43 UTC

> One thing I looked at and that may be relatively easy to
> change in the current setup is move the calls to negotiatewithgroup into the
> main HFS algorithm. This will give us better data with regard to surplus. If we
> get a rejection, the unused slots can be distributed and the group node pruned.
> This should help avoid the overlapping pool problem.

I agree, it seems like it is possible to make more extensive use of information about rejections and rejection-reason.


> 
> I'm not sure we want to use <none>. It shows up in condor_userprio. Perhaps
> "ungrouped users"? or maybe change condor_userprio to filter out  <none>.

Changing the name to something a bit more polished might be appreciated by users.  The name doesn't matter to the internals, except to the extent that a name 'impossible' to show up in a user's config is best.  "ungrouped users" would fit that, having a space in it.

Note You need to log in before you can comment on or make changes to this bug.