Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 619552

Summary: negotiator hfs incorrect remaining and infinite loop
Product: Red Hat Enterprise MRG Reporter: Jon Thomas <jthomas>
Component: condorAssignee: Erik Erlandson <eerlands>
Status: CLOSED ERRATA QA Contact: Lubos Trilety <ltrilety>
Severity: urgent Docs Contact:
Priority: high    
Version: 1.2CC: fnadge, ltrilety, matt
Target Milestone: 1.3   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Previously, the HFS algorithm was caught in an infinite loop in some situations due to a floating point precision problem. This update corrects the problems in the HFS algorithm and the issue is resolved.
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-10-14 16:06:59 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 528800    
Attachments:
Description Flags
patch
none
reorg of fairshare() for safe loop halting conditions
none
updated reorg of fairshare() with halting conditions tweaked to preserve HFS resource sharing
none
Fixed typo in 'niter' check from previous patch
none
Fixed yet another typo in 'niter' halting check.
none
patch
none
This is the test case that fails when I add in simple loop counter to current code.
none
patch for the loop, roundoff, and fractional slot problem
none
candidate patch based partly on jrt's patch (#436912)
none
Patch that correctly handles logic of "give-away" mode. none

Description Jon Thomas 2010-07-29 19:45:50 UTC
The first part:

In some situations, the HFS algorthim is caught in an infinite loop. This is due to a floating point precision problem where the number of remaining slots is a very small number when it should be zero. Remaining slots is calculated with:

remaining=remaining-totalgiven;

totalgiven is a sum of floating point values that should sum to be equal to remaining. The system precision in some cases causes totalgiven to be slightly smaller than remaining. It affects a system that uses doubles too.

It means we can end up in an infinite loop here:

while (gavesome&&remaining>0) {

the fix is to catch this condition by setting our own level of precision.

while (gavesome&&remaining>0.000001){


The second part:

is a bug where remaining is incorrectly set to be 0. This changes the totalgiven calculation which causes a small fraction (0.5 or less) of a slot to be discarded. This slot is later reclaimed by the roundoff stages, but it should be claimed in the fairshare stage.

		 			if (remaining<0.05) {
						childshare=remaining;
						remaining=0; <--bug


I have this fixed in a new version of the hfs patch, but I'll have to backport it to the old version.

Comment 1 Jon Thomas 2010-07-30 18:05:10 UTC
Created attachment 435617 [details]
patch

patch to fix the precision/roundoff problem

Comment 2 Jon Thomas 2010-08-03 16:57:47 UTC
reproducer:

<jrt> It's 2 slots
<jrt> GROUP_NAMES = a1, b1, a1.a2, a1.a1, a1.a3 b1.b2, b1.b1
<jrt> GROUP_QUOTA_DYNAMIC_a1 = .5
<jrt> GROUP_QUOTA_DYNAMIC_b1 = .4999999
<jrt> GROUP_QUOTA_DYNAMIC_b1.b1 = .4
<jrt> GROUP_QUOTA_DYNAMIC_b1.b2 = .6
<jrt> GROUP_QUOTA_DYNAMIC_a1.a1 = 1
<jrt> and the job is 
<jrt> executable = ./sleep.jrt
<jrt> arguments = 6000
<jrt> universe = vanilla
<jrt> #transfer_executable = true
<jrt> should_transfer_files = YES
<jrt> when_to_transfer_output = ON_EXIT
<jrt> #output = a.out.$(cluster).$(process)
<jrt> #log = a.log.$(cluster).$(process)
<jrt> #+AccountingGroup = "jrt"
<jrt> queue 80

Comment 3 Jon Thomas 2010-08-03 17:28:16 UTC
Comment on attachment 435617 [details]
patch

mmarking this as obsolete as it is not the correct patch for this particular problem.

The issue is really small childshare leading to a really small decrement of remaining. It's not really an infinite loop, but would appear as one to the system.

Comment 4 Erik Erlandson 2010-08-03 19:07:22 UTC
A minimal reproducer:

HFS config:

GROUP_NAMES = grp_a
GROUP_QUOTA_DYNAMIC_grp_a = 0.9999999


Submit (a single non-group job):

% echo -e "universe=vanilla\ncmd=/bin/sleep\nargs=1m\nqueue\n" | condor_submit

Comment 5 Jon Thomas 2010-08-03 19:55:34 UTC
Here's another reproducer of the same problem except at the group level. The same problem occurs in two locations (same math).  All the more reason for the "not getting work done" small totalgiven fix. In addition to the fix for this problem, I think we may want to consider scrubbing the config.

To reproduce , just run a job as b1.b1

GROUP_NAMES = a1, b1, a1.a2, a1.a1, a1.a3, a1.a3.a1, a1.a3.a2, a1.a3.a3, b1.b2, b1.b1
GROUP_QUOTA_DYNAMIC_a1 = .4

GROUP_QUOTA_DYNAMIC_b1 = .4
GROUP_QUOTA_DYNAMIC_b1.b1 = .00000001
GROUP_QUOTA_DYNAMIC_b1.b2 = .00000001
NUM_CPUS=2
GROUP_AUTOREGROUP_b1 = TRUE
GROUP_AUTOREGROUP_b1.b2 = TRUE
GROUP_AUTOREGROUP_b1.b1 = TRUE

Comment 6 Erik Erlandson 2010-08-03 19:57:07 UTC
Created attachment 436370 [details]
reorg of fairshare() for safe loop halting conditions

Comment 7 Jon Thomas 2010-08-04 14:44:30 UTC
Unfortunately, the patch breaks hfs.

I ran the nongroupuser test case and saw two problems:

1) missing slots

2) Unused quota isn't staying in the subgroup level. Quota that should be used by peer groups is getting used at the upper levels.

Here's some hard to read data. The slots consumed by a1.a3.a2 and a1.a3.a3 should be used by a1.a3 and a1.a3.a1. With the patch, the unused slots are used by non-group users and the b1.x groups.

no patch, good:

 jrt a1.a1 a1.a2 a1.a3   b1.b1 b1.b2   a1.a3.a1 a1.a3.a2   a1.a3.a3  qa1.a1 qa1.a2 qa1.a3 qb1.b1 qb1.b2 qa1.a3.a1 qa1.a3.a2 qa1.a3.a3
   20    8    8   24/4.8    16    24      9.6       4.8       4.8 
 -------------------------------------------------------------------------
   20  -  0  -  0  -  14  -  16  -  24  -  26  -  0  -  0       -80 0 0 100 100 100 100 0 0
   20  -  0  -  0  -  14  -  16  -  24  -  26  -  0  -  0       -80 0 0 100 100 100 100 0 0
   20  -  0  -  0  -  14  -  16  -  24  -  26  -  0  -  0       -80 0 0 100 100 100 100 0 0

with patch, Bad:

  jrt a1.a1 a1.a2 a1.a3   b1.b1 b1.b2   a1.a3.a1 a1.a3.a2   a1.a3.a3  qa1.a1 qa1.a2 qa1.a3 qb1.b1 qb1.b2 qa1.a3.a1 qa1.a3.a2 qa1.a3.a3
   20    8    8   24/4.8    16    24      9.6       4.8       4.8 
 -------------------------------------------------------------------------
   26  -  0  -  0  -  10  -  16  -  28  -  19  -  0  -  0       -80 0 0 100 100 100 100 0 0
   26  -  0  -  0  -  10  -  16  -  28  -  19  -  0  -  0       -80 0 0 100 100 100 100 0 0
   26  -  0  -  0  -  10  -  16  -  29  -  19  -  0  -  0       -80 0 0 100 100 100 100 0 0
   26  -  0  -  0  -  10  -  16  -  29  -  19  -  0  -  0       -80 0 0 100 100 100 100 0 0


This is with

 if (remaining < 0.1) break;

As discussed 0.1 is a bit large. I can test with a smaller value.

Comment 8 Jon Thomas 2010-08-04 15:27:40 UTC
I think the problem with the patch wrt hfs is that it removed the parts that consumed all the fractional slot a node was given and subsequently return 0 for the amount that wasn't used. This means that groupArray[index].unused doesn't equal zero in some cases. 

groupArray[index].unused is used as a flag that prunes the tree. Unused>0 signifies that the subtree has unused slots and therefore can't use more. 

In any case, I'm more inclined to leave the parts that consume the fractional slots based upon the logic that the algorithm is trying to give out a computed share. Either the subtree can use the entire computed share or not due to number of jobs. 

I think a fix for this is checking totalgiven for a small value. perhaps something like

if (totalgiven<0.001&&was_an_autoregroup==true) {
 round_robin_whole_slots;
 give_away_fraction;
 remaining=0;
}

In the repro, we are trying to give away 2 slots. If we bail because we don't give any due to math, then we need to use an alternate method to distribute the slots.

Comment 9 Erik Erlandson 2010-08-04 20:26:43 UTC
Created attachment 436656 [details]
updated reorg of fairshare() with halting conditions tweaked to preserve HFS resource sharing

Comment 10 Erik Erlandson 2010-08-04 20:38:08 UTC
Created attachment 436662 [details]
Fixed typo in 'niter' check from previous patch

Comment 11 Erik Erlandson 2010-08-04 20:44:22 UTC
Created attachment 436663 [details]
Fixed yet another typo in 'niter' halting check.

Comment 12 Erik Erlandson 2010-08-04 23:45:44 UTC
Latest patch still isn't behaving correctly on all cases.  I don't think this patch returns remaining value of exactly zero in all the same cases that the current version does.  Need to verify.

Comment 13 Erik Erlandson 2010-08-05 05:25:42 UTC
Discovered that just taking the current hfs code, and adding a loop iteration counter, with a dprintf output of iteration count at the end, causes one of jrt's tests to stop working ("depthtest")

That puts my testing results from recent patches in some doubt.  Possible they work, but changes triggered whatever adding that single counter triggered.

Either way, it seems likely there is a memory intitialization bug somwhere in hfs, or some other similar thing that would be triggered by a small code tweak.  Probably have to track that down, since any code changes now can't be tested with confidence.

Comment 14 Jon Thomas 2010-08-05 13:53:28 UTC
Created attachment 436855 [details]
patch

here's a patch with somewhat of a different approach. Seems to work so far, but I haven't run all the scenarios yet.

Comment 15 Erik Erlandson 2010-08-05 15:49:03 UTC
Created attachment 436891 [details]
This is the test case that fails when I add in simple loop counter to current code.

Comment 16 Jon Thomas 2010-08-05 16:57:06 UTC
Created attachment 436912 [details]
patch for the loop, roundoff, and fractional slot problem

This seems to fix the reproducer. It also fixes the fractional slot problem which exists in the same section. The issues are somewhat related in bad remaining counts. The fractional slot problem was the reason for a more involved second stage of roundoff. That is removed. As a side note, the second stage of roundoff had some faulty logic which fails the depthtest, which may be the problem eje was/is experiencing.

Comment 17 Erik Erlandson 2010-08-06 05:19:26 UTC
(In reply to comment #13)
> Discovered that just taking the current hfs code, and adding a loop iteration
> counter, with a dprintf output of iteration count at the end, causes one of
> jrt's tests to stop working ("depthtest")

Looks like this wasn't a bug, it was an artifact of different job groups getting submitted at multiple times, and the first group getting allocated before others got into queue.  These jobs do not preempt each other, so the allocation ratios remained different than the 'expected' ratio.

Changing tests to do group submissions as single submit does a lot to improve the testing consistency.

Submitting short-duration jobs would also allow allocation ratios to 'recover', but it would also require submitting even more jobs.

Comment 18 Erik Erlandson 2010-08-06 05:28:54 UTC
Created attachment 437030 [details]
candidate patch based partly on jrt's patch (#436912)

This patch uses some fairshare loop halting logic adapted from one of my earlier patches, and also includes the roundoff and fractional slot fixes from jrt's most recent patch (#436912).

I added a looping halt on a maximum iteration threshold, as a purely defensive measure to prevent any future unanticipated long/infinite looping corner cases for users/customers that would disable the negotiator.

Comment 19 Jon Thomas 2010-08-06 14:11:05 UTC
Don't shoot the messenger, but the latest patch doesn't really fix the fractional slot problem and has a bug.

The fractional slot problem is when:

childshare=remaining;
remaining=0;
canonlyuse<childshare;

The node didn't use all childshare, but remaining was set to 0. This means that a (childshare-canonlyuse) number of slots is thrown out. I suppose it's misnamed, but I observed it with fractions of slots being "lost" so I called it the fractional slot problem.

The bug is along the same lines. In the childshare = remaining; case, remaining has to be set to an appropriate value. Otherwise further down in the code or in the loop, that amount of remaining could be given out multiple times.

The other thing is I think we will have faster convergence if  we use:

factor=1.0/totalquota;

consider 10 slots and two groups with 0.2 for quota

first iteration each group gets 2 slots for a total of 4;

factor=1.0/totalquota;

gives us a factor=1/.4  of 2.5  while

factor=remaining / totalgiven;

gives us a factor=6/4 of 1.5

if you move  remaining=remaining-totalgiven;

below the calculation of factor or R, then its 10/4 which gives us 2.5 too.

I used both cases in the patch I attached mostly due to fact remaining could change prior to reaching the end of the loop.

Comment 20 Erik Erlandson 2010-08-06 18:39:28 UTC
Created attachment 437232 [details]
Patch that correctly handles logic of "give-away" mode.

Fixed my bug in handling of give-away mode based on jrt's explanation.  Also tweaked multiplier logic for faster conversion as he described.

I think this version may make the difference between normal fair-share mode and give-away mode easier to follow in the code.

Comment 21 Erik Erlandson 2010-08-06 21:30:34 UTC
Pushed patch (id=437232) to branch V7_4-BZ619552-fairshare-loop-halting on FH repo.

Link to patch: https://bugzilla.redhat.com/attachment.cgi?id=437232&action=diff

Comment 22 Lubos Trilety 2010-08-27 07:57:09 UTC
Tested with (version):
condor-7.4.4-0.9

Tested on:
RHEL5 x86_64,i386  - passed
RHEL4 x86_64,i386  - passed

>>> VERIFIED

Comment 23 Florian Nadge 2010-10-07 14:02:53 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Previously, the HFS algorithm was caught in an infinite loop in some situations due to a floating point precision problem. This update corrects the problems in the HFS algorithm and the issue is resolved.

Comment 25 errata-xmlrpc 2010-10-14 16:06:59 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0773.html