Bug 619552
| Summary: | negotiator hfs incorrect remaining and infinite loop | ||
|---|---|---|---|
| Product: | Red Hat Enterprise MRG | Reporter: | Jon Thomas <jthomas> |
| Component: | condor | Assignee: | Erik Erlandson <eerlands> |
| Status: | CLOSED ERRATA | QA Contact: | Lubos Trilety <ltrilety> |
| Severity: | urgent | Docs Contact: | |
| Priority: | high | ||
| Version: | 1.2 | CC: | fnadge, ltrilety, matt |
| Target Milestone: | 1.3 | ||
| Target Release: | --- | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: |
Previously, the HFS algorithm was caught in an infinite loop in some situations due to a floating point precision problem. This update corrects the problems in the HFS algorithm and the issue is resolved.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2010-10-14 16:06:59 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 528800 | ||
| Attachments: | |||
Created attachment 435617 [details]
patch
patch to fix the precision/roundoff problem
reproducer: <jrt> It's 2 slots <jrt> GROUP_NAMES = a1, b1, a1.a2, a1.a1, a1.a3 b1.b2, b1.b1 <jrt> GROUP_QUOTA_DYNAMIC_a1 = .5 <jrt> GROUP_QUOTA_DYNAMIC_b1 = .4999999 <jrt> GROUP_QUOTA_DYNAMIC_b1.b1 = .4 <jrt> GROUP_QUOTA_DYNAMIC_b1.b2 = .6 <jrt> GROUP_QUOTA_DYNAMIC_a1.a1 = 1 <jrt> and the job is <jrt> executable = ./sleep.jrt <jrt> arguments = 6000 <jrt> universe = vanilla <jrt> #transfer_executable = true <jrt> should_transfer_files = YES <jrt> when_to_transfer_output = ON_EXIT <jrt> #output = a.out.$(cluster).$(process) <jrt> #log = a.log.$(cluster).$(process) <jrt> #+AccountingGroup = "jrt" <jrt> queue 80 Comment on attachment 435617 [details]
patch
mmarking this as obsolete as it is not the correct patch for this particular problem.
The issue is really small childshare leading to a really small decrement of remaining. It's not really an infinite loop, but would appear as one to the system.
A minimal reproducer: HFS config: GROUP_NAMES = grp_a GROUP_QUOTA_DYNAMIC_grp_a = 0.9999999 Submit (a single non-group job): % echo -e "universe=vanilla\ncmd=/bin/sleep\nargs=1m\nqueue\n" | condor_submit Here's another reproducer of the same problem except at the group level. The same problem occurs in two locations (same math). All the more reason for the "not getting work done" small totalgiven fix. In addition to the fix for this problem, I think we may want to consider scrubbing the config. To reproduce , just run a job as b1.b1 GROUP_NAMES = a1, b1, a1.a2, a1.a1, a1.a3, a1.a3.a1, a1.a3.a2, a1.a3.a3, b1.b2, b1.b1 GROUP_QUOTA_DYNAMIC_a1 = .4 GROUP_QUOTA_DYNAMIC_b1 = .4 GROUP_QUOTA_DYNAMIC_b1.b1 = .00000001 GROUP_QUOTA_DYNAMIC_b1.b2 = .00000001 NUM_CPUS=2 GROUP_AUTOREGROUP_b1 = TRUE GROUP_AUTOREGROUP_b1.b2 = TRUE GROUP_AUTOREGROUP_b1.b1 = TRUE Created attachment 436370 [details]
reorg of fairshare() for safe loop halting conditions
Unfortunately, the patch breaks hfs. I ran the nongroupuser test case and saw two problems: 1) missing slots 2) Unused quota isn't staying in the subgroup level. Quota that should be used by peer groups is getting used at the upper levels. Here's some hard to read data. The slots consumed by a1.a3.a2 and a1.a3.a3 should be used by a1.a3 and a1.a3.a1. With the patch, the unused slots are used by non-group users and the b1.x groups. no patch, good: jrt a1.a1 a1.a2 a1.a3 b1.b1 b1.b2 a1.a3.a1 a1.a3.a2 a1.a3.a3 qa1.a1 qa1.a2 qa1.a3 qb1.b1 qb1.b2 qa1.a3.a1 qa1.a3.a2 qa1.a3.a3 20 8 8 24/4.8 16 24 9.6 4.8 4.8 ------------------------------------------------------------------------- 20 - 0 - 0 - 14 - 16 - 24 - 26 - 0 - 0 -80 0 0 100 100 100 100 0 0 20 - 0 - 0 - 14 - 16 - 24 - 26 - 0 - 0 -80 0 0 100 100 100 100 0 0 20 - 0 - 0 - 14 - 16 - 24 - 26 - 0 - 0 -80 0 0 100 100 100 100 0 0 with patch, Bad: jrt a1.a1 a1.a2 a1.a3 b1.b1 b1.b2 a1.a3.a1 a1.a3.a2 a1.a3.a3 qa1.a1 qa1.a2 qa1.a3 qb1.b1 qb1.b2 qa1.a3.a1 qa1.a3.a2 qa1.a3.a3 20 8 8 24/4.8 16 24 9.6 4.8 4.8 ------------------------------------------------------------------------- 26 - 0 - 0 - 10 - 16 - 28 - 19 - 0 - 0 -80 0 0 100 100 100 100 0 0 26 - 0 - 0 - 10 - 16 - 28 - 19 - 0 - 0 -80 0 0 100 100 100 100 0 0 26 - 0 - 0 - 10 - 16 - 29 - 19 - 0 - 0 -80 0 0 100 100 100 100 0 0 26 - 0 - 0 - 10 - 16 - 29 - 19 - 0 - 0 -80 0 0 100 100 100 100 0 0 This is with if (remaining < 0.1) break; As discussed 0.1 is a bit large. I can test with a smaller value. I think the problem with the patch wrt hfs is that it removed the parts that consumed all the fractional slot a node was given and subsequently return 0 for the amount that wasn't used. This means that groupArray[index].unused doesn't equal zero in some cases.
groupArray[index].unused is used as a flag that prunes the tree. Unused>0 signifies that the subtree has unused slots and therefore can't use more.
In any case, I'm more inclined to leave the parts that consume the fractional slots based upon the logic that the algorithm is trying to give out a computed share. Either the subtree can use the entire computed share or not due to number of jobs.
I think a fix for this is checking totalgiven for a small value. perhaps something like
if (totalgiven<0.001&&was_an_autoregroup==true) {
round_robin_whole_slots;
give_away_fraction;
remaining=0;
}
In the repro, we are trying to give away 2 slots. If we bail because we don't give any due to math, then we need to use an alternate method to distribute the slots.
Created attachment 436656 [details]
updated reorg of fairshare() with halting conditions tweaked to preserve HFS resource sharing
Created attachment 436662 [details]
Fixed typo in 'niter' check from previous patch
Created attachment 436663 [details]
Fixed yet another typo in 'niter' halting check.
Latest patch still isn't behaving correctly on all cases. I don't think this patch returns remaining value of exactly zero in all the same cases that the current version does. Need to verify. Discovered that just taking the current hfs code, and adding a loop iteration counter, with a dprintf output of iteration count at the end, causes one of jrt's tests to stop working ("depthtest")
That puts my testing results from recent patches in some doubt. Possible they work, but changes triggered whatever adding that single counter triggered.
Either way, it seems likely there is a memory intitialization bug somwhere in hfs, or some other similar thing that would be triggered by a small code tweak. Probably have to track that down, since any code changes now can't be tested with confidence.
Created attachment 436855 [details]
patch
here's a patch with somewhat of a different approach. Seems to work so far, but I haven't run all the scenarios yet.
Created attachment 436891 [details]
This is the test case that fails when I add in simple loop counter to current code.
Created attachment 436912 [details]
patch for the loop, roundoff, and fractional slot problem
This seems to fix the reproducer. It also fixes the fractional slot problem which exists in the same section. The issues are somewhat related in bad remaining counts. The fractional slot problem was the reason for a more involved second stage of roundoff. That is removed. As a side note, the second stage of roundoff had some faulty logic which fails the depthtest, which may be the problem eje was/is experiencing.
(In reply to comment #13) > Discovered that just taking the current hfs code, and adding a loop iteration > counter, with a dprintf output of iteration count at the end, causes one of > jrt's tests to stop working ("depthtest") Looks like this wasn't a bug, it was an artifact of different job groups getting submitted at multiple times, and the first group getting allocated before others got into queue. These jobs do not preempt each other, so the allocation ratios remained different than the 'expected' ratio. Changing tests to do group submissions as single submit does a lot to improve the testing consistency. Submitting short-duration jobs would also allow allocation ratios to 'recover', but it would also require submitting even more jobs. Created attachment 437030 [details]
candidate patch based partly on jrt's patch (#436912)
This patch uses some fairshare loop halting logic adapted from one of my earlier patches, and also includes the roundoff and fractional slot fixes from jrt's most recent patch (#436912).
I added a looping halt on a maximum iteration threshold, as a purely defensive measure to prevent any future unanticipated long/infinite looping corner cases for users/customers that would disable the negotiator.
Don't shoot the messenger, but the latest patch doesn't really fix the fractional slot problem and has a bug. The fractional slot problem is when: childshare=remaining; remaining=0; canonlyuse<childshare; The node didn't use all childshare, but remaining was set to 0. This means that a (childshare-canonlyuse) number of slots is thrown out. I suppose it's misnamed, but I observed it with fractions of slots being "lost" so I called it the fractional slot problem. The bug is along the same lines. In the childshare = remaining; case, remaining has to be set to an appropriate value. Otherwise further down in the code or in the loop, that amount of remaining could be given out multiple times. The other thing is I think we will have faster convergence if we use: factor=1.0/totalquota; consider 10 slots and two groups with 0.2 for quota first iteration each group gets 2 slots for a total of 4; factor=1.0/totalquota; gives us a factor=1/.4 of 2.5 while factor=remaining / totalgiven; gives us a factor=6/4 of 1.5 if you move remaining=remaining-totalgiven; below the calculation of factor or R, then its 10/4 which gives us 2.5 too. I used both cases in the patch I attached mostly due to fact remaining could change prior to reaching the end of the loop. Created attachment 437232 [details]
Patch that correctly handles logic of "give-away" mode.
Fixed my bug in handling of give-away mode based on jrt's explanation. Also tweaked multiplier logic for faster conversion as he described.
I think this version may make the difference between normal fair-share mode and give-away mode easier to follow in the code.
Pushed patch (id=437232) to branch V7_4-BZ619552-fairshare-loop-halting on FH repo. Link to patch: https://bugzilla.redhat.com/attachment.cgi?id=437232&action=diff Tested with (version):
condor-7.4.4-0.9
Tested on:
RHEL5 x86_64,i386 - passed
RHEL4 x86_64,i386 - passed
>>> VERIFIED
Technical note added. If any revisions are required, please edit the "Technical Notes" field
accordingly. All revisions will be proofread by the Engineering Content Services team.
New Contents:
Previously, the HFS algorithm was caught in an infinite loop in some situations due to a floating point precision problem. This update corrects the problems in the HFS algorithm and the issue is resolved.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0773.html |
The first part: In some situations, the HFS algorthim is caught in an infinite loop. This is due to a floating point precision problem where the number of remaining slots is a very small number when it should be zero. Remaining slots is calculated with: remaining=remaining-totalgiven; totalgiven is a sum of floating point values that should sum to be equal to remaining. The system precision in some cases causes totalgiven to be slightly smaller than remaining. It affects a system that uses doubles too. It means we can end up in an infinite loop here: while (gavesome&&remaining>0) { the fix is to catch this condition by setting our own level of precision. while (gavesome&&remaining>0.000001){ The second part: is a bug where remaining is incorrectly set to be 0. This changes the totalgiven calculation which causes a small fraction (0.5 or less) of a slot to be discarded. This slot is later reclaimed by the roundoff stages, but it should be claimed in the fairshare stage. if (remaining<0.05) { childshare=remaining; remaining=0; <--bug I have this fixed in a new version of the hfs patch, but I'll have to backport it to the old version.