Bug 637281
Summary: | Slots don't partition fully when preemption is turned off | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise MRG | Reporter: | Jon Thomas <jthomas> | ||||||
Component: | condor | Assignee: | Erik Erlandson <eerlands> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Lubos Trilety <ltrilety> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | 1.3 | CC: | fnadge, ltrilety, matt | ||||||
Target Milestone: | 1.3.2 | ||||||||
Target Release: | --- | ||||||||
Hardware: | All | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | condor-7.4.5-0.2 | Doc Type: | Bug Fix | ||||||
Doc Text: |
Previously, partitionable slots could not be fully utilized when the preemption was disabled and GROUP_DYNAMIC_MACH_CONSTRAINT was set.
With this update, trimming of startd ads for preemption si now carried out after the constraint checking for GROUP_DYNAMIC_MACH_CONSTRAINT, so the negotiator correctly counts claimed slots. Now, the negotiator sends the proper slot counts including claimed slots to the inner negotiation loops when preemption is disabled and GROUP_DYNAMIC_MACH_CONSTRAINT is enabled. This allows the negotiation to include already-claimed dynamic slots and so partitionable slots can be fully utilized.
|
Story Points: | --- | ||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2011-02-15 12:12:08 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
Jon Thomas
2010-09-24 19:20:15 UTC
additional note: This was on a personal Condor. A customer with more than one node, indicated he thought he saw the same behavior. BTW, this simpler config shows same problem. START = true NUM_CPUS = 10 SLOT_TYPE_1 = cpus=10 SLOT_TYPE_1_PARTITIONABLE = TRUE NUM_SLOTS_TYPE_1 = 1 Created attachment 452182 [details]
move trimStartdAds to after constraint eval
Created attachment 452400 [details]
quotaconstraintv2 patch
It looks like the bug has been around as long as GROUP_DYNAMIC_MACH_CONSTRAINT has been used with trimStartdAds. So, looks like bug exists upstream too.
The previous patch moved the trimStartdAds to after GROUP_DYNAMIC_MACH_CONSTRAINT, but that means trimStartdAds would not be called if GROUP_NAMES was not defined. New patch ensures trimStartdAds gets called for both cases.
I don't understand why this needs fixing -- trimStartdAds removes claimed (or preempting) ads from consideration. In the event that consider-preemption is off, it seems correct for numDynGroupSlots to not include those slots either, since they won't be up for consideration. partitionable slots are never in claimed state, so disabling preemption should never affect the number of partitionable slots seen by HFS. Hi, Did you try the repro? The reason the fix is required is that NegotiateWithGroup expects the group quota to include claimed slots. Therefore, we have to count slots before trimming out the claimed ones. btw, the actual value of GROUP_DYNAMIC_MACH_CONSTRAINT is mostly irrelevant here. The constraint evaluation returns Length() or some number less than Length(), but this needs to include the claimed slots. Incorporated Jon's fix in devel branch: V7_4-BZ619557-HFS-tree-structure Where can I find following lines, in what file are they? 09/24/10 15:09:33 GROUP_DYNAMIC_MACH_CONSTRAINT constraint reduces machine count from 5 to 3 09/24/10 15:10:53 GROUP_DYNAMIC_MACH_CONSTRAINT constraint reduces machine count from 8 to 3 > Where can I find following lines, in what file are they?
>
> 09/24/10 15:09:33 GROUP_DYNAMIC_MACH_CONSTRAINT constraint reduces machine
> count from 5 to 3
That check is in matchmaker.cpp, line 1000
Successfully reproduced on: $CondorVersion: 7.4.4 Sep 27 2010 BuildID: RH-7.4.4-0.16.el5 PRE-RELEASE $ $CondorPlatform: X86_64-LINUX_RHEL5 $ Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Cause: Bug manifests when preemption is disabled and GROUP_DYNAMIC_MACH_CONSTRAINT is set (accounting groups are in effect). Consequence: Partitionable slots will not be fully utilized. Fix: Trimming of startd ads for preemption was moved to after the constraint checking for GROUP_DYNAMIC_MACH_CONSTRAINT, so the negotiator properly counts claimed slots. Result: When preemption is disabled, and GROUP_DYNAMIC_MACH_CONSTRAINT is enabled, the negotiator now sends the proper slot counts including claimed slots to the inner negotiation loops. This allows the negotiation to include already-claimed dynamic slots and so partitionable slots can be fully utilized. Tested with (version):
condor-7.4.5-0.6
Tested on:
RHEL5 x86_64,i386 - passed
RHEL4 x86_64,i386 - passed
/var/log/condor/NegotiatorLog:01/10/11 10:06:05 GROUP_DYNAMIC_MACH_CONSTRAINT constraint reduces machine count from 11 to 6
# condor_status
Name OpSys Arch State Activity LoadAv Mem ActvtyTime
slot1@host LINUX X86_64 Owner Idle 0.130 12888 0+00:00:11
slot2@host LINUX X86_64 Owner Idle 0.000 12888 0+00:00:12
slot3@host LINUX X86_64 Unclaimed Idle 0.000 12886 0+00:00:05
slot3_1@host LINUX X86_64 Claimed Busy 0.000 1 0+00:00:05
slot3_2@host LINUX X86_64 Claimed Busy 0.000 1 0+00:00:06
slot4@host LINUX X86_64 Unclaimed Idle 0.000 12886 0+00:00:07
slot4_1@host LINUX X86_64 Claimed Busy 0.000 1 0+00:00:07
slot4_2@host LINUX X86_64 Claimed Busy 0.000 1 0+00:00:07
slot5@host LINUX X86_64 Unclaimed Idle 0.000 12886 0+00:00:28
slot5_1@host LINUX X86_64 Claimed Busy 0.000 1 0+00:00:08
slot5_2@host LINUX X86_64 Claimed Busy 0.000 1 0+00:00:08
Machines Owner Claimed Unclaimed Matched Preempting
X86_64/LINUX 11 2 6 3 0 0
Total 11 2 6 3 0 0
>>> VERIFIED
Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1,11 +1,2 @@ -Cause: +Previously, partitionable slots could not be fully utilized when the preemption was disabled and GROUP_DYNAMIC_MACH_CONSTRAINT was set. -Bug manifests when preemption is disabled and GROUP_DYNAMIC_MACH_CONSTRAINT is set (accounting groups are in effect). +With this update, trimming of startd ads for preemption si now carried out after the constraint checking for GROUP_DYNAMIC_MACH_CONSTRAINT, so the negotiator correctly counts claimed slots. Now, the negotiator sends the proper slot counts including claimed slots to the inner negotiation loops when preemption is disabled and GROUP_DYNAMIC_MACH_CONSTRAINT is enabled. This allows the negotiation to include already-claimed dynamic slots and so partitionable slots can be fully utilized.- -Consequence: -Partitionable slots will not be fully utilized. - -Fix: -Trimming of startd ads for preemption was moved to after the constraint checking for GROUP_DYNAMIC_MACH_CONSTRAINT, so the negotiator properly counts claimed slots. - -Result: -When preemption is disabled, and GROUP_DYNAMIC_MACH_CONSTRAINT is enabled, the negotiator now sends the proper slot counts including claimed slots to the inner negotiation loops. This allows the negotiation to include already-claimed dynamic slots and so partitionable slots can be fully utilized. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-0217.html |