With preemption off, a partitionable slot doesn't partition fully. repro/data: Works: ------ NUM_CPUS = 10 SLOT_TYPE_1 = cpus=2 SLOT_TYPE_1_PARTITIONABLE = TRUE NUM_SLOTS_TYPE_1 = 5 START = SlotId > 2 GROUP_DYNAMIC_MACH_CONSTRAINT = State =!= "Owner" && Cpus > 0 NEGOTIATOR_USE_SLOT_WEIGHTS = FALSE #NEGOTIATOR_CONSIDER_PREEMPTION = False #PREEMPT = False #PREEMPTION_REQUIREMENTS = False 09/24/10 15:02:39 GROUP_DYNAMIC_MACH_CONSTRAINT constraint reduces machine count from 5 to 3 ... submit jobs to use slots 09/24/10 15:03:19 GROUP_DYNAMIC_MACH_CONSTRAINT constraint reduces machine count from 11 to 6 condor_status Name OpSys Arch State Activity LoadAv Mem ActvtyTime slot1 LINUX X86_64 Owner Idle 0.000 390 0+07:25:17 slot2 LINUX X86_64 Owner Idle 0.000 390 0+07:25:18 slot3 LINUX X86_64 Unclaimed Idle 0.000 388 0+00:00:06 slot3_1@localhost. LINUX X86_64 Claimed Busy 0.000 1 0+00:00:06 slot3_2@localhost. LINUX X86_64 Claimed Busy 0.000 1 0+00:00:06 slot4 LINUX X86_64 Unclaimed Idle 0.000 388 0+00:00:07 slot4_1@localhost. LINUX X86_64 Claimed Busy 0.000 1 0+00:00:07 slot4_2@localhost. LINUX X86_64 Claimed Busy 0.000 1 0+00:00:07 slot5 LINUX X86_64 Unclaimed Idle 0.000 388 0+00:00:08 slot5_1@localhost. LINUX X86_64 Claimed Busy 0.000 1 0+00:00:08 slot5_2@localhost. LINUX X86_64 Claimed Busy 0.000 1 0+00:00:08 Machines Owner Claimed Unclaimed Matched Preempting X86_64/LINUX 11 2 6 3 0 0 Total 11 2 6 3 0 0 Doesn't work: ------------- NUM_CPUS = 10 SLOT_TYPE_1 = cpus=2 SLOT_TYPE_1_PARTITIONABLE = TRUE NUM_SLOTS_TYPE_1 = 5 START = SlotId > 2 GROUP_DYNAMIC_MACH_CONSTRAINT = State =!= "Owner" && Cpus > 0 NEGOTIATOR_USE_SLOT_WEIGHTS = FALSE NEGOTIATOR_CONSIDER_PREEMPTION = False PREEMPT = False PREEMPTION_REQUIREMENTS = False 09/24/10 15:09:33 GROUP_DYNAMIC_MACH_CONSTRAINT constraint reduces machine count from 5 to 3 ... submit jobs to use slots 09/24/10 15:10:53 GROUP_DYNAMIC_MACH_CONSTRAINT constraint reduces machine count from 8 to 3 condor_status Name OpSys Arch State Activity LoadAv Mem ActvtyTime slot1 LINUX X86_64 Owner Idle 0.380 390 0+00:00:09 slot2 LINUX X86_64 Owner Idle 0.000 390 0+00:00:10 slot3 LINUX X86_64 Unclaimed Idle 0.000 389 0+00:00:56 slot3_1@localhost. LINUX X86_64 Claimed Busy 0.000 1 0+00:00:06 slot4 LINUX X86_64 Unclaimed Idle 0.000 389 0+00:00:57 slot4_1@localhost. LINUX X86_64 Claimed Busy 0.000 1 0+00:00:07 slot5 LINUX X86_64 Unclaimed Idle 0.000 389 0+00:00:58 slot5_1@localhost. LINUX X86_64 Claimed Busy 0.000 1 0+00:00:08 Machines Owner Claimed Unclaimed Matched Preempting X86_64/LINUX 8 2 3 3 0 0 Total 8 2 3 3 0 0
additional note: This was on a personal Condor. A customer with more than one node, indicated he thought he saw the same behavior. BTW, this simpler config shows same problem. START = true NUM_CPUS = 10 SLOT_TYPE_1 = cpus=10 SLOT_TYPE_1_PARTITIONABLE = TRUE NUM_SLOTS_TYPE_1 = 1
Created attachment 452182 [details] move trimStartdAds to after constraint eval
Created attachment 452400 [details] quotaconstraintv2 patch It looks like the bug has been around as long as GROUP_DYNAMIC_MACH_CONSTRAINT has been used with trimStartdAds. So, looks like bug exists upstream too. The previous patch moved the trimStartdAds to after GROUP_DYNAMIC_MACH_CONSTRAINT, but that means trimStartdAds would not be called if GROUP_NAMES was not defined. New patch ensures trimStartdAds gets called for both cases.
I don't understand why this needs fixing -- trimStartdAds removes claimed (or preempting) ads from consideration. In the event that consider-preemption is off, it seems correct for numDynGroupSlots to not include those slots either, since they won't be up for consideration. partitionable slots are never in claimed state, so disabling preemption should never affect the number of partitionable slots seen by HFS.
Hi, Did you try the repro? The reason the fix is required is that NegotiateWithGroup expects the group quota to include claimed slots. Therefore, we have to count slots before trimming out the claimed ones. btw, the actual value of GROUP_DYNAMIC_MACH_CONSTRAINT is mostly irrelevant here. The constraint evaluation returns Length() or some number less than Length(), but this needs to include the claimed slots.
Incorporated Jon's fix in devel branch: V7_4-BZ619557-HFS-tree-structure
Where can I find following lines, in what file are they? 09/24/10 15:09:33 GROUP_DYNAMIC_MACH_CONSTRAINT constraint reduces machine count from 5 to 3 09/24/10 15:10:53 GROUP_DYNAMIC_MACH_CONSTRAINT constraint reduces machine count from 8 to 3
> Where can I find following lines, in what file are they? > > 09/24/10 15:09:33 GROUP_DYNAMIC_MACH_CONSTRAINT constraint reduces machine > count from 5 to 3 That check is in matchmaker.cpp, line 1000
Successfully reproduced on: $CondorVersion: 7.4.4 Sep 27 2010 BuildID: RH-7.4.4-0.16.el5 PRE-RELEASE $ $CondorPlatform: X86_64-LINUX_RHEL5 $
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Cause: Bug manifests when preemption is disabled and GROUP_DYNAMIC_MACH_CONSTRAINT is set (accounting groups are in effect). Consequence: Partitionable slots will not be fully utilized. Fix: Trimming of startd ads for preemption was moved to after the constraint checking for GROUP_DYNAMIC_MACH_CONSTRAINT, so the negotiator properly counts claimed slots. Result: When preemption is disabled, and GROUP_DYNAMIC_MACH_CONSTRAINT is enabled, the negotiator now sends the proper slot counts including claimed slots to the inner negotiation loops. This allows the negotiation to include already-claimed dynamic slots and so partitionable slots can be fully utilized.
Tested with (version): condor-7.4.5-0.6 Tested on: RHEL5 x86_64,i386 - passed RHEL4 x86_64,i386 - passed /var/log/condor/NegotiatorLog:01/10/11 10:06:05 GROUP_DYNAMIC_MACH_CONSTRAINT constraint reduces machine count from 11 to 6 # condor_status Name OpSys Arch State Activity LoadAv Mem ActvtyTime slot1@host LINUX X86_64 Owner Idle 0.130 12888 0+00:00:11 slot2@host LINUX X86_64 Owner Idle 0.000 12888 0+00:00:12 slot3@host LINUX X86_64 Unclaimed Idle 0.000 12886 0+00:00:05 slot3_1@host LINUX X86_64 Claimed Busy 0.000 1 0+00:00:05 slot3_2@host LINUX X86_64 Claimed Busy 0.000 1 0+00:00:06 slot4@host LINUX X86_64 Unclaimed Idle 0.000 12886 0+00:00:07 slot4_1@host LINUX X86_64 Claimed Busy 0.000 1 0+00:00:07 slot4_2@host LINUX X86_64 Claimed Busy 0.000 1 0+00:00:07 slot5@host LINUX X86_64 Unclaimed Idle 0.000 12886 0+00:00:28 slot5_1@host LINUX X86_64 Claimed Busy 0.000 1 0+00:00:08 slot5_2@host LINUX X86_64 Claimed Busy 0.000 1 0+00:00:08 Machines Owner Claimed Unclaimed Matched Preempting X86_64/LINUX 11 2 6 3 0 0 Total 11 2 6 3 0 0 >>> VERIFIED
Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1,11 +1,2 @@ -Cause: +Previously, partitionable slots could not be fully utilized when the preemption was disabled and GROUP_DYNAMIC_MACH_CONSTRAINT was set. -Bug manifests when preemption is disabled and GROUP_DYNAMIC_MACH_CONSTRAINT is set (accounting groups are in effect). +With this update, trimming of startd ads for preemption si now carried out after the constraint checking for GROUP_DYNAMIC_MACH_CONSTRAINT, so the negotiator correctly counts claimed slots. Now, the negotiator sends the proper slot counts including claimed slots to the inner negotiation loops when preemption is disabled and GROUP_DYNAMIC_MACH_CONSTRAINT is enabled. This allows the negotiation to include already-claimed dynamic slots and so partitionable slots can be fully utilized.- -Consequence: -Partitionable slots will not be fully utilized. - -Fix: -Trimming of startd ads for preemption was moved to after the constraint checking for GROUP_DYNAMIC_MACH_CONSTRAINT, so the negotiator properly counts claimed slots. - -Result: -When preemption is disabled, and GROUP_DYNAMIC_MACH_CONSTRAINT is enabled, the negotiator now sends the proper slot counts including claimed slots to the inner negotiation loops. This allows the negotiation to include already-claimed dynamic slots and so partitionable slots can be fully utilized.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-0217.html