Motivation is twofold: 1) Customers have been requesting a return to legacy behavior of "none" group such that it negotiates last (instead of its current behavior of being ordered in with other groups by starvation) 2) There have been some other requests that indicate a desire to explicitly control the negotiation order of accounting groups in customer-dependent ways The proposed solution to both of these cases is to expose an accounting group ordering expression to configuration that would return some floating point number that would be used for acct groups
TESTING: Using the following configuration: NEGOTIATOR_DEBUG = D_FULLDEBUG NEGOTIATOR_USE_SLOT_WEIGHTS = FALSE NEGOTIATOR_INTERVAL = 30 SCHEDD_INTERVAL = 15 CLAIM_WORKLIFE = 0 NUM_CPUS = 10 # turn off round robin and multiple allocation rounds HFS_ROUND_ROBIN_RATE = 100000000 HFS_MAX_ALLOCATION_ROUNDS = 1 GROUP_NAMES = a, b GROUP_QUOTA_a = 5 GROUP_QUOTA_b = 5 GROUP_AUTOREGROUP = TRUE # sorts "b" before "a": GROUP_SORT_EXPR = ifThenElse(Name=?="a", 2, ifThenElse(Name=?="b", 1, 3.40282e+38)) # this should trigger warnings and default to FLT_MAX: #GROUP_SORT_EXPR = 2+"a" Bring up the negotiator with the above configuration, and then observe the ordering behavior for group negotiation: $ tail -f NegotiatorLog | grep -e WARNING -e sortkey 12/07/11 14:44:43 Group b - sortkey= 1 12/07/11 14:44:43 Group a - sortkey= 2 12/07/11 14:44:43 Group <none> - sortkey= 3.40282e+38 Next, enable GROUP_SORT_EXPR = 2+"a", which defaults with warnings: $ tail -f NegotiatorLog | grep -e WARNING -e sortkey 12/07/11 14:39:41 WARNING: sort expression "2+"a"" failed to evaluate to floating point for group <none> - defaulting to 3.40282e+38 12/07/11 14:39:41 WARNING: sort expression "2+"a"" failed to evaluate to floating point for group a - defaulting to 3.40282e+38 12/07/11 14:39:41 WARNING: sort expression "2+"a"" failed to evaluate to floating point for group b - defaulting to 3.40282e+38 12/07/11 14:39:41 Group <none> - sortkey= 3.40282e+38 12/07/11 14:39:41 Group a - sortkey= 3.40282e+38 12/07/11 14:39:41 Group b - sortkey= 3.40282e+38 Disable the settings for GROUP_SORT_EXPR, and let it take on its default "starvation-order" expr. Submit the following file: cmd = /bin/sleep args = 60 should_transfer_files = if_needed when_to_transfer_output = on_exit +AccountingGroup="a.user" queue 5 +AccountingGroup="b.user" queue 10 You should see something similar to the following sequence, where prior to submission, all values are FLT_MAX, then "a" and "b" have zero starvation-ratio, as their allocation is nonzero but no jobs are yet running. Then their starvation ratios will become 1, as all allocation is filled. Then "a" will empty as its jobs complete first, then finally all ratios will return to FLT_MAX: $ tail -f NegotiatorLog | grep -e WARNING -e sortkey 12/07/11 15:33:31 Group <none> - sortkey= 3.40282e+38 12/07/11 15:33:31 Group a - sortkey= 3.40282e+38 12/07/11 15:33:31 Group b - sortkey= 3.40282e+38 12/07/11 15:33:51 Group a - sortkey= 0 12/07/11 15:33:52 Group b - sortkey= 0 12/07/11 15:33:52 Group <none> - sortkey= 3.40282e+38 12/07/11 15:34:23 Group a - sortkey= 1 12/07/11 15:34:23 Group b - sortkey= 1 12/07/11 15:34:23 Group <none> - sortkey= 3.40282e+38 12/07/11 15:34:53 Group a - sortkey= 1 12/07/11 15:34:53 Group b - sortkey= 1 12/07/11 15:34:53 Group <none> - sortkey= 3.40282e+38 12/07/11 15:35:23 Group b - sortkey= 0 12/07/11 15:35:24 Group <none> - sortkey= 3.40282e+38 12/07/11 15:35:24 Group a - sortkey= 3.40282e+38 12/07/11 15:35:54 Group b - sortkey= 1 12/07/11 15:35:54 Group <none> - sortkey= 3.40282e+38 12/07/11 15:35:54 Group a - sortkey= 3.40282e+38 12/07/11 15:36:24 Group b - sortkey= 1 12/07/11 15:36:24 Group <none> - sortkey= 3.40282e+38 12/07/11 15:36:24 Group a - sortkey= 3.40282e+38 12/07/11 15:36:55 Group <none> - sortkey= 3.40282e+38 12/07/11 15:36:55 Group a - sortkey= 3.40282e+38 12/07/11 15:36:55 Group b - sortkey= 3.40282e+38
In the test above, you should use AccountingGroup instead of Name for the attribute here: # sorts "b" before "a": GROUP_SORT_EXPR = ifThenElse(AccountingGroup=?="a", 2, ifThenElse(AccountingGroup=?="b", 1, 3.40282e+38))
I tried given scenario on condor-7.6.7-0.7, in the starvation mode even group '<none>' was set to zero after submit of the jobs. It stays to be zero for the rest of test. According to the provided information only groups 'a' and 'b' should be zeroed for small period of time before all allocation is filled.
(In reply to comment #4) > I tried given scenario on condor-7.6.7-0.7, in the starvation mode even group > '<none>' was set to zero after submit of the jobs. It stays to be zero for the > rest of test. According to the provided information only groups 'a' and 'b' > should be zeroed for small period of time before all allocation is filled. Sorry, the problem is that the semantic of GROUP_AUTOREGROUP was altered on gt2679. Tweak the test config above to turn off autoregroup and surplus: GROUP_AUTOREGROUP = FALSE GROUP_ACCEPT_SURPLUS = FALSE That restored the expected testing behavior.
(In reply to comment #1) > You should see something similar to the following sequence, where prior to > submission, all values are FLT_MAX, then "a" and "b" have zero > starvation-ratio, as their allocation is nonzero but no jobs are yet > running. Then their starvation ratios will become 1, as all allocation is > filled. Then "a" will empty as its jobs complete first, then finally all > ratios will return to FLT_MAX: Since this ticket was originally created, the default for GROUP_SORT_EXPR has been changed to: ifThenElse(AccountingGroup=?="<none>",3.4e+38,ifThenElse(GroupQuota>0,GroupResourcesInUse/GroupQuota,3.3e+38)) The impact of this is that the 'ground state' for group sort keys is now zero instead of FLT_MAX, due to the change in denominator from GroupResourcesAllocated (which can be zero) to GroupQuota (which is not zero) Since the change occurred prior to our releasing this feature, there will be no errata required from the change. For further explanation, see also: https://bugzilla.redhat.com/show_bug.cgi?id=884629#c2 Note, GroupQuota *can* be zero if the negotiator cycles prior to any slot ads appearing in the collector, so FLT_MAX may appear briefly before startds check in with collector.
Tested on RHEL 5.9/6.4 x i386/x86_64 with condor-7.8.8-0.4.1 and it works. -->VERIFIED
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2013-0564.html