Bug 707081
Summary: | groups are not sorted in starvation order | ||
---|---|---|---|
Product: | Red Hat Enterprise MRG | Reporter: | Erik Erlandson <eerlands> |
Component: | condor | Assignee: | Erik Erlandson <eerlands> |
Status: | CLOSED ERRATA | QA Contact: | Tomas Rusnak <trusnak> |
Severity: | medium | Docs Contact: | |
Priority: | high | ||
Version: | 2.0 | CC: | claudiol, jneedle, jthomas, matt, mkudlej, trusnak, tstclair |
Target Milestone: | 2.0.1 | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | condor-7.6.2-0.1 | Doc Type: | Bug Fix |
Doc Text: |
Cause:
Ordering of accounting groups by "starvation" (usage/allocated) was left out of new Hierarchical Accounting Groups feature.
Consequence:
Accounting groups that fall later in the list could be starved by groups before them, due to arbitrary ordering.
Fix:
Sorting of accounting groups by starvation ratio (usage/allocated) prior to negotiation was restored.
Result:
Accounting groups no longer starved due to arbitrary ordering.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2011-09-07 16:41:20 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 723887 |
Description
Erik Erlandson
2011-05-23 23:04:27 UTC
Fixed upstream on V7_6-branch: https://condor-wiki.cs.wisc.edu/index.cgi/chngview?cn=21979 Repro/test example is described in detail here: https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2186 Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Cause: Ordering of accounting groups by "starvation" (usage/allocated) was left out of new Hierarchical Accounting Groups feature. Consequence: Accounting groups that fall later in the list could be starved by groups before them, due to arbitrary ordering. Fix: Sorting of accounting groups by starvation ratio (usage/allocated) prior to negotiation was restored. Result: Accounting groups no longer starved due to arbitrary ordering. Repro and test information: Using this configuration: $ cat 95.starvation_order.config NEGOTIATOR_DEBUG = D_FULLDEBUG NEGOTIATOR_USE_SLOT_WEIGHTS = FALSE NEGOTIATOR_INTERVAL = 30 SCHEDD_INTERVAL = 15 CLAIM_WORKLIFE = 0 NUM_CPUS = 10 # turn off round robin and multiple allocation rounds HFS_ROUND_ROBIN_RATE = 100000000 HFS_MAX_ALLOCATION_ROUNDS = 1 GROUP_NAMES = a, b GROUP_QUOTA_a = 5 GROUP_QUOTA_b = 5 GROUP_AUTOREGROUP = TRUE Submit this file, which generates some jobs for groups "a" and "b", which compete for the same five slots (therefore setting up potential starvation), and which have randomized durations 15-45 seconds: $ cat starvation_order.submit universe = vanilla cmd = /bin/sleep # random sleep durations from 15 to 45 seconds arguments = $$([15 + random(46)]) # set up "a" and "b" to compete for sub-pool: Requirements = (SlotID <= 5) +AccountingGroup = "a.user" queue 50 +AccountingGroup = "b.user" queue 50 Before restoring starvation order, we see this behavior on negotiation, where group "a" always negotiates first, and starves group "b": $ tail -f NegotiatorLog | grep -e 'group quotas: Group.*allocated=.*usage= ' -e 'Round.*totals:' 05/24/11 22:28:34 group quotas: Group a allocated= 0 usage= 0 05/24/11 22:28:34 group quotas: Group b allocated= 0 usage= 0 05/24/11 22:28:34 Round 1 totals: allocated= 0 usage= 0 05/24/11 22:28:54 group quotas: Group <none> allocated= 0 usage= 0 05/24/11 22:28:54 group quotas: Group a allocated= 5 usage= 5 05/24/11 22:28:54 group quotas: Group b allocated= 5 usage= 0 05/24/11 22:28:54 Round 1 totals: allocated= 10 usage= 5 05/24/11 22:29:24 group quotas: Group <none> allocated= 0 usage= 0 05/24/11 22:29:24 group quotas: Group a allocated= 5 usage= 5 05/24/11 22:29:24 group quotas: Group b allocated= 5 usage= 0 05/24/11 22:29:24 Round 1 totals: allocated= 10 usage= 5 05/24/11 22:29:55 group quotas: Group <none> allocated= 0 usage= 0 05/24/11 22:29:55 group quotas: Group a allocated= 5 usage= 5 05/24/11 22:29:55 group quotas: Group b allocated= 5 usage= 0 05/24/11 22:29:55 Round 1 totals: allocated= 10 usage= 5 05/24/11 22:30:26 group quotas: Group <none> allocated= 0 usage= 0 05/24/11 22:30:26 group quotas: Group a allocated= 5 usage= 5 05/24/11 22:30:26 group quotas: Group b allocated= 5 usage= 0 05/24/11 22:30:26 Round 1 totals: allocated= 10 usage= 5 ... After patch to restore starvation ordering, we see groups negotiate in changing order, by who is 'most starved': both groups get a balanced allocation of jobs over time: $ tail -f NegotiatorLog | grep -e 'group quotas: Group.*allocated=.*usage= ' -e 'Round.*totals:' -e 'starvation=' 05/24/11 22:35:35 Group a - starvation= 0 (0/5) prio= 0.5 05/24/11 22:35:36 Group b - starvation= 0 (0/5) prio= 0.5 05/24/11 22:35:36 Group <none> - starvation= 1.79769e+308 (0/0) prio= 0.5 05/24/11 22:35:36 group quotas: Group <none> allocated= 0 usage= 0 05/24/11 22:35:36 group quotas: Group a allocated= 5 usage= 5 05/24/11 22:35:36 group quotas: Group b allocated= 5 usage= 0 05/24/11 22:35:36 Round 1 totals: allocated= 10 usage= 5 05/24/11 22:36:06 Group b - starvation= 0 (0/5) prio= 0.5 05/24/11 22:36:07 Group a - starvation= 0.6 (3/5) prio= 0.501091 05/24/11 22:36:07 Group <none> - starvation= 1.79769e+308 (0/0) prio= 0.5 05/24/11 22:36:07 group quotas: Group <none> allocated= 0 usage= 0 05/24/11 22:36:07 group quotas: Group a allocated= 5 usage= 3 05/24/11 22:36:07 group quotas: Group b allocated= 5 usage= 2 05/24/11 22:36:07 Round 1 totals: allocated= 10 usage= 5 05/24/11 22:36:37 Group a - starvation= 0.2 (1/5) prio= 0.501712 05/24/11 22:36:38 Group b - starvation= 0.4 (2/5) prio= 0.500365 05/24/11 22:36:38 Group <none> - starvation= 1.79769e+308 (0/0) prio= 0.5 05/24/11 22:36:38 group quotas: Group <none> allocated= 0 usage= 0 05/24/11 22:36:38 group quotas: Group a allocated= 5 usage= 3 05/24/11 22:36:38 group quotas: Group b allocated= 5 usage= 2 05/24/11 22:36:38 Round 1 totals: allocated= 10 usage= 5 05/24/11 22:37:08 Group b - starvation= 0.2 (1/5) prio= 0.500738 05/24/11 22:37:09 Group a - starvation= 0.4 (2/5) prio= 0.502325 05/24/11 22:37:09 Group <none> - starvation= 1.79769e+308 (0/0) prio= 0.5 05/24/11 22:37:09 group quotas: Group <none> allocated= 0 usage= 0 05/24/11 22:37:09 group quotas: Group a allocated= 5 usage= 2 05/24/11 22:37:09 group quotas: Group b allocated= 5 usage= 3 05/24/11 22:37:09 Round 1 totals: allocated= 10 usage= 5 ... Reproduced on: $CondorVersion: 7.6.0 Mar 30 2011 BuildID: RH-7.6.0-0.4.el5 PRE-RELEASE-GRID $ $CondorPlatform: X86_64-Redhat_5.6 $ # tail -f /var/log/condor/NegotiatorLog | grep -e 'group quotas: Group.*allocated=.*usage= ' -e 'Round.*totals:' 07/25/11 17:35:56 group quotas: Group <none> allocated= 0 usage= 0 07/25/11 17:35:56 group quotas: Group a allocated= 5 usage= 5 07/25/11 17:35:56 group quotas: Group b allocated= 5 usage= 0 07/25/11 17:35:56 Round 1 totals: allocated= 10 usage= 5 07/25/11 17:36:27 group quotas: Group <none> allocated= 0 usage= 0 07/25/11 17:36:27 group quotas: Group a allocated= 5 usage= 5 07/25/11 17:36:27 group quotas: Group b allocated= 5 usage= 0 07/25/11 17:36:27 Round 1 totals: allocated= 10 usage= 5 07/25/11 17:36:58 group quotas: Group <none> allocated= 0 usage= 0 07/25/11 17:36:58 group quotas: Group a allocated= 5 usage= 5 07/25/11 17:36:58 group quotas: Group b allocated= 5 usage= 0 07/25/11 17:36:58 Round 1 totals: allocated= 10 usage= 5 07/25/11 17:37:31 group quotas: Group <none> allocated= 0 usage= 0 07/25/11 17:37:31 group quotas: Group a allocated= 5 usage= 5 07/25/11 17:37:31 group quotas: Group b allocated= 5 usage= 0 07/25/11 17:37:31 Round 1 totals: allocated= 10 usage= 5 07/25/11 17:38:02 group quotas: Group <none> allocated= 0 usage= 0 07/25/11 17:38:02 group quotas: Group a allocated= 5 usage= 5 07/25/11 17:38:02 group quotas: Group b allocated= 5 usage= 0 07/25/11 17:38:02 Round 1 totals: allocated= 10 usage= 5 07/25/11 17:38:32 group quotas: Group <none> allocated= 0 usage= 0 07/25/11 17:38:32 group quotas: Group a allocated= 5 usage= 5 07/25/11 17:38:32 group quotas: Group b allocated= 5 usage= 0 07/25/11 17:38:32 Round 1 totals: allocated= 10 usage= 5 Restested over all supported platforms x86,x86_64/RHEL5,RHEL6 with:
condor-7.6.3-0.2
# sudo -u test condor_submit starvation_order.submit && tail -f /var/log/condor/NegotiatorLog | grep -e 'group quotas: Group.*allocated=.*usage= ' -e 'Round.*totals:' -e 'starvation='
Submitting job(s)....................................................................................................
100 job(s) submitted to cluster 364.
07/25/11 16:19:04 Group a - starvation= 3.40282e+38 (0/0) prio= 0.5
07/25/11 16:19:04 Group b - starvation= 3.40282e+38 (0/0) prio= 0.5
07/25/11 16:19:04 group quotas: Group <none> allocated= 0 usage= 0
07/25/11 16:19:04 group quotas: Group a allocated= 0 usage= 0
07/25/11 16:19:04 group quotas: Group b allocated= 0 usage= 0
07/25/11 16:19:04 Round 1 totals: allocated= 0 usage= 0
07/25/11 16:19:24 Group a - starvation= 0 (0/5) prio= 0.5
07/25/11 16:19:25 Group b - starvation= 0 (0/5) prio= 0.5
07/25/11 16:19:26 Group <none> - starvation= 3.40282e+38 (0/0) prio= 0.5
07/25/11 16:19:26 group quotas: Group <none> allocated= 0 usage= 0
07/25/11 16:19:26 group quotas: Group a allocated= 5 usage= 5
07/25/11 16:19:26 group quotas: Group b allocated= 5 usage= 0
07/25/11 16:19:26 Round 1 totals: allocated= 10 usage= 5
07/25/11 16:19:56 Group b - starvation= 0 (0/5) prio= 0.5
07/25/11 16:19:57 Group a - starvation= 0.8 (4/5) prio= 0.501139
07/25/11 16:19:57 Group <none> - starvation= 3.40282e+38 (0/0) prio= 0.5
07/25/11 16:19:57 group quotas: Group <none> allocated= 0 usage= 0
07/25/11 16:19:57 group quotas: Group a allocated= 5 usage= 4
07/25/11 16:19:57 group quotas: Group b allocated= 5 usage= 1
07/25/11 16:19:57 Round 1 totals: allocated= 10 usage= 5
The jobs in groups is now balanced in time.
>>> VERIFIED
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-1249.html |