Bug 639244
| Summary: | distribution of slots is not correct | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise MRG | Reporter: | Lubos Trilety <ltrilety> | ||||
| Component: | condor | Assignee: | Erik Erlandson <eerlands> | ||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Lubos Trilety <ltrilety> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | medium | ||||||
| Version: | 1.3 | CC: | matt | ||||
| Target Milestone: | 1.3.2 | ||||||
| Target Release: | --- | ||||||
| Hardware: | All | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | condor-7.4.5-0.2 | Doc Type: | Bug Fix | ||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2011-02-15 13:01:24 UTC | Type: | --- | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 528800 | ||||||
| Attachments: |
|
||||||
Comment from Jon Thomas moved from 629614:
It looks like what happened was that the jobs for nongroup and jobs for group b
were in the process of ending at the start of the negotiation cycle. There were
classAds for these jobs, but the values for
ad->LookupInteger(ATTR_RUNNING_JOBS, numrunning);
ad->LookupInteger(ATTR_IDLE_JOBS, numidle);
were zero. This meant numsubmits was zero. Since there was a classAd, the ad
was inserted into the list for the group.
groupArray[0].submitterAds.Insert(ad);
Later, this caused a problem with
if ( groupArray[i].submitterAds.MyLength() == 0 ) {
So I think,
if ( groupArray[i].submitterAds.MyLength() == 0 ) {
should be changed to
if ( groupArray[i].numsubmits == 0 ) {
We use numsubmits in a number of places, so this should be better. Negotiation
is skipped in negotiatewithgroup based upon:
num_idle_jobs = 0;
schedd->LookupInteger(ATTR_IDLE_JOBS,num_idle_jobs);
if ( num_idle_jobs < 0 ) {
num_idle_jobs = 0;
}
This is where we get the ..." skipped because no idle jobs" message. So I
think the later stages of negotiation rely on these same attributes. If we make
the change, negotiation would be skipped higher in the call chain and it would
not add to the number of unused slots.
Latest devel branch: V7_4-BZ619557-HFS-tree-structure In test scenario it needs to be add following lines to condor configuration: CONCURRENCY_LIMIT_DEFAULT = 2 none_LIMIT = 20 Moreover test.file has to be changed properly $ cat test.file universe = vanilla cmd = /bin/sleep args = 5m concurrency_limits = a +AccountingGroup = "a.user" queue 10 concurrency_limits = none +AccountingGroup = "b.q.user" queue 15 +AccountingGroup = "b.r.user" queue 15 In new version of condor also HFS_MAX_ALLOCATION_ROUNDS to number bigger than 1 has to be set, otherwise no slots will be used. (e.g. HFS_MAX_ALLOCATION_ROUNDS = 3) Tested with (version):
condor-7.4.5-0.6
Tested on:
RHEL4 i386,x86_64 - passed
RHEL5 i386,x86_64 - passed
>>> VERIFIED
|
Created attachment 450962 [details] Negotiator log Description of problem: If there is a group with some subgroups defined in condor and some jobs already run with a user from such group (it works even when the user is not in any group), it is possible that after submitting of jobs with users from subgroups get wrong number of slots. Version-Release number of selected component (if applicable): condor-7.4.4-0.16 How reproducible: 100% Steps to Reproduce: 1. set condor configuration to: NEGOTIATOR_DEBUG = D_FULLDEBUG | D_MATCH SCHEDD_INTERVAL = 15 NEGOTIATOR_USE_SLOT_WEIGHTS = FALSE NUM_CPUS = 20 GROUP_NAMES = a, b, b.q, b.r GROUP_QUOTA_DYNAMIC_a = 0.51 GROUP_QUOTA_DYNAMIC_b = 0.4 GROUP_QUOTA_DYNAMIC_b.q = 0.5 GROUP_QUOTA_DYNAMIC_b.r = 0.25 GROUP_AUTOREGROUP_b = TRUE GROUP_AUTOREGROUP_b.q = TRUE GROUP_AUTOREGROUP_b.r = TRUE 2. prepare two simple submit files $ cat prepare.file cmd = /bin/sleep args = 2 +AccountingGroup = "b.user" queue +AccountingGroup = "none.user" queue $ cat test.file universe = vanilla cmd = /bin/sleep args = 5m +AccountingGroup = "a.user" queue 10 concurrency_limits = none +AccountingGroup = "b.q.user" queue 15 +AccountingGroup = "b.r.user" queue 15 3. submit submit.file $ condor_submit prepare.file Submitting job(s).. 2 job(s) submitted to cluster 1. 4. wait until all jobs finish, submit second file $ condor_submit test.file Submitting job(s)........................................ 40 job(s) submitted to cluster 2. 5. see used resources $ condor_q -run -l | grep AccountingGroup | sort | uniq -c 2 AccountingGroup = "a.user" 14 AccountingGroup = "b.q.user" 4 AccountingGroup = "b.r.user" Actual results: "a.user": 2, "b.q.user": 14, "b.r.user": 4 Expected results: The result should be "a.user": 2, "b.q.user": 12, "b.r.user": 6 Additional info: In negotiator log, it can be found that in first iteration "a.user" gets 2 slots, because of concurrency limit, "b.q.user" gets 6 slots and "b.r.user" obtains 4 slots. So only 8 slots are not used but in logs there is: "Failed to match 12.800001 slots on iteration 1."