Created attachment 450962 [details] Negotiator log Description of problem: If there is a group with some subgroups defined in condor and some jobs already run with a user from such group (it works even when the user is not in any group), it is possible that after submitting of jobs with users from subgroups get wrong number of slots. Version-Release number of selected component (if applicable): condor-7.4.4-0.16 How reproducible: 100% Steps to Reproduce: 1. set condor configuration to: NEGOTIATOR_DEBUG = D_FULLDEBUG | D_MATCH SCHEDD_INTERVAL = 15 NEGOTIATOR_USE_SLOT_WEIGHTS = FALSE NUM_CPUS = 20 GROUP_NAMES = a, b, b.q, b.r GROUP_QUOTA_DYNAMIC_a = 0.51 GROUP_QUOTA_DYNAMIC_b = 0.4 GROUP_QUOTA_DYNAMIC_b.q = 0.5 GROUP_QUOTA_DYNAMIC_b.r = 0.25 GROUP_AUTOREGROUP_b = TRUE GROUP_AUTOREGROUP_b.q = TRUE GROUP_AUTOREGROUP_b.r = TRUE 2. prepare two simple submit files $ cat prepare.file cmd = /bin/sleep args = 2 +AccountingGroup = "b.user" queue +AccountingGroup = "none.user" queue $ cat test.file universe = vanilla cmd = /bin/sleep args = 5m +AccountingGroup = "a.user" queue 10 concurrency_limits = none +AccountingGroup = "b.q.user" queue 15 +AccountingGroup = "b.r.user" queue 15 3. submit submit.file $ condor_submit prepare.file Submitting job(s).. 2 job(s) submitted to cluster 1. 4. wait until all jobs finish, submit second file $ condor_submit test.file Submitting job(s)........................................ 40 job(s) submitted to cluster 2. 5. see used resources $ condor_q -run -l | grep AccountingGroup | sort | uniq -c 2 AccountingGroup = "a.user" 14 AccountingGroup = "b.q.user" 4 AccountingGroup = "b.r.user" Actual results: "a.user": 2, "b.q.user": 14, "b.r.user": 4 Expected results: The result should be "a.user": 2, "b.q.user": 12, "b.r.user": 6 Additional info: In negotiator log, it can be found that in first iteration "a.user" gets 2 slots, because of concurrency limit, "b.q.user" gets 6 slots and "b.r.user" obtains 4 slots. So only 8 slots are not used but in logs there is: "Failed to match 12.800001 slots on iteration 1."
Comment from Jon Thomas moved from 629614: It looks like what happened was that the jobs for nongroup and jobs for group b were in the process of ending at the start of the negotiation cycle. There were classAds for these jobs, but the values for ad->LookupInteger(ATTR_RUNNING_JOBS, numrunning); ad->LookupInteger(ATTR_IDLE_JOBS, numidle); were zero. This meant numsubmits was zero. Since there was a classAd, the ad was inserted into the list for the group. groupArray[0].submitterAds.Insert(ad); Later, this caused a problem with if ( groupArray[i].submitterAds.MyLength() == 0 ) { So I think, if ( groupArray[i].submitterAds.MyLength() == 0 ) { should be changed to if ( groupArray[i].numsubmits == 0 ) { We use numsubmits in a number of places, so this should be better. Negotiation is skipped in negotiatewithgroup based upon: num_idle_jobs = 0; schedd->LookupInteger(ATTR_IDLE_JOBS,num_idle_jobs); if ( num_idle_jobs < 0 ) { num_idle_jobs = 0; } This is where we get the ..." skipped because no idle jobs" message. So I think the later stages of negotiation rely on these same attributes. If we make the change, negotiation would be skipped higher in the call chain and it would not add to the number of unused slots.
Latest devel branch: V7_4-BZ619557-HFS-tree-structure
In test scenario it needs to be add following lines to condor configuration: CONCURRENCY_LIMIT_DEFAULT = 2 none_LIMIT = 20
Moreover test.file has to be changed properly $ cat test.file universe = vanilla cmd = /bin/sleep args = 5m concurrency_limits = a +AccountingGroup = "a.user" queue 10 concurrency_limits = none +AccountingGroup = "b.q.user" queue 15 +AccountingGroup = "b.r.user" queue 15 In new version of condor also HFS_MAX_ALLOCATION_ROUNDS to number bigger than 1 has to be set, otherwise no slots will be used. (e.g. HFS_MAX_ALLOCATION_ROUNDS = 3)
Tested with (version): condor-7.4.5-0.6 Tested on: RHEL4 i386,x86_64 - passed RHEL5 i386,x86_64 - passed >>> VERIFIED