Bug 637140

Summary: slots stay in claim state even when there's no job to run
Product: Red Hat Enterprise MRG Reporter: Lubos Trilety <ltrilety>
Component: condorAssignee: Matthew Farrellee <matt>
Status: CLOSED ERRATA QA Contact: Lubos Trilety <ltrilety>
Severity: medium Docs Contact:
Priority: medium    
Version: 1.0CC: iboverma, jneedle, matt
Target Milestone: 2.0   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: condor-7.5.6-0.1 Doc Type: Bug Fix
Doc Text:
N/A
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-06-23 15:38:55 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 693778    

Description Lubos Trilety 2010-09-24 12:50:20 UTC
Description of problem:
When user submit jobs with one group and before they finish another user submit jobs with another group in some cases the first user still claims slots even when there are no jobs to run from that user.

Version-Release number of selected component (if applicable):
condor-7.4.4-0.14

How reproducible:
80%

Steps to Reproduce:
1. set configuration
NUM_CPUS = 100
GROUP_NAMES = A1, A2
GROUP_QUOTA_DYNAMIC_A1 = 0.09
GROUP_QUOTA_DYNAMIC_A2 = 0.01
GROUP_AUTOREGROUP_A1 = TRUE
GROUP_AUTOREGROUP_A2 = TRUE

2. Submit 1000 short jobs with group A2
# su condor_user -c 'echo -e "cmd=/bin/sleep\nargs=2\n+AccountingGroup = \"A2.user\"\nqueue 1000" | condor_submit'
Submitting job(s)....................
1000 job(s) submitted to cluster 2.

3. wait some time, until there are about 200 jobs to run
# condor_q | grep jobs
221 jobs; 121 idle, 100 running, 0 held

4. Submit 100 long jobs with group A1
# su condor_user -c 'echo -e "cmd=/bin/sleep\nargs=1d\n+AccountingGroup = \"A1.user\"\nqueue 100" | condor_submit'
Submitting job(s)....................
100 job(s) submitted to cluster 3.

5. wait until all jobs with A2 group finishes, see 'condor_userprio -l', 'condor_status' and 'condor_q'
# condor_q
-- Submitter: hostname : <ip_address:43186> : hostname
ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   3.0   condor_user     9/24 08:01   0+00:19:32 R  0   3.9  sleep 1d
...
100 jobs; 43 idle, 57 running, 0 held

# condor_status
Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime
slot100@hostname LINUX      X86_64 Claimed   Busy     0.000    20  0+00:07:03
slot10@hostname LINUX      X86_64 Claimed   Idle     0.000    20  0+00:07:29
....
             Machines Owner Claimed Unclaimed Matched Preempting
X86_64/LINUX      100     0     100         0       0          0
       Total      100     0     100         0       0          0

# condor_userprio -l
LastUpdate = 1285330225
Name1 = "A2.user.eng.bos.redhat.com"
Priority1 = 0.986168
ResourcesUsed1 = 43
...
Name2 = "A1"
Priority2 = 0.718069
ResourcesUsed2 = 57
...
Name3 = "A2"
Priority3 = 0.986168
ResourcesUsed3 = 43
...
Name4 = "A1.user.eng.bos.redhat.com"
Priority4 = 0.718069
ResourcesUsed4 = 57
...
NumSubmittors = 4

6. after minute check it again, the results are still the same

7. run 'condor_rm -all'
# condor_rm -all
All jobs marked for removal.

8. see 'condor_status'
# condor_status
Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime
slot100@hostname LINUX      X86_64 Claimed   Busy     0.000    20  0+00:07:03
slot10@hostname LINUX      X86_64 Claimed   Idle     0.000    20  0+00:07:29
....
             Machines Owner Claimed Unclaimed Matched Preempting
X86_64/LINUX      100     0      43        57       0          0
       Total      100     0      43        57       0          0


Actual results:
some slots remained claimed by A2.user
after some time they go to unclaimed state (5-30 minutes)

Expected results:
slots should be released when there is no job to run from user who claims them

Additional info:
in MatchLog can be found something like that:
Rejected 3.62 A1.user@hostname <ip_address:46914>: insufficient priority

Comment 2 Matthew Farrellee 2011-02-28 19:43:50 UTC
Please verify this is still a problem with condor 7.5.6-0.1

Comment 3 Lubos Trilety 2011-03-04 14:18:51 UTC
Will retest during validation cycle

Comment 4 Matthew Farrellee 2011-04-27 20:22:15 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
N/A

Comment 6 Lubos Trilety 2011-05-02 15:21:56 UTC
Tested on:
$CondorVersion: 7.6.1 Apr 27 2011 BuildID: RH-7.6.1-0.4.el6 $
$CondorPlatform: I686-RedHat_6.0 $

$CondorVersion: 7.6.1 Apr 27 2011 BuildID: RH-7.6.1-0.4.el6 $
$CondorPlatform: X86_64-RedHat_6.0 $

$CondorVersion: 7.6.1 Apr 27 2011 BuildID: RH-7.6.1-0.4.el5 $
$CondorPlatform: I686-RedHat_5.6 $

$CondorVersion: 7.6.1 Apr 27 2011 BuildID: RH-7.6.1-0.4.el5 $
$CondorPlatform: X86_64-RedHat_5.6 $

>>> VERIFIED

Comment 7 errata-xmlrpc 2011-06-23 15:38:55 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2011-0889.html