637140 – slots stay in claim state even when there's no job to run

Bug 637140 - slots stay in claim state even when there's no job to run

Summary: slots stay in claim state even when there's no job to run

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	condor
Sub Component:
Version:	1.0
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	2.0
Target Release:	---
Assignee:	Matthew Farrellee
QA Contact:	Lubos Trilety
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	693778
TreeView+	depends on / blocked

Reported:	2010-09-24 12:50 UTC by Lubos Trilety
Modified:	2011-06-23 15:38 UTC (History)
CC List:	3 users (show)
Fixed In Version:	condor-7.5.6-0.1
Doc Type:	Bug Fix
Doc Text:	N/A
Clone Of:
Environment:
Last Closed:	2011-06-23 15:38:55 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2011:0889	0	normal	SHIPPED_LIVE	Red Hat Enterprise MRG Grid 2.0 Release	2011-06-23 15:35:53 UTC

Description Lubos Trilety 2010-09-24 12:50:20 UTC

Description of problem:
When user submit jobs with one group and before they finish another user submit jobs with another group in some cases the first user still claims slots even when there are no jobs to run from that user.

Version-Release number of selected component (if applicable):
condor-7.4.4-0.14

How reproducible:
80%

Steps to Reproduce:
1. set configuration
NUM_CPUS = 100
GROUP_NAMES = A1, A2
GROUP_QUOTA_DYNAMIC_A1 = 0.09
GROUP_QUOTA_DYNAMIC_A2 = 0.01
GROUP_AUTOREGROUP_A1 = TRUE
GROUP_AUTOREGROUP_A2 = TRUE

2. Submit 1000 short jobs with group A2
# su condor_user -c 'echo -e "cmd=/bin/sleep\nargs=2\n+AccountingGroup = \"A2.user\"\nqueue 1000" | condor_submit'
Submitting job(s)....................
1000 job(s) submitted to cluster 2.

3. wait some time, until there are about 200 jobs to run
# condor_q | grep jobs
221 jobs; 121 idle, 100 running, 0 held

4. Submit 100 long jobs with group A1
# su condor_user -c 'echo -e "cmd=/bin/sleep\nargs=1d\n+AccountingGroup = \"A1.user\"\nqueue 100" | condor_submit'
Submitting job(s)....................
100 job(s) submitted to cluster 3.

5. wait until all jobs with A2 group finishes, see 'condor_userprio -l', 'condor_status' and 'condor_q'
# condor_q
-- Submitter: hostname : <ip_address:43186> : hostname
ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   3.0   condor_user     9/24 08:01   0+00:19:32 R  0   3.9  sleep 1d
...
100 jobs; 43 idle, 57 running, 0 held

# condor_status
Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime
slot100@hostname LINUX      X86_64 Claimed   Busy     0.000    20  0+00:07:03
slot10@hostname LINUX      X86_64 Claimed   Idle     0.000    20  0+00:07:29
....
             Machines Owner Claimed Unclaimed Matched Preempting
X86_64/LINUX      100     0     100         0       0          0
       Total      100     0     100         0       0          0

# condor_userprio -l
LastUpdate = 1285330225
Name1 = "A2.user.eng.bos.redhat.com"
Priority1 = 0.986168
ResourcesUsed1 = 43
...
Name2 = "A1"
Priority2 = 0.718069
ResourcesUsed2 = 57
...
Name3 = "A2"
Priority3 = 0.986168
ResourcesUsed3 = 43
...
Name4 = "A1.user.eng.bos.redhat.com"
Priority4 = 0.718069
ResourcesUsed4 = 57
...
NumSubmittors = 4

6. after minute check it again, the results are still the same

7. run 'condor_rm -all'
# condor_rm -all
All jobs marked for removal.

8. see 'condor_status'
# condor_status
Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime
slot100@hostname LINUX      X86_64 Claimed   Busy     0.000    20  0+00:07:03
slot10@hostname LINUX      X86_64 Claimed   Idle     0.000    20  0+00:07:29
....
             Machines Owner Claimed Unclaimed Matched Preempting
X86_64/LINUX      100     0      43        57       0          0
       Total      100     0      43        57       0          0


Actual results:
some slots remained claimed by A2.user
after some time they go to unclaimed state (5-30 minutes)

Expected results:
slots should be released when there is no job to run from user who claims them

Additional info:
in MatchLog can be found something like that:
Rejected 3.62 A1.user@hostname <ip_address:46914>: insufficient priority

Comment 2 Matthew Farrellee 2011-02-28 19:43:50 UTC

Please verify this is still a problem with condor 7.5.6-0.1

Comment 3 Lubos Trilety 2011-03-04 14:18:51 UTC

Will retest during validation cycle

Comment 4 Matthew Farrellee 2011-04-27 20:22:15 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
N/A

Comment 6 Lubos Trilety 2011-05-02 15:21:56 UTC

Tested on:
$CondorVersion: 7.6.1 Apr 27 2011 BuildID: RH-7.6.1-0.4.el6 $
$CondorPlatform: I686-RedHat_6.0 $

$CondorVersion: 7.6.1 Apr 27 2011 BuildID: RH-7.6.1-0.4.el6 $
$CondorPlatform: X86_64-RedHat_6.0 $

$CondorVersion: 7.6.1 Apr 27 2011 BuildID: RH-7.6.1-0.4.el5 $
$CondorPlatform: I686-RedHat_5.6 $

$CondorVersion: 7.6.1 Apr 27 2011 BuildID: RH-7.6.1-0.4.el5 $
$CondorPlatform: X86_64-RedHat_5.6 $

>>> VERIFIED

Comment 7 errata-xmlrpc 2011-06-23 15:38:55 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2011-0889.html

Note You need to log in before you can comment on or make changes to this bug.