Bug 505923 - dedicated scheduler may be inappropriately reusing claims
Summary: dedicated scheduler may be inappropriately reusing claims
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor
Version: 1.1.1
Hardware: All
OS: Linux
medium
medium
Target Milestone: 1.3
: ---
Assignee: Erik Erlandson
QA Contact: Lubos Trilety
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2009-06-14 18:27 UTC by Matthew Farrellee
Modified: 2010-10-14 16:11 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Previously, claims were not released after a parallel universe job finished, because claim re-use did not handle concurrency limits properly. With this update, all concurrency limits of jobs can be checked.
Clone Of:
Environment:
Last Closed: 2010-10-14 16:11:35 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2010:0773 0 normal SHIPPED_LIVE Moderate: Red Hat Enterprise MRG Messaging and Grid Version 1.3 2010-10-14 15:56:44 UTC

Description Matthew Farrellee 2009-06-14 18:27:30 UTC
Description of problem:

From condor-users -

	My test Pool: 1 Dedicated Schedd,2 Startd
	I set a concurrency limit in Negotiator Config "license1_LIMIT=2".
	Then I submit 3 parallel jobs, each job requests 2 slots "machine_count = 2":
		first: concurrency_limits=license1:3
		second: concurrency_limits=license1:2
		third: concurrency_limits=license1:3

	First job could not run  because the concurrency limits exceed,and I removed first job from schedd,the second job started to run,but after the 2nd job completed,the 3rd job started running !!!.
	When setting NEGOTIATOR_DEBUG to D_FULLDEBUG, I found sth wrong in logs, after 2nd job completed ,the SCHEDD would not communicate with NEGOTIATOR, and concurrency limits of jobs could not be checked.


Version-Release number of selected component (if applicable):

condor 7.2

Comment 1 Erik Erlandson 2010-05-21 22:07:09 UTC
Claims do not appear to be released after a parallel universe job finishes.  After my parallel job completed, my slots remained in 'claimed' state.   These claims blocked execution of non-parallel job, but the slots were reusable by another parallel job.

[eje@rorschach ~]$ condor_status

Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime

slot1 LINUX      X86_64 Claimed   Idle     0.360   951  0+00:01:13
slot2 LINUX      X86_64 Claimed   Idle     0.000   951  0+00:01:14
                     Total Owner Claimed Unclaimed Matched Preempting Backfill

        X86_64/LINUX     2     0       2         0       0          0        0

               Total     2     0       2         0       0          0        0

Comment 2 Erik Erlandson 2010-05-25 22:47:43 UTC
(In reply to comment #1)
> Claims do not appear to be released after a parallel universe job finishes. 
> After my parallel job completed, my slots remained in 'claimed' state.

This behavior is intended, and governed by config parameter UNUSED_CLAIM_TIMEOUT.

Problem seems to be that claim re-use is not properly handling concurrency limits.  In the repro example, third job should not be eligible since it exceeds concurrency limits.

Comment 3 Erik Erlandson 2010-06-11 20:54:32 UTC
pushed a fix to branch: V7_4-BZ505923-Ded-Schedd-Concurrency-Limits-branch

Comment 4 Lubos Trilety 2010-07-28 13:32:39 UTC
Tested with (version):
condor-7.4.4-0.4

Test Scenario:
  Test Pool: 1 Dedicated Schedd,1 Startd
  Set a concurrency limit in Negotiator Config "license1_LIMIT=2".
1. Submit 3 parallel jobs, each job requests 2 slots "machine_count = 1":
   first: concurrency_limits=license1:3
   second: concurrency_limits=license1:2
   third: concurrency_limits=license1:3
2. First job could not run because the concurrency limits exceed.
3. Remove first job. Second job started to run.
4. After the 2nd job completed, check that the 3rd job could not run because the concurrency limits exceed (see logs).


Tested on:
RHEL4 x86_64  - passed
RHEL4 i386    - passed
RHEL5 x86_64  - passed
RHEL5 i386    - passed

>>> VERIFIED

Comment 5 Florian Nadge 2010-10-07 17:06:08 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Previously, claims were not released after a parallel universe job finished, because claim re-use  did not handle concurrency limits properly. With this update, all concurrency limits of jobs can be checked.

Comment 7 errata-xmlrpc 2010-10-14 16:11:35 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0773.html


Note You need to log in before you can comment on or make changes to this bug.