Bug 505923
Summary: | dedicated scheduler may be inappropriately reusing claims | ||
---|---|---|---|
Product: | Red Hat Enterprise MRG | Reporter: | Matthew Farrellee <matt> |
Component: | condor | Assignee: | Erik Erlandson <eerlands> |
Status: | CLOSED ERRATA | QA Contact: | Lubos Trilety <ltrilety> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 1.1.1 | CC: | fnadge, iboverma, ltrilety |
Target Milestone: | 1.3 | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
Previously, claims were not released after a parallel universe job finished, because claim re-use did not handle concurrency limits properly. With this update, all concurrency limits of jobs can be checked.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2010-10-14 16:11:35 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Matthew Farrellee
2009-06-14 18:27:30 UTC
Claims do not appear to be released after a parallel universe job finishes. After my parallel job completed, my slots remained in 'claimed' state. These claims blocked execution of non-parallel job, but the slots were reusable by another parallel job. [eje@rorschach ~]$ condor_status Name OpSys Arch State Activity LoadAv Mem ActvtyTime slot1 LINUX X86_64 Claimed Idle 0.360 951 0+00:01:13 slot2 LINUX X86_64 Claimed Idle 0.000 951 0+00:01:14 Total Owner Claimed Unclaimed Matched Preempting Backfill X86_64/LINUX 2 0 2 0 0 0 0 Total 2 0 2 0 0 0 0 (In reply to comment #1) > Claims do not appear to be released after a parallel universe job finishes. > After my parallel job completed, my slots remained in 'claimed' state. This behavior is intended, and governed by config parameter UNUSED_CLAIM_TIMEOUT. Problem seems to be that claim re-use is not properly handling concurrency limits. In the repro example, third job should not be eligible since it exceeds concurrency limits. pushed a fix to branch: V7_4-BZ505923-Ded-Schedd-Concurrency-Limits-branch Tested with (version):
condor-7.4.4-0.4
Test Scenario:
Test Pool: 1 Dedicated Schedd,1 Startd
Set a concurrency limit in Negotiator Config "license1_LIMIT=2".
1. Submit 3 parallel jobs, each job requests 2 slots "machine_count = 1":
first: concurrency_limits=license1:3
second: concurrency_limits=license1:2
third: concurrency_limits=license1:3
2. First job could not run because the concurrency limits exceed.
3. Remove first job. Second job started to run.
4. After the 2nd job completed, check that the 3rd job could not run because the concurrency limits exceed (see logs).
Tested on:
RHEL4 x86_64 - passed
RHEL4 i386 - passed
RHEL5 x86_64 - passed
RHEL5 i386 - passed
>>> VERIFIED
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Previously, claims were not released after a parallel universe job finished, because claim re-use did not handle concurrency limits properly. With this update, all concurrency limits of jobs can be checked. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0773.html |