Bug 711993 - RFE: check for deferred jobs that missed their window in the scheduler
Summary: RFE: check for deferred jobs that missed their window in the scheduler
Keywords:
Status: CLOSED DUPLICATE of bug 712026
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor
Version: 1.0
Hardware: Unspecified
OS: Unspecified
low
medium
Target Milestone: 2.1
: ---
Assignee: Timothy St. Clair
QA Contact: MRG Quality Engineering
URL:
Whiteboard:
Depends On: 712026
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-06-09 08:39 UTC by Lubos Trilety
Modified: 2012-02-07 09:50 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-09-23 13:33:11 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Lubos Trilety 2011-06-09 08:39:46 UTC
Description of problem:
There are certain conditions where a deferred job that misses its window can fail to enter the 'held' state. For example, it may fail to be matched with a slot in time, or (more rarely) the startd might fail after matching.

These conditions can arise because deferred jobs are only checked for holding in the startd. A similar check could be added to the scheduler: the count() routine could include a scan for jobs with deferral, and if they have missed their window and are still idle, the scheduler can put them on hold.

Adding this check would increase the consistency of the behavior for deferred jobs missing their window.


Version-Release number of selected component (if applicable):
condor-7.6.1-0.10

How reproducible:
100%

Steps to Reproduce:
1. submit job with deferral_time set
i.e.
# job.submit
universe = vanilla
cmd = /bin/sleep
args = 1m
deferral_time = (CurrentTime + 60)
deferral_window = 30
queue 1

$ condor_submit job.submit
Submitting job(s).
1 job(s) submitted to cluster 1.

2. during waiting to deferral time stop startd daemon
# condor_off -subsystem startd
Sent "Kill-Daemon" command for "startd" to local master

3. wait until deferral time + deferral window, see job status
$ condor_q
-- Submitter: hostname : <IP:41815> : hostname
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   1.0   condor_user     6/8  17:08   0+00:00:08 I  0   2.0  sleep 1m


Actual results:
job status didn't change 

Expected results:
after deferral time + deferral window job status changes to held

Additional info:
An upstream RFE exists.
https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2219

Comment 1 Timothy St. Clair 2011-09-21 19:04:09 UTC
At present fix for https://bugzilla.redhat.com/show_bug.cgi?id=712026 would invalidate this bug.  

As a job which is vacated (for whatever reason) will notify the shadow and try to re-run the job.  Pushing this to the schedd user_policy is not a good solution in this case (or in general).

Comment 2 Lubos Trilety 2011-09-23 06:52:30 UTC
(In reply to comment #1)
> At present fix for https://bugzilla.redhat.com/show_bug.cgi?id=712026 would
> invalidate this bug.  
> 
It seems very probable.
We will validate that on next release.


Note You need to log in before you can comment on or make changes to this bug.