Bug 711993

Summary: RFE: check for deferred jobs that missed their window in the scheduler
Product: Red Hat Enterprise MRG Reporter: Lubos Trilety <ltrilety>
Component: condorAssignee: Timothy St. Clair <tstclair>
Status: CLOSED DUPLICATE QA Contact: MRG Quality Engineering <mrgqe-bugs>
Severity: medium Docs Contact:
Priority: low    
Version: 1.0CC: dahorak, matt, mkudlej, tstclair
Target Milestone: 2.1   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-09-23 13:33:11 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 712026    
Bug Blocks:    

Description Lubos Trilety 2011-06-09 08:39:46 UTC
Description of problem:
There are certain conditions where a deferred job that misses its window can fail to enter the 'held' state. For example, it may fail to be matched with a slot in time, or (more rarely) the startd might fail after matching.

These conditions can arise because deferred jobs are only checked for holding in the startd. A similar check could be added to the scheduler: the count() routine could include a scan for jobs with deferral, and if they have missed their window and are still idle, the scheduler can put them on hold.

Adding this check would increase the consistency of the behavior for deferred jobs missing their window.


Version-Release number of selected component (if applicable):
condor-7.6.1-0.10

How reproducible:
100%

Steps to Reproduce:
1. submit job with deferral_time set
i.e.
# job.submit
universe = vanilla
cmd = /bin/sleep
args = 1m
deferral_time = (CurrentTime + 60)
deferral_window = 30
queue 1

$ condor_submit job.submit
Submitting job(s).
1 job(s) submitted to cluster 1.

2. during waiting to deferral time stop startd daemon
# condor_off -subsystem startd
Sent "Kill-Daemon" command for "startd" to local master

3. wait until deferral time + deferral window, see job status
$ condor_q
-- Submitter: hostname : <IP:41815> : hostname
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   1.0   condor_user     6/8  17:08   0+00:00:08 I  0   2.0  sleep 1m


Actual results:
job status didn't change 

Expected results:
after deferral time + deferral window job status changes to held

Additional info:
An upstream RFE exists.
https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2219

Comment 1 Timothy St. Clair 2011-09-21 19:04:09 UTC
At present fix for https://bugzilla.redhat.com/show_bug.cgi?id=712026 would invalidate this bug.  

As a job which is vacated (for whatever reason) will notify the shadow and try to re-run the job.  Pushing this to the schedd user_policy is not a good solution in this case (or in general).

Comment 2 Lubos Trilety 2011-09-23 06:52:30 UTC
(In reply to comment #1)
> At present fix for https://bugzilla.redhat.com/show_bug.cgi?id=712026 would
> invalidate this bug.  
> 
It seems very probable.
We will validate that on next release.