711993 – RFE: check for deferred jobs that missed their window in the scheduler

Bug 711993 - RFE: check for deferred jobs that missed their window in the scheduler

Summary: RFE: check for deferred jobs that missed their window in the scheduler

Keywords:
Status:	CLOSED DUPLICATE of bug 712026
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	condor
Sub Component:
Version:	1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	medium
Target Milestone:	2.1
Target Release:	---
Assignee:	Timothy St. Clair
QA Contact:	MRG Quality Engineering
Docs Contact:
URL:
Whiteboard:
Depends On:	712026
Blocks:
TreeView+	depends on / blocked

Reported:	2011-06-09 08:39 UTC by Lubos Trilety
Modified:	2012-02-07 09:50 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2011-09-23 13:33:11 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Lubos Trilety 2011-06-09 08:39:46 UTC

Description of problem:
There are certain conditions where a deferred job that misses its window can fail to enter the 'held' state. For example, it may fail to be matched with a slot in time, or (more rarely) the startd might fail after matching.

These conditions can arise because deferred jobs are only checked for holding in the startd. A similar check could be added to the scheduler: the count() routine could include a scan for jobs with deferral, and if they have missed their window and are still idle, the scheduler can put them on hold.

Adding this check would increase the consistency of the behavior for deferred jobs missing their window.

Version-Release number of selected component (if applicable):
condor-7.6.1-0.10

How reproducible:
100%

Steps to Reproduce:
1. submit job with deferral_time set
i.e.
# job.submit
universe = vanilla
cmd = /bin/sleep
args = 1m
deferral_time = (CurrentTime + 60)
deferral_window = 30
queue 1

$ condor_submit job.submit
Submitting job(s).
1 job(s) submitted to cluster 1.

2. during waiting to deferral time stop startd daemon
# condor_off -subsystem startd
Sent "Kill-Daemon" command for "startd" to local master

3. wait until deferral time + deferral window, see job status
$ condor_q
-- Submitter: hostname : <IP:41815> : hostname
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
1.0 condor_user 6/8 17:08 0+00:00:08 I 0 2.0 sleep 1m

Actual results:
job status didn't change

Expected results:
after deferral time + deferral window job status changes to held

Additional info:
An upstream RFE exists.
https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2219

Comment 1 Timothy St. Clair 2011-09-21 19:04:09 UTC

At present fix for https://bugzilla.redhat.com/show_bug.cgi?id=712026 would invalidate this bug.  

As a job which is vacated (for whatever reason) will notify the shadow and try to re-run the job.  Pushing this to the schedd user_policy is not a good solution in this case (or in general).

Comment 2 Lubos Trilety 2011-09-23 06:52:30 UTC

(In reply to comment #1)
> At present fix for https://bugzilla.redhat.com/show_bug.cgi?id=712026 would
> invalidate this bug.  
> 
It seems very probable.
We will validate that on next release.

Note You need to log in before you can comment on or make changes to this bug.