Red Hat Bugzilla – Bug 475865
Periodic* race in JobRouter (and elsewhere)
Last modified: 2009-04-21 12:18:14 EDT
Specifically for the JobRouter, the configuration contains:
set_PeriodicRemove = JobStatus == 5 || \
(JobStatus == 1 && (CurrentTime - QDate) > 3600*6); \
JobStatus 5 is the Hold state.
When a job is submitted to Condor it is briefly put in the Hold state while its data is spooled. That brief hold can be long enough, or just timed poorly enough, for the Periodic expressions to be evaluated. In this example, the PeriodicRemove evaluates to true, JobStatus == 5, and the job is removed before it is completely spooled. Oops!
A temporary workaround, which is fragile, is to test (JobStatus == 5 && HoldReason =!= "Spooling input data files").
This will be addressed in 7.2.1-0.2
Author: Dan Bradley <dan>
Date: Fri Dec 12 17:38:52 2008 -0600
Added protection against periodic expressions messing up the 'spooling' hold state.
This is a temporary solution for 7.2 only. In 7.3, we will get rid of
the spooling state.
I have reproduced the bug on RHEL5.3 i386,
condor-7.2.0-3.el5 (from MRG 1.1)
The bug is not present on condor-7.2.2-0.7.el5
(from MRG-candidate repo).
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.