Bug 475865 - Periodic* race in JobRouter (and elsewhere)
Summary: Periodic* race in JobRouter (and elsewhere)
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: grid
Version: 1.0
Hardware: All
OS: Linux
high
high
Target Milestone: 1.1.1
: ---
Assignee: Matthew Farrellee
QA Contact: Jeff Needle
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2008-12-10 20:43 UTC by Matthew Farrellee
Modified: 2009-04-21 16:18 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-04-21 16:18:14 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2009:0434 0 normal SHIPPED_LIVE Red Hat Enterprise MRG Messaging and Grid Version 1.1.1 2009-04-21 16:15:50 UTC

Description Matthew Farrellee 2008-12-10 20:43:05 UTC
Specifically for the JobRouter, the configuration contains:

    set_PeriodicRemove = JobStatus == 5 || \
                         (JobStatus == 1 && (CurrentTime - QDate) > 3600*6); \

JobStatus 5 is the Hold state.

When a job is submitted to Condor it is briefly put in the Hold state while its data is spooled. That brief hold can be long enough, or just timed poorly enough, for the Periodic expressions to be evaluated. In this example, the PeriodicRemove evaluates to true, JobStatus == 5, and the job is removed before it is completely spooled. Oops!

A temporary workaround, which is fragile, is to test (JobStatus == 5 && HoldReason =!= "Spooling input data files").

Comment 2 Matthew Farrellee 2009-01-29 16:42:42 UTC
This will be addressed in 7.2.1-0.2

commit d1763c3bc25c2dfb511a61048eb872d5f28fd2da
Author: Dan Bradley <dan>
Date:   Fri Dec 12 17:38:52 2008 -0600

    Added protection against periodic expressions messing up the 'spooling' hold state.
    This is a temporary solution for 7.2 only.  In 7.3, we will get rid of
    the spooling state.

Comment 4 Jan Sarenik 2009-03-09 12:27:07 UTC
I have reproduced the bug on RHEL5.3 i386,
condor-7.2.0-3.el5 (from MRG 1.1)

The bug is not present on condor-7.2.2-0.7.el5
(from MRG-candidate repo).

Comment 6 errata-xmlrpc 2009-04-21 16:18:14 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2009-0434.html


Note You need to log in before you can comment on or make changes to this bug.