Bug 476087 - HoldReason lost (gridmanager & schedd implicated)
Summary: HoldReason lost (gridmanager & schedd implicated)
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: grid
Version: 1.0
Hardware: All
OS: Linux
medium
high
Target Milestone: 1.1
: ---
Assignee: Matthew Farrellee
QA Contact: Jeff Needle
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2008-12-11 20:52 UTC by Matthew Farrellee
Modified: 2009-02-04 16:05 UTC (History)
0 users

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-02-04 16:05:55 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2009:0036 0 normal SHIPPED_LIVE Red Hat Enterprise MRG Grid 1.1 Release 2009-02-04 16:03:49 UTC

Description Matthew Farrellee 2008-12-11 20:52:30 UTC
Description of problem:

EC2 jobs were being removed by the PeriodicRemove expression: JobStatus == 5 && HoldReason =!= "Spooling input data files"...

The test for HoldReason was to work around a race where the PeriodicRemove would be evaluated during file spooling, when the job is on Hold.

When the job was written to the history file it's HoldReason information read:

PeriodicRemove = (JobStatus == 5 && HoldReason =!= "Spooling input data files") || (JobStatus == 1 && (CurrentTime - QDate) > 3600 * 6)
LastHoldReason = "Spooling input data files"
HoldReasonCode = 0
HoldReasonSubCode = 0
RemoveReason = "The job attribute PeriodicRemove expression '(JobStatus == 5 && HoldReason =!= \"Spooling input data files\") || (JobStatus == 1 && (CurrentTime - QDate) > 3600 * 6)' evaluated to TRUE"
HoldReason = UNDEFINED

The SchedLog clearly shows "Job X released from hold: Data files spooled" before the PeriodicRemove is evaluated.

Just after the HoldReason is reset, the SchedLog shows QMGMT_CMD activity, which can be traced to the GridmanagerLog, specifically

Updating classad values for X:
    JobStatus = 5
    EnteredCurrentStatus = 1229027101
    HoldReason = "The image id '[ami-6441a50d]' does not exist"
    HoldReasonCode = 0
    HoldReasonSubCode = 0
    LastReleaseReason = "Data files spooled"
    ReleaseReason = UNDEFINED
    NumSystemHolds = 1
    Managed = "Schedd"

This indicates the HoldReason no longer equals "Data files spooled".

The PeriodicRemove expression is evaluated, and attaching to the condor_schedd with gdb indicates that job X's ad contains the following when PeriodRemove is evaluated:

LastHoldReason = "Spooling input data files"
JobStatus = 5
EnteredCurrentStatus = 1229027101
HoldReason = "The image id '[ami-6441a50d]' does not exist"
HoldReasonCode = 0
HoldReasonSubCode = 0
LastReleaseReason = "Data files spooled"
ReleaseReason = UNDEFINED

(If you are lucky you can see the above with a will timed condor_q X, but gdb will reliably show it if you break on UserPolicy::AnalyzeSinglePeriodicPolicy and utilize dPrint)

This allows JobStatus == 5 && HoldReason =!=  ... to evaluate to TRUE.

The job is then removed and archived in the history file where you should be able to debug the situation.

$ condor_history X -l | grep Reason
PeriodicRemove = (JobStatus == 5 && HoldReason =!= "Spooling input data files") || (JobStatus == 1 && (CurrentTime - QDate) > 3600 * 6)
LastHoldReason = "Spooling input data files"
HoldReasonCode = 0
HoldReasonSubCode = 0
LastReleaseReason = "Data files spooled"
ReleaseReason = UNDEFINED
RemoveReason = "The job attribute PeriodicRemove expression '(JobStatus == 5 && HoldReason =!= \"Spooling input data files\") || (JobStatus == 1 && (CurrentTime - QDate) > 3600 * 6)' evaluated to TRUE"
HoldReason = UNDEFINED

Except wait, the HoldReason is UNDEFINED, expected and the LastHoldReason is "Spooling input data files". The HoldReason mentioning the ami is lost! Even more annoying is the LastHoldReason suggests that the PeriodicRemove expression should not have evaluated to true!


Version-Release number of selected component (if applicable):

condor 7.2.0-0.11

Comment 1 Matthew Farrellee 2008-12-16 14:51:39 UTC
Fixed upstream, in 7.2.0-0.12

commit 77716f2135fb7c0933438e532eef63e407b59c82
Author: Jaime Frey
Date:   Mon Dec 15 15:52:30 2008 -0600

    Rotate HoldReason to LastHoldReason when held jobs are removed
    
    Now, when a held job is removed, the schedd moves HoldReason,
    HoldReasonCode, and HoldReasonSubCode to the LastHoldReason and friends.
    Otherwise, they can be lost if the job becomes re-held (say if the
    gridmanager can't remove the job from the remote host).
    Some code in the gridmanager already assumes that HoldReason should only
    be defined if the job is held.

Comment 4 errata-xmlrpc 2009-02-04 16:05:55 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-0036.html


Note You need to log in before you can comment on or make changes to this bug.