Bug 476087

Summary: HoldReason lost (gridmanager & schedd implicated)
Product: Red Hat Enterprise MRG Reporter: Matthew Farrellee <matt>
Component: gridAssignee: Matthew Farrellee <matt>
Status: CLOSED ERRATA QA Contact: Jeff Needle <jneedle>
Severity: high Docs Contact:
Priority: medium    
Version: 1.0   
Target Milestone: 1.1   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-02-04 16:05:55 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Matthew Farrellee 2008-12-11 20:52:30 UTC
Description of problem:

EC2 jobs were being removed by the PeriodicRemove expression: JobStatus == 5 && HoldReason =!= "Spooling input data files"...

The test for HoldReason was to work around a race where the PeriodicRemove would be evaluated during file spooling, when the job is on Hold.

When the job was written to the history file it's HoldReason information read:

PeriodicRemove = (JobStatus == 5 && HoldReason =!= "Spooling input data files") || (JobStatus == 1 && (CurrentTime - QDate) > 3600 * 6)
LastHoldReason = "Spooling input data files"
HoldReasonCode = 0
HoldReasonSubCode = 0
RemoveReason = "The job attribute PeriodicRemove expression '(JobStatus == 5 && HoldReason =!= \"Spooling input data files\") || (JobStatus == 1 && (CurrentTime - QDate) > 3600 * 6)' evaluated to TRUE"
HoldReason = UNDEFINED

The SchedLog clearly shows "Job X released from hold: Data files spooled" before the PeriodicRemove is evaluated.

Just after the HoldReason is reset, the SchedLog shows QMGMT_CMD activity, which can be traced to the GridmanagerLog, specifically

Updating classad values for X:
    JobStatus = 5
    EnteredCurrentStatus = 1229027101
    HoldReason = "The image id '[ami-6441a50d]' does not exist"
    HoldReasonCode = 0
    HoldReasonSubCode = 0
    LastReleaseReason = "Data files spooled"
    ReleaseReason = UNDEFINED
    NumSystemHolds = 1
    Managed = "Schedd"

This indicates the HoldReason no longer equals "Data files spooled".

The PeriodicRemove expression is evaluated, and attaching to the condor_schedd with gdb indicates that job X's ad contains the following when PeriodRemove is evaluated:

LastHoldReason = "Spooling input data files"
JobStatus = 5
EnteredCurrentStatus = 1229027101
HoldReason = "The image id '[ami-6441a50d]' does not exist"
HoldReasonCode = 0
HoldReasonSubCode = 0
LastReleaseReason = "Data files spooled"
ReleaseReason = UNDEFINED

(If you are lucky you can see the above with a will timed condor_q X, but gdb will reliably show it if you break on UserPolicy::AnalyzeSinglePeriodicPolicy and utilize dPrint)

This allows JobStatus == 5 && HoldReason =!=  ... to evaluate to TRUE.

The job is then removed and archived in the history file where you should be able to debug the situation.

$ condor_history X -l | grep Reason
PeriodicRemove = (JobStatus == 5 && HoldReason =!= "Spooling input data files") || (JobStatus == 1 && (CurrentTime - QDate) > 3600 * 6)
LastHoldReason = "Spooling input data files"
HoldReasonCode = 0
HoldReasonSubCode = 0
LastReleaseReason = "Data files spooled"
ReleaseReason = UNDEFINED
RemoveReason = "The job attribute PeriodicRemove expression '(JobStatus == 5 && HoldReason =!= \"Spooling input data files\") || (JobStatus == 1 && (CurrentTime - QDate) > 3600 * 6)' evaluated to TRUE"
HoldReason = UNDEFINED

Except wait, the HoldReason is UNDEFINED, expected and the LastHoldReason is "Spooling input data files". The HoldReason mentioning the ami is lost! Even more annoying is the LastHoldReason suggests that the PeriodicRemove expression should not have evaluated to true!


Version-Release number of selected component (if applicable):

condor 7.2.0-0.11

Comment 1 Matthew Farrellee 2008-12-16 14:51:39 UTC
Fixed upstream, in 7.2.0-0.12

commit 77716f2135fb7c0933438e532eef63e407b59c82
Author: Jaime Frey
Date:   Mon Dec 15 15:52:30 2008 -0600

    Rotate HoldReason to LastHoldReason when held jobs are removed
    
    Now, when a held job is removed, the schedd moves HoldReason,
    HoldReasonCode, and HoldReasonSubCode to the LastHoldReason and friends.
    Otherwise, they can be lost if the job becomes re-held (say if the
    gridmanager can't remove the job from the remote host).
    Some code in the gridmanager already assumes that HoldReason should only
    be defined if the job is held.

Comment 4 errata-xmlrpc 2009-02-04 16:05:55 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-0036.html