Description of problem: EC2 jobs were being removed by the PeriodicRemove expression: JobStatus == 5 && HoldReason =!= "Spooling input data files"... The test for HoldReason was to work around a race where the PeriodicRemove would be evaluated during file spooling, when the job is on Hold. When the job was written to the history file it's HoldReason information read: PeriodicRemove = (JobStatus == 5 && HoldReason =!= "Spooling input data files") || (JobStatus == 1 && (CurrentTime - QDate) > 3600 * 6) LastHoldReason = "Spooling input data files" HoldReasonCode = 0 HoldReasonSubCode = 0 RemoveReason = "The job attribute PeriodicRemove expression '(JobStatus == 5 && HoldReason =!= \"Spooling input data files\") || (JobStatus == 1 && (CurrentTime - QDate) > 3600 * 6)' evaluated to TRUE" HoldReason = UNDEFINED The SchedLog clearly shows "Job X released from hold: Data files spooled" before the PeriodicRemove is evaluated. Just after the HoldReason is reset, the SchedLog shows QMGMT_CMD activity, which can be traced to the GridmanagerLog, specifically Updating classad values for X: JobStatus = 5 EnteredCurrentStatus = 1229027101 HoldReason = "The image id '[ami-6441a50d]' does not exist" HoldReasonCode = 0 HoldReasonSubCode = 0 LastReleaseReason = "Data files spooled" ReleaseReason = UNDEFINED NumSystemHolds = 1 Managed = "Schedd" This indicates the HoldReason no longer equals "Data files spooled". The PeriodicRemove expression is evaluated, and attaching to the condor_schedd with gdb indicates that job X's ad contains the following when PeriodRemove is evaluated: LastHoldReason = "Spooling input data files" JobStatus = 5 EnteredCurrentStatus = 1229027101 HoldReason = "The image id '[ami-6441a50d]' does not exist" HoldReasonCode = 0 HoldReasonSubCode = 0 LastReleaseReason = "Data files spooled" ReleaseReason = UNDEFINED (If you are lucky you can see the above with a will timed condor_q X, but gdb will reliably show it if you break on UserPolicy::AnalyzeSinglePeriodicPolicy and utilize dPrint) This allows JobStatus == 5 && HoldReason =!= ... to evaluate to TRUE. The job is then removed and archived in the history file where you should be able to debug the situation. $ condor_history X -l | grep Reason PeriodicRemove = (JobStatus == 5 && HoldReason =!= "Spooling input data files") || (JobStatus == 1 && (CurrentTime - QDate) > 3600 * 6) LastHoldReason = "Spooling input data files" HoldReasonCode = 0 HoldReasonSubCode = 0 LastReleaseReason = "Data files spooled" ReleaseReason = UNDEFINED RemoveReason = "The job attribute PeriodicRemove expression '(JobStatus == 5 && HoldReason =!= \"Spooling input data files\") || (JobStatus == 1 && (CurrentTime - QDate) > 3600 * 6)' evaluated to TRUE" HoldReason = UNDEFINED Except wait, the HoldReason is UNDEFINED, expected and the LastHoldReason is "Spooling input data files". The HoldReason mentioning the ami is lost! Even more annoying is the LastHoldReason suggests that the PeriodicRemove expression should not have evaluated to true! Version-Release number of selected component (if applicable): condor 7.2.0-0.11
Fixed upstream, in 7.2.0-0.12 commit 77716f2135fb7c0933438e532eef63e407b59c82 Author: Jaime Frey Date: Mon Dec 15 15:52:30 2008 -0600 Rotate HoldReason to LastHoldReason when held jobs are removed Now, when a held job is removed, the schedd moves HoldReason, HoldReasonCode, and HoldReasonSubCode to the LastHoldReason and friends. Otherwise, they can be lost if the job becomes re-held (say if the gridmanager can't remove the job from the remote host). Some code in the gridmanager already assumes that HoldReason should only be defined if the job is held.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2009-0036.html