Bug 738338 - Failed job execution causes starter to shutdown w/o completing exit hook
Summary: Failed job execution causes starter to shutdown w/o completing exit hook
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor
Version: Development
Hardware: Unspecified
OS: Unspecified
medium
unspecified
Target Milestone: 2.1
: ---
Assignee: Robert Rati
QA Contact: Lubos Trilety
URL:
Whiteboard:
Depends On:
Blocks: 738335 743350
TreeView+ depends on / blocked
 
Reported: 2011-09-14 15:09 UTC by Robert Rati
Modified: 2012-01-23 17:29 UTC (History)
6 users (show)

Fixed In Version: condor-7.6.4-0.8
Doc Type: Bug Fix
Doc Text:
When an executable in a job was unable to be executed, the starter shut down immediately without waiting for the exit job hook to complete its operation. With this update, the starter waits 30 seconds by default before giving up on the exit hook and shutting down and a job that fails to execute will try to make sure a configured exit hook will complete its operation before exiting.
Clone Of:
Environment:
Last Closed: 2012-01-23 17:29:03 UTC


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2012:0045 normal SHIPPED_LIVE Red Hat Enterprise MRG Grid 2.1 bug fix and enhancement update 2012-01-23 22:22:58 UTC

Description Robert Rati 2011-09-14 15:09:27 UTC
Description of problem:
If the starter has a problem running the executable for the job, it will shut down fast.  Part of this process invokes the exit hook as it should, but it shuts down without waiting for the exit hook to exit.  Anything using the job hooks that runs into this case will likely not be notified the job has exited and been evicted.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Martin Kudlej 2011-09-29 11:13:06 UTC
How we can test this bug? Is there any repro scenario?

Comment 2 Robert Rati 2011-09-29 19:58:26 UTC
Run a low-latency job that has an executable that can't be run.  The starter log will show the exit hook being invoke, but not exiting.  carod also won't send a message denoting the job has exited.

Comment 3 Robert Rati 2011-10-12 20:40:43 UTC
The starter will wait for the exit hook to complete if it is configured. The starter will wait <keyword>_HOOK_<hook_type>_TIMEOUT seconds, defaults to 30 for the exit hook, before the starter determines the hook is hung and continues with shutdown.  Currently, the only timeout that can be set is the exit hook.

Fixed on branch:
BZ738338-starter-waits-for-hook

Comment 5 Robert Rati 2011-10-18 15:54:25 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
C: An executable in a job is unable to be executed
C: The starter will shutdown immediately without waiting for the exit job hook to complete if it is configured
F: The starter will wait for the exit hook to complete before exiting.  The starter will wait <keyword>_HOOK_JOB_EXIT_TIMEOUT seconds, which defaults to 30, before giving up on the exit hook and exiting
R: A job that fails to execute will try to make sure a configured exit hook will complete before exiting

Comment 6 Lubos Trilety 2011-10-27 14:29:41 UTC
Successfully reproduced on:
$CondorVersion: 7.6.3 Jul 27 2011 BuildID: RH-7.6.3-0.3.el5 $
$CondorPlatform: X86_64-RedHat_5.6 $

run file_no_perms.py
# python file_no_perms.py
...
Only received 1 messages but expected 2.  TEST FAILED!

# cat /var/log/condor/StarterLog | grep HOOK_JOB_EXIT | wc -l
0
# cat /var/log/condor/CaroLog | grep Expiring | wc -l
1

Comment 7 Lubos Trilety 2011-10-27 14:46:22 UTC
Tested on:
$CondorVersion: 7.6.5 Oct 21 2011 BuildID: RH-7.6.5-0.2.el5 $
$CondorPlatform: I686-RedHat_5.7 $

$CondorVersion: 7.6.5 Oct 21 2011 BuildID: RH-7.6.5-0.2.el5 $
$CondorPlatform: X86_64-RedHat_5.7 $

$CondorVersion: 7.6.5 Oct 21 2011 BuildID: RH-7.6.5-0.2.el6 $
$CondorPlatform: I686-RedHat_6.1 $

$CondorVersion: 7.6.5 Oct 21 2011 BuildID: RH-7.6.5-0.2.el6 $
$CondorPlatform: X86_64-RedHat_6.1 $

run file_no_perms.py - passed

# cat /var/log/condor/StarterLog | grep HOOK_JOB_EXIT | wc -l
1 
(or higher)
# cat /var/log/condor/CaroLog | grep Expiring | wc -l
0

>>> VERIFIED

Comment 8 Tomas Capek 2011-11-16 15:02:28 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1,4 +1 @@
-C: An executable in a job is unable to be executed
+When an executable in a job was unable to be executed, the starter shut down immediately without waiting for the exit job hook to complete its operation. With this update, the starter waits 30 seconds by default before giving up on the exit hook and shutting down and a job that fails to execute will try to make sure a configured exit hook will complete its operation before exiting.-C: The starter will shutdown immediately without waiting for the exit job hook to complete if it is configured
-F: The starter will wait for the exit hook to complete before exiting.  The starter will wait <keyword>_HOOK_JOB_EXIT_TIMEOUT seconds, which defaults to 30, before giving up on the exit hook and exiting
-R: A job that fails to execute will try to make sure a configured exit hook will complete before exiting

Comment 9 errata-xmlrpc 2012-01-23 17:29:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2012-0045.html


Note You need to log in before you can comment on or make changes to this bug.