Bug 738338

Summary: Failed job execution causes starter to shutdown w/o completing exit hook
Product: Red Hat Enterprise MRG Reporter: Robert Rati <rrati>
Component: condorAssignee: Robert Rati <rrati>
Status: CLOSED ERRATA QA Contact: Lubos Trilety <ltrilety>
Severity: unspecified Docs Contact:
Priority: medium    
Version: DevelopmentCC: iboverma, ltoscano, ltrilety, matt, mkudlej, tstclair
Target Milestone: 2.1   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: condor-7.6.4-0.8 Doc Type: Bug Fix
Doc Text:
When an executable in a job was unable to be executed, the starter shut down immediately without waiting for the exit job hook to complete its operation. With this update, the starter waits 30 seconds by default before giving up on the exit hook and shutting down and a job that fails to execute will try to make sure a configured exit hook will complete its operation before exiting.
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-01-23 17:29:03 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 738335, 743350    

Description Robert Rati 2011-09-14 15:09:27 UTC
Description of problem:
If the starter has a problem running the executable for the job, it will shut down fast.  Part of this process invokes the exit hook as it should, but it shuts down without waiting for the exit hook to exit.  Anything using the job hooks that runs into this case will likely not be notified the job has exited and been evicted.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Martin Kudlej 2011-09-29 11:13:06 UTC
How we can test this bug? Is there any repro scenario?

Comment 2 Robert Rati 2011-09-29 19:58:26 UTC
Run a low-latency job that has an executable that can't be run.  The starter log will show the exit hook being invoke, but not exiting.  carod also won't send a message denoting the job has exited.

Comment 3 Robert Rati 2011-10-12 20:40:43 UTC
The starter will wait for the exit hook to complete if it is configured. The starter will wait <keyword>_HOOK_<hook_type>_TIMEOUT seconds, defaults to 30 for the exit hook, before the starter determines the hook is hung and continues with shutdown.  Currently, the only timeout that can be set is the exit hook.

Fixed on branch:
BZ738338-starter-waits-for-hook

Comment 5 Robert Rati 2011-10-18 15:54:25 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
C: An executable in a job is unable to be executed
C: The starter will shutdown immediately without waiting for the exit job hook to complete if it is configured
F: The starter will wait for the exit hook to complete before exiting.  The starter will wait <keyword>_HOOK_JOB_EXIT_TIMEOUT seconds, which defaults to 30, before giving up on the exit hook and exiting
R: A job that fails to execute will try to make sure a configured exit hook will complete before exiting

Comment 6 Lubos Trilety 2011-10-27 14:29:41 UTC
Successfully reproduced on:
$CondorVersion: 7.6.3 Jul 27 2011 BuildID: RH-7.6.3-0.3.el5 $
$CondorPlatform: X86_64-RedHat_5.6 $

run file_no_perms.py
# python file_no_perms.py
...
Only received 1 messages but expected 2.  TEST FAILED!

# cat /var/log/condor/StarterLog | grep HOOK_JOB_EXIT | wc -l
0
# cat /var/log/condor/CaroLog | grep Expiring | wc -l
1

Comment 7 Lubos Trilety 2011-10-27 14:46:22 UTC
Tested on:
$CondorVersion: 7.6.5 Oct 21 2011 BuildID: RH-7.6.5-0.2.el5 $
$CondorPlatform: I686-RedHat_5.7 $

$CondorVersion: 7.6.5 Oct 21 2011 BuildID: RH-7.6.5-0.2.el5 $
$CondorPlatform: X86_64-RedHat_5.7 $

$CondorVersion: 7.6.5 Oct 21 2011 BuildID: RH-7.6.5-0.2.el6 $
$CondorPlatform: I686-RedHat_6.1 $

$CondorVersion: 7.6.5 Oct 21 2011 BuildID: RH-7.6.5-0.2.el6 $
$CondorPlatform: X86_64-RedHat_6.1 $

run file_no_perms.py - passed

# cat /var/log/condor/StarterLog | grep HOOK_JOB_EXIT | wc -l
1 
(or higher)
# cat /var/log/condor/CaroLog | grep Expiring | wc -l
0

>>> VERIFIED

Comment 8 Tomas Capek 2011-11-16 15:02:28 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1,4 +1 @@
-C: An executable in a job is unable to be executed
+When an executable in a job was unable to be executed, the starter shut down immediately without waiting for the exit job hook to complete its operation. With this update, the starter waits 30 seconds by default before giving up on the exit hook and shutting down and a job that fails to execute will try to make sure a configured exit hook will complete its operation before exiting.-C: The starter will shutdown immediately without waiting for the exit job hook to complete if it is configured
-F: The starter will wait for the exit hook to complete before exiting.  The starter will wait <keyword>_HOOK_JOB_EXIT_TIMEOUT seconds, which defaults to 30, before giving up on the exit hook and exiting
-R: A job that fails to execute will try to make sure a configured exit hook will complete before exiting

Comment 9 errata-xmlrpc 2012-01-23 17:29:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2012-0045.html