Bug 738338
Summary: | Failed job execution causes starter to shutdown w/o completing exit hook | ||
---|---|---|---|
Product: | Red Hat Enterprise MRG | Reporter: | Robert Rati <rrati> |
Component: | condor | Assignee: | Robert Rati <rrati> |
Status: | CLOSED ERRATA | QA Contact: | Lubos Trilety <ltrilety> |
Severity: | unspecified | Docs Contact: | |
Priority: | medium | ||
Version: | Development | CC: | iboverma, ltoscano, ltrilety, matt, mkudlej, tstclair |
Target Milestone: | 2.1 | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | condor-7.6.4-0.8 | Doc Type: | Bug Fix |
Doc Text: |
When an executable in a job was unable to be executed, the starter shut down immediately without waiting for the exit job hook to complete its operation. With this update, the starter waits 30 seconds by default before giving up on the exit hook and shutting down and a job that fails to execute will try to make sure a configured exit hook will complete its operation before exiting.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2012-01-23 17:29:03 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 738335, 743350 |
Description
Robert Rati
2011-09-14 15:09:27 UTC
How we can test this bug? Is there any repro scenario? Run a low-latency job that has an executable that can't be run. The starter log will show the exit hook being invoke, but not exiting. carod also won't send a message denoting the job has exited. The starter will wait for the exit hook to complete if it is configured. The starter will wait <keyword>_HOOK_<hook_type>_TIMEOUT seconds, defaults to 30 for the exit hook, before the starter determines the hook is hung and continues with shutdown. Currently, the only timeout that can be set is the exit hook. Fixed on branch: BZ738338-starter-waits-for-hook Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: C: An executable in a job is unable to be executed C: The starter will shutdown immediately without waiting for the exit job hook to complete if it is configured F: The starter will wait for the exit hook to complete before exiting. The starter will wait <keyword>_HOOK_JOB_EXIT_TIMEOUT seconds, which defaults to 30, before giving up on the exit hook and exiting R: A job that fails to execute will try to make sure a configured exit hook will complete before exiting Successfully reproduced on: $CondorVersion: 7.6.3 Jul 27 2011 BuildID: RH-7.6.3-0.3.el5 $ $CondorPlatform: X86_64-RedHat_5.6 $ run file_no_perms.py # python file_no_perms.py ... Only received 1 messages but expected 2. TEST FAILED! # cat /var/log/condor/StarterLog | grep HOOK_JOB_EXIT | wc -l 0 # cat /var/log/condor/CaroLog | grep Expiring | wc -l 1 Tested on:
$CondorVersion: 7.6.5 Oct 21 2011 BuildID: RH-7.6.5-0.2.el5 $
$CondorPlatform: I686-RedHat_5.7 $
$CondorVersion: 7.6.5 Oct 21 2011 BuildID: RH-7.6.5-0.2.el5 $
$CondorPlatform: X86_64-RedHat_5.7 $
$CondorVersion: 7.6.5 Oct 21 2011 BuildID: RH-7.6.5-0.2.el6 $
$CondorPlatform: I686-RedHat_6.1 $
$CondorVersion: 7.6.5 Oct 21 2011 BuildID: RH-7.6.5-0.2.el6 $
$CondorPlatform: X86_64-RedHat_6.1 $
run file_no_perms.py - passed
# cat /var/log/condor/StarterLog | grep HOOK_JOB_EXIT | wc -l
1
(or higher)
# cat /var/log/condor/CaroLog | grep Expiring | wc -l
0
>>> VERIFIED
Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1,4 +1 @@ -C: An executable in a job is unable to be executed +When an executable in a job was unable to be executed, the starter shut down immediately without waiting for the exit job hook to complete its operation. With this update, the starter waits 30 seconds by default before giving up on the exit hook and shutting down and a job that fails to execute will try to make sure a configured exit hook will complete its operation before exiting.-C: The starter will shutdown immediately without waiting for the exit job hook to complete if it is configured -F: The starter will wait for the exit hook to complete before exiting. The starter will wait <keyword>_HOOK_JOB_EXIT_TIMEOUT seconds, which defaults to 30, before giving up on the exit hook and exiting -R: A job that fails to execute will try to make sure a configured exit hook will complete before exiting Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHEA-2012-0045.html |