Bug 904178

Summary: Change of behavior for code escalation/custom signal for jobs
Product: Red Hat Enterprise MRG Reporter: Luigi Toscano <ltoscano>
Component: condorAssignee: Robert Rati <rrati>
Status: CLOSED CURRENTRELEASE QA Contact: Luigi Toscano <ltoscano>
Severity: high Docs Contact:
Priority: high    
Version: DevelopmentCC: esammons, matt, rrati, tstclair
Target Milestone: 2.3Keywords: Regression
Target Release: ---   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: condor-7.8.8-0.4.1 Doc Type: Known Issue
Doc Text:
Cause: Rebase to condor 7.8 Consequence: When custom kill signals are used has been improved Workaround (if any): Result: Custom kill signals are no longer used during a fast shutdown. If a job is to use custom kill signals, it will be gracefully removed. Additionally, some instances where custom kill signals were sent more than once have been removed. One 1 custom kill signal will be sent before the job is hard killed.
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-03-19 16:39:27 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Luigi Toscano 2013-01-25 16:44:04 UTC
Description of problem:
HTCondor 7.8 changed the way a job can specify how to receive a custom signal and the related timeout. 

Before: the job had to specify the custom signal as (Remove_)kill_sig. Upon rm, the signal was sent and then a kill after after KILLING_TIMEOUT-1 (or kill_sig_timeout if specified). 
The same timeout was used for vacate_job if the job didn't react on the signal; TERM was used unless kill_sig was specified.

Now: the job has to specify  the custom signal as (Remove_)kill_sig. Upon rm, in order for the escalation to take place, the job must also define want_graceful_exit=true; in this case the signal is sent and then the escalation takes place after MACHINEMAXVACATETIME (or JobMaxVacateTime if specified by the job - kill_sig_timeout is depracted). 
The same timeout is used for vacate_job if the job does not react on the signal (but note that, if specified, the custom signal is always used even if want_graceful_exit is not specified).


Expected results:
Either change the default behaviour (at least for the remove the need for want_graceful_exit and restore the previous timeout) or properly document and explain the change. The name of the new variable (MACHINEMAXVACATETIME/ JobMaxVacateTime instead of KILLING_TIMEOUT/kill_sig_timeout) should be probably documented anyway, unless the old ones are created as alias.

Comment 1 Luigi Toscano 2013-01-25 16:46:22 UTC
See also:
http://research.cs.wisc.edu/htcondor/manual/v7.8/9_4Development_Release.html#SECTION001044000000000000000

The paragraph "The new submit command want_graceful_removal ..."

Comment 3 Robert Rati 2013-02-01 15:23:05 UTC
A new parameter named GRACEFULLY_REMOVE_JOBS, which defaults to true, determines whether jobs will be removed gracefully by default and use any custom signals defined.  A job can override this setting by specifying want_graceful_removal in the job ad.

Also, the default config has MACHINEMAXVACATETIME set to KILLING_TIMEOUT-1.

Fixed on branch:
BZ904178-orig-signal-escalation-semantics

Comment 6 Luigi Toscano 2013-02-18 00:09:03 UTC
The feature works now according the behavior described into #3 (which, in the default configuration, is _almost_ like the old default behavior - now the signal escalation is always enabled).

Verified on RHEL5.9/6.4beta, i386/x86_64.

condor-classads-7.8.8-0.4.1
condor-7.8.8-0.4.1