Red Hat Bugzilla – Full Text Bug Listing
|Summary:||Change of behavior for code escalation/custom signal for jobs|
|Product:||Red Hat Enterprise MRG||Reporter:||Luigi Toscano <ltoscano>|
|Component:||condor||Assignee:||Robert Rati <rrati>|
|Status:||CLOSED CURRENTRELEASE||QA Contact:||Luigi Toscano <ltoscano>|
|Version:||Development||CC:||esammons, matt, rrati, tstclair|
|Fixed In Version:||condor-7.8.8-0.4.1||Doc Type:||Known Issue|
Cause: Rebase to condor 7.8 Consequence: When custom kill signals are used has been improved Workaround (if any): Result: Custom kill signals are no longer used during a fast shutdown. If a job is to use custom kill signals, it will be gracefully removed. Additionally, some instances where custom kill signals were sent more than once have been removed. One 1 custom kill signal will be sent before the job is hard killed.
|Last Closed:||2013-03-19 12:39:27 EDT||Type:||Bug|
|oVirt Team:||---||RHEL 7.3 requirements from Atomic Host:|
Description Luigi Toscano 2013-01-25 11:44:04 EST
Description of problem: HTCondor 7.8 changed the way a job can specify how to receive a custom signal and the related timeout. Before: the job had to specify the custom signal as (Remove_)kill_sig. Upon rm, the signal was sent and then a kill after after KILLING_TIMEOUT-1 (or kill_sig_timeout if specified). The same timeout was used for vacate_job if the job didn't react on the signal; TERM was used unless kill_sig was specified. Now: the job has to specify the custom signal as (Remove_)kill_sig. Upon rm, in order for the escalation to take place, the job must also define want_graceful_exit=true; in this case the signal is sent and then the escalation takes place after MACHINEMAXVACATETIME (or JobMaxVacateTime if specified by the job - kill_sig_timeout is depracted). The same timeout is used for vacate_job if the job does not react on the signal (but note that, if specified, the custom signal is always used even if want_graceful_exit is not specified). Expected results: Either change the default behaviour (at least for the remove the need for want_graceful_exit and restore the previous timeout) or properly document and explain the change. The name of the new variable (MACHINEMAXVACATETIME/ JobMaxVacateTime instead of KILLING_TIMEOUT/kill_sig_timeout) should be probably documented anyway, unless the old ones are created as alias.
Comment 1 Luigi Toscano 2013-01-25 11:46:22 EST
See also: http://research.cs.wisc.edu/htcondor/manual/v7.8/9_4Development_Release.html#SECTION001044000000000000000 The paragraph "The new submit command want_graceful_removal ..."
Comment 3 Robert Rati 2013-02-01 10:23:05 EST
A new parameter named GRACEFULLY_REMOVE_JOBS, which defaults to true, determines whether jobs will be removed gracefully by default and use any custom signals defined. A job can override this setting by specifying want_graceful_removal in the job ad. Also, the default config has MACHINEMAXVACATETIME set to KILLING_TIMEOUT-1. Fixed on branch: BZ904178-orig-signal-escalation-semantics
Comment 6 Luigi Toscano 2013-02-17 19:09:03 EST
The feature works now according the behavior described into #3 (which, in the default configuration, is _almost_ like the old default behavior - now the signal escalation is always enabled). Verified on RHEL5.9/6.4beta, i386/x86_64. condor-classads-7.8.8-0.4.1 condor-7.8.8-0.4.1