This service will be undergoing maintenance at 00:00 UTC, 2016-08-01. It is expected to last about 1 hours
Bug 904178 - Change of behavior for code escalation/custom signal for jobs
Change of behavior for code escalation/custom signal for jobs
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor (Show other bugs)
Development
All All
high Severity high
: 2.3
: ---
Assigned To: Robert Rati
Luigi Toscano
: Regression
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-01-25 11:44 EST by Luigi Toscano
Modified: 2013-04-15 20:48 EDT (History)
4 users (show)

See Also:
Fixed In Version: condor-7.8.8-0.4.1
Doc Type: Known Issue
Doc Text:
Cause: Rebase to condor 7.8 Consequence: When custom kill signals are used has been improved Workaround (if any): Result: Custom kill signals are no longer used during a fast shutdown. If a job is to use custom kill signals, it will be gracefully removed. Additionally, some instances where custom kill signals were sent more than once have been removed. One 1 custom kill signal will be sent before the job is hard killed.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-03-19 12:39:27 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Condor 2536 None None None Never

  None (edit)
Description Luigi Toscano 2013-01-25 11:44:04 EST
Description of problem:
HTCondor 7.8 changed the way a job can specify how to receive a custom signal and the related timeout. 

Before: the job had to specify the custom signal as (Remove_)kill_sig. Upon rm, the signal was sent and then a kill after after KILLING_TIMEOUT-1 (or kill_sig_timeout if specified). 
The same timeout was used for vacate_job if the job didn't react on the signal; TERM was used unless kill_sig was specified.

Now: the job has to specify  the custom signal as (Remove_)kill_sig. Upon rm, in order for the escalation to take place, the job must also define want_graceful_exit=true; in this case the signal is sent and then the escalation takes place after MACHINEMAXVACATETIME (or JobMaxVacateTime if specified by the job - kill_sig_timeout is depracted). 
The same timeout is used for vacate_job if the job does not react on the signal (but note that, if specified, the custom signal is always used even if want_graceful_exit is not specified).


Expected results:
Either change the default behaviour (at least for the remove the need for want_graceful_exit and restore the previous timeout) or properly document and explain the change. The name of the new variable (MACHINEMAXVACATETIME/ JobMaxVacateTime instead of KILLING_TIMEOUT/kill_sig_timeout) should be probably documented anyway, unless the old ones are created as alias.
Comment 1 Luigi Toscano 2013-01-25 11:46:22 EST
See also:
http://research.cs.wisc.edu/htcondor/manual/v7.8/9_4Development_Release.html#SECTION001044000000000000000

The paragraph "The new submit command want_graceful_removal ..."
Comment 3 Robert Rati 2013-02-01 10:23:05 EST
A new parameter named GRACEFULLY_REMOVE_JOBS, which defaults to true, determines whether jobs will be removed gracefully by default and use any custom signals defined.  A job can override this setting by specifying want_graceful_removal in the job ad.

Also, the default config has MACHINEMAXVACATETIME set to KILLING_TIMEOUT-1.

Fixed on branch:
BZ904178-orig-signal-escalation-semantics
Comment 6 Luigi Toscano 2013-02-17 19:09:03 EST
The feature works now according the behavior described into #3 (which, in the default configuration, is _almost_ like the old default behavior - now the signal escalation is always enabled).

Verified on RHEL5.9/6.4beta, i386/x86_64.

condor-classads-7.8.8-0.4.1
condor-7.8.8-0.4.1

Note You need to log in before you can comment on or make changes to this bug.