Bug 904178

Summary:	Change of behavior for code escalation/custom signal for jobs
Product:	Red Hat Enterprise MRG	Reporter:	Luigi Toscano <ltoscano>
Component:	condor	Assignee:	Robert Rati <rrati>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Luigi Toscano <ltoscano>
Severity:	high	Docs Contact:
Priority:	high
Version:	Development	CC:	esammons, matt, rrati, tstclair
Target Milestone:	2.3	Keywords:	Regression
Target Release:	---
Hardware:	All
OS:	All
Whiteboard:
Fixed In Version:	condor-7.8.8-0.4.1	Doc Type:	Known Issue
Doc Text:	Cause: Rebase to condor 7.8 Consequence: When custom kill signals are used has been improved Workaround (if any): Result: Custom kill signals are no longer used during a fast shutdown. If a job is to use custom kill signals, it will be gracefully removed. Additionally, some instances where custom kill signals were sent more than once have been removed. One 1 custom kill signal will be sent before the job is hard killed.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2013-03-19 16:39:27 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Luigi Toscano 2013-01-25 16:44:04 UTC

Description of problem:
HTCondor 7.8 changed the way a job can specify how to receive a custom signal and the related timeout. 

Before: the job had to specify the custom signal as (Remove_)kill_sig. Upon rm, the signal was sent and then a kill after after KILLING_TIMEOUT-1 (or kill_sig_timeout if specified). 
The same timeout was used for vacate_job if the job didn't react on the signal; TERM was used unless kill_sig was specified.

Now: the job has to specify  the custom signal as (Remove_)kill_sig. Upon rm, in order for the escalation to take place, the job must also define want_graceful_exit=true; in this case the signal is sent and then the escalation takes place after MACHINEMAXVACATETIME (or JobMaxVacateTime if specified by the job - kill_sig_timeout is depracted). 
The same timeout is used for vacate_job if the job does not react on the signal (but note that, if specified, the custom signal is always used even if want_graceful_exit is not specified).


Expected results:
Either change the default behaviour (at least for the remove the need for want_graceful_exit and restore the previous timeout) or properly document and explain the change. The name of the new variable (MACHINEMAXVACATETIME/ JobMaxVacateTime instead of KILLING_TIMEOUT/kill_sig_timeout) should be probably documented anyway, unless the old ones are created as alias.

Comment 1 Luigi Toscano 2013-01-25 16:46:22 UTC

See also:
http://research.cs.wisc.edu/htcondor/manual/v7.8/9_4Development_Release.html#SECTION001044000000000000000

The paragraph "The new submit command want_graceful_removal ..."

Comment 3 Robert Rati 2013-02-01 15:23:05 UTC

A new parameter named GRACEFULLY_REMOVE_JOBS, which defaults to true, determines whether jobs will be removed gracefully by default and use any custom signals defined.  A job can override this setting by specifying want_graceful_removal in the job ad.

Also, the default config has MACHINEMAXVACATETIME set to KILLING_TIMEOUT-1.

Fixed on branch:
BZ904178-orig-signal-escalation-semantics

Comment 6 Luigi Toscano 2013-02-18 00:09:03 UTC

The feature works now according the behavior described into #3 (which, in the default configuration, is _almost_ like the old default behavior - now the signal escalation is always enabled).

Verified on RHEL5.9/6.4beta, i386/x86_64.

condor-classads-7.8.8-0.4.1
condor-7.8.8-0.4.1