614518 – Clarification of custom kill signals

Bug 614518 - Clarification of custom kill signals

Summary: Clarification of custom kill signals

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	Grid_User_Guide
Sub Component:
Version:	Development
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	1.3
Target Release:	---
Assignee:	Lana Brindley
QA Contact:	Lubos Trilety
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2010-07-14 16:15 UTC by Robert Rati
Modified:	2013-10-23 23:16 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2010-10-14 20:09:05 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Robert Rati 2010-07-14 16:15:11 UTC

Description of problem:
Suggest clarification of the custom signal section 4.3 to something like:

When a vanilla universe job needs to be removed from an execute node (ie condor_rm, condor_hold, etc), the starter will send the running process a SIGKILL, which will forcibly kill it and not allow for any clean up.  To avoid this hard kill of the process, it is possible to define a custom signal that the starter will send to the process in cases where the process needs to be killed.

If a custom kill signal is used, the starter will wait killing_timeout-1 before determining that the process is not responding/exiting and send the process a SIGKILL.  Alternately a job may specify kill_sig_timeout in the job submit file to define how long the starter should wait after sending the custom kill signal before sending the SIGKILL.  However, it is not possible to exceed killing_timeout-1 as the starter will wait the shorter of killing_time-1 and kill_sig_timeout.

To use custom signals, define 'kill_sig' in the job description file.  For example, so use signal 1 (SIGHUP), add:

kill_sig = 1

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Lana Brindley 2010-07-18 23:51:09 UTC

This section has been updated with these changes as a result of feedback from Jon Thomas. Please review the section (available on the stage within the next 24 hours) and provide information any further changes required.

LKB

Comment 2 Robert Rati 2010-07-19 21:02:57 UTC

"and the SIGKILL command is determined by a value defined in killing_timeout or kill_sig_timeout" => and the SIGKILL command is determined by the lessor value defined by killing_timeout-1 or kill_sig_timeout.

"The starter will wait a specified length of time for the job to complete, and if it has not completed in that time, it will be hard-killed." => Should gut as we've covered above.  However, should replace with "The startd will wait a period of time after initiating the job termination before determining that the starter isn't responding and needs to be killed.  The period is defined by the killing_timeout parameter."


Step 2:
"wait the number of seconds defined in the killing_timeout configuration variable" => wait killing_time-1 seconds before sending the SIGKILL signal.


With the above, I feel there is a repetition of information regarding the which killing_timeout-1 vs. kill_sig_timeout will be used.  I don't see a cleaner way to present the information atm, but I'm sure it exists.

Comment 3 Lana Brindley 2010-07-20 00:01:26 UTC

(In reply to comment #2)
> "and the SIGKILL command is determined by a value defined in killing_timeout or
> kill_sig_timeout" => and the SIGKILL command is determined by the lessor value
> defined by killing_timeout-1 or kill_sig_timeout.
> 
> "The starter will wait a specified length of time for the job to complete, and
> if it has not completed in that time, it will be hard-killed." => Should gut as
> we've covered above.  However, should replace with "The startd will wait a
> period of time after initiating the job termination before determining that the
> starter isn't responding and needs to be killed.  The period is defined by the
> killing_timeout parameter."
> 
> 
> Step 2:
> "wait the number of seconds defined in the killing_timeout configuration
> variable" => wait killing_time-1 seconds before sending the SIGKILL signal.
> 
> 
> With the above, I feel there is a repetition of information regarding the which
> killing_timeout-1 vs. kill_sig_timeout will be used.  I don't see a cleaner way
> to present the information atm, but I'm sure it exists.    

You're right. I've taken the duplicate information out of the initial paragraph. It doesn't need to be presented in order to understand the procedure, so describing it in the procedure itself seems more reasonable. Let me know if you want that changed again.

LKB

Comment 4 Robert Rati 2010-07-20 15:20:29 UTC

"The starter will wait a period of time after initiating the job termination before determining that the starter isn't responding and needs to be killed" => The startd will wait ...

Step 2:
"wait the number of seconds defined in the killing_timeout configuration
 variable" => wait killing_time-1 seconds before sending the SIGKILL signal.

Comment 5 Lana Brindley 2010-07-20 20:30:23 UTC

(In reply to comment #4)
> "The starter will wait a period of time after initiating the job termination
> before determining that the starter isn't responding and needs to be killed" =>
> The startd will wait ...
> 
> Step 2:
> "wait the number of seconds defined in the killing_timeout configuration
>  variable" => wait killing_time-1 seconds before sending the SIGKILL signal.    

Sorry, but I don't understand the value in making this change. For starters, I don't like the practice of saying "wait [variable_name] time". Using the name of the variable to explain something defeats the purpose of trying to explain it sometimes. Even [variable_name] seems self-explanatory, it's good practice to explain it other words to try and aid reader understanding. Given that, I don't see that the suggested version of the sentence gives any more information than the original. In fact, I would argue that it gives less information, overall.

I'm going to send this document to MRG QE review today, so this round of technical review is now completed. Any further amendments will be made in the next round.

LKB

Comment 6 Robert Rati 2010-07-20 20:56:30 UTC

<rsquared> re: 614518, the reason I'm stating killing_timeout-1 seconds is because the starter waits for 'value defined in killing_timeout' minus 1.  The docs keep saying 'value defined in killing_timeout', but that is not correct.  It is that value -1.  However you want to state value-1 is fine, so long as it's there. :)
<Lana> ahhh i see
<Lana> why the -1?
<Lana> that doesn't seem quite sane to me
<rsquared> Because the startd waits killing_timeout
<Lana> ahhh i geddit
<Lana> ok

Comment 7 Lana Brindley 2010-08-01 22:36:00 UTC

<para>
	By default, the starter will wait the number of seconds defined in the <command>killing_timeout</command> configuration variable, less one second. It is also possible to set a timeout value in the job description file, using the <command>kill_sig_timeout</command> parameter. The starter will wait the shorter of the two values.
</para>

LKB

Comment 8 Lubos Trilety 2010-09-01 13:39:31 UTC

Chapter checked

>>> VERIFIED

Note You need to log in before you can comment on or make changes to this bug.