Description of problem: Suggest clarification of the custom signal section 4.3 to something like: When a vanilla universe job needs to be removed from an execute node (ie condor_rm, condor_hold, etc), the starter will send the running process a SIGKILL, which will forcibly kill it and not allow for any clean up. To avoid this hard kill of the process, it is possible to define a custom signal that the starter will send to the process in cases where the process needs to be killed. If a custom kill signal is used, the starter will wait killing_timeout-1 before determining that the process is not responding/exiting and send the process a SIGKILL. Alternately a job may specify kill_sig_timeout in the job submit file to define how long the starter should wait after sending the custom kill signal before sending the SIGKILL. However, it is not possible to exceed killing_timeout-1 as the starter will wait the shorter of killing_time-1 and kill_sig_timeout. To use custom signals, define 'kill_sig' in the job description file. For example, so use signal 1 (SIGHUP), add: kill_sig = 1 Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
This section has been updated with these changes as a result of feedback from Jon Thomas. Please review the section (available on the stage within the next 24 hours) and provide information any further changes required. LKB
"and the SIGKILL command is determined by a value defined in killing_timeout or kill_sig_timeout" => and the SIGKILL command is determined by the lessor value defined by killing_timeout-1 or kill_sig_timeout. "The starter will wait a specified length of time for the job to complete, and if it has not completed in that time, it will be hard-killed." => Should gut as we've covered above. However, should replace with "The startd will wait a period of time after initiating the job termination before determining that the starter isn't responding and needs to be killed. The period is defined by the killing_timeout parameter." Step 2: "wait the number of seconds defined in the killing_timeout configuration variable" => wait killing_time-1 seconds before sending the SIGKILL signal. With the above, I feel there is a repetition of information regarding the which killing_timeout-1 vs. kill_sig_timeout will be used. I don't see a cleaner way to present the information atm, but I'm sure it exists.
(In reply to comment #2) > "and the SIGKILL command is determined by a value defined in killing_timeout or > kill_sig_timeout" => and the SIGKILL command is determined by the lessor value > defined by killing_timeout-1 or kill_sig_timeout. > > "The starter will wait a specified length of time for the job to complete, and > if it has not completed in that time, it will be hard-killed." => Should gut as > we've covered above. However, should replace with "The startd will wait a > period of time after initiating the job termination before determining that the > starter isn't responding and needs to be killed. The period is defined by the > killing_timeout parameter." > > > Step 2: > "wait the number of seconds defined in the killing_timeout configuration > variable" => wait killing_time-1 seconds before sending the SIGKILL signal. > > > With the above, I feel there is a repetition of information regarding the which > killing_timeout-1 vs. kill_sig_timeout will be used. I don't see a cleaner way > to present the information atm, but I'm sure it exists. You're right. I've taken the duplicate information out of the initial paragraph. It doesn't need to be presented in order to understand the procedure, so describing it in the procedure itself seems more reasonable. Let me know if you want that changed again. LKB
"The starter will wait a period of time after initiating the job termination before determining that the starter isn't responding and needs to be killed" => The startd will wait ... Step 2: "wait the number of seconds defined in the killing_timeout configuration variable" => wait killing_time-1 seconds before sending the SIGKILL signal.
(In reply to comment #4) > "The starter will wait a period of time after initiating the job termination > before determining that the starter isn't responding and needs to be killed" => > The startd will wait ... > > Step 2: > "wait the number of seconds defined in the killing_timeout configuration > variable" => wait killing_time-1 seconds before sending the SIGKILL signal. Sorry, but I don't understand the value in making this change. For starters, I don't like the practice of saying "wait [variable_name] time". Using the name of the variable to explain something defeats the purpose of trying to explain it sometimes. Even [variable_name] seems self-explanatory, it's good practice to explain it other words to try and aid reader understanding. Given that, I don't see that the suggested version of the sentence gives any more information than the original. In fact, I would argue that it gives less information, overall. I'm going to send this document to MRG QE review today, so this round of technical review is now completed. Any further amendments will be made in the next round. LKB
<rsquared> re: 614518, the reason I'm stating killing_timeout-1 seconds is because the starter waits for 'value defined in killing_timeout' minus 1. The docs keep saying 'value defined in killing_timeout', but that is not correct. It is that value -1. However you want to state value-1 is fine, so long as it's there. :) <Lana> ahhh i see <Lana> why the -1? <Lana> that doesn't seem quite sane to me <rsquared> Because the startd waits killing_timeout <Lana> ahhh i geddit <Lana> ok
<para> By default, the starter will wait the number of seconds defined in the <command>killing_timeout</command> configuration variable, less one second. It is also possible to set a timeout value in the job description file, using the <command>kill_sig_timeout</command> parameter. The starter will wait the shorter of the two values. </para> LKB
Chapter checked >>> VERIFIED