Description of problem: Change all HOSTALLOW_ -> ALLOW_ && HOSTDENY_ -> DENY_ Set USE_PROCD = FALSE in the startd configuration => STARTD.USE_PROCD = FALSE & STARTER.USE_PROCD = FALSE in the startd configuration. "The startd will always wait the value specified in the killing_timeout parameter before hard-killing the job" => The startd will always wait the value specified in the killing_timeout parameter before hard-killing the starter "However, the starter will always wait for the value specified in the killing_timeout-1 configuration variable before attempting to hard-kill the job" => However, by default the starter will wait killing_timeout-1 before attempting to hard-kill the job. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
(In reply to comment #0) > Description of problem: > Change all HOSTALLOW_ -> ALLOW_ && HOSTDENY_ -> DENY_ Done. > > Set USE_PROCD = FALSE in the startd configuration => STARTD.USE_PROCD = FALSE & > STARTER.USE_PROCD = FALSE in the startd configuration. <listitem> <para> Set <command>STARTD.USE_PROCD = FALSE</command> and <command>STARTER.USE_PROCD = FALSE</command> in the startd configuration. This is the most reliable way to handle the situation. </para> </listitem> > > "The startd will always wait the value specified in the killing_timeout > parameter before hard-killing the job" => The startd will always wait the value > specified in the killing_timeout parameter before hard-killing the starter > > "However, the starter will always wait for the value specified in the > killing_timeout-1 configuration variable before attempting to hard-kill the > job" => However, by default the starter will wait killing_timeout-1 before > attempting to hard-kill the job. <para> When you try to kill a job with a custom signal, it can sometimes cause a race condition to occur between the starter and the startd. This happens when the startd communicates with the starter using <command>procd</command>. The startd will always wait the value specified in the <parameter>killing_timeout</parameter> parameter before hard-killing the starter. However, by default the starter will wait for the value specified in the <parameter>killing_timeout-1</parameter> configuration variable before attempting to hard-kill the job. This means that it is sometimes possible for the startd to be attempting to hard-kill the starter, while the starter is cleaning up and exiting. It causes the starter to stop communicating with the <command>procd</command>, which makes the startd suffer a communication failure, and then crash. </para> LKB
No HOSTALLOW/HOSTDENY in grid user guide. Chapter was correctly changed. >>> VERIFIED