It looks like condor_rm isn't sending a {{SIGKILL}} to a running job that has been previously been sent a {{SIGTERM}} via condor_vacate_job. Steps to Reproduce: 1) Submit a job that sleeps for 600 seconds that ignores SIGTERM signals. 2) Issue a condor_vacate_job on the job. The job should continue to run because it is ignoring SIGTERM signals. 3) Issue a condor_rm on the job. Actual results: The job runs to completion and remains in an X state in condor_q. Expected results: The job is killed and removed immediately. Additional info: If I log into the machine the job is running on and explicitly send a kill -9 to the running test job, the job leaves the pool. Appears to be caused by being hung in deactivateclaim.
Looks like what happens is the job doesn't come out of deactivate_claim/shutdowngraceful and release the claim. If you run condor_rm after condor_vacate_job, the job appears in condor_q, but not in any state. Adding more jobs, shows the slot is not available anymore. startlog 02/10 15:55:29 slot1: Changing activity: Idle -> Busy 02/10 15:56:28 slot1: Called deactivate_claim() slot1 log 02/10 15:56:28 Got SIGTERM. Performing graceful shutdown. 02/10 15:56:28 ShutdownGraceful all jobs. ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0 jrthomas 2/10 15:55 0+00:00:00 X 0 0.0 signaltest 0 jobs; 0 idle, 0 running, 0 held
trace Resource::deactivate_claim( void ) ..prints dprintf(D_ALWAYS, "Called deactivate_claim()\n") Claim::deactivateClaim( bool graceful ) starterKillSoft(); c_starter->killSoft(); kill(DC_SIGSOFTKILL) ...translates to sigterm reallykill( signo, 0 );
bit more info, it looks like it's not coming back from send_signal. 02/10 16:37:09 slot1: Called deactivate_claim() 02/10 16:37:09 slot1: In Starter::kill() with pid 11761, sig 15 (SIGTERM) 02/10 16:37:09 Send_Signal(): Doing kill(11761,15) [SIGTERM] 02/10 16:37:09 ShutdownGraceful all jobs. 02/10 16:37:09 in VanillaProc::ShutdownGraceful() 02/10 16:37:09 Send_Signal(): Doing kill(11762,15) [SIGTERM]
this is where we get to ShutdownGraceful daemonCore->Register_Signal(DC_SIGSOFTKILL, "DC_SIGSOFTKILL", (SignalHandlercpp)&CStarter::RemoteShutdownGraceful, "RemoteShutdownGraceful", this)
There are a couple issues here: 1) If the job catches sig 15 and does nothing, a normal condor_vacate_job will not vacate the job. There is no escalation. The work around is to use "condor_vacate_job -fast" which calls deactivate_claim_forcibly and uses SIGKILL which can't be caught. 2) If the job was vacated with condor_vacate_job, condor_rm will not remove the job from the queue because the claim is never released. I think probably the fix here is to dup condor_rm escalation code in the vacate path. In this case we would only need to escalate to sigkill.
Created attachment 411609 [details] vacate patch Patch to provide signal escalation in vacate. Reuses the condor_rm escalation timer.
Patch seems to test good. In a job that catches and ignores sig 15, timer fires and KILLs the job. In a job that catches and exits on 15, the job exits and VanillaProc::JobReaper cancels the timer. In a job that doesn't catch the signal, the job exits on the signal and VanillaProc::JobReaper cancels the timer. In all three cases, the job is restarted as one expects with a vacate.
This should be fixed in https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=1392 It can be picked up as a backport or come as part of a 7.6 rebase
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: C: Jobs are only sent SIGTERM on vacate C: A job ignoring SIGTERM may never be evicted F: A signal escalation timer was added to the job vacate code path R: Job will be killed with SIGKILL after ignoring SIGTERM
Fix has been re-introduced upstream with some additional logic. Basically, the starter will only escalate signals if the graceful shutdown was not because of a state change in the startd. The startd will signal the starter when it changes state before it sends the graceful shutdown in that case. In the case where the graceful shutdown didn't come from a state change, the starter will escalate and kill the job after KILLING_TIMEOUT or a timeout defined in the job (whichever is less) Fix pushed on uw/master and V7_6-branch
Successfully reproduced on: $CondorVersion: 7.4.5 Feb 4 2011 BuildID: RH-7.4.5-0.8.el5 PRE-RELEASE $ $CondorPlatform: X86_64-LINUX_RHEL5 $ # echo -e "cmd=/root/test.sh\nargs=600\nqueue" | runuser condor -s /bin/bash -c condor_submit Submitting job(s). 1 job(s) submitted to cluster 1. # condor_vacate_job 1.0 Job 1.0 vacated # condor_rm 1.0 Job 1.0 marked for removal # condor_q -- Submitter: hostname : <IP:32780> : hostname ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0 condor 5/18 14:23 0+00:00:00 X 0 0.0 test.sh 600 0 jobs; 0 idle, 0 running, 0 held # ps -eaf | grep test condor 3976 3975 0 14:23 ? 00:00:00 /bin/bash /root/test.sh 600
Tested on: $CondorVersion: 7.6.1 May 17 2011 BuildID: RH-7.6.1-0.5.el5 $ $CondorPlatform: I686-RedHat_5.6 $ $CondorVersion: 7.6.1 May 17 2011 BuildID: RH-7.6.1-0.5.el5 $ $CondorPlatform: X86_64-RedHat_5.6 $ $CondorVersion: 7.6.1 May 17 2011 BuildID: RH-7.6.1-0.5.el6 $ $CondorPlatform: I686-RedHat_6.0 $ $CondorVersion: 7.6.1 May 17 2011 BuildID: RH-7.6.1-0.5.el6 $ $CondorPlatform: X86_64-RedHat_6.0 $ # echo -e "cmd=/root/test.sh\nargs=600\nqueue" | runuser condor -s /bin/bash -c condor_submit Submitting job(s). 1 job(s) submitted to cluster 1. # condor_vacate_job 1.0 Job 1.0 vacated # condor_rm 1.0 Job 1.0 marked for removal # condor_q -- Submitter: hostname : <IP:57088> : hostname ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 0 jobs; 0 idle, 0 running, 0 held # ps -eaf | grep test # >>> VERIFIED
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHEA-2011-0889.html