Bug 563337 - condor_rm doesn't seem to work on a job that has been issued a condor_vacate_job
Summary: condor_rm doesn't seem to work on a job that has been issued a condor_vacate_job
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor
Version: 1.2
Hardware: All
OS: Linux
low
low
Target Milestone: 2.0
: ---
Assignee: Matthew Farrellee
QA Contact: Lubos Trilety
URL:
Whiteboard:
Depends On:
Blocks: 693778
TreeView+ depends on / blocked
 
Reported: 2010-02-09 21:52 UTC by Jon Thomas
Modified: 2018-10-27 15:28 UTC (History)
6 users (show)

Fixed In Version: condor-7.5.6-0.1
Doc Type: Bug Fix
Doc Text:
C: Jobs are only sent SIGTERM on vacate C: A job ignoring SIGTERM may never be evicted F: A signal escalation timer was added to the job vacate code path R: Job will be killed with SIGKILL after ignoring SIGTERM
Clone Of:
Environment:
Last Closed: 2011-06-23 15:40:48 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
vacate patch (660 bytes, patch)
2010-05-05 13:48 UTC, Jon Thomas
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2011:0889 0 normal SHIPPED_LIVE Red Hat Enterprise MRG Grid 2.0 Release 2011-06-23 15:35:53 UTC

Description Jon Thomas 2010-02-09 21:52:27 UTC
It looks like condor_rm isn't sending a {{SIGKILL}} to a running job that has been previously been sent a {{SIGTERM}} via condor_vacate_job.

Steps to Reproduce:
1) Submit a job that sleeps for 600 seconds that ignores SIGTERM signals.
2) Issue a condor_vacate_job on the job. The job should continue to run because it is ignoring SIGTERM signals.
3) Issue a condor_rm on the job.

Actual results:
The job runs to completion and remains in an X state in condor_q.

Expected results:
The job is killed and removed immediately.

Additional info:
If I log into the machine the job is running on and explicitly send a kill -9 to the running test job, the job leaves the pool.

Appears to be caused by being hung in deactivateclaim.

Comment 1 Jon Thomas 2010-02-10 21:08:34 UTC
Looks like what happens is the job doesn't come out of deactivate_claim/shutdowngraceful and release the claim. If you run condor_rm after condor_vacate_job, the job appears in condor_q, but not in any state. Adding more jobs, shows the slot is not available anymore.


startlog

02/10 15:55:29 slot1: Changing activity: Idle -> Busy
02/10 15:56:28 slot1: Called deactivate_claim()

slot1 log

02/10 15:56:28 Got SIGTERM. Performing graceful shutdown.
02/10 15:56:28 ShutdownGraceful all jobs.


 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   1.0   jrthomas        2/10 15:55   0+00:00:00 X  0   0.0  signaltest        

0 jobs; 0 idle, 0 running, 0 held

Comment 2 Jon Thomas 2010-02-10 21:33:01 UTC
trace
Resource::deactivate_claim( void )
..prints dprintf(D_ALWAYS, "Called deactivate_claim()\n")

Claim::deactivateClaim( bool graceful )
  starterKillSoft();
c_starter->killSoft();
kill(DC_SIGSOFTKILL)   ...translates to sigterm
reallykill( signo, 0 );

Comment 3 Jon Thomas 2010-02-10 21:49:26 UTC
bit more info, it looks like it's not coming back from send_signal. 

02/10 16:37:09 slot1: Called deactivate_claim()
02/10 16:37:09 slot1: In Starter::kill() with pid 11761, sig 15 (SIGTERM)
02/10 16:37:09 Send_Signal(): Doing kill(11761,15) [SIGTERM]

02/10 16:37:09 ShutdownGraceful all jobs.
02/10 16:37:09 in VanillaProc::ShutdownGraceful()
02/10 16:37:09 Send_Signal(): Doing kill(11762,15) [SIGTERM]

Comment 4 Jon Thomas 2010-02-10 21:51:22 UTC
this is where we get to ShutdownGraceful

daemonCore->Register_Signal(DC_SIGSOFTKILL, "DC_SIGSOFTKILL",
		(SignalHandlercpp)&CStarter::RemoteShutdownGraceful, "RemoteShutdownGraceful",
		this)

Comment 6 Jon Thomas 2010-05-05 11:54:43 UTC
There are a couple issues here:

1) If the job catches sig 15 and does nothing, a normal condor_vacate_job will not vacate the job. There is no escalation. The work around is to use "condor_vacate_job -fast" which calls deactivate_claim_forcibly and uses SIGKILL which can't be caught.

2) If the job was vacated with condor_vacate_job, condor_rm will not remove the job from the queue because the claim is never released.

I think probably the fix here is to dup condor_rm escalation code in the vacate path. In this case we would only need to escalate to sigkill.

Comment 7 Jon Thomas 2010-05-05 13:48:01 UTC
Created attachment 411609 [details]
vacate patch

Patch to provide signal escalation in vacate. Reuses the condor_rm escalation timer.

Comment 8 Jon Thomas 2010-05-05 14:14:29 UTC
Patch seems to test good.

In a job that catches and ignores sig 15, timer fires and KILLs the job.

In a job that catches and exits on 15, the job exits and VanillaProc::JobReaper cancels the timer.

In a job that doesn't catch the signal, the job exits on the signal and VanillaProc::JobReaper cancels the timer.


In all three cases, the job is restarted as one expects with a vacate.

Comment 9 Matthew Farrellee 2011-01-07 18:44:12 UTC
This should be fixed in https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=1392

It can be picked up as a backport or come as part of a 7.6 rebase

Comment 10 Matthew Farrellee 2011-01-07 19:58:54 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
C: Jobs are only sent SIGTERM on vacate
C: A job ignoring SIGTERM may never be evicted
F: A signal escalation timer was added to the job vacate code path
R: Job will be killed with SIGKILL after ignoring SIGTERM

Comment 12 Robert Rati 2011-05-11 21:55:10 UTC
Fix has been re-introduced upstream with some additional logic.  Basically, the starter will only escalate signals if the graceful shutdown was not because of a state change in the startd.  The startd will signal the starter when it changes state before it sends the graceful shutdown in that case.  In the case where the graceful shutdown didn't come from a state change, the starter will escalate and kill the job after KILLING_TIMEOUT or a timeout defined in the job (whichever is less)

Fix pushed on uw/master and V7_6-branch

Comment 13 Lubos Trilety 2011-05-18 12:29:46 UTC
Successfully reproduced on:
$CondorVersion: 7.4.5 Feb  4 2011 BuildID: RH-7.4.5-0.8.el5 PRE-RELEASE $
$CondorPlatform: X86_64-LINUX_RHEL5 $

# echo -e "cmd=/root/test.sh\nargs=600\nqueue" | runuser condor -s /bin/bash -c condor_submit
Submitting job(s).
1 job(s) submitted to cluster 1.

# condor_vacate_job 1.0
Job 1.0 vacated

# condor_rm 1.0
Job 1.0 marked for removal

# condor_q
-- Submitter: hostname : <IP:32780> : hostname
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   1.0   condor          5/18 14:23   0+00:00:00 X  0   0.0  test.sh 600       
0 jobs; 0 idle, 0 running, 0 held

# ps -eaf | grep test
condor    3976  3975  0 14:23 ?        00:00:00 /bin/bash /root/test.sh 600

Comment 14 Lubos Trilety 2011-05-18 14:13:33 UTC
Tested on:
$CondorVersion: 7.6.1 May 17 2011 BuildID: RH-7.6.1-0.5.el5 $
$CondorPlatform: I686-RedHat_5.6 $

$CondorVersion: 7.6.1 May 17 2011 BuildID: RH-7.6.1-0.5.el5 $
$CondorPlatform: X86_64-RedHat_5.6 $

$CondorVersion: 7.6.1 May 17 2011 BuildID: RH-7.6.1-0.5.el6 $
$CondorPlatform: I686-RedHat_6.0 $

$CondorVersion: 7.6.1 May 17 2011 BuildID: RH-7.6.1-0.5.el6 $
$CondorPlatform: X86_64-RedHat_6.0 $


# echo -e "cmd=/root/test.sh\nargs=600\nqueue" | runuser condor -s /bin/bash -c
condor_submit
Submitting job(s).
1 job(s) submitted to cluster 1.

# condor_vacate_job 1.0
Job 1.0 vacated

# condor_rm 1.0
Job 1.0 marked for removal

# condor_q
-- Submitter: hostname : <IP:57088> : hostname
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
0 jobs; 0 idle, 0 running, 0 held

# ps -eaf | grep test
#

>>> VERIFIED

Comment 15 errata-xmlrpc 2011-06-23 15:40:48 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2011-0889.html


Note You need to log in before you can comment on or make changes to this bug.