623460 – Periodicremove sends multiple signals

Bug 623460 - Periodicremove sends multiple signals

Summary: Periodicremove sends multiple signals

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	condor
Sub Component:
Version:	1.2
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	2.0
Target Release:	---
Assignee:	Robert Rati
QA Contact:	Luigi Toscano
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2010-08-11 20:17 UTC by Jon Thomas
Modified:	2018-11-14 18:51 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2011-03-28 15:10:31 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
patch (1022 bytes, patch) 2010-08-11 20:17 UTC, Jon Thomas	no flags	Details \| Diff
signal program (1.07 KB, text/plain) 2011-02-21 19:47 UTC, Jon Thomas	no flags	Details
View All

Description Jon Thomas 2010-08-11 20:17:50 UTC

Created attachment 438288 [details]
patch

In attempting to use custom signals with periodicremove, the signals are sent twice followed by SIGKILL.

I think the root cause of the problem is that the updates from the starter don't reach the shadow. I think the shadow closes the socket which causes JICShadow::update_shadow to return -1. It appears that the state isn't quite captured correctly when removejob exits.

BaseShadow::removeJob{
..
DC_Exit( JOB_SHOULD_REMOVE );
}

DC_Exit( JOB_SHOULD_REMOVE ); seems to be the cause for the socket to get closed. There is some other state that isn't captured correctly either. Simply commenting out DC_Exit( JOB_SHOULD_REMOVE ); doesn't fix the problem. What happens is the job gets placed back into the idle queue and is rescheduled.  Adding remRes->setExitReason( JOB_KILLED ); in cleanUp fixes this.

To repro, submit a job that catches and outputs the signal and then use something like thins in the description file:

PeriodicRemove =  ((CurrentTime - EnteredCurrentStatus) > 5)

It's also visible in the shadow, starter, and starter slot logs.

I'm attaching a patch that seems to fix this.

Comment 2 Jon Thomas 2011-02-21 19:47:39 UTC

Created attachment 479985 [details]
signal program

Here's a signal catching program. I change it around based on what signals I'm using. Comment out caught=1; for it to ignore the signal...or just add in some arguments code.

Comment 3 Jon Thomas 2011-02-21 21:23:14 UTC

the double signal still seems to be there in 

$CondorVersion: 7.4.5 Feb 21 2011 BuildID: RH-7.4.5-0.8.el5 PRE-RELEASE $

executable = signaltest
universe = vanilla
output = a.out.$(cluster).$(process)
log = a.log.$(cluster).$(process)
kill_sig = 2
remove_kill_sig = 2
KillSigTimeout = 45
PeriodicRemove =  ((CurrentTime - EnteredCurrentStatus) > 5)
queue 1


It appears that ShutdownGraceful() is being called after ShutdownFast() and mixed in there is the escalation timer. It also looks like the shadow dies before receiving updates.



02/21/11 16:07:41 Sent job ClassAd update to startd.
02/21/11 16:07:41 JICShadow::updateShadow(): failed to send update
02/21/11 16:07:41 Got SIGQUIT.  Performing fast shutdown.
02/21/11 16:07:41 ShutdownFast all jobs.
02/21/11 16:07:41 in VanillaProc::ShutdownFast()
02/21/11 16:07:41 sending signal 2 to process with pid 1769
02/21/11 16:07:41 Using escalation delay 29 for Escalation Timer
02/21/11 16:07:41 Got SIGTERM. Performing graceful shutdown.
02/21/11 16:07:41 Started timer to call main_shutdown_fast in 1800 seconds
02/21/11 16:07:41 ShutdownGraceful all jobs.
02/21/11 16:07:41 in VanillaProc::ShutdownGraceful()
02/21/11 16:07:41 Send_Signal(): Doing kill(1769,2) [SIGINT]
02/21/11 16:08:10 Esclation Timer fired.  Killing job
02/21/11 16:08:10 Cancel signal escalation timer (11)
02/21/11 16:08:10 DaemonCore: No more children processes to reap.
02/21/11 16:08:10 Process exited, pid=1769, signal=9
02/21/11 16:08:10 Inside VanillaProc::JobReaper()
02/21/11 16:08:10 Inside OsProc::JobReaper()
02/21/11 16:08:10 Inside UserProc::JobReaper()
02/21/11 16:08:10 Reaper: all=1 handled=1 ShuttingDown=1
02/21/11 16:08:10 In VanillaProc::PublishUpdateAd()
02/21/11 16:08:10 Inside OsProc::PublishUpdateAd()
02/21/11 16:08:10 Inside UserProc::PublishUpdateAd()
02/21/11 16:08:10 HOOK_JOB_EXIT not configured.
02/21/11 16:08:10 Inside JICShadow::transferOutput(void)
02/21/11 16:08:10 JICShadow::transferOutput(void): Transferring...
02/21/11 16:08:10 Inside JICShadow::transferOutputMopUp(void)
02/21/11 16:08:10 Inside OsProc::JobExit()
02/21/11 16:08:10 In VanillaProc::PublishUpdateAd()
02/21/11 16:08:10 Inside OsProc::PublishUpdateAd()
02/21/11 16:08:10 Inside UserProc::PublishUpdateAd()
02/21/11 16:08:10 Sent job ClassAd update to startd.
02/21/11 16:08:10 condor_write(): Socket closed when trying to write 311 bytes to <166.217.62.105:37663>, fd is 5
02/21/11 16:08:10 Buf::write(): condor_write() failed
02/21/11 16:08:10 Failed to send job exit status to shadow
02/21/11 16:08:10 JobExit() failed, waiting for job lease to expire or for a reconnect attempt
02/21/11 16:08:10 Returning from CStarter::JobReaper()

Comment 4 Luigi Toscano 2011-03-09 11:47:56 UTC

Could you please check again with condor-7.4.5-0.8.el5? The output from the test signal program shows only one trapped signal. How should it looks like?

Comment 5 Jon Thomas 2011-03-09 15:08:37 UTC

Hi,

The output from the program isn't important here. It's how condor is handling signals. The log in comment #3 is from condor-7.4.5-0.8.el5. It shows that ShutdownGraceful is being called after ShutdownFast and that the job exit status is not reaching the shadow. The log should look like:


03/09/11 09:54:38 Sent job ClassAd update to startd.
03/09/11 09:54:38 Leaving JICShadow::updateShadow(): success
03/09/11 09:55:01 Got SIGQUIT.  Performing fast shutdown.
03/09/11 09:55:01 ShutdownFast all jobs.
03/09/11 09:55:01 in VanillaProc::ShutdownFast()
03/09/11 09:55:01 sending signal 2 to process with pid 6112
03/09/11 09:55:01 Using escalation delay 29 for Escalation Timer
03/09/11 09:55:30 Esclation Timer fired.  Killing job
03/09/11 09:55:30 Cancel signal escalation timer (11)
03/09/11 09:55:30 DaemonCore: No more children processes to reap.
03/09/11 09:55:30 Process exited, pid=6112, signal=9
03/09/11 09:55:30 Inside VanillaProc::JobReaper()
03/09/11 09:55:30 Inside OsProc::JobReaper()
03/09/11 09:55:30 Inside UserProc::JobReaper()
03/09/11 09:55:30 Reaper: all=1 handled=1 ShuttingDown=1
03/09/11 09:55:30 In VanillaProc::PublishUpdateAd()
03/09/11 09:55:30 Inside OsProc::PublishUpdateAd()
03/09/11 09:55:30 Inside UserProc::PublishUpdateAd()
03/09/11 09:55:30 HOOK_JOB_EXIT not configured.
03/09/11 09:55:30 Inside JICShadow::transferOutput(void)
03/09/11 09:55:30 JICShadow::transferOutput(void): Transferring...
03/09/11 09:55:30 Inside JICShadow::transferOutputMopUp(void)
03/09/11 09:55:30 Inside OsProc::JobExit()
03/09/11 09:55:30 In VanillaProc::PublishUpdateAd()
03/09/11 09:55:30 Inside OsProc::PublishUpdateAd()
03/09/11 09:55:30 Inside UserProc::PublishUpdateAd()
03/09/11 09:55:30 Sent job ClassAd update to startd.
03/09/11 09:55:30 Last process exited, now Starter is exiting

Comment 6 Luigi Toscano 2011-03-09 15:40:10 UTC

Thanks, I see the issue (I analyzed the wrong output).

Comment 7 Robert Rati 2011-03-22 15:17:59 UTC

What is observed here is actually how the system is designed. Here's what happens:

The shadow evaluates the PeriodicRemove expression, and when it evaluates to true it deactivates the claim forcefully then exits. The forceful claim deactivation sent to the startd results in the startd sending the starter a ShutdownFast signal which results in the remove_kill_sig being sent to the job. Since signal escalation is being used and the job running ignores signals, the starter waits around for the appropriate timeout before escalating to kill the job.

Since the shadow has exited, the schedd moves to either A) reuse the claim or B) vacate the claim (RELEASE_CLAIM):

If there are no more runnable jobs in the queue that will run on the node, B will occur. This results in the schedd sending a RELEASE_CLAIM to the startd, which will then send a ShutdownGraceful to the starter (since it is still doing signal escalation) which results in the kill_sig being sent to the job.

If there are jobs to in the queue that will run on the node, the schedd will schedule a new job and the resulting shadow will attempt to reuse the claim. However, if the starter from the previous job is still going through signal escalation then when the shadow asks the startd to reuse the claim, the startd tells the shadow to retry (because the starter is still running). The shadow will attempt to reuse the claim every second for 20 seconds before giving up. The result then is the schedd sending a RELEASE_CLAIM to the startd, which will result in a ShutdownGraceful being sent to the starter (if it's still running) and the kill_sig being sent to the job.

In either case, the shadow has exited long before the starter has a chance to send the final update to it.

Comment 9 Robert Rati 2011-03-28 15:10:31 UTC

The original customer issue doesn't seem to be related to a bug in condor.  There is a bz to discuss and evaluate alternate shadow behavior (bz690297), but the information provided indicate this is not a bug.

Note You need to log in before you can comment on or make changes to this bug.