690297 – Evaluate shadow exit cases and potential utilization concerns

Bug 690297 - Evaluate shadow exit cases and potential utilization concerns

Summary: Evaluate shadow exit cases and potential utilization concerns

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	condor
Sub Component:
Version:	1.3
Hardware:	All
OS:	All
Priority:	high
Severity:	medium
Target Milestone:	2.1
Target Release:	---
Assignee:	Timothy St. Clair
QA Contact:	Tomas Rusnak
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	743350
TreeView+	depends on / blocked

Reported:	2011-03-23 20:02 UTC by Robert Rati
Modified:	2012-01-23 17:26 UTC (History)
CC List:	6 users (show)
Fixed In Version:	condor-7.6.4-0.1
Doc Type:	Bug Fix
Doc Text:	During signal escalation it was possible for a starter to leak due to disjointed communications. This in turn could cause a claim to be unnecessarily relinquished due to out-of-order processing, such as if a shadow exited before the starter had been cleaned. This update forces the shadow to wait for the final update from the starter, which results in a claim reuse performance increase and increased job throughput during period_evaluation and signal escalation.
Clone Of:
Environment:
Last Closed:	2012-01-23 17:26:00 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Program that catches all signals (1.07 KB, text/plain) 2011-06-13 20:31 UTC, Robert Rati	no flags	Details
Python script to catch signals (1.27 KB, application/octet-stream) 2011-06-13 20:48 UTC, Robert Rati	no flags	Details
RH-7.6.5-0.2.el6 logs (6.55 KB, application/gzip) 2011-10-26 15:06 UTC, Tomas Rusnak	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2012:0045	0	normal	SHIPPED_LIVE	Red Hat Enterprise MRG Grid 2.1 bug fix and enhancement update	2012-01-23 22:22:58 UTC

Description Robert Rati 2011-03-23 20:02:13 UTC

Description of problem:
There are certain instances where the shadow will kill off starters and then exit without waiting for the starter's last update. One such instance is the evaluation of PeriodicRemove.

The shadow evaluates the PeriodicRemove expression, and when it evaluates to true it deactivates the claim forcefully then exits. The forceful claim deactivation sent to the startd results in the startd sending the starter a ShutdownFast signal.

Since the shadow has exited, the schedd moves to either A) reuse the claim or B) vacate the claim (RELEASE_CLAIM):

If there are no more runnable jobs in the queue that will run on the node, B will occur. This results in the schedd sending a RELEASE_CLAIM to the startd, which will then send a ShutdownGraceful to the starter.

If there are jobs in the queue that will run on the node, the schedd will schedule a new job and the resulting shadow will attempt to reuse the claim. However, if the starter from the previous job is still running (because of signal escalation or the job is taking a while to exit) then when the shadow asks the startd to reuse the claim, the startd tells the shadow to retry (because the starter is still running). The shadow will attempt to reuse the claim every second for 20 seconds before giving up. The result then is the schedd sending a RELEASE_CLAIM to the startd, which will result in a ShutdownGraceful being sent to the starter (if it's still running).

Of interest is that this doesn't occur when you use condor_rm to remove the job for a vanilla universe job, even if signal escalation is used. This is because when condor_rm is used the shadow uses handleJobRemoval to handle the removal, and these methods do not exit. However, when PeriodicRemove evals to true the shadow uses BaseShadow::removeJob, which calls DC_Exit after killing starters and doing a few other things. It looks like all outcomes from a Periodic evaluation result in a fire-and-forget mentality, where the shadow kills starters and exits.

Since the shadow isn't waiting for the starters to send their final updates, the schedd is incorrectly thinking that the startds are free to run other jobs. If the starter hasn't exited for some reason, this obviously is not the case and though the schedd has tried to schedule a new job for the startd it may be unable to run and reuse the claim.

Should investigate if there is a need for the shadow to use a fire-and-forget mentality for Period evaluations (and any other cases) rather than to wait for the starter to exit.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1.
2.
3.

Actual results:

Expected results:

Additional info:

Comment 1 Timothy St. Clair 2011-06-13 19:52:08 UTC

For completeness, when testing you indeed see a fast shutdown being sent to the starter which the shadow not caring about it's return. 

-----------------
Shadow:
06/13/11 14:33:21 (110.0) (13620): Resource slot1.com changing state from EXECUTING to PENDING_DEATH
06/13/11 14:33:21 (110.0) (13620): Killed starter (fast) at <128.105.13.17:9620?sock=13425_fd56_2>

Starter.slot
06/13/11 14:33:21 Inside OsProc::JobExit()
06/13/11 14:33:21 condor_write(): Socket closed when trying to write 96 bytes to <128.105.13.17:55038>, fd is 9
06/13/11 14:33:21 Buf::write(): condor_write() failed
06/13/11 14:33:21 Failed to send job exit status to shadow
06/13/11 14:33:21 JobExit() failed, waiting for job lease to expire or for a reconnect attempt
-----------------

IMHO think the fire and forget mentality is the correct thing for the schedd to do, and I think the onus is on the startd to cleanup prior to running a new job.  

However, I have been unable to repro a starter "hanging around."  Is there a repo condition for this?

Comment 2 Robert Rati 2011-06-13 20:31:33 UTC

Created attachment 504547 [details]
Program that catches all signals

Comment 3 Robert Rati 2011-06-13 20:33:02 UTC

executable = /tmp/signaltest_noexit
output = /tmp/job.$(CLUSTER).$(PROCESS).out
error = /tmp/job.$(CLUSTER).$(PROCESS).err
log = /tmp/job.$(CLUSTER).$(PROCESS).log
universe = vanilla
kill_sig = 2
remove_kill_sig = 3
KillSigTimeout = 45
queue 1

Use a job similar to this with the attached signal catcher program.

Comment 4 Robert Rati 2011-06-13 20:48:02 UTC

Created attachment 504549 [details]
Python script to catch signals

Comment 5 Robert Rati 2011-06-13 20:50:49 UTC

Signal escalation will show a starter hanging around (until killing_timeout-1 or KillSigTimeout, whichever is less).  The c program needs to be compiled and it catches all signals.  The python script is slightly different.  It takes 2 args, 1st is whether to exit (1 or 0), other is the signal specified in kill_sig.

Comment 6 Timothy St. Clair 2011-06-14 16:56:36 UTC

Notes: 

In testing this at a reasonable scale I see a lot of the following: 

Starter.slot
-----
06/14/11 11:18:41 sending signal 3 to process with pid 19619
06/14/11 11:18:41 error writing to named pipe: watchdog pipe has closed
06/14/11 11:18:41 LocalClient: error sending message to server
06/14/11 11:18:41 ProcFamilyClient: failed to start connection with ProcD
06/14/11 11:18:41 signal_process: ProcD communication error
06/14/11 11:18:41 ERROR "ProcD has failed" at line 587 in file /home/tstclair/work/spaces/active/git/src/condor_utils/proc_family_proxy.cpp

Which seems rather odd.. possibly new BZ.

Comment 7 Timothy St. Clair 2011-07-12 20:12:34 UTC

Tracking info for the procd issue found here: 

https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2247

Need to retest to verify that there are still orphaned starters.

Comment 8 Timothy St. Clair 2011-07-18 19:57:56 UTC

The schedd is releasing the claim b/c when trying to reuse the claim the schedd will try to send another job to the startd which refuses with JOB_NOT_STARTED (108) - "Can't connect to startd or request refused".  This results in the Schedd releasing the claim b/c the resource was "busy".  This all stems from the fact that the shadow never hung around to wait for the final update, and the schedd is a couple of steps ahead.  

I need to validate if periodic_hold has a similar behavior.

Comment 9 Timothy St. Clair 2011-07-22 19:47:10 UTC

tracking upstream: 
https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2329

Comment 10 Timothy St. Clair 2011-07-22 20:15:41 UTC

Change upstream pending review 7/22/11

Comment 11 Timothy St. Clair 2011-07-29 16:00:46 UTC

Fix still pending nmi run which appears to be failing for other reasons 7/29/11

Comment 12 Timothy St. Clair 2011-08-02 19:46:05 UTC

original patch failed some tests because the new modifications to wait for starter exit did not handle the some methods by which the the starter notifies the shadow that it is dead already.  Updated patch running through nmi again 8/2/11

Comment 16 Timothy St. Clair 2011-09-23 19:59:06 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
C: During signal escalation it is possible for communications to be out of sink such that a starter is temporarily leaked.
C: Causes a claim to be unnecessarily relinquished due to out of order processing, e.g. shadow exits before starter has been cleaned
F: Force shadow to wait for final update from starter.
R: Claim reuse performance increase ++ increased job throughput during period_ evaluation and signal escalation

Comment 18 Tomas Rusnak 2011-10-24 14:21:25 UTC

Reproduced on RHEL5/x86_64:

$CondorVersion: 7.6.0 Mar 30 2011 BuildID: RH-7.6.0-0.4.el5 PRE-RELEASE-GRID $
$CondorPlatform: X86_64-Redhat_5.6 $

StarterLog:
10/24/11 16:07:05 Inside JICShadow::transferOutput(void)
10/24/11 16:07:05 JICShadow::transferOutput(void): Transferring...
10/24/11 16:07:05 Inside JICShadow::transferOutputMopUp(void)
10/24/11 16:07:05 Inside OsProc::JobExit()
10/24/11 16:07:05 Sent job ClassAd update to startd.
10/24/11 16:07:05 condor_write(): Socket closed when trying to write 78 bytes to <10.34.33.148:41901>, fd is 6
10/24/11 16:07:05 Buf::write(): condor_write() failed
10/24/11 16:07:05 Failed to send job exit status to shadow
10/24/11 16:07:05 JobExit() failed, waiting for job lease to expire or for a reconnect attempt
10/24/11 16:07:05 Returning from CStarter::JobReaper()

SchedLog:
10/24/11 16:07:05 (pid:28158) Got SIGQUIT.  Performing fast shutdown.
10/24/11 16:07:05 (pid:28158) Now in shutdown_fast. Sending signals to shadows
10/24/11 16:07:05 (pid:28158) Sent signal 9 to shadow [pid 28205] for job 21.0
10/24/11 16:07:05 (pid:28158) ScheddCronJobMgr: Shutting down
10/24/11 16:07:05 (pid:28158) CronJobMgr: Killing all jobs
10/24/11 16:07:05 (pid:28158) Cron: Killing all jobs
10/24/11 16:07:05 (pid:28158) Trying to update collector <IP:9618>
10/24/11 16:07:05 (pid:28158) Attempting to send update via UDP to collector localhost <IP:9618>
10/24/11 16:07:05 (pid:28158) Trying to update collector <IP:9618>
10/24/11 16:07:05 (pid:28158) Attempting to send update via UDP to collector localhost <IP:9618>
10/24/11 16:07:05 (pid:28158) All shadows have been killed, exiting.
10/24/11 16:07:05 (pid:28158) **** condor_schedd (condor_SCHEDD) pid 28158 EXITING WITH STATUS 0

python test script:
Installing handlers for signals: [1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 14, 15, 16, 17, 18, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31] (exit_on_signal: True)
Signal received: 3, exiting...

Comment 19 Tomas Rusnak 2011-10-24 14:55:40 UTC

Retested over current condor:

$CondorVersion: 7.6.5 Oct 21 2011 BuildID: RH-7.6.5-0.2.el6 $
$CondorPlatform: X86_64-RedHat_6.1 $

StarterLog.slot1:
10/24/11 13:34:17 Got SIGQUIT.  Performing fast shutdown.
10/24/11 13:34:17 ShutdownFast all jobs.
10/24/11 13:34:17 in VanillaProc::ShutdownFast()
10/24/11 13:34:17 sending signal 3 to process with pid 16912
10/24/11 13:34:17 Using escalation delay 29 for Escalation Timer
10/24/11 13:34:17 DaemonCore: No more children processes to reap.
10/24/11 13:34:17 Process exited, pid=16912, status=0
10/24/11 13:34:17 Inside VanillaProc::JobReaper()
10/24/11 13:34:17 Cancel signal escalation timer (11)
10/24/11 13:34:17 Inside OsProc::JobReaper()
10/24/11 13:34:17 Inside UserProc::JobReaper()
10/24/11 13:34:17 Reaper: all=1 handled=1 ShuttingDown=1
10/24/11 13:34:17 In VanillaProc::PublishUpdateAd()
10/24/11 13:34:17 Inside OsProc::PublishUpdateAd()
10/24/11 13:34:17 Inside UserProc::PublishUpdateAd()
10/24/11 13:34:17 HOOK_JOB_EXIT not configured.
10/24/11 13:34:17 In VanillaProc::PublishUpdateAd()
10/24/11 13:34:17 Inside OsProc::PublishUpdateAd()
10/24/11 13:34:17 Inside UserProc::PublishUpdateAd()
10/24/11 13:34:17 Entering JICShadow::updateShadow()
10/24/11 13:34:17 condor_write(): Socket closed when trying to write 266 bytes to <IP:41859>, fd is 6
10/24/11 13:34:17 Buf::write(): condor_write() failed
10/24/11 13:34:17 Sent job ClassAd update to startd.
10/24/11 13:34:17 JICShadow::updateShadow(): failed to send update
10/24/11 13:34:17 Inside JICShadow::transferOutput(void)
10/24/11 13:34:17 JICShadow::transferOutput(void): Transferring...
10/24/11 13:34:17 Inside JICShadow::transferOutputMopUp(void)
10/24/11 13:34:17 Inside OsProc::JobExit()
10/24/11 13:34:17 Notifying exit status=0 reason=107
10/24/11 13:34:17 Sent job ClassAd update to startd.
10/24/11 13:34:17 condor_write(): Socket closed when trying to write 78 bytes to <IP:41859>, fd is 6
10/24/11 13:34:17 Buf::write(): condor_write() failed
10/24/11 13:34:17 Failed to send job exit status to shadow
10/24/11 13:34:17 JobExit() failed, waiting for job lease to expire or for a reconnect attempt
10/24/11 13:34:17 Returning from CStarter::JobReaper()

SchedLog:
10/24/11 13:34:17 (pid:16870) Got SIGQUIT.  Performing fast shutdown.
10/24/11 13:34:17 (pid:16870) Now in shutdown_fast. Sending signals to shadows
10/24/11 13:34:17 (pid:16870) Sent signal 9 to shadow [pid 16909] for job 383.0
10/24/11 13:34:17 (pid:16870) ScheddCronJobMgr: Shutting down
10/24/11 13:34:17 (pid:16870) CronJobMgr: Killing all jobs
10/24/11 13:34:17 (pid:16870) Cron: Killing all jobs
10/24/11 13:34:17 (pid:16870) Trying to update collector <IP:9618>
10/24/11 13:34:17 (pid:16870) Attempting to send update via UDP to collector localhost <IP:9618>
10/24/11 13:34:17 (pid:16870) Trying to update collector <IP:9618>
10/24/11 13:34:17 (pid:16870) Attempting to send update via UDP to collector localhost <IP:9618>
10/24/11 13:34:17 (pid:16870) All shadows have been killed, exiting.
10/24/11 13:34:17 (pid:16870) **** condor_schedd (condor_SCHEDD) pid 16870 EXITING WITH STATUS 0

Installing handlers for signals: [1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 14, 15, 16, 17, 18, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31] (exit_on_signal: True)
Signal received: 3, exiting...


There is still "Failed to send job exit status to shadow" error message in StarterLog. The procd problem seems to be solved.

Please, could you confirm that patch was included into latest build?

Comment 20 Timothy St. Clair 2011-10-24 16:16:36 UTC

That could be possibly due to other issues. Another way to validate would be to check the starter logs to make certain they were able to send their final update prior to exit.  

How exactly are you testing the modification?

Comment 21 Tomas Rusnak 2011-10-25 09:18:18 UTC

In Comment #19 - part StarterLog:

JICShadow::updateShadow(): failed to send update

Is it an update what you are talked about?

I tested with python script from attachments and this job:

executable = /usr/bin/python
args = "test_signal.py 1 2"
output = job.$(CLUSTER).$(PROCESS).out
error = job.$(CLUSTER).$(PROCESS).err
log = job.$(CLUSTER).$(PROCESS).log
universe = vanilla
kill_sig = 2
remove_kill_sig = 3
KillSigTimeout = 45
queue 1

Comment 22 Timothy St. Clair 2011-10-25 20:25:56 UTC

When using: 

#created from signal_catcher.c
executable = a.out
output = /tmp/job.$(CLUSTER).$(PROCESS).out
error = /tmp/job.$(CLUSTER).$(PROCESS).err
log = /tmp/job.$(CLUSTER).$(PROCESS).log
universe = vanilla
periodic_remove=JobStartDate + 20 < CurrentTime
kill_sig = 2
remove_kill_sig = 3
KillSigTimeout = 45
queue 4

It seems to be working as expected.  

Could you please attach a full log&submission set.

Comment 23 Tomas Rusnak 2011-10-26 15:06:58 UTC

Created attachment 530300 [details]
RH-7.6.5-0.2.el6 logs

I just retest it to have clean logs for you. 

$CondorVersion: 7.6.5 Oct 21 2011 BuildID: RH-7.6.5-0.2.el6 $
$CondorPlatform: X86_64-RedHat_6.1 $

Comment 24 Timothy St. Clair 2011-10-26 18:49:31 UTC

The attached logs don't tell me anything, it doesn't look like the job is being shutdown at all. 

To repro use example above and update queue count to 100+ 

Monitor logs b4 fix:
EXITING 108 <- RELEASE_CLAIM

after fix 
Claim is reused over and over until timeout.

Comment 25 Tomas Rusnak 2011-10-27 15:24:37 UTC

Thanks Timothy, retested on supported platforms RHEL5,RHEL6/x86,x86_64:

condor-7.6.5-0.2

10/27/11 16:28:14 (259.1) (14086): **** condor_shadow (condor_SHADOW) pid 14086 EXITING WITH STATUS 113
10/27/11 16:28:14 Last process exited, now Starter is exiting
10/27/11 16:28:14 **** condor_starter (condor_STARTER) pid 14087 EXITING WITH STATUS 0
10/27/11 16:33:15 (259.2) (14110): **** condor_shadow (condor_SHADOW) pid 14110 EXITING WITH STATUS 113
....

After 100 iterations there is no such EXITING 108. Claim was reused.

>>> VERIFIED

Comment 26 Douglas Silas 2011-11-17 00:10:15 UTC

    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1,4 +1 @@
-C: During signal escalation it is possible for communications to be out of sink such that a starter is temporarily leaked.
+During signal escalation it was possible for a starter to leak due to disjointed communications. This in turn could cause a claim to be unnecessarily relinquished due to out-of-order processing, such as if a shadow exited before the starter had been cleaned. This update forces the shadow to wait for the final update from the starter, which results in a claim reuse performance increase and increased job throughput during period_evaluation and signal escalation.-C: Causes a claim to be unnecessarily relinquished due to out of order processing, e.g. shadow exits before starter has been cleaned
-F: Force shadow to wait for final update from starter.
-R: Claim reuse performance increase ++ increased job throughput during period_ evaluation and signal escalation

Comment 27 errata-xmlrpc 2012-01-23 17:26:00 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2012-0045.html

Note You need to log in before you can comment on or make changes to this bug.