Description of problem: There are certain instances where the shadow will kill off starters and then exit without waiting for the starter's last update. One such instance is the evaluation of PeriodicRemove. The shadow evaluates the PeriodicRemove expression, and when it evaluates to true it deactivates the claim forcefully then exits. The forceful claim deactivation sent to the startd results in the startd sending the starter a ShutdownFast signal. Since the shadow has exited, the schedd moves to either A) reuse the claim or B) vacate the claim (RELEASE_CLAIM): If there are no more runnable jobs in the queue that will run on the node, B will occur. This results in the schedd sending a RELEASE_CLAIM to the startd, which will then send a ShutdownGraceful to the starter. If there are jobs in the queue that will run on the node, the schedd will schedule a new job and the resulting shadow will attempt to reuse the claim. However, if the starter from the previous job is still running (because of signal escalation or the job is taking a while to exit) then when the shadow asks the startd to reuse the claim, the startd tells the shadow to retry (because the starter is still running). The shadow will attempt to reuse the claim every second for 20 seconds before giving up. The result then is the schedd sending a RELEASE_CLAIM to the startd, which will result in a ShutdownGraceful being sent to the starter (if it's still running). Of interest is that this doesn't occur when you use condor_rm to remove the job for a vanilla universe job, even if signal escalation is used. This is because when condor_rm is used the shadow uses handleJobRemoval to handle the removal, and these methods do not exit. However, when PeriodicRemove evals to true the shadow uses BaseShadow::removeJob, which calls DC_Exit after killing starters and doing a few other things. It looks like all outcomes from a Periodic evaluation result in a fire-and-forget mentality, where the shadow kills starters and exits. Since the shadow isn't waiting for the starters to send their final updates, the schedd is incorrectly thinking that the startds are free to run other jobs. If the starter hasn't exited for some reason, this obviously is not the case and though the schedd has tried to schedule a new job for the startd it may be unable to run and reuse the claim. Should investigate if there is a need for the shadow to use a fire-and-forget mentality for Period evaluations (and any other cases) rather than to wait for the starter to exit. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
For completeness, when testing you indeed see a fast shutdown being sent to the starter which the shadow not caring about it's return. ----------------- Shadow: 06/13/11 14:33:21 (110.0) (13620): Resource slot1.com changing state from EXECUTING to PENDING_DEATH 06/13/11 14:33:21 (110.0) (13620): Killed starter (fast) at <128.105.13.17:9620?sock=13425_fd56_2> Starter.slot 06/13/11 14:33:21 Inside OsProc::JobExit() 06/13/11 14:33:21 condor_write(): Socket closed when trying to write 96 bytes to <128.105.13.17:55038>, fd is 9 06/13/11 14:33:21 Buf::write(): condor_write() failed 06/13/11 14:33:21 Failed to send job exit status to shadow 06/13/11 14:33:21 JobExit() failed, waiting for job lease to expire or for a reconnect attempt ----------------- IMHO think the fire and forget mentality is the correct thing for the schedd to do, and I think the onus is on the startd to cleanup prior to running a new job. However, I have been unable to repro a starter "hanging around." Is there a repo condition for this?
Created attachment 504547 [details] Program that catches all signals
executable = /tmp/signaltest_noexit output = /tmp/job.$(CLUSTER).$(PROCESS).out error = /tmp/job.$(CLUSTER).$(PROCESS).err log = /tmp/job.$(CLUSTER).$(PROCESS).log universe = vanilla kill_sig = 2 remove_kill_sig = 3 KillSigTimeout = 45 queue 1 Use a job similar to this with the attached signal catcher program.
Created attachment 504549 [details] Python script to catch signals
Signal escalation will show a starter hanging around (until killing_timeout-1 or KillSigTimeout, whichever is less). The c program needs to be compiled and it catches all signals. The python script is slightly different. It takes 2 args, 1st is whether to exit (1 or 0), other is the signal specified in kill_sig.
Notes: In testing this at a reasonable scale I see a lot of the following: Starter.slot ----- 06/14/11 11:18:41 sending signal 3 to process with pid 19619 06/14/11 11:18:41 error writing to named pipe: watchdog pipe has closed 06/14/11 11:18:41 LocalClient: error sending message to server 06/14/11 11:18:41 ProcFamilyClient: failed to start connection with ProcD 06/14/11 11:18:41 signal_process: ProcD communication error 06/14/11 11:18:41 ERROR "ProcD has failed" at line 587 in file /home/tstclair/work/spaces/active/git/src/condor_utils/proc_family_proxy.cpp Which seems rather odd.. possibly new BZ.
Tracking info for the procd issue found here: https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2247 Need to retest to verify that there are still orphaned starters.
The schedd is releasing the claim b/c when trying to reuse the claim the schedd will try to send another job to the startd which refuses with JOB_NOT_STARTED (108) - "Can't connect to startd or request refused". This results in the Schedd releasing the claim b/c the resource was "busy". This all stems from the fact that the shadow never hung around to wait for the final update, and the schedd is a couple of steps ahead. I need to validate if periodic_hold has a similar behavior.
tracking upstream: https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2329
Change upstream pending review 7/22/11
Fix still pending nmi run which appears to be failing for other reasons 7/29/11
original patch failed some tests because the new modifications to wait for starter exit did not handle the some methods by which the the starter notifies the shadow that it is dead already. Updated patch running through nmi again 8/2/11
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: C: During signal escalation it is possible for communications to be out of sink such that a starter is temporarily leaked. C: Causes a claim to be unnecessarily relinquished due to out of order processing, e.g. shadow exits before starter has been cleaned F: Force shadow to wait for final update from starter. R: Claim reuse performance increase ++ increased job throughput during period_ evaluation and signal escalation
Reproduced on RHEL5/x86_64: $CondorVersion: 7.6.0 Mar 30 2011 BuildID: RH-7.6.0-0.4.el5 PRE-RELEASE-GRID $ $CondorPlatform: X86_64-Redhat_5.6 $ StarterLog: 10/24/11 16:07:05 Inside JICShadow::transferOutput(void) 10/24/11 16:07:05 JICShadow::transferOutput(void): Transferring... 10/24/11 16:07:05 Inside JICShadow::transferOutputMopUp(void) 10/24/11 16:07:05 Inside OsProc::JobExit() 10/24/11 16:07:05 Sent job ClassAd update to startd. 10/24/11 16:07:05 condor_write(): Socket closed when trying to write 78 bytes to <10.34.33.148:41901>, fd is 6 10/24/11 16:07:05 Buf::write(): condor_write() failed 10/24/11 16:07:05 Failed to send job exit status to shadow 10/24/11 16:07:05 JobExit() failed, waiting for job lease to expire or for a reconnect attempt 10/24/11 16:07:05 Returning from CStarter::JobReaper() SchedLog: 10/24/11 16:07:05 (pid:28158) Got SIGQUIT. Performing fast shutdown. 10/24/11 16:07:05 (pid:28158) Now in shutdown_fast. Sending signals to shadows 10/24/11 16:07:05 (pid:28158) Sent signal 9 to shadow [pid 28205] for job 21.0 10/24/11 16:07:05 (pid:28158) ScheddCronJobMgr: Shutting down 10/24/11 16:07:05 (pid:28158) CronJobMgr: Killing all jobs 10/24/11 16:07:05 (pid:28158) Cron: Killing all jobs 10/24/11 16:07:05 (pid:28158) Trying to update collector <IP:9618> 10/24/11 16:07:05 (pid:28158) Attempting to send update via UDP to collector localhost <IP:9618> 10/24/11 16:07:05 (pid:28158) Trying to update collector <IP:9618> 10/24/11 16:07:05 (pid:28158) Attempting to send update via UDP to collector localhost <IP:9618> 10/24/11 16:07:05 (pid:28158) All shadows have been killed, exiting. 10/24/11 16:07:05 (pid:28158) **** condor_schedd (condor_SCHEDD) pid 28158 EXITING WITH STATUS 0 python test script: Installing handlers for signals: [1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 14, 15, 16, 17, 18, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31] (exit_on_signal: True) Signal received: 3, exiting...
Retested over current condor: $CondorVersion: 7.6.5 Oct 21 2011 BuildID: RH-7.6.5-0.2.el6 $ $CondorPlatform: X86_64-RedHat_6.1 $ StarterLog.slot1: 10/24/11 13:34:17 Got SIGQUIT. Performing fast shutdown. 10/24/11 13:34:17 ShutdownFast all jobs. 10/24/11 13:34:17 in VanillaProc::ShutdownFast() 10/24/11 13:34:17 sending signal 3 to process with pid 16912 10/24/11 13:34:17 Using escalation delay 29 for Escalation Timer 10/24/11 13:34:17 DaemonCore: No more children processes to reap. 10/24/11 13:34:17 Process exited, pid=16912, status=0 10/24/11 13:34:17 Inside VanillaProc::JobReaper() 10/24/11 13:34:17 Cancel signal escalation timer (11) 10/24/11 13:34:17 Inside OsProc::JobReaper() 10/24/11 13:34:17 Inside UserProc::JobReaper() 10/24/11 13:34:17 Reaper: all=1 handled=1 ShuttingDown=1 10/24/11 13:34:17 In VanillaProc::PublishUpdateAd() 10/24/11 13:34:17 Inside OsProc::PublishUpdateAd() 10/24/11 13:34:17 Inside UserProc::PublishUpdateAd() 10/24/11 13:34:17 HOOK_JOB_EXIT not configured. 10/24/11 13:34:17 In VanillaProc::PublishUpdateAd() 10/24/11 13:34:17 Inside OsProc::PublishUpdateAd() 10/24/11 13:34:17 Inside UserProc::PublishUpdateAd() 10/24/11 13:34:17 Entering JICShadow::updateShadow() 10/24/11 13:34:17 condor_write(): Socket closed when trying to write 266 bytes to <IP:41859>, fd is 6 10/24/11 13:34:17 Buf::write(): condor_write() failed 10/24/11 13:34:17 Sent job ClassAd update to startd. 10/24/11 13:34:17 JICShadow::updateShadow(): failed to send update 10/24/11 13:34:17 Inside JICShadow::transferOutput(void) 10/24/11 13:34:17 JICShadow::transferOutput(void): Transferring... 10/24/11 13:34:17 Inside JICShadow::transferOutputMopUp(void) 10/24/11 13:34:17 Inside OsProc::JobExit() 10/24/11 13:34:17 Notifying exit status=0 reason=107 10/24/11 13:34:17 Sent job ClassAd update to startd. 10/24/11 13:34:17 condor_write(): Socket closed when trying to write 78 bytes to <IP:41859>, fd is 6 10/24/11 13:34:17 Buf::write(): condor_write() failed 10/24/11 13:34:17 Failed to send job exit status to shadow 10/24/11 13:34:17 JobExit() failed, waiting for job lease to expire or for a reconnect attempt 10/24/11 13:34:17 Returning from CStarter::JobReaper() SchedLog: 10/24/11 13:34:17 (pid:16870) Got SIGQUIT. Performing fast shutdown. 10/24/11 13:34:17 (pid:16870) Now in shutdown_fast. Sending signals to shadows 10/24/11 13:34:17 (pid:16870) Sent signal 9 to shadow [pid 16909] for job 383.0 10/24/11 13:34:17 (pid:16870) ScheddCronJobMgr: Shutting down 10/24/11 13:34:17 (pid:16870) CronJobMgr: Killing all jobs 10/24/11 13:34:17 (pid:16870) Cron: Killing all jobs 10/24/11 13:34:17 (pid:16870) Trying to update collector <IP:9618> 10/24/11 13:34:17 (pid:16870) Attempting to send update via UDP to collector localhost <IP:9618> 10/24/11 13:34:17 (pid:16870) Trying to update collector <IP:9618> 10/24/11 13:34:17 (pid:16870) Attempting to send update via UDP to collector localhost <IP:9618> 10/24/11 13:34:17 (pid:16870) All shadows have been killed, exiting. 10/24/11 13:34:17 (pid:16870) **** condor_schedd (condor_SCHEDD) pid 16870 EXITING WITH STATUS 0 Installing handlers for signals: [1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 14, 15, 16, 17, 18, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31] (exit_on_signal: True) Signal received: 3, exiting... There is still "Failed to send job exit status to shadow" error message in StarterLog. The procd problem seems to be solved. Please, could you confirm that patch was included into latest build?
That could be possibly due to other issues. Another way to validate would be to check the starter logs to make certain they were able to send their final update prior to exit. How exactly are you testing the modification?
In Comment #19 - part StarterLog: JICShadow::updateShadow(): failed to send update Is it an update what you are talked about? I tested with python script from attachments and this job: executable = /usr/bin/python args = "test_signal.py 1 2" output = job.$(CLUSTER).$(PROCESS).out error = job.$(CLUSTER).$(PROCESS).err log = job.$(CLUSTER).$(PROCESS).log universe = vanilla kill_sig = 2 remove_kill_sig = 3 KillSigTimeout = 45 queue 1
When using: #created from signal_catcher.c executable = a.out output = /tmp/job.$(CLUSTER).$(PROCESS).out error = /tmp/job.$(CLUSTER).$(PROCESS).err log = /tmp/job.$(CLUSTER).$(PROCESS).log universe = vanilla periodic_remove=JobStartDate + 20 < CurrentTime kill_sig = 2 remove_kill_sig = 3 KillSigTimeout = 45 queue 4 It seems to be working as expected. Could you please attach a full log&submission set.
Created attachment 530300 [details] RH-7.6.5-0.2.el6 logs I just retest it to have clean logs for you. $CondorVersion: 7.6.5 Oct 21 2011 BuildID: RH-7.6.5-0.2.el6 $ $CondorPlatform: X86_64-RedHat_6.1 $
The attached logs don't tell me anything, it doesn't look like the job is being shutdown at all. To repro use example above and update queue count to 100+ Monitor logs b4 fix: EXITING 108 <- RELEASE_CLAIM after fix Claim is reused over and over until timeout.
Thanks Timothy, retested on supported platforms RHEL5,RHEL6/x86,x86_64: condor-7.6.5-0.2 10/27/11 16:28:14 (259.1) (14086): **** condor_shadow (condor_SHADOW) pid 14086 EXITING WITH STATUS 113 10/27/11 16:28:14 Last process exited, now Starter is exiting 10/27/11 16:28:14 **** condor_starter (condor_STARTER) pid 14087 EXITING WITH STATUS 0 10/27/11 16:33:15 (259.2) (14110): **** condor_shadow (condor_SHADOW) pid 14110 EXITING WITH STATUS 113 .... After 100 iterations there is no such EXITING 108. Claim was reused. >>> VERIFIED
Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1,4 +1 @@ -C: During signal escalation it is possible for communications to be out of sink such that a starter is temporarily leaked. +During signal escalation it was possible for a starter to leak due to disjointed communications. This in turn could cause a claim to be unnecessarily relinquished due to out-of-order processing, such as if a shadow exited before the starter had been cleaned. This update forces the shadow to wait for the final update from the starter, which results in a claim reuse performance increase and increased job throughput during period_evaluation and signal escalation.-C: Causes a claim to be unnecessarily relinquished due to out of order processing, e.g. shadow exits before starter has been cleaned -F: Force shadow to wait for final update from starter. -R: Claim reuse performance increase ++ increased job throughput during period_ evaluation and signal escalation
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHEA-2012-0045.html