Bug 690297
Summary: | Evaluate shadow exit cases and potential utilization concerns | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise MRG | Reporter: | Robert Rati <rrati> | ||||||||
Component: | condor | Assignee: | Timothy St. Clair <tstclair> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | Tomas Rusnak <trusnak> | ||||||||
Severity: | medium | Docs Contact: | |||||||||
Priority: | high | ||||||||||
Version: | 1.3 | CC: | dan, ltoscano, matt, mkudlej, trusnak, tstclair | ||||||||
Target Milestone: | 2.1 | ||||||||||
Target Release: | --- | ||||||||||
Hardware: | All | ||||||||||
OS: | All | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | condor-7.6.4-0.1 | Doc Type: | Bug Fix | ||||||||
Doc Text: |
During signal escalation it was possible for a starter to leak due to disjointed communications. This in turn could cause a claim to be unnecessarily relinquished due to out-of-order processing, such as if a shadow exited before the starter had been cleaned. This update forces the shadow to wait for the final update from the starter, which results in a claim reuse performance increase and increased job throughput during period_evaluation and signal escalation.
|
Story Points: | --- | ||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2012-01-23 17:26:00 UTC | Type: | --- | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | |||||||||||
Bug Blocks: | 743350 | ||||||||||
Attachments: |
|
Description
Robert Rati
2011-03-23 20:02:13 UTC
For completeness, when testing you indeed see a fast shutdown being sent to the starter which the shadow not caring about it's return. ----------------- Shadow: 06/13/11 14:33:21 (110.0) (13620): Resource slot1.com changing state from EXECUTING to PENDING_DEATH 06/13/11 14:33:21 (110.0) (13620): Killed starter (fast) at <128.105.13.17:9620?sock=13425_fd56_2> Starter.slot 06/13/11 14:33:21 Inside OsProc::JobExit() 06/13/11 14:33:21 condor_write(): Socket closed when trying to write 96 bytes to <128.105.13.17:55038>, fd is 9 06/13/11 14:33:21 Buf::write(): condor_write() failed 06/13/11 14:33:21 Failed to send job exit status to shadow 06/13/11 14:33:21 JobExit() failed, waiting for job lease to expire or for a reconnect attempt ----------------- IMHO think the fire and forget mentality is the correct thing for the schedd to do, and I think the onus is on the startd to cleanup prior to running a new job. However, I have been unable to repro a starter "hanging around." Is there a repo condition for this? Created attachment 504547 [details]
Program that catches all signals
executable = /tmp/signaltest_noexit output = /tmp/job.$(CLUSTER).$(PROCESS).out error = /tmp/job.$(CLUSTER).$(PROCESS).err log = /tmp/job.$(CLUSTER).$(PROCESS).log universe = vanilla kill_sig = 2 remove_kill_sig = 3 KillSigTimeout = 45 queue 1 Use a job similar to this with the attached signal catcher program. Created attachment 504549 [details]
Python script to catch signals
Signal escalation will show a starter hanging around (until killing_timeout-1 or KillSigTimeout, whichever is less). The c program needs to be compiled and it catches all signals. The python script is slightly different. It takes 2 args, 1st is whether to exit (1 or 0), other is the signal specified in kill_sig. Notes: In testing this at a reasonable scale I see a lot of the following: Starter.slot ----- 06/14/11 11:18:41 sending signal 3 to process with pid 19619 06/14/11 11:18:41 error writing to named pipe: watchdog pipe has closed 06/14/11 11:18:41 LocalClient: error sending message to server 06/14/11 11:18:41 ProcFamilyClient: failed to start connection with ProcD 06/14/11 11:18:41 signal_process: ProcD communication error 06/14/11 11:18:41 ERROR "ProcD has failed" at line 587 in file /home/tstclair/work/spaces/active/git/src/condor_utils/proc_family_proxy.cpp Which seems rather odd.. possibly new BZ. Tracking info for the procd issue found here: https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2247 Need to retest to verify that there are still orphaned starters. The schedd is releasing the claim b/c when trying to reuse the claim the schedd will try to send another job to the startd which refuses with JOB_NOT_STARTED (108) - "Can't connect to startd or request refused". This results in the Schedd releasing the claim b/c the resource was "busy". This all stems from the fact that the shadow never hung around to wait for the final update, and the schedd is a couple of steps ahead. I need to validate if periodic_hold has a similar behavior. tracking upstream: https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2329 Change upstream pending review 7/22/11 Fix still pending nmi run which appears to be failing for other reasons 7/29/11 original patch failed some tests because the new modifications to wait for starter exit did not handle the some methods by which the the starter notifies the shadow that it is dead already. Updated patch running through nmi again 8/2/11 Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: C: During signal escalation it is possible for communications to be out of sink such that a starter is temporarily leaked. C: Causes a claim to be unnecessarily relinquished due to out of order processing, e.g. shadow exits before starter has been cleaned F: Force shadow to wait for final update from starter. R: Claim reuse performance increase ++ increased job throughput during period_ evaluation and signal escalation Reproduced on RHEL5/x86_64: $CondorVersion: 7.6.0 Mar 30 2011 BuildID: RH-7.6.0-0.4.el5 PRE-RELEASE-GRID $ $CondorPlatform: X86_64-Redhat_5.6 $ StarterLog: 10/24/11 16:07:05 Inside JICShadow::transferOutput(void) 10/24/11 16:07:05 JICShadow::transferOutput(void): Transferring... 10/24/11 16:07:05 Inside JICShadow::transferOutputMopUp(void) 10/24/11 16:07:05 Inside OsProc::JobExit() 10/24/11 16:07:05 Sent job ClassAd update to startd. 10/24/11 16:07:05 condor_write(): Socket closed when trying to write 78 bytes to <10.34.33.148:41901>, fd is 6 10/24/11 16:07:05 Buf::write(): condor_write() failed 10/24/11 16:07:05 Failed to send job exit status to shadow 10/24/11 16:07:05 JobExit() failed, waiting for job lease to expire or for a reconnect attempt 10/24/11 16:07:05 Returning from CStarter::JobReaper() SchedLog: 10/24/11 16:07:05 (pid:28158) Got SIGQUIT. Performing fast shutdown. 10/24/11 16:07:05 (pid:28158) Now in shutdown_fast. Sending signals to shadows 10/24/11 16:07:05 (pid:28158) Sent signal 9 to shadow [pid 28205] for job 21.0 10/24/11 16:07:05 (pid:28158) ScheddCronJobMgr: Shutting down 10/24/11 16:07:05 (pid:28158) CronJobMgr: Killing all jobs 10/24/11 16:07:05 (pid:28158) Cron: Killing all jobs 10/24/11 16:07:05 (pid:28158) Trying to update collector <IP:9618> 10/24/11 16:07:05 (pid:28158) Attempting to send update via UDP to collector localhost <IP:9618> 10/24/11 16:07:05 (pid:28158) Trying to update collector <IP:9618> 10/24/11 16:07:05 (pid:28158) Attempting to send update via UDP to collector localhost <IP:9618> 10/24/11 16:07:05 (pid:28158) All shadows have been killed, exiting. 10/24/11 16:07:05 (pid:28158) **** condor_schedd (condor_SCHEDD) pid 28158 EXITING WITH STATUS 0 python test script: Installing handlers for signals: [1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 14, 15, 16, 17, 18, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31] (exit_on_signal: True) Signal received: 3, exiting... Retested over current condor: $CondorVersion: 7.6.5 Oct 21 2011 BuildID: RH-7.6.5-0.2.el6 $ $CondorPlatform: X86_64-RedHat_6.1 $ StarterLog.slot1: 10/24/11 13:34:17 Got SIGQUIT. Performing fast shutdown. 10/24/11 13:34:17 ShutdownFast all jobs. 10/24/11 13:34:17 in VanillaProc::ShutdownFast() 10/24/11 13:34:17 sending signal 3 to process with pid 16912 10/24/11 13:34:17 Using escalation delay 29 for Escalation Timer 10/24/11 13:34:17 DaemonCore: No more children processes to reap. 10/24/11 13:34:17 Process exited, pid=16912, status=0 10/24/11 13:34:17 Inside VanillaProc::JobReaper() 10/24/11 13:34:17 Cancel signal escalation timer (11) 10/24/11 13:34:17 Inside OsProc::JobReaper() 10/24/11 13:34:17 Inside UserProc::JobReaper() 10/24/11 13:34:17 Reaper: all=1 handled=1 ShuttingDown=1 10/24/11 13:34:17 In VanillaProc::PublishUpdateAd() 10/24/11 13:34:17 Inside OsProc::PublishUpdateAd() 10/24/11 13:34:17 Inside UserProc::PublishUpdateAd() 10/24/11 13:34:17 HOOK_JOB_EXIT not configured. 10/24/11 13:34:17 In VanillaProc::PublishUpdateAd() 10/24/11 13:34:17 Inside OsProc::PublishUpdateAd() 10/24/11 13:34:17 Inside UserProc::PublishUpdateAd() 10/24/11 13:34:17 Entering JICShadow::updateShadow() 10/24/11 13:34:17 condor_write(): Socket closed when trying to write 266 bytes to <IP:41859>, fd is 6 10/24/11 13:34:17 Buf::write(): condor_write() failed 10/24/11 13:34:17 Sent job ClassAd update to startd. 10/24/11 13:34:17 JICShadow::updateShadow(): failed to send update 10/24/11 13:34:17 Inside JICShadow::transferOutput(void) 10/24/11 13:34:17 JICShadow::transferOutput(void): Transferring... 10/24/11 13:34:17 Inside JICShadow::transferOutputMopUp(void) 10/24/11 13:34:17 Inside OsProc::JobExit() 10/24/11 13:34:17 Notifying exit status=0 reason=107 10/24/11 13:34:17 Sent job ClassAd update to startd. 10/24/11 13:34:17 condor_write(): Socket closed when trying to write 78 bytes to <IP:41859>, fd is 6 10/24/11 13:34:17 Buf::write(): condor_write() failed 10/24/11 13:34:17 Failed to send job exit status to shadow 10/24/11 13:34:17 JobExit() failed, waiting for job lease to expire or for a reconnect attempt 10/24/11 13:34:17 Returning from CStarter::JobReaper() SchedLog: 10/24/11 13:34:17 (pid:16870) Got SIGQUIT. Performing fast shutdown. 10/24/11 13:34:17 (pid:16870) Now in shutdown_fast. Sending signals to shadows 10/24/11 13:34:17 (pid:16870) Sent signal 9 to shadow [pid 16909] for job 383.0 10/24/11 13:34:17 (pid:16870) ScheddCronJobMgr: Shutting down 10/24/11 13:34:17 (pid:16870) CronJobMgr: Killing all jobs 10/24/11 13:34:17 (pid:16870) Cron: Killing all jobs 10/24/11 13:34:17 (pid:16870) Trying to update collector <IP:9618> 10/24/11 13:34:17 (pid:16870) Attempting to send update via UDP to collector localhost <IP:9618> 10/24/11 13:34:17 (pid:16870) Trying to update collector <IP:9618> 10/24/11 13:34:17 (pid:16870) Attempting to send update via UDP to collector localhost <IP:9618> 10/24/11 13:34:17 (pid:16870) All shadows have been killed, exiting. 10/24/11 13:34:17 (pid:16870) **** condor_schedd (condor_SCHEDD) pid 16870 EXITING WITH STATUS 0 Installing handlers for signals: [1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 14, 15, 16, 17, 18, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31] (exit_on_signal: True) Signal received: 3, exiting... There is still "Failed to send job exit status to shadow" error message in StarterLog. The procd problem seems to be solved. Please, could you confirm that patch was included into latest build? That could be possibly due to other issues. Another way to validate would be to check the starter logs to make certain they were able to send their final update prior to exit. How exactly are you testing the modification? In Comment #19 - part StarterLog: JICShadow::updateShadow(): failed to send update Is it an update what you are talked about? I tested with python script from attachments and this job: executable = /usr/bin/python args = "test_signal.py 1 2" output = job.$(CLUSTER).$(PROCESS).out error = job.$(CLUSTER).$(PROCESS).err log = job.$(CLUSTER).$(PROCESS).log universe = vanilla kill_sig = 2 remove_kill_sig = 3 KillSigTimeout = 45 queue 1 When using: #created from signal_catcher.c executable = a.out output = /tmp/job.$(CLUSTER).$(PROCESS).out error = /tmp/job.$(CLUSTER).$(PROCESS).err log = /tmp/job.$(CLUSTER).$(PROCESS).log universe = vanilla periodic_remove=JobStartDate + 20 < CurrentTime kill_sig = 2 remove_kill_sig = 3 KillSigTimeout = 45 queue 4 It seems to be working as expected. Could you please attach a full log&submission set. Created attachment 530300 [details]
RH-7.6.5-0.2.el6 logs
I just retest it to have clean logs for you.
$CondorVersion: 7.6.5 Oct 21 2011 BuildID: RH-7.6.5-0.2.el6 $
$CondorPlatform: X86_64-RedHat_6.1 $
The attached logs don't tell me anything, it doesn't look like the job is being shutdown at all. To repro use example above and update queue count to 100+ Monitor logs b4 fix: EXITING 108 <- RELEASE_CLAIM after fix Claim is reused over and over until timeout. Thanks Timothy, retested on supported platforms RHEL5,RHEL6/x86,x86_64:
condor-7.6.5-0.2
10/27/11 16:28:14 (259.1) (14086): **** condor_shadow (condor_SHADOW) pid 14086 EXITING WITH STATUS 113
10/27/11 16:28:14 Last process exited, now Starter is exiting
10/27/11 16:28:14 **** condor_starter (condor_STARTER) pid 14087 EXITING WITH STATUS 0
10/27/11 16:33:15 (259.2) (14110): **** condor_shadow (condor_SHADOW) pid 14110 EXITING WITH STATUS 113
....
After 100 iterations there is no such EXITING 108. Claim was reused.
>>> VERIFIED
Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1,4 +1 @@ -C: During signal escalation it is possible for communications to be out of sink such that a starter is temporarily leaked. +During signal escalation it was possible for a starter to leak due to disjointed communications. This in turn could cause a claim to be unnecessarily relinquished due to out-of-order processing, such as if a shadow exited before the starter had been cleaned. This update forces the shadow to wait for the final update from the starter, which results in a claim reuse performance increase and increased job throughput during period_evaluation and signal escalation.-C: Causes a claim to be unnecessarily relinquished due to out of order processing, e.g. shadow exits before starter has been cleaned -F: Force shadow to wait for final update from starter. -R: Claim reuse performance increase ++ increased job throughput during period_ evaluation and signal escalation Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHEA-2012-0045.html |