All shadow -> schedd communication happens on a new connection. This causes undo connection creation load on the Schedd, and can result in timeouts in the Shadow. Instead, there should be a persistent connection, which can be re-opened in the case of a failure. Using SHADOW_TIMEOUT_MULTIPLIER=4 can help the shadow avoid bailing out, but does not reduce the load imposed on the Schedd.
Things to consider: Many environments impose a default file descriptor limit of 1024, which may be slightly annoying. DaemonCore's select loop may not scale well to large numbers of registered sockets.
I'm in agreement w/Dan. From a previous life, select scales rather poorly at large scale which can drastically decrease performance. So I'm not certain a persistent connection is the best route here, or whether the reconnect should be more fault-tolerant on the shadows end.
After waking from my restful slumber this am. I seem to recall a conversation between Dan, TT, and I about having the shadow simply trying once at the end of the job then writing a file so the next go-a-round for the reaper the Schedd would detect the file and update correctly. This seems like the best of all worlds, as it would decrease load and yield no information loss.
Scale tests should check #699723 prior to any major efforts here. B/C this should effect schedd load performance.
So in performing tests I only receive a couple of dozen "Failed to perform final update to job queue!" per several thousand jobs. You may see a fair amount of connection timeouts in the ShadowLog, but that doesn't result in job failure + rerun. That is part of the normal retry logic. The variables which control the cleanup logic are: SHADOW_MAX_JOB_CLEANUP_RETRIES default is 5 SHADOW_JOB_CLEANUP_RETRY_DELAY default is 30
Testing Notes: By adjusting SHADOW_MAX_JOB_CLEANUP_RETRIES=100 decreased the number of occurances to be only a handful, but they still occurred. I'm going to play the the ACCEPT threshold to see if this eliminates.
Testing Notes: Setting MAX_ACCEPTS_PER_CYCLE = 10, w/SHADOW_MAX_JOB_CLEANUP_RETRIES=100 eliminates entirely. If fact I haven't seen any Connection retries in the shadow logs now.
A modest increase of MAX_ACCEPTS_PER_CYCLE = 8 seems to do the trick under heavy loads. Still tinkering to get a full scale test to verify.
To verify, have a script which does constant submissions at a given rate, scan shadow logs for timeouts b4 and after modification at medium scale.
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: C: The Schedd could often spawn more shadows the in could handle given its current default param config C: Shadows would timeout trying to communicate with the Schedd F: Up the MAX_ACCEPTS_PER_CYCLE default. R: 0 timeouts under scale testing.
I tried to reproduce with: $CondorVersion: 7.6.1 Jun 02 2011 BuildID: RH-7.6.1-0.10.el5 $ $CondorPlatform: X86_64-RedHat_5.6 $ Config: NUM_CPUS=500 Job: universe = vanilla cmd=/bin/sleep args=60 should_transfer_files = if_needed WhenToTransferOutput = ON_EXIT queue=1 Submitted with: for i in `seq 1 1000`; do sudo -u test condor_submit submit.job; done; # tail -F /var/log/condor/ShadowLog | grep -i timeout 11/01/11 17:37:57 (2297.0) (5677): condor_read(): timeout reading 5 bytes from <IP:42051>. 11/01/11 17:37:57 (2218.0) (6308): condor_read(): timeout reading 5 bytes from <IP:42051>. 11/01/11 17:37:58 (2350.0) (3449): condor_read(): timeout reading 5 bytes from <IP:42051>. 11/01/11 17:38:03 (2237.0) (6342): condor_read(): timeout reading 5 bytes from <IP:42051> # condor_config_val MAX_ACCEPTS_PER_CYCLE Not defined: MAX_ACCEPTS_PER_CYCLE # condor_config_val SHADOW_MAX_JOB_CLEANUP_RETRIES 5 # condor_config_val SHADOW_JOB_CLEANUP_RETRY_DELAY 30 I can't find any error similar to "Failed to perform final update to job queue!" in ShadowLog. Do you have any idea what is missing in my case to successfully reproduce it?
Retested on: $CondorVersion: 7.6.5 Nov 04 2011 BuildID: RH-7.6.5-0.6.el5 $ $CondorPlatform: X86_64-RedHat_5.7 $ 11/05/11 15:03:31 (260.5423) (24360): Received GoAhead from peer to send /bin/sleep and all further files. 11/05/11 15:03:31 (260.4126) (21738): Can't connect to queue manager: CEDAR:6001:Failed to connect to <IP:38808> 11/05/11 15:03:31 (260.4126) (21738): Failed to perform final update to job queue! 11/05/11 15:03:31 (260.4126) (21738): Will retry job cleanup in SHADOW_JOB_CLEANUP_RETRY_DELAY=30 seconds 11/05/11 15:03:31 (260.4126) (21738): condor_read(): Socket closed when trying to read 5 bytes from startd slot2@localhost 11/05/11 15:03:31 (260.4126) (21738): IO: EOF reading packet header 11/05/11 15:03:31 (260.5426) (23561): entering FileTransfer::HandleCommands It looks like there is same issue on current version of condor.
The question is what is the rate of these errors? Before it could be substantial, and now it should be approaching zero, but != 0 b/c there are numerous cases in the schedd where it can very busy.
It's big difference between slow and normal submitting machine. I was not able to reproduce it on faster one, but on slowest the error message shows up more often. Retested over all supported platforms x86,x86_64/RHEL5,RHEL6 with: condor-7.6.5-0.6 Medium scale pool used - 380 cores/slots, 10000 jobs submitted. On all platforms I can see exactly 22 errors like: "Failed to perform final update to job queue!" >>> VERIFIED
Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1,4 +1 @@ -C: The Schedd could often spawn more shadows the in could handle given its current default param config +The schedd daemon could spawn more shadows than it could handle given its current default configuration, which would in turn cause shadows to time out when trying to communicate with the schedd. With this update, the MAX_ACCEPTS_PER_CYCLE value has been increased, with the result that zero timeouts occurred under scale testing.-C: Shadows would timeout trying to communicate with the Schedd -F: Up the MAX_ACCEPTS_PER_CYCLE default. -R: 0 timeouts under scale testing.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHEA-2012-0045.html