699726 – reduce shadow->schedd communication timeouts.

Bug 699726 - reduce shadow->schedd communication timeouts.

Summary: reduce shadow->schedd communication timeouts.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	condor
Sub Component:
Version:	1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	2.1
Target Release:	---
Assignee:	Timothy St. Clair
QA Contact:	Tomas Rusnak
Docs Contact:
URL:
Whiteboard:
Depends On:	699723
Blocks:	743350
TreeView+	depends on / blocked

Reported:	2011-04-26 13:32 UTC by Matthew Farrellee
Modified:	2012-01-23 17:26 UTC (History)
CC List:	7 users (show)
Fixed In Version:	condor-7.6.4-0.1
Doc Type:	Bug Fix
Doc Text:	The schedd daemon could spawn more shadows than it could handle given its current default configuration, which would in turn cause shadows to time out when trying to communicate with the schedd. With this update, the MAX_ACCEPTS_PER_CYCLE value has been increased, with the result that zero timeouts occurred under scale testing.
Clone Of:
Environment:
Last Closed:	2012-01-23 17:26:25 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2012:0045	0	normal	SHIPPED_LIVE	Red Hat Enterprise MRG Grid 2.1 bug fix and enhancement update	2012-01-23 22:22:58 UTC

Description Matthew Farrellee 2011-04-26 13:32:27 UTC

All shadow -> schedd communication happens on a new connection. This causes undo connection creation load on the Schedd, and can result in timeouts in the Shadow. Instead, there should be a persistent connection, which can be re-opened in the case of a failure.

Using SHADOW_TIMEOUT_MULTIPLIER=4 can help the shadow avoid bailing out, but does not reduce the load imposed on the Schedd.

Comment 1 Dan Bradley 2011-04-26 16:30:28 UTC

Things to consider:

Many environments impose a default file descriptor limit of 1024, which may be slightly annoying.

DaemonCore's select loop may not scale well to large numbers of registered sockets.

Comment 2 Timothy St. Clair 2011-06-03 17:00:28 UTC

I'm in agreement w/Dan.  From a previous life, select scales rather poorly at large scale which can drastically decrease performance.  So I'm not certain a persistent connection is the best route here, or whether the reconnect should be more fault-tolerant on the shadows end.

Comment 3 Timothy St. Clair 2011-06-06 13:53:12 UTC

After waking from my restful slumber this am. I seem to recall a conversation between Dan, TT, and I about having the shadow simply trying once at the end of the job then writing a file so the next go-a-round for the reaper the Schedd would detect the file and update correctly.  

This seems like the best of all worlds, as it would decrease load and yield no information loss.

Comment 4 Timothy St. Clair 2011-06-13 20:30:55 UTC

Scale tests should check #699723 prior to any major efforts here.  B/C this should effect schedd load performance.

Comment 5 Timothy St. Clair 2011-07-15 20:54:40 UTC

So in performing tests I only receive a couple of dozen 

"Failed to perform final update to job queue!" per several thousand jobs.  


You may see a fair amount of connection timeouts in the ShadowLog, but that doesn't result in job failure + rerun.  That is part of the normal retry logic.     

The variables which control the cleanup logic are: 
SHADOW_MAX_JOB_CLEANUP_RETRIES default is 5
SHADOW_JOB_CLEANUP_RETRY_DELAY default is 30

Comment 6 Timothy St. Clair 2011-07-15 21:31:42 UTC

Testing Notes: 

By adjusting SHADOW_MAX_JOB_CLEANUP_RETRIES=100 decreased the number of occurances to be only a handful, but they still occurred.

I'm going to play the the ACCEPT threshold to see if this eliminates.

Comment 7 Timothy St. Clair 2011-07-18 14:27:53 UTC

Testing Notes: 

Setting MAX_ACCEPTS_PER_CYCLE = 10, w/SHADOW_MAX_JOB_CLEANUP_RETRIES=100 eliminates entirely. 

If fact I haven't seen any Connection retries in the shadow logs now.

Comment 10 Timothy St. Clair 2011-07-19 19:39:01 UTC

A modest increase of MAX_ACCEPTS_PER_CYCLE = 8 seems to do the trick under heavy loads.  

Still tinkering to get a full scale test to verify.

Comment 12 Timothy St. Clair 2011-09-08 16:12:58 UTC

To verify, have a script which does constant submissions at a given rate, scan shadow logs for timeouts b4 and after modification at medium scale.

Comment 13 Timothy St. Clair 2011-09-23 20:18:05 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
C: The Schedd could often spawn more shadows the in could handle given its current default param config
C: Shadows would timeout trying to communicate with the Schedd
F: Up the MAX_ACCEPTS_PER_CYCLE default.
R: 0 timeouts under scale testing.

Comment 15 Tomas Rusnak 2011-11-01 16:50:47 UTC

I tried to reproduce with:

$CondorVersion: 7.6.1 Jun 02 2011 BuildID: RH-7.6.1-0.10.el5 $
$CondorPlatform: X86_64-RedHat_5.6 $

Config:
NUM_CPUS=500

Job:
universe = vanilla
cmd=/bin/sleep
args=60
should_transfer_files = if_needed
WhenToTransferOutput = ON_EXIT
queue=1

Submitted with:
for i in `seq 1 1000`; do sudo -u test condor_submit submit.job; done;

# tail -F /var/log/condor/ShadowLog | grep -i timeout
11/01/11 17:37:57 (2297.0) (5677): condor_read(): timeout reading 5 bytes from <IP:42051>.
11/01/11 17:37:57 (2218.0) (6308): condor_read(): timeout reading 5 bytes from <IP:42051>.
11/01/11 17:37:58 (2350.0) (3449): condor_read(): timeout reading 5 bytes from <IP:42051>.
11/01/11 17:38:03 (2237.0) (6342): condor_read(): timeout reading 5 bytes from <IP:42051>

# condor_config_val MAX_ACCEPTS_PER_CYCLE
Not defined: MAX_ACCEPTS_PER_CYCLE
# condor_config_val SHADOW_MAX_JOB_CLEANUP_RETRIES
5
# condor_config_val SHADOW_JOB_CLEANUP_RETRY_DELAY
30

I can't find any error similar to "Failed to perform final update to job queue!" in ShadowLog. Do you have any idea what is missing in my case to successfully reproduce it?

Comment 18 Tomas Rusnak 2011-11-08 17:24:58 UTC

Retested on:

$CondorVersion: 7.6.5 Nov 04 2011 BuildID: RH-7.6.5-0.6.el5 $
$CondorPlatform: X86_64-RedHat_5.7 $


11/05/11 15:03:31 (260.5423) (24360): Received GoAhead from peer to send /bin/sleep and all further files.
11/05/11 15:03:31 (260.4126) (21738): Can't connect to queue manager: CEDAR:6001:Failed to connect to <IP:38808>
11/05/11 15:03:31 (260.4126) (21738): Failed to perform final update to job queue!
11/05/11 15:03:31 (260.4126) (21738): Will retry job cleanup in SHADOW_JOB_CLEANUP_RETRY_DELAY=30 seconds
11/05/11 15:03:31 (260.4126) (21738): condor_read(): Socket closed when trying to read 5 bytes from startd slot2@localhost
11/05/11 15:03:31 (260.4126) (21738): IO: EOF reading packet header
11/05/11 15:03:31 (260.5426) (23561): entering FileTransfer::HandleCommands

It looks like there is same issue on current version of condor.

Comment 19 Timothy St. Clair 2011-11-08 18:26:32 UTC

The question is what is the rate of these errors?  Before it could be substantial, and now it should be approaching zero, but != 0 b/c there are numerous cases in the schedd where it can very busy.

Comment 20 Tomas Rusnak 2011-11-09 15:23:42 UTC

It's big difference between slow and normal submitting machine. I was not able to reproduce it on faster one, but on slowest the error message shows up more often.

Retested over all supported platforms x86,x86_64/RHEL5,RHEL6 with:

condor-7.6.5-0.6

Medium scale pool used - 380 cores/slots, 10000 jobs submitted. On all platforms I can see exactly 22 errors like:

"Failed to perform final update to job queue!" 

>>> VERIFIED

Comment 21 Douglas Silas 2011-11-17 09:26:16 UTC

    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1,4 +1 @@
-C: The Schedd could often spawn more shadows the in could handle given its current default param config
+The schedd daemon could spawn more shadows than it could handle given its current default configuration, which would in turn cause shadows to time out when trying to communicate with the schedd. With this update, the MAX_ACCEPTS_PER_CYCLE value has been increased, with the result that zero timeouts occurred under scale testing.-C: Shadows would timeout trying to communicate with the Schedd
-F: Up the MAX_ACCEPTS_PER_CYCLE default.
-R: 0 timeouts under scale testing.

Comment 22 errata-xmlrpc 2012-01-23 17:26:25 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2012-0045.html

Note You need to log in before you can comment on or make changes to this bug.