603728 – Schedd crashes after submitting 1,000 Windows jobs

Bug 603728 - Schedd crashes after submitting 1,000 Windows jobs

Summary: Schedd crashes after submitting 1,000 Windows jobs

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	condor
Sub Component:
Version:	Development
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	1.3
Target Release:	---
Assignee:	Timothy St. Clair
QA Contact:	Martin Kudlej
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	578396
TreeView+	depends on / blocked

Reported:	2010-06-14 13:00 UTC by Martin Kudlej
Modified:	2011-03-17 18:16 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2010-10-20 11:28:38 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
log files and condor_config.local (38.88 KB, application/x-gzip) 2010-06-14 13:00 UTC, Martin Kudlej	no flags	Details
View All

Description Martin Kudlej 2010-06-14 13:00:38 UTC

Created attachment 423823 [details]
log files and condor_config.local

Description of problem:
I've tried to submit 4,000 Windows jobs:

for i in `seq 4`; do su xxx -c 'condor_submit /root/wait.bat.sub' || service condor stop || killall condor_schedd;sleep 30;done

$ cat wait.bat.sub:
universe = vanilla
executable = /root/wait.bat
arguments = 1
requirements = ( Arch=="Intel") && ( OpSys=="WINNT51" || OpSys=="WINNT52" )
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
iwd = /tmp
queue 1000

$ cat wait.bat:
@ping 127.0.0.1 -n %1% -w 1000 > nul

And scheduler has crashed after submitting first 1,000 jobs. I've set up full debug, so after condor_submit exit with return code > 0, I've stop condor service and then clean schedd process by "killall condor_schedd".

Version-Release number of selected component (if applicable):
condor-7.4.3-0.17.el5

How reproducible:
100%

Steps to Reproduce:
1. set up condor pool: CM - RHEL 5.5beta + execute windows node
2. try to submit 1000 Windows jobs
3. wait for crash
  
Actual results:
Scheduler has crashed.

Expected results:
Scheduler doesn't crash.

Additional info:

$ cat ScheddLog:
06/11 15:38:51 (pid:8361) OwnerCheck retval 1 (success),no ad
06/11 15:38:51 (pid:8361) OwnerCheck retval 1 (success),no ad
06/11 15:38:51 (pid:8361) OwnerCheck retval 1 (success),no ad
06/11 15:38:51 (pid:8361) OwnerCheck retval 1 (success),no ad
Stack dump for process 8361 at timestamp 1276285165 (0 frames)

condor_config.local = Personal Condor(default settings) and:
ALLOW_WRITE=*
ALLOW_READ=*

CREATE_CORE_FILES = True
ABORT_ON_EXCEPTION = True

ALL_DEBUG = D_FULLDEBUG

Comment 2 Martin Kudlej 2010-06-21 12:01:18 UTC

I've retested this 3 times with condor-7.4.3-0.20.el5 with 
for i in `seq 100`; do su xxx -c 'condor_submit /root/wait.bat.sub' || service condor stop || killall condor_schedd;sleep 30;condor_rm -all;sleep 10;done;

And I don't see any Stack dump.

It should be retested for all architectures and OSes for verifying.

Comment 3 Martin Kudlej 2010-08-12 13:33:15 UTC

I've retested this as in comment #3 on RHEL 5.5/4.8 x i386/x86_64 with condor-7.4.4-0.8 and it works. --> VERIFIED

Note You need to log in before you can comment on or make changes to this bug.