Bug 603728 - Schedd crashes after submitting 1,000 Windows jobs
Summary: Schedd crashes after submitting 1,000 Windows jobs
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor
Version: Development
Hardware: All
OS: Linux
high
high
Target Milestone: 1.3
: ---
Assignee: Timothy St. Clair
QA Contact: Martin Kudlej
URL:
Whiteboard:
Depends On:
Blocks: 578396
TreeView+ depends on / blocked
 
Reported: 2010-06-14 13:00 UTC by Martin Kudlej
Modified: 2011-03-17 18:16 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2010-10-20 11:28:38 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
log files and condor_config.local (38.88 KB, application/x-gzip)
2010-06-14 13:00 UTC, Martin Kudlej
no flags Details

Description Martin Kudlej 2010-06-14 13:00:38 UTC
Created attachment 423823 [details]
log files and condor_config.local

Description of problem:
I've tried to submit 4,000 Windows jobs:

for i in `seq 4`; do su xxx -c 'condor_submit /root/wait.bat.sub' || service condor stop || killall condor_schedd;sleep 30;done

$ cat wait.bat.sub:
universe = vanilla
executable = /root/wait.bat
arguments = 1
requirements = ( Arch=="Intel") && ( OpSys=="WINNT51" || OpSys=="WINNT52" )
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
iwd = /tmp
queue 1000

$ cat wait.bat:
@ping 127.0.0.1 -n %1% -w 1000 > nul

And scheduler has crashed after submitting first 1,000 jobs. I've set up full debug, so after condor_submit exit with return code > 0, I've stop condor service and then clean schedd process by "killall condor_schedd".

Version-Release number of selected component (if applicable):
condor-7.4.3-0.17.el5

How reproducible:
100%

Steps to Reproduce:
1. set up condor pool: CM - RHEL 5.5beta + execute windows node
2. try to submit 1000 Windows jobs
3. wait for crash
  
Actual results:
Scheduler has crashed.

Expected results:
Scheduler doesn't crash.

Additional info:

$ cat ScheddLog:
06/11 15:38:51 (pid:8361) OwnerCheck retval 1 (success),no ad
06/11 15:38:51 (pid:8361) OwnerCheck retval 1 (success),no ad
06/11 15:38:51 (pid:8361) OwnerCheck retval 1 (success),no ad
06/11 15:38:51 (pid:8361) OwnerCheck retval 1 (success),no ad
Stack dump for process 8361 at timestamp 1276285165 (0 frames)

condor_config.local = Personal Condor(default settings) and:
ALLOW_WRITE=*
ALLOW_READ=*

CREATE_CORE_FILES = True
ABORT_ON_EXCEPTION = True

ALL_DEBUG = D_FULLDEBUG

Comment 2 Martin Kudlej 2010-06-21 12:01:18 UTC
I've retested this 3 times with condor-7.4.3-0.20.el5 with 
for i in `seq 100`; do su xxx -c 'condor_submit /root/wait.bat.sub' || service condor stop || killall condor_schedd;sleep 30;condor_rm -all;sleep 10;done;

And I don't see any Stack dump.

It should be retested for all architectures and OSes for verifying.

Comment 3 Martin Kudlej 2010-08-12 13:33:15 UTC
I've retested this as in comment #3 on RHEL 5.5/4.8 x i386/x86_64 with condor-7.4.4-0.8 and it works. --> VERIFIED


Note You need to log in before you can comment on or make changes to this bug.