Bugzilla will be upgraded to version 5.0 on a still to be determined date in the near future. The original upgrade date has been delayed.
Bug 603728 - Schedd crashes after submitting 1,000 Windows jobs
Schedd crashes after submitting 1,000 Windows jobs
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor (Show other bugs)
All Linux
high Severity high
: 1.3
: ---
Assigned To: Timothy St. Clair
Martin Kudlej
Depends On:
Blocks: 578396
  Show dependency treegraph
Reported: 2010-06-14 09:00 EDT by Martin Kudlej
Modified: 2011-03-17 14:16 EDT (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2010-10-20 07:28:38 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
log files and condor_config.local (38.88 KB, application/x-gzip)
2010-06-14 09:00 EDT, Martin Kudlej
no flags Details

  None (edit)
Description Martin Kudlej 2010-06-14 09:00:38 EDT
Created attachment 423823 [details]
log files and condor_config.local

Description of problem:
I've tried to submit 4,000 Windows jobs:

for i in `seq 4`; do su xxx -c 'condor_submit /root/wait.bat.sub' || service condor stop || killall condor_schedd;sleep 30;done

$ cat wait.bat.sub:
universe = vanilla
executable = /root/wait.bat
arguments = 1
requirements = ( Arch=="Intel") && ( OpSys=="WINNT51" || OpSys=="WINNT52" )
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
iwd = /tmp
queue 1000

$ cat wait.bat:
@ping -n %1% -w 1000 > nul

And scheduler has crashed after submitting first 1,000 jobs. I've set up full debug, so after condor_submit exit with return code > 0, I've stop condor service and then clean schedd process by "killall condor_schedd".

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1. set up condor pool: CM - RHEL 5.5beta + execute windows node
2. try to submit 1000 Windows jobs
3. wait for crash
Actual results:
Scheduler has crashed.

Expected results:
Scheduler doesn't crash.

Additional info:

$ cat ScheddLog:
06/11 15:38:51 (pid:8361) OwnerCheck retval 1 (success),no ad
06/11 15:38:51 (pid:8361) OwnerCheck retval 1 (success),no ad
06/11 15:38:51 (pid:8361) OwnerCheck retval 1 (success),no ad
06/11 15:38:51 (pid:8361) OwnerCheck retval 1 (success),no ad
Stack dump for process 8361 at timestamp 1276285165 (0 frames)

condor_config.local = Personal Condor(default settings) and:


Comment 2 Martin Kudlej 2010-06-21 08:01:18 EDT
I've retested this 3 times with condor-7.4.3-0.20.el5 with 
for i in `seq 100`; do su xxx -c 'condor_submit /root/wait.bat.sub' || service condor stop || killall condor_schedd;sleep 30;condor_rm -all;sleep 10;done;

And I don't see any Stack dump.

It should be retested for all architectures and OSes for verifying.
Comment 3 Martin Kudlej 2010-08-12 09:33:15 EDT
I've retested this as in comment #3 on RHEL 5.5/4.8 x i386/x86_64 with condor-7.4.4-0.8 and it works. --> VERIFIED

Note You need to log in before you can comment on or make changes to this bug.