Bug 603658 - Scheduler crashes in count_jobs
Scheduler crashes in count_jobs
Status: CLOSED WONTFIX
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor (Show other bugs)
Development
All Linux
high Severity high
: 1.3
: ---
Assigned To: Matthew Farrellee
MRG Quality Engineering
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2010-06-14 05:32 EDT by Martin Kudlej
Modified: 2011-03-17 14:16 EDT (History)
1 user (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2010-06-17 11:19:09 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
log files and condor_config.local (646.88 KB, application/x-gzip)
2010-06-14 05:32 EDT, Martin Kudlej
no flags Details

  None (edit)
Description Martin Kudlej 2010-06-14 05:32:31 EDT
Created attachment 423771 [details]
log files and condor_config.local

Description of problem:
I've tried submit 100,000 jobs and Scheduler has crashed.

Version-Release number of selected component (if applicable):
condor-7.4.3-0.16.el5

How reproducible:
100%

Steps to Reproduce:
1. set up NUM_CPUS=1024
2. service condor restart
  
Actual results:
Scheduler crashes.

Expected results:
Scheduler doesn't crash.

Additional info:
06/12 06:41:47 (pid:2469) 1.1054: JobLeaseDuration remaining: 526
Stack dump for process 2469 at timestamp 1276339325 (16 frames)
condor_schedd(dprintf_dump_stack+0x44)[0x817d344]
condor_schedd[0x817f0a4]
[0xa0f420]
/usr/lib/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x150)[0x6edb10]
/usr/lib/libstdc++.so.6[0x6eb515]
/usr/lib/libstdc++.so.6[0x6eb552]
/usr/lib/libstdc++.so.6[0x6eb68a]
/usr/lib/libstdc++.so.6(_Znwj+0x7e)[0x6ebb0e]
/usr/lib/libstdc++.so.6(_Znaj+0x1d)[0x6ebbed]
condor_schedd(_Z14grow_prio_recsi+0x47)[0x8113997]
condor_schedd(_ZN9Scheduler10count_jobsEv+0xad8)[0x80feb38]
condor_schedd(_ZN9Scheduler7timeoutEv+0xef)[0x810374f]
condor_schedd(_Z9main_initiPPc+0x1b1)[0x810c111]
condor_schedd(main+0xd73)[0x8177283]
/lib/libc.so.6(__libc_start_main+0xdc)[0x29ae9c]
condor_schedd[0x80e4601]
Comment 1 Martin Kudlej 2010-06-14 05:41:35 EDT
I forgot last line of "Steps to Reproduce:"
3. submit jobs
for i in `seq 200`; do condor_submit job.sub >/dev/null 2>/dev/null; echo "$i";sleep 60;done

$ cat job.sub:
Universe=vanilla
Executable=/bin/sleep
Arguments=10
Queue 100000
Comment 2 Martin Kudlej 2010-06-14 06:02:20 EDT
I can reproduce this also without NUM_CPUS=1024. It depends just on number of submitted jobs.
Comment 3 Matthew Farrellee 2010-06-15 05:26:45 EDT
Looks like the test was to submit 20 million (not 100 thousand) jobs.

Please test with the newest packages and monitor memory usage. The stack suggests a new failed.

FYI, stack trace piped through c++filt,

Stack dump for process 2469 at timestamp 1276339325 (16 frames)
condor_schedd(dprintf_dump_stack+0x44)[0x817d344]
condor_schedd[0x817f0a4]
[0xa0f420]
/usr/lib/libstdc++.so.6(__gnu_cxx::__verbose_terminate_handler()+0x150)[0x6edb10]
/usr/lib/libstdc++.so.6[0x6eb515]
/usr/lib/libstdc++.so.6[0x6eb552]
/usr/lib/libstdc++.so.6[0x6eb68a]
/usr/lib/libstdc++.so.6(operator new(unsigned int)+0x7e)[0x6ebb0e]
/usr/lib/libstdc++.so.6(operator new[](unsigned int)+0x1d)[0x6ebbed]
condor_schedd(grow_prio_recs(int)+0x47)[0x8113997]
condor_schedd(Scheduler::count_jobs()+0xad8)[0x80feb38]
Comment 4 Matthew Farrellee 2010-06-16 13:21:01 EDT
It appears the test was on a 32 bit machine. It is expected that you can exhaust a 32 bit process address space when submitting 20M jobs.

Please run pmap -d $(pidof condor_schedd), in a loop, while the Schedd starts up with the job_queue.log left after the crash. Somewhere around 3GB of mapped memory the Schedd will likely crash.
Comment 5 Matthew Farrellee 2010-06-17 11:19:09 EDT
Current prevention is via MAX_JOBS_SUBMITTED, new prevention should be filed as an RFE.

Note You need to log in before you can comment on or make changes to this bug.