Bug 603658 - Scheduler crashes in count_jobs
Summary: Scheduler crashes in count_jobs
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor   
(Show other bugs)
Version: Development
Hardware: All
OS: Linux
high
high
Target Milestone: 1.3
: ---
Assignee: Matthew Farrellee
QA Contact: MRG Quality Engineering
URL:
Whiteboard:
Keywords:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-06-14 09:32 UTC by Martin Kudlej
Modified: 2011-03-17 18:16 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2010-06-17 15:19:09 UTC
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
log files and condor_config.local (646.88 KB, application/x-gzip)
2010-06-14 09:32 UTC, Martin Kudlej
no flags Details

Description Martin Kudlej 2010-06-14 09:32:31 UTC
Created attachment 423771 [details]
log files and condor_config.local

Description of problem:
I've tried submit 100,000 jobs and Scheduler has crashed.

Version-Release number of selected component (if applicable):
condor-7.4.3-0.16.el5

How reproducible:
100%

Steps to Reproduce:
1. set up NUM_CPUS=1024
2. service condor restart
  
Actual results:
Scheduler crashes.

Expected results:
Scheduler doesn't crash.

Additional info:
06/12 06:41:47 (pid:2469) 1.1054: JobLeaseDuration remaining: 526
Stack dump for process 2469 at timestamp 1276339325 (16 frames)
condor_schedd(dprintf_dump_stack+0x44)[0x817d344]
condor_schedd[0x817f0a4]
[0xa0f420]
/usr/lib/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x150)[0x6edb10]
/usr/lib/libstdc++.so.6[0x6eb515]
/usr/lib/libstdc++.so.6[0x6eb552]
/usr/lib/libstdc++.so.6[0x6eb68a]
/usr/lib/libstdc++.so.6(_Znwj+0x7e)[0x6ebb0e]
/usr/lib/libstdc++.so.6(_Znaj+0x1d)[0x6ebbed]
condor_schedd(_Z14grow_prio_recsi+0x47)[0x8113997]
condor_schedd(_ZN9Scheduler10count_jobsEv+0xad8)[0x80feb38]
condor_schedd(_ZN9Scheduler7timeoutEv+0xef)[0x810374f]
condor_schedd(_Z9main_initiPPc+0x1b1)[0x810c111]
condor_schedd(main+0xd73)[0x8177283]
/lib/libc.so.6(__libc_start_main+0xdc)[0x29ae9c]
condor_schedd[0x80e4601]

Comment 1 Martin Kudlej 2010-06-14 09:41:35 UTC
I forgot last line of "Steps to Reproduce:"
3. submit jobs
for i in `seq 200`; do condor_submit job.sub >/dev/null 2>/dev/null; echo "$i";sleep 60;done

$ cat job.sub:
Universe=vanilla
Executable=/bin/sleep
Arguments=10
Queue 100000

Comment 2 Martin Kudlej 2010-06-14 10:02:20 UTC
I can reproduce this also without NUM_CPUS=1024. It depends just on number of submitted jobs.

Comment 3 Matthew Farrellee 2010-06-15 09:26:45 UTC
Looks like the test was to submit 20 million (not 100 thousand) jobs.

Please test with the newest packages and monitor memory usage. The stack suggests a new failed.

FYI, stack trace piped through c++filt,

Stack dump for process 2469 at timestamp 1276339325 (16 frames)
condor_schedd(dprintf_dump_stack+0x44)[0x817d344]
condor_schedd[0x817f0a4]
[0xa0f420]
/usr/lib/libstdc++.so.6(__gnu_cxx::__verbose_terminate_handler()+0x150)[0x6edb10]
/usr/lib/libstdc++.so.6[0x6eb515]
/usr/lib/libstdc++.so.6[0x6eb552]
/usr/lib/libstdc++.so.6[0x6eb68a]
/usr/lib/libstdc++.so.6(operator new(unsigned int)+0x7e)[0x6ebb0e]
/usr/lib/libstdc++.so.6(operator new[](unsigned int)+0x1d)[0x6ebbed]
condor_schedd(grow_prio_recs(int)+0x47)[0x8113997]
condor_schedd(Scheduler::count_jobs()+0xad8)[0x80feb38]

Comment 4 Matthew Farrellee 2010-06-16 17:21:01 UTC
It appears the test was on a 32 bit machine. It is expected that you can exhaust a 32 bit process address space when submitting 20M jobs.

Please run pmap -d $(pidof condor_schedd), in a loop, while the Schedd starts up with the job_queue.log left after the crash. Somewhere around 3GB of mapped memory the Schedd will likely crash.

Comment 5 Matthew Farrellee 2010-06-17 15:19:09 UTC
Current prevention is via MAX_JOBS_SUBMITTED, new prevention should be filed as an RFE.


Note You need to log in before you can comment on or make changes to this bug.