Bug 603657 - Scheduler crashes in AutoCluster
Scheduler crashes in AutoCluster
Status: CLOSED WONTFIX
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor (Show other bugs)
Development
All Linux
high Severity high
: 1.3
: ---
Assigned To: Matthew Farrellee
MRG Quality Engineering
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2010-06-14 05:29 EDT by Martin Kudlej
Modified: 2011-03-17 14:16 EDT (History)
1 user (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2010-06-17 11:18:54 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
log files and configuration file (646.88 KB, application/x-gzip)
2010-06-14 05:29 EDT, Martin Kudlej
no flags Details

  None (edit)
Description Martin Kudlej 2010-06-14 05:29:53 EDT
Created attachment 423770 [details]
log files and configuration file

Description of problem:
I've tried submit 100,000 jobs and Scheduler has crashed.

Version-Release number of selected component (if applicable):
condor-7.4.3-0.16.el5

How reproducible:
100%

Steps to Reproduce:
1. set up NUM_CPUS=1024
2. service condor restart
  
Actual results:
Scheduler crashes.

Expected results:
Scheduler doesn't crash.

Additional info:
6/12 06:38:26 (pid:2403) AutoCluster:config() significant atttributes changed to JobUniverse,LastCheckpointPlatform,NumCkpts
Stack dump for process 2403 at timestamp 1276339165 (29 frames)
condor_schedd(dprintf_dump_stack+0x44)[0x817d344]
condor_schedd[0x817f0a4]
[0x289420]
/usr/lib/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x150)[0x6edb10]
/usr/lib/libstdc++.so.6[0x6eb515]
/usr/lib/libstdc++.so.6[0x6eb552]
/usr/lib/libstdc++.so.6[0x6eb68a]
/usr/lib/libstdc++.so.6(_Znwj+0x7e)[0x6ebb0e]
/usr/lib/libstdc++.so.6(_Znaj+0x1d)[0x6ebbed]
condor_schedd(_ZN8MyString7reserveEi+0x1c)[0x81e96bc]
condor_schedd(_ZN8MyString16reserve_at_leastEi+0x1e)[0x81e974e]
condor_schedd(_ZN8MyString10append_strEPKci+0x4a)[0x81e9a2a]
condor_schedd(_ZN8MyStringpLEPKc+0x37)[0x81e9a67]
condor_schedd(_ZN11AutoCluster16getAutoClusteridEP7ClassAd+0x539)[0x8130b29]
condor_schedd(_Z12get_job_prioP7ClassAd+0x3b)[0x8113d5b]
condor_schedd(_Z12WalkJobQueuePFiP7ClassAdE+0x41)[0x8115c91]
condor_schedd(_Z17BuildPrioRecArrayb+0x1a6)[0x8115ea6]
condor_schedd(_ZN9Scheduler9negotiateEiP6Stream+0x64e)[0x810486e]
condor_schedd(_ZN9Scheduler11doNegotiateEiP6Stream+0x1a)[0x81065ca]
condor_schedd(_ZN10DaemonCore18CallCommandHandlerEiP6Streamb+0xa0)[0x815a8f0]
condor_schedd(_ZN10DaemonCore9HandleReqEP6StreamS1_+0x136d)[0x816819d]
condor_schedd(_ZN10DaemonCore24CallSocketHandler_workerEibP6Stream+0xa3f)[0x816b1ff]
condor_schedd(_ZN10DaemonCore35CallSocketHandler_worker_demarshallEPv+0x22)[0x816b2d2]
condor_schedd(_ZN13CondorThreads8pool_addEPFvPvES0_PiPKc+0x40)[0x82040c0]
condor_schedd(_ZN10DaemonCore17CallSocketHandlerERib+0x130)[0x81630b0]
condor_schedd(_ZN10DaemonCore6DriverEv+0x1f66)[0x81657a6]
condor_schedd(main+0xd80)[0x8177290]
/lib/libc.so.6(__libc_start_main+0xdc)[0x125e9c]
condor_schedd[0x80e4601]
Comment 1 Martin Kudlej 2010-06-14 05:41:28 EDT
I forgot last line of "Steps to Reproduce:"
3. submit jobs
for i in `seq 200`; do condor_submit job.sub >/dev/null 2>/dev/null; echo "$i";sleep 60;done

$ cat job.sub:
Universe=vanilla
Executable=/bin/sleep
Arguments=10
Queue 100000
Comment 2 Martin Kudlej 2010-06-14 06:02:22 EDT
I can reproduce this also without NUM_CPUS=1024. It depends just on number of submitted jobs.
Comment 3 Matthew Farrellee 2010-06-15 05:24:46 EDT
Looks like you tried to submit 20 million jobs, not just 100 thousand.

Please test with the newest packages and monitor memory usage.

FYI, stack trace piped through c++filt,

Stack dump for process 2403 at timestamp 1276339165 (29 frames)
condor_schedd(dprintf_dump_stack+0x44)[0x817d344]
condor_schedd[0x817f0a4]
[0x289420]
/usr/lib/libstdc++.so.6(__gnu_cxx::__verbose_terminate_handler()+0x150)[0x6edb10]
/usr/lib/libstdc++.so.6[0x6eb515]
/usr/lib/libstdc++.so.6[0x6eb552]
/usr/lib/libstdc++.so.6[0x6eb68a]
/usr/lib/libstdc++.so.6(operator new(unsigned int)+0x7e)[0x6ebb0e]
/usr/lib/libstdc++.so.6(operator new[](unsigned int)+0x1d)[0x6ebbed]
condor_schedd(MyString::reserve(int)+0x1c)[0x81e96bc]
condor_schedd(MyString::reserve_at_least(int)+0x1e)[0x81e974e]
Comment 4 Matthew Farrellee 2010-06-16 13:21:03 EDT
It appears the test was on a 32 bit machine. It is expected that you can exhaust a 32 bit process address space when submitting 20M jobs.

Please run pmap -d $(pidof condor_schedd), in a loop, while the Schedd starts up with the job_queue.log left after the crash. Somewhere around 3GB of mapped memory the Schedd will likely crash.
Comment 5 Matthew Farrellee 2010-06-17 11:18:54 EDT
Current prevention is via MAX_JOBS_SUBMITTED, new prevention should be filed as an RFE.

Note You need to log in before you can comment on or make changes to this bug.