Created attachment 423770 [details] log files and configuration file Description of problem: I've tried submit 100,000 jobs and Scheduler has crashed. Version-Release number of selected component (if applicable): condor-7.4.3-0.16.el5 How reproducible: 100% Steps to Reproduce: 1. set up NUM_CPUS=1024 2. service condor restart Actual results: Scheduler crashes. Expected results: Scheduler doesn't crash. Additional info: 6/12 06:38:26 (pid:2403) AutoCluster:config() significant atttributes changed to JobUniverse,LastCheckpointPlatform,NumCkpts Stack dump for process 2403 at timestamp 1276339165 (29 frames) condor_schedd(dprintf_dump_stack+0x44)[0x817d344] condor_schedd[0x817f0a4] [0x289420] /usr/lib/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x150)[0x6edb10] /usr/lib/libstdc++.so.6[0x6eb515] /usr/lib/libstdc++.so.6[0x6eb552] /usr/lib/libstdc++.so.6[0x6eb68a] /usr/lib/libstdc++.so.6(_Znwj+0x7e)[0x6ebb0e] /usr/lib/libstdc++.so.6(_Znaj+0x1d)[0x6ebbed] condor_schedd(_ZN8MyString7reserveEi+0x1c)[0x81e96bc] condor_schedd(_ZN8MyString16reserve_at_leastEi+0x1e)[0x81e974e] condor_schedd(_ZN8MyString10append_strEPKci+0x4a)[0x81e9a2a] condor_schedd(_ZN8MyStringpLEPKc+0x37)[0x81e9a67] condor_schedd(_ZN11AutoCluster16getAutoClusteridEP7ClassAd+0x539)[0x8130b29] condor_schedd(_Z12get_job_prioP7ClassAd+0x3b)[0x8113d5b] condor_schedd(_Z12WalkJobQueuePFiP7ClassAdE+0x41)[0x8115c91] condor_schedd(_Z17BuildPrioRecArrayb+0x1a6)[0x8115ea6] condor_schedd(_ZN9Scheduler9negotiateEiP6Stream+0x64e)[0x810486e] condor_schedd(_ZN9Scheduler11doNegotiateEiP6Stream+0x1a)[0x81065ca] condor_schedd(_ZN10DaemonCore18CallCommandHandlerEiP6Streamb+0xa0)[0x815a8f0] condor_schedd(_ZN10DaemonCore9HandleReqEP6StreamS1_+0x136d)[0x816819d] condor_schedd(_ZN10DaemonCore24CallSocketHandler_workerEibP6Stream+0xa3f)[0x816b1ff] condor_schedd(_ZN10DaemonCore35CallSocketHandler_worker_demarshallEPv+0x22)[0x816b2d2] condor_schedd(_ZN13CondorThreads8pool_addEPFvPvES0_PiPKc+0x40)[0x82040c0] condor_schedd(_ZN10DaemonCore17CallSocketHandlerERib+0x130)[0x81630b0] condor_schedd(_ZN10DaemonCore6DriverEv+0x1f66)[0x81657a6] condor_schedd(main+0xd80)[0x8177290] /lib/libc.so.6(__libc_start_main+0xdc)[0x125e9c] condor_schedd[0x80e4601]
I forgot last line of "Steps to Reproduce:" 3. submit jobs for i in `seq 200`; do condor_submit job.sub >/dev/null 2>/dev/null; echo "$i";sleep 60;done $ cat job.sub: Universe=vanilla Executable=/bin/sleep Arguments=10 Queue 100000
I can reproduce this also without NUM_CPUS=1024. It depends just on number of submitted jobs.
Looks like you tried to submit 20 million jobs, not just 100 thousand. Please test with the newest packages and monitor memory usage. FYI, stack trace piped through c++filt, Stack dump for process 2403 at timestamp 1276339165 (29 frames) condor_schedd(dprintf_dump_stack+0x44)[0x817d344] condor_schedd[0x817f0a4] [0x289420] /usr/lib/libstdc++.so.6(__gnu_cxx::__verbose_terminate_handler()+0x150)[0x6edb10] /usr/lib/libstdc++.so.6[0x6eb515] /usr/lib/libstdc++.so.6[0x6eb552] /usr/lib/libstdc++.so.6[0x6eb68a] /usr/lib/libstdc++.so.6(operator new(unsigned int)+0x7e)[0x6ebb0e] /usr/lib/libstdc++.so.6(operator new[](unsigned int)+0x1d)[0x6ebbed] condor_schedd(MyString::reserve(int)+0x1c)[0x81e96bc] condor_schedd(MyString::reserve_at_least(int)+0x1e)[0x81e974e]
It appears the test was on a 32 bit machine. It is expected that you can exhaust a 32 bit process address space when submitting 20M jobs. Please run pmap -d $(pidof condor_schedd), in a loop, while the Schedd starts up with the job_queue.log left after the crash. Somewhere around 3GB of mapped memory the Schedd will likely crash.
Current prevention is via MAX_JOBS_SUBMITTED, new prevention should be filed as an RFE.