603657 – Scheduler crashes in AutoCluster

Bug 603657 - Scheduler crashes in AutoCluster

Summary: Scheduler crashes in AutoCluster

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	condor
Sub Component:
Version:	Development
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	1.3
Target Release:	---
Assignee:	Matthew Farrellee
QA Contact:	MRG Quality Engineering
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2010-06-14 09:29 UTC by Martin Kudlej
Modified:	2011-03-17 18:16 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2010-06-17 15:18:54 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
log files and configuration file (646.88 KB, application/x-gzip) 2010-06-14 09:29 UTC, Martin Kudlej	no flags	Details
View All

Description Martin Kudlej 2010-06-14 09:29:53 UTC

Created attachment 423770 [details]
log files and configuration file

Description of problem:
I've tried submit 100,000 jobs and Scheduler has crashed.

Version-Release number of selected component (if applicable):
condor-7.4.3-0.16.el5

How reproducible:
100%

Steps to Reproduce:
1. set up NUM_CPUS=1024
2. service condor restart
  
Actual results:
Scheduler crashes.

Expected results:
Scheduler doesn't crash.

Additional info:
6/12 06:38:26 (pid:2403) AutoCluster:config() significant atttributes changed to JobUniverse,LastCheckpointPlatform,NumCkpts
Stack dump for process 2403 at timestamp 1276339165 (29 frames)
condor_schedd(dprintf_dump_stack+0x44)[0x817d344]
condor_schedd[0x817f0a4]
[0x289420]
/usr/lib/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x150)[0x6edb10]
/usr/lib/libstdc++.so.6[0x6eb515]
/usr/lib/libstdc++.so.6[0x6eb552]
/usr/lib/libstdc++.so.6[0x6eb68a]
/usr/lib/libstdc++.so.6(_Znwj+0x7e)[0x6ebb0e]
/usr/lib/libstdc++.so.6(_Znaj+0x1d)[0x6ebbed]
condor_schedd(_ZN8MyString7reserveEi+0x1c)[0x81e96bc]
condor_schedd(_ZN8MyString16reserve_at_leastEi+0x1e)[0x81e974e]
condor_schedd(_ZN8MyString10append_strEPKci+0x4a)[0x81e9a2a]
condor_schedd(_ZN8MyStringpLEPKc+0x37)[0x81e9a67]
condor_schedd(_ZN11AutoCluster16getAutoClusteridEP7ClassAd+0x539)[0x8130b29]
condor_schedd(_Z12get_job_prioP7ClassAd+0x3b)[0x8113d5b]
condor_schedd(_Z12WalkJobQueuePFiP7ClassAdE+0x41)[0x8115c91]
condor_schedd(_Z17BuildPrioRecArrayb+0x1a6)[0x8115ea6]
condor_schedd(_ZN9Scheduler9negotiateEiP6Stream+0x64e)[0x810486e]
condor_schedd(_ZN9Scheduler11doNegotiateEiP6Stream+0x1a)[0x81065ca]
condor_schedd(_ZN10DaemonCore18CallCommandHandlerEiP6Streamb+0xa0)[0x815a8f0]
condor_schedd(_ZN10DaemonCore9HandleReqEP6StreamS1_+0x136d)[0x816819d]
condor_schedd(_ZN10DaemonCore24CallSocketHandler_workerEibP6Stream+0xa3f)[0x816b1ff]
condor_schedd(_ZN10DaemonCore35CallSocketHandler_worker_demarshallEPv+0x22)[0x816b2d2]
condor_schedd(_ZN13CondorThreads8pool_addEPFvPvES0_PiPKc+0x40)[0x82040c0]
condor_schedd(_ZN10DaemonCore17CallSocketHandlerERib+0x130)[0x81630b0]
condor_schedd(_ZN10DaemonCore6DriverEv+0x1f66)[0x81657a6]
condor_schedd(main+0xd80)[0x8177290]
/lib/libc.so.6(__libc_start_main+0xdc)[0x125e9c]
condor_schedd[0x80e4601]

Comment 1 Martin Kudlej 2010-06-14 09:41:28 UTC

I forgot last line of "Steps to Reproduce:"
3. submit jobs
for i in `seq 200`; do condor_submit job.sub >/dev/null 2>/dev/null; echo "$i";sleep 60;done

$ cat job.sub:
Universe=vanilla
Executable=/bin/sleep
Arguments=10
Queue 100000

Comment 2 Martin Kudlej 2010-06-14 10:02:22 UTC

I can reproduce this also without NUM_CPUS=1024. It depends just on number of submitted jobs.

Comment 3 Matthew Farrellee 2010-06-15 09:24:46 UTC

Looks like you tried to submit 20 million jobs, not just 100 thousand.

Please test with the newest packages and monitor memory usage.

FYI, stack trace piped through c++filt,

Stack dump for process 2403 at timestamp 1276339165 (29 frames)
condor_schedd(dprintf_dump_stack+0x44)[0x817d344]
condor_schedd[0x817f0a4]
[0x289420]
/usr/lib/libstdc++.so.6(__gnu_cxx::__verbose_terminate_handler()+0x150)[0x6edb10]
/usr/lib/libstdc++.so.6[0x6eb515]
/usr/lib/libstdc++.so.6[0x6eb552]
/usr/lib/libstdc++.so.6[0x6eb68a]
/usr/lib/libstdc++.so.6(operator new(unsigned int)+0x7e)[0x6ebb0e]
/usr/lib/libstdc++.so.6(operator new[](unsigned int)+0x1d)[0x6ebbed]
condor_schedd(MyString::reserve(int)+0x1c)[0x81e96bc]
condor_schedd(MyString::reserve_at_least(int)+0x1e)[0x81e974e]

Comment 4 Matthew Farrellee 2010-06-16 17:21:03 UTC

It appears the test was on a 32 bit machine. It is expected that you can exhaust a 32 bit process address space when submitting 20M jobs.

Please run pmap -d $(pidof condor_schedd), in a loop, while the Schedd starts up with the job_queue.log left after the crash. Somewhere around 3GB of mapped memory the Schedd will likely crash.

Comment 5 Matthew Farrellee 2010-06-17 15:18:54 UTC

Current prevention is via MAX_JOBS_SUBMITTED, new prevention should be filed as an RFE.

Note You need to log in before you can comment on or make changes to this bug.