Created attachment 424764 [details] log files and condor_config.local Version-Release number of selected component (if applicable): condor-7.4.3-0.19.el5 How reproducible: starters for all slots have crashed at least once Steps to Reproduce: 1. set ulimit -n to 8192 2. set NUM_CPUS=1024 3. restart condor 4. submit jobs Actual results: There are crashes of Starter Expected results: There are just error messages and no crash. Additional info: 06/16 00:01:47 ERROR: SECMAN:2003:TCP auth connection to <:43475> failed. 06/16 00:01:47 ERROR "FAILED TO SEND INITIAL KEEP ALIVE TO OUR PARENT <:43475>" at line 9417 in file daemon_core.cpp 06/16 00:01:47 ShutdownFast all jobs. Stack dump for process 27794 at timestamp 1276660907 (11 frames) condor_starter(dprintf_dump_stack+0x44)[0x81033f4] condor_starter[0x8105154] [0x9c4420] /lib/libc.so.6(abort+0x101)[0x2af701] condor_starter(_EXCEPT_+0x93)[0x81032c3] condor_starter(_ZN10DaemonCore17SendAliveToParentEv+0x45d)[0x80f5c3d] condor_starter(_ZN12TimerManager7TimeoutEv+0x14b)[0x810295b] condor_starter(_ZN10DaemonCore6DriverEv+0x244)[0x80ea454] condor_starter(main+0xd80)[0x80fded0] /lib/libc.so.6(__libc_start_main+0xdc)[0x29ae9c] condor_starter[0x80b53a1]
Please take a look at the condor_startd's CPU usage. I imagine it is pegging a core. You are using NUM_CPUS=1024, and presumably not using partitionable slots. The Startd is not optimized to run on machines with 1024 cores. This error is likely from a lagging Startd. You could attempt to eliminate the error by tuning MAX_ACCEPTS_PER_CYCLE, which I'd be interested in hearing about, but more likely you should reduce the NUM_CPUS. If you need many slots for scale testing you should run multiple Startds whose sum of NUM_CPUS is 1024, say start with 5.
I know I should set up less cpus there. But I think starter has not to crash because of startd cannot serve it.
I've submitted those jobs by: for i in `seq 10`; do condor_submit job.sub >/dev/null 2>/dev/null; echo "$i";sleep 60;done cat job.sub: Universe=vanilla Executable=/bin/sleep Arguments=10 Queue 1000000
The stack is coming from ABORT_ON_EXCEPTION=True in configuration. The exit after failing to send the initial keep alive is expected.