Created attachment 424764 [details]
log files and condor_config.local
Version-Release number of selected component (if applicable):
starters for all slots have crashed at least once
Steps to Reproduce:
1. set ulimit -n to 8192
2. set NUM_CPUS=1024
3. restart condor
4. submit jobs
There are crashes of Starter
There are just error messages and no crash.
06/16 00:01:47 ERROR: SECMAN:2003:TCP auth connection to <:43475> failed.
06/16 00:01:47 ERROR "FAILED TO SEND INITIAL KEEP ALIVE TO OUR PARENT <:43475>" at line 9417 in file daemon_core.cpp
06/16 00:01:47 ShutdownFast all jobs.
Stack dump for process 27794 at timestamp 1276660907 (11 frames)
Please take a look at the condor_startd's CPU usage. I imagine it is pegging a core.
You are using NUM_CPUS=1024, and presumably not using partitionable slots.
The Startd is not optimized to run on machines with 1024 cores. This error is likely from a lagging Startd. You could attempt to eliminate the error by tuning MAX_ACCEPTS_PER_CYCLE, which I'd be interested in hearing about, but more likely you should reduce the NUM_CPUS. If you need many slots for scale testing you should run multiple Startds whose sum of NUM_CPUS is 1024, say start with 5.
I know I should set up less cpus there. But I think starter has not to crash because of startd cannot serve it.
I've submitted those jobs by:
for i in `seq 10`; do condor_submit job.sub >/dev/null 2>/dev/null; echo "$i";sleep 60;done
The stack is coming from ABORT_ON_EXCEPTION=True in configuration. The exit after failing to send the initial keep alive is expected.