Bug 605175 - Starter has crashed after FAILED TO SEND INITIAL KEEP ALIVE TO OUR PARENT
Starter has crashed after FAILED TO SEND INITIAL KEEP ALIVE TO OUR PARENT
Status: CLOSED NOTABUG
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor (Show other bugs)
Development
All Linux
high Severity high
: 1.3
: ---
Assigned To: Matthew Farrellee
Martin Kudlej
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2010-06-17 07:09 EDT by Martin Kudlej
Modified: 2010-06-17 10:08 EDT (History)
1 user (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2010-06-17 10:08:40 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
log files and condor_config.local (7.60 MB, application/x-gzip)
2010-06-17 07:09 EDT, Martin Kudlej
no flags Details

  None (edit)
Description Martin Kudlej 2010-06-17 07:09:15 EDT
Created attachment 424764 [details]
log files and condor_config.local

Version-Release number of selected component (if applicable):
condor-7.4.3-0.19.el5

How reproducible:
starters for all slots have crashed at least once

Steps to Reproduce:
1. set ulimit -n to 8192
2. set NUM_CPUS=1024
3. restart condor
4. submit jobs
  
Actual results:
There are crashes of Starter

Expected results:
There are just error messages and no crash.

Additional info:
06/16 00:01:47 ERROR: SECMAN:2003:TCP auth connection to <:43475> failed.
06/16 00:01:47 ERROR "FAILED TO SEND INITIAL KEEP ALIVE TO OUR PARENT <:43475>" at line 9417 in file daemon_core.cpp
06/16 00:01:47 ShutdownFast all jobs.
Stack dump for process 27794 at timestamp 1276660907 (11 frames)
condor_starter(dprintf_dump_stack+0x44)[0x81033f4]
condor_starter[0x8105154]
[0x9c4420]
/lib/libc.so.6(abort+0x101)[0x2af701]
condor_starter(_EXCEPT_+0x93)[0x81032c3]
condor_starter(_ZN10DaemonCore17SendAliveToParentEv+0x45d)[0x80f5c3d]
condor_starter(_ZN12TimerManager7TimeoutEv+0x14b)[0x810295b]
condor_starter(_ZN10DaemonCore6DriverEv+0x244)[0x80ea454]
condor_starter(main+0xd80)[0x80fded0]
/lib/libc.so.6(__libc_start_main+0xdc)[0x29ae9c]
condor_starter[0x80b53a1]
Comment 1 Matthew Farrellee 2010-06-17 07:45:25 EDT
Please take a look at the condor_startd's CPU usage. I imagine it is pegging a core.

You are using NUM_CPUS=1024, and presumably not using partitionable slots.

The Startd is not optimized to run on machines with 1024 cores. This error is likely from a lagging Startd. You could attempt to eliminate the error by tuning MAX_ACCEPTS_PER_CYCLE, which I'd be interested in hearing about, but more likely you should reduce the NUM_CPUS. If you need many slots for scale testing you should run multiple Startds whose sum of NUM_CPUS is 1024, say start with 5.
Comment 2 Martin Kudlej 2010-06-17 08:29:49 EDT
I know I should set up less cpus there. But I think starter has not to crash because of startd cannot serve it.
Comment 3 Martin Kudlej 2010-06-17 09:04:42 EDT
I've submitted those jobs by:
for i in `seq 10`; do condor_submit job.sub >/dev/null 2>/dev/null; echo "$i";sleep 60;done

cat job.sub:
Universe=vanilla
Executable=/bin/sleep
Arguments=10

Queue 1000000
Comment 4 Matthew Farrellee 2010-06-17 10:08:40 EDT
The stack is coming from ABORT_ON_EXCEPTION=True in configuration. The exit after failing to send the initial keep alive is expected.

Note You need to log in before you can comment on or make changes to this bug.