Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 605175

Summary: Starter has crashed after FAILED TO SEND INITIAL KEEP ALIVE TO OUR PARENT
Product: Red Hat Enterprise MRG Reporter: Martin Kudlej <mkudlej>
Component: condorAssignee: Matthew Farrellee <matt>
Status: CLOSED NOTABUG QA Contact: Martin Kudlej <mkudlej>
Severity: high Docs Contact:
Priority: high    
Version: DevelopmentCC: matt
Target Milestone: 1.3   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-06-17 14:08:40 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
log files and condor_config.local none

Description Martin Kudlej 2010-06-17 11:09:15 UTC
Created attachment 424764 [details]
log files and condor_config.local

Version-Release number of selected component (if applicable):
condor-7.4.3-0.19.el5

How reproducible:
starters for all slots have crashed at least once

Steps to Reproduce:
1. set ulimit -n to 8192
2. set NUM_CPUS=1024
3. restart condor
4. submit jobs
  
Actual results:
There are crashes of Starter

Expected results:
There are just error messages and no crash.

Additional info:
06/16 00:01:47 ERROR: SECMAN:2003:TCP auth connection to <:43475> failed.
06/16 00:01:47 ERROR "FAILED TO SEND INITIAL KEEP ALIVE TO OUR PARENT <:43475>" at line 9417 in file daemon_core.cpp
06/16 00:01:47 ShutdownFast all jobs.
Stack dump for process 27794 at timestamp 1276660907 (11 frames)
condor_starter(dprintf_dump_stack+0x44)[0x81033f4]
condor_starter[0x8105154]
[0x9c4420]
/lib/libc.so.6(abort+0x101)[0x2af701]
condor_starter(_EXCEPT_+0x93)[0x81032c3]
condor_starter(_ZN10DaemonCore17SendAliveToParentEv+0x45d)[0x80f5c3d]
condor_starter(_ZN12TimerManager7TimeoutEv+0x14b)[0x810295b]
condor_starter(_ZN10DaemonCore6DriverEv+0x244)[0x80ea454]
condor_starter(main+0xd80)[0x80fded0]
/lib/libc.so.6(__libc_start_main+0xdc)[0x29ae9c]
condor_starter[0x80b53a1]

Comment 1 Matthew Farrellee 2010-06-17 11:45:25 UTC
Please take a look at the condor_startd's CPU usage. I imagine it is pegging a core.

You are using NUM_CPUS=1024, and presumably not using partitionable slots.

The Startd is not optimized to run on machines with 1024 cores. This error is likely from a lagging Startd. You could attempt to eliminate the error by tuning MAX_ACCEPTS_PER_CYCLE, which I'd be interested in hearing about, but more likely you should reduce the NUM_CPUS. If you need many slots for scale testing you should run multiple Startds whose sum of NUM_CPUS is 1024, say start with 5.

Comment 2 Martin Kudlej 2010-06-17 12:29:49 UTC
I know I should set up less cpus there. But I think starter has not to crash because of startd cannot serve it.

Comment 3 Martin Kudlej 2010-06-17 13:04:42 UTC
I've submitted those jobs by:
for i in `seq 10`; do condor_submit job.sub >/dev/null 2>/dev/null; echo "$i";sleep 60;done

cat job.sub:
Universe=vanilla
Executable=/bin/sleep
Arguments=10

Queue 1000000

Comment 4 Matthew Farrellee 2010-06-17 14:08:40 UTC
The stack is coming from ABORT_ON_EXCEPTION=True in configuration. The exit after failing to send the initial keep alive is expected.