605175 – Starter has crashed after FAILED TO SEND INITIAL KEEP ALIVE TO OUR PARENT

Bug 605175 - Starter has crashed after FAILED TO SEND INITIAL KEEP ALIVE TO OUR PARENT

Summary: Starter has crashed after FAILED TO SEND INITIAL KEEP ALIVE TO OUR PARENT

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	condor
Sub Component:
Version:	Development
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	1.3
Target Release:	---
Assignee:	Matthew Farrellee
QA Contact:	Martin Kudlej
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2010-06-17 11:09 UTC by Martin Kudlej
Modified:	2010-06-17 14:08 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2010-06-17 14:08:40 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
log files and condor_config.local (7.60 MB, application/x-gzip) 2010-06-17 11:09 UTC, Martin Kudlej	no flags	Details
View All

Description Martin Kudlej 2010-06-17 11:09:15 UTC

Created attachment 424764 [details]
log files and condor_config.local

Version-Release number of selected component (if applicable):
condor-7.4.3-0.19.el5

How reproducible:
starters for all slots have crashed at least once

Steps to Reproduce:
1. set ulimit -n to 8192
2. set NUM_CPUS=1024
3. restart condor
4. submit jobs
  
Actual results:
There are crashes of Starter

Expected results:
There are just error messages and no crash.

Additional info:
06/16 00:01:47 ERROR: SECMAN:2003:TCP auth connection to <:43475> failed.
06/16 00:01:47 ERROR "FAILED TO SEND INITIAL KEEP ALIVE TO OUR PARENT <:43475>" at line 9417 in file daemon_core.cpp
06/16 00:01:47 ShutdownFast all jobs.
Stack dump for process 27794 at timestamp 1276660907 (11 frames)
condor_starter(dprintf_dump_stack+0x44)[0x81033f4]
condor_starter[0x8105154]
[0x9c4420]
/lib/libc.so.6(abort+0x101)[0x2af701]
condor_starter(_EXCEPT_+0x93)[0x81032c3]
condor_starter(_ZN10DaemonCore17SendAliveToParentEv+0x45d)[0x80f5c3d]
condor_starter(_ZN12TimerManager7TimeoutEv+0x14b)[0x810295b]
condor_starter(_ZN10DaemonCore6DriverEv+0x244)[0x80ea454]
condor_starter(main+0xd80)[0x80fded0]
/lib/libc.so.6(__libc_start_main+0xdc)[0x29ae9c]
condor_starter[0x80b53a1]

Comment 1 Matthew Farrellee 2010-06-17 11:45:25 UTC

Please take a look at the condor_startd's CPU usage. I imagine it is pegging a core.

You are using NUM_CPUS=1024, and presumably not using partitionable slots.

The Startd is not optimized to run on machines with 1024 cores. This error is likely from a lagging Startd. You could attempt to eliminate the error by tuning MAX_ACCEPTS_PER_CYCLE, which I'd be interested in hearing about, but more likely you should reduce the NUM_CPUS. If you need many slots for scale testing you should run multiple Startds whose sum of NUM_CPUS is 1024, say start with 5.

Comment 2 Martin Kudlej 2010-06-17 12:29:49 UTC

I know I should set up less cpus there. But I think starter has not to crash because of startd cannot serve it.

Comment 3 Martin Kudlej 2010-06-17 13:04:42 UTC

I've submitted those jobs by:
for i in `seq 10`; do condor_submit job.sub >/dev/null 2>/dev/null; echo "$i";sleep 60;done

cat job.sub:
Universe=vanilla
Executable=/bin/sleep
Arguments=10

Queue 1000000

Comment 4 Matthew Farrellee 2010-06-17 14:08:40 UTC

The stack is coming from ABORT_ON_EXCEPTION=True in configuration. The exit after failing to send the initial keep alive is expected.

Note You need to log in before you can comment on or make changes to this bug.