Bug 605175
| Summary: | Starter has crashed after FAILED TO SEND INITIAL KEEP ALIVE TO OUR PARENT | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise MRG | Reporter: | Martin Kudlej <mkudlej> | ||||
| Component: | condor | Assignee: | Matthew Farrellee <matt> | ||||
| Status: | CLOSED NOTABUG | QA Contact: | Martin Kudlej <mkudlej> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | high | ||||||
| Version: | Development | CC: | matt | ||||
| Target Milestone: | 1.3 | ||||||
| Target Release: | --- | ||||||
| Hardware: | All | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2010-06-17 14:08:40 UTC | Type: | --- | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
Please take a look at the condor_startd's CPU usage. I imagine it is pegging a core. You are using NUM_CPUS=1024, and presumably not using partitionable slots. The Startd is not optimized to run on machines with 1024 cores. This error is likely from a lagging Startd. You could attempt to eliminate the error by tuning MAX_ACCEPTS_PER_CYCLE, which I'd be interested in hearing about, but more likely you should reduce the NUM_CPUS. If you need many slots for scale testing you should run multiple Startds whose sum of NUM_CPUS is 1024, say start with 5. I know I should set up less cpus there. But I think starter has not to crash because of startd cannot serve it. I've submitted those jobs by: for i in `seq 10`; do condor_submit job.sub >/dev/null 2>/dev/null; echo "$i";sleep 60;done cat job.sub: Universe=vanilla Executable=/bin/sleep Arguments=10 Queue 1000000 The stack is coming from ABORT_ON_EXCEPTION=True in configuration. The exit after failing to send the initial keep alive is expected. |
Created attachment 424764 [details] log files and condor_config.local Version-Release number of selected component (if applicable): condor-7.4.3-0.19.el5 How reproducible: starters for all slots have crashed at least once Steps to Reproduce: 1. set ulimit -n to 8192 2. set NUM_CPUS=1024 3. restart condor 4. submit jobs Actual results: There are crashes of Starter Expected results: There are just error messages and no crash. Additional info: 06/16 00:01:47 ERROR: SECMAN:2003:TCP auth connection to <:43475> failed. 06/16 00:01:47 ERROR "FAILED TO SEND INITIAL KEEP ALIVE TO OUR PARENT <:43475>" at line 9417 in file daemon_core.cpp 06/16 00:01:47 ShutdownFast all jobs. Stack dump for process 27794 at timestamp 1276660907 (11 frames) condor_starter(dprintf_dump_stack+0x44)[0x81033f4] condor_starter[0x8105154] [0x9c4420] /lib/libc.so.6(abort+0x101)[0x2af701] condor_starter(_EXCEPT_+0x93)[0x81032c3] condor_starter(_ZN10DaemonCore17SendAliveToParentEv+0x45d)[0x80f5c3d] condor_starter(_ZN12TimerManager7TimeoutEv+0x14b)[0x810295b] condor_starter(_ZN10DaemonCore6DriverEv+0x244)[0x80ea454] condor_starter(main+0xd80)[0x80fded0] /lib/libc.so.6(__libc_start_main+0xdc)[0x29ae9c] condor_starter[0x80b53a1]