Description of problem: I've submitted 10000 simple jobs "sleep 3" and after few minutes I can see in logs that condor_startd and condor_shadow crash. There is no corefile in ~condor even if I've set up it properly. Version-Release number of selected component (if applicable): RHEL 5.3 and 5.4 with condor-7.4.1-0.6 and -0.7. There are i386 and x86_64 architectures there. How reproducible: 100% submittion of 10000 simple jobs Steps to Reproduce: 1. turn off qmf plugins 2. turn on advance logging 3. submit 10000 simple jobs Actual results: condor_shadow condor_startd crash Expected results: core dumps setup: # common part rhel4/5/... ulimit -c unlimited export DAEMON_COREFILE_LIMIT='unlimited' grep DAEMON_COREFILE_LIMIT /etc/sysconfig/init || echo "DAEMON_COREFILE_LIMIT='unlimited'" >>/etc/sysconfig/init # condor specific grep ^CREATE_CORE_FILES ${condor_config} || \ echo "CREATE_CORE_FILES = True" >> ${condor_config} # condor optional (useful in some cases) grep ^ABORT_ON_EXCEPTION ${condor_config} || \ echo "ABORT_ON_EXCEPTION = True" >> ${condor_config} # rhel5/... only echo 2 > /proc/sys/fs/suid_dumpable # rhel4 only echo 2 > /proc/sys/kernel/suid_dumpable condor_shadow condor_startd don't crash Additional info: submition file: executable = /bin/sleep arguments = 3 universe=vanilla should_transfer_files = YES when_to_transfer_output = ON_EXIT queue 10000 MasterLog: 11/23 14:28:44 ProcAPI::getProcInfo() pid 14406 does not exist. 11/23 14:28:44 ProcAPI::getProcInfo() pid 14406 does not exist. 11/23 14:28:44 ProcAPI::getProcInfo() pid 14406 does not exist. 11/23 14:28:44 ProcAPI::getProcInfo() pid 14406 does not exist. 11/23 14:28:44 ProcAPI::getProcInfo() pid 14407 does not exist. 11/23 14:28:44 ProcAPI::getProcInfo() pid 14407 does not exist. 11/23 14:28:44 ProcAPI::getProcInfo() pid 14407 does not exist. 11/23 14:28:44 ProcAPI::getProcInfo() pid 14407 does not exist. 11/23 14:28:44 ProcAPI::getProcInfo() pid 14407 does not exist. 11/23 14:29:30 DaemonCore: No more children processes to reap. 11/23 14:29:30 The STARTD (pid 24153) died due to signal 11 (Segmentation fault) 11/23 14:29:30 ProcAPI::getProcInfo() pid 18780 does not exist. 11/23 14:29:30 ProcAPI::getProcInfo() pid 18780 does not exist. 11/23 14:29:30 ProcAPI::getProcInfo() pid 18780 does not exist. 11/23 14:29:30 ProcAPI::getProcInfo() pid 18780 does not exist. 11/23 14:29:30 ProcAPI::getProcInfo() pid 18780 does not exist. 11/23 14:29:30 ProcAPI::getProcInfo() pid 18805 does not exist. 11/23 14:29:30 ProcAPI::getProcInfo() pid 18805 does not exist. 11/23 14:29:30 ProcAPI::getProcInfo() pid 18805 does not exist. 11/23 14:29:30 ProcAPI::getProcInfo() pid 18805 does not exist. StartdLog: 11/23 14:29:30 slot1: match_info called 11/23 14:29:30 slot1: Received match <:51411>#1259003801#17#... 11/23 14:29:30 slot1: Started match timer (4569) for 120 seconds. 11/23 14:29:30 slot1: State change: match notification protocol successful 11/23 14:29:30 slot1: Changing state: Unclaimed -> Matched 11/23 14:29:30 slot1: Canceled match timer (4569) 11/23 14:29:30 slot1: Schedd addr = <:43049> 11/23 14:29:30 slot1: Alive interval = 300 11/23 14:29:30 slot1: Received ClaimId from schedd (<:51411>#1259003801#17#...) 11/23 14:29:30 slot1: Rank of this claim is: 0.000000 11/23 14:29:30 slot1: Request accepted. 11/23 14:29:30 slot1: Remote owner is condor@.. 11/23 14:29:30 slot1: State change: claiming protocol successful 11/23 14:29:30 slot1: Changing state: Matched -> Claimed 11/23 14:29:30 slot1: Started ClaimLease timer (4572) w/ 1800 second lease duration 11/23 14:29:30 ERROR "Can't find WANT_SUSPEND in internal ClassAd" at line 1261 in file Resource.cpp 11/23 14:29:30 CronMgr: 0 jobs alive 11/23 14:29:30 slot1: Canceled ClaimLease timer (4572) 11/23 14:29:30 slot1: Changing state and activity: Claimed/Idle -> Preempting/Killing 11/23 14:29:30 Entered vacate_client <:43049> ... 11/23 14:29:30 slot1: State change: No preempting claim, returning to owner 11/23 14:29:30 slot1: Changing state and activity: Preempting/Killing -> Owner/Idle 11/23 14:29:30 slot1: State change: IS_OWNER is false 11/23 14:29:30 slot1: Changing state: Owner -> Unclaimed 11/23 14:29:30 slot2: Canceled ClaimLease timer (33) 11/23 14:29:30 slot2: Changing state and activity: Claimed/Busy -> Preempting/Killing 11/23 14:29:30 slot2: In Starter::kill() with pid 18939, sig 3 (SIGQUIT) 11/23 14:29:30 Send_Signal(): Doing kill(18939,3) [SIGQUIT] 11/23 14:29:30 slot3: Canceled ClaimLease timer (35) 11/23 14:29:30 slot3: Changing state and activity: Claimed/Busy -> Preempting/Killing 11/23 14:29:30 slot3: In Starter::kill() with pid 18780, sig 3 (SIGQUIT) 11/23 14:29:30 Send_Signal(): Doing kill(18780,3) [SIGQUIT] 11/23 14:29:30 slot4: Canceled ClaimLease timer (37) 11/23 14:29:30 slot4: Changing state and activity: Claimed/Busy -> Preempting/Killing 11/23 14:29:30 slot4: In Starter::kill() with pid 18889, sig 3 (SIGQUIT) 11/23 14:29:30 Send_Signal(): Doing kill(18889,3) [SIGQUIT] 11/23 14:29:30 slot5: Canceled ClaimLease timer (39) 11/23 14:29:30 slot5: Changing state and activity: Claimed/Busy -> Preempting/Killing 11/23 14:29:30 slot5: In Starter::kill() with pid 18828, sig 3 (SIGQUIT) 11/23 14:29:30 Send_Signal(): Doing kill(18828,3) [SIGQUIT] 11/23 14:29:30 slot6: Canceled ClaimLease timer (41) 11/23 14:29:30 slot6: Changing state and activity: Claimed/Busy -> Preempting/Killing 11/23 14:29:30 slot6: In Starter::kill() with pid 18805, sig 3 (SIGQUIT) 11/23 14:29:30 Send_Signal(): Doing kill(18805,3) [SIGQUIT] 11/23 14:29:30 slot7: Canceled ClaimLease timer (43) 11/23 14:29:30 slot7: Changing state and activity: Claimed/Busy -> Preempting/Killing 11/23 14:29:30 slot7: In Starter::kill() with pid 18863, sig 3 (SIGQUIT) 11/23 14:29:30 Send_Signal(): Doing kill(18863,3) [SIGQUIT] 11/23 14:29:30 slot8: Canceled ClaimLease timer (31) 11/23 14:29:30 slot8: Changing state and activity: Claimed/Busy -> Preempting/Killing 11/23 14:29:30 slot8: In Starter::kill() with pid 18915, sig 3 (SIGQUIT) 11/23 14:29:30 Send_Signal(): Doing kill(18915,3) [SIGQUIT] 11/23 14:29:30 startd exiting because of fatal exception. Stack dump for process 24153 at timestamp 1259004570 (24 frames) condor_startd(dprintf_dump_stack+0xb7)[0x5020d1] condor_startd[0x50233e] /lib64/libpthread.so.0[0x34c380e7c0] /lib64/libc.so.6(abort+0x28f)[0x34c3031e8f] condor_startd(_EXCEPT_+0x1a5)[0x500663] condor_startd(_ZN8Resource13wants_suspendEv+0x113)[0x4a5065] condor_startd(_ZN8ResState4evalEv+0xe5)[0x4a40d9] condor_startd(_ZN8ResState6changeE5State8Activity+0x2af)[0x4a335b] condor_startd(_ZN8ResState6changeE5State+0x7e)[0x4a33e4] condor_startd(_ZN8Resource12change_stateE5State+0x1f)[0x49819d] condor_startd(_Z20accept_request_claimP8Resource+0x66c)[0x4b1eae] condor_startd(_Z13request_claimP8ResourceP5ClaimPcP6Stream+0x144a)[0x4b343c] condor_startd(_Z21command_request_claimP7ServiceiP6Stream+0x2a6)[0x4b494e] condor_startd(_ZN10DaemonCore18CallCommandHandlerEiP6Streamb+0x29f)[0x4d7675] condor_startd(_ZN10DaemonCore9HandleReqEP6StreamS1_+0x3393)[0x4eb671] condor_startd(_ZN10DaemonCore22HandleReqSocketHandlerEP6Stream+0x99)[0x4ec287] condor_startd(_ZN10DaemonCore24CallSocketHandler_workerEibP6Stream+0x279)[0x4ebffb] condor_startd(_ZN10DaemonCore35CallSocketHandler_worker_demarshallEPv+0x39)[0x4ec1e3] condor_startd(_ZN13CondorThreads8pool_addEPFvPvES0_PiPKc+0x3f)[0x579acd] condor_startd(_ZN10DaemonCore17CallSocketHandlerERib+0x1cd)[0x4de445] condor_startd(_ZN10DaemonCore6DriverEv+0x1876)[0x4dfd3c] condor_startd(main+0x1917)[0x4f7785] /lib64/libc.so.6(__libc_start_main+0xf4)[0x34c301d994] condor_startd[0x490169] 11/23 14:29:40 Found have_phsy 1, have_core 1 11/23 14:29:40 Analyzing 8 processors with IDs 11/23 14:29:40 Comparing P0 and P1: 0!=1 or 0!=0 (! match) 11/23 14:29:40 Comparing P0 and P2: 0!=0 or 0!=1 (! match) 11/23 14:29:40 Comparing P0 and P3: 0!=1 or 0!=1 (! match) 11/23 14:29:40 Comparing P0 and P4: 0!=0 or 0!=2 (! match) 11/23 14:29:40 Comparing P0 and P5: 0!=1 or 0!=2 (! match) 11/23 14:29:40 Comparing P0 and P6: 0!=0 or 0!=3 (! match) 11/23 14:29:40 Comparing P0 and P7: 0!=1 or 0!=3 (! match) 11/23 14:29:40 P0: match->1 11/23 14:29:40 Comparing P1 and P2: 1!=0 or 0!=1 (! match) 11/23 14:29:40 Comparing P1 and P3: 1!=1 or 0!=1 (! match) 11/23 14:29:40 Comparing P1 and P4: 1!=0 or 0!=2 (! match) 11/23 14:29:40 Comparing P1 and P5: 1!=1 or 0!=2 (! match) 11/23 14:29:40 Comparing P1 and P6: 1!=0 or 0!=3 (! match) 11/23 14:29:40 Comparing P1 and P7: 1!=1 or 0!=3 (! match) 11/23 14:29:40 P1: match->1 11/23 14:29:40 Comparing P2 and P3: 0!=1 or 1!=1 (! match) 11/23 14:29:40 Comparing P2 and P4: 0!=0 or 1!=2 (! match) 11/23 14:29:40 Comparing P2 and P5: 0!=1 or 1!=2 (! match) 11/23 14:29:40 Comparing P2 and P6: 0!=0 or 1!=3 (! match) 11/23 14:29:40 Comparing P2 and P7: 0!=1 or 1!=3 (! match) 11/23 14:29:40 P2: match->1 11/23 14:29:40 Comparing P3 and P4: 1!=0 or 1!=2 (! match) 11/23 14:29:40 Comparing P3 and P5: 1!=1 or 1!=2 (! match) 11/23 14:29:40 Comparing P3 and P6: 1!=0 or 1!=3 (! match) 11/23 14:29:40 Comparing P3 and P7: 1!=1 or 1!=3 (! match) 11/23 14:29:40 P3: match->1 11/23 14:29:40 Comparing P4 and P5: 0!=1 or 2!=2 (! match) 11/23 14:29:40 Comparing P4 and P6: 0!=0 or 2!=3 (! match) 11/23 14:29:40 Comparing P4 and P7: 0!=1 or 2!=3 (! match) 11/23 14:29:40 P4: match->1 11/23 14:29:40 Comparing P5 and P6: 1!=0 or 2!=3 (! match) 11/23 14:29:40 Comparing P5 and P7: 1!=1 or 2!=3 (! match) 11/23 14:29:40 P5: match->1 11/23 14:29:40 Comparing P6 and P7: 0!=1 or 3!=3 (! match) 11/23 14:29:40 P6: match->1 11/23 14:29:40 P7: match->1 11/23 14:29:40 Using IDs: 8 processors, 8 CPUs, 0 HTs ShadowLog: 11/23 14:29:29 (1.944) (18849): Set NumJobStarts to 1 11/23 14:29:29 (1.945) (18875): Resource slot4@... changing state from STARTUP to EXECUTING 11/23 14:29:29 (1.945) (18875): WriteUserLog: not initialized @ writeEvent() 11/23 14:29:29 (1.945) (18875): Started timer to evaluate periodic user policy expressions every 60 seconds 11/23 14:29:29 (1.945) (18875): QmgrJobUpdater: started timer to update queue every 900 seconds (tid=11) 11/23 14:29:29 (1.945) (18875): Set NumJobStarts to 1 11/23 14:29:29 (1.946) (18902): Resource slot8@... changing state from STARTUP to EXECUTING 11/23 14:29:29 (1.946) (18902): WriteUserLog: not initialized @ writeEvent() 11/23 14:29:29 (1.946) (18902): Started timer to evaluate periodic user policy expressions every 60 seconds 11/23 14:29:29 (1.946) (18902): QmgrJobUpdater: started timer to update queue every 900 seconds (tid=11) 11/23 14:29:29 (1.946) (18902): Set NumJobStarts to 1 11/23 14:29:29 (1.947) (18923): Resource slot2@... changing state from STARTUP to EXECUTING 11/23 14:29:29 (1.947) (18923): WriteUserLog: not initialized @ writeEvent() 11/23 14:29:29 (1.947) (18923): Started timer to evaluate periodic user policy expressions every 60 seconds 11/23 14:29:29 (1.947) (18923): QmgrJobUpdater: started timer to update queue every 900 seconds (tid=11) 11/23 14:29:29 (1.947) (18923): Set NumJobStarts to 1 11/23 14:29:30 (1.947) (18923): WriteUserLog: not initialized @ writeEvent() 11/23 14:29:30 (1.947) (18923): ERROR "Error from slot2@...: ProcD has failed" at line 655 in file pseudo_ops.cpp 11/23 14:29:30 (1.942) (18782): WriteUserLog: not initialized @ writeEvent() 11/23 14:29:30 (1.942) (18782): ERROR "Error from slot6@...: ProcD has failed" at line 655 in file pseudo_ops.cpp Stack dump for process 18923 at timestamp 1259004570 (17 frames) condor_shadow(dprintf_dump_stack+0xb7)[0x4df855] condor_shadow[0x4dfac2] /lib64/libpthread.so.0[0x34c380e7c0] /lib64/libc.so.6(gsignal+0x35)[0x34c3030265] /lib64/libc.so.6(abort+0x110)[0x34c3031d10] condor_shadow(_EXCEPT_+0x1a5)[0x4ddde7] condor_shadow(_Z11pseudo_ulogP7ClassAd+0x37d)[0x4a0355] condor_shadow(_Z17do_REMOTE_syscallv+0x3609)[0x4a4f9d] condor_shadow(_ZN14RemoteResource14handleSysCallsEP6Stream+0x32)[0x49d22a] condor_shadow(_ZN10DaemonCore24CallSocketHandler_workerEibP6Stream+0x279)[0x4cb7d7] condor_shadow(_ZN10DaemonCore35CallSocketHandler_worker_demarshallEPv+0x39)[0x4cb9bf] condor_shadow(_ZN13CondorThreads8pool_addEPFvPvES0_PiPKc+0x3f)[0x56bdd9] condor_shadow(_ZN10DaemonCore17CallSocketHandlerERib+0x1cd)[0x4bdc21] condor_shadow(_ZN10DaemonCore6DriverEv+0x1876)[0x4bf518] condor_shadow(main+0x1917)[0x4d70d1] /lib64/libc.so.6(__libc_start_main+0xf4)[0x34c301d994] condor_shadow[0x48ef69] 11/23 14:29:30 (1.941) (18766): WriteUserLog: not initialized @ writeEvent() 11/23 14:29:30 (1.941) (18766): ERROR "Error from slot3@...: ProcD has failed" at line 655 in file pseudo_ops.cpp 11/23 14:29:30 (1.944) (18849): WriteUserLog: not initialized @ writeEvent() Stack dump for process 18766 at timestamp 1259004570 (17 frames) condor_shadow(dprintf_dump_stack+0xb7)[0x4df855] condor_shadow[0x4dfac2] /lib64/libpthread.so.0[0x34c380e7c0] /lib64/libc.so.6(gsignal+0x35)[0x34c3030265] /lib64/libc.so.6(abort+0x110)[0x34c3031d10] condor_shadow(_EXCEPT_+0x1a5)[0x4ddde7] condor_shadow(_Z11pseudo_ulogP7ClassAd+0x37d)[0x4a0355] condor_shadow(_Z17do_REMOTE_syscallv+0x3609)[0x4a4f9d] Stack dump for process 18782 at timestcondor_shadow(_ZN14RemoteResource14handleSysCallsEP6Stream+0x32)[0x49d22a] condor_shadow(_ZN10DaemonCore24CallSocketHandler_workerEibP6Stream+0x279)[0x4cb7d7] condor_shadow(_ZN10DaemonCore35CallSocketHandler_worker_demarshallEPv+0x39)[0x4cb9bf] condor_shadow(_ZN13CondorThreads8pool_addEPFvPvES0_PiPKc+0x3f)[0x56bdd9] condor_shadow(_ZN10DaemonCore17CallSocketHandlerERib+0x1cd)[0x4bdc21] condor_shadow(_ZN10DaemonCore6DriverEv+0x1876)[0x4bf518] condor_shadow(main+0x1917)[0x4d70d1] /lib64/libc.so.6(__libc_start_main+0xf4)[0x34c301d994] condor_shadow[0x48ef69] 11/23 14:29:30 (1.944) (18849): ERROR "Error from slot7@...: ProcD has failed" at line 655 in file pseudo_ops.cpp 11/23 14:29:30 (1.945) (18875): WriteUserLog: not initialized @ writeEvent() 11/23 14:29:30 (1.943) (18816): WriteUserLog: not initialized @ writeEvent() 11/23 14:29:30 (1.943) (18816): ERROR "Error from slot5@...: ProcD has failed" at line 655 in file pseudo_ops.cpp Stack dump for procesamp 1259004570 (17 frames) Stack dump for process 18849 at timestamp 1259004570 (17 frames) condor_shadow(dprintf_dump_stack+0xb7)[0x4df855] condor_shadow[0x4dfac2] /lib64/libpthread.so.0[0x34c380e7c0] /lib64/libc.so.6(gsignal+0x35)[0x34c3030265] /lib64/libc.so.6(abort+0x110)[0x34c3031d10] condor_shadow(_EXCEPT_+0x1a5)[0x4ddde7] condor_shadow(_Z11pseudo_ulogP7ClassAd+0x37d)[0x4a0355] condor_shadow(_Z17do_REMOTE_syscallv+0x3609)[0x4a4f9d] condor_shadow(_ZN14RemoteResource14handleSysCallsEP6Stream+0x32)[0x49d22a] condor_shadow(_ZN10DaemonCore24CallSocketHandler_workerEibP6Stream+0x279)[0x4cb7d7] condor_shadow(_ZN10DaemonCore35CallSocketHandler_worker_demarshallEPv+0x39)[0x4cb9bf] condor_shadow(_ZN13CondorThreads8pool_addEPFvPvES0_PiPKc+0x3f)[0x56bdd9] condor_shadow(_ZN10DaemonCore17CallSocketHandlerERib+0x1cd)[0x4bdc21] condor_shadow(_ZN10DaemonCore6DriverEv+0x1876)[0x4bf518] condor_shadow(main+0x1917)[0x4d70d1] scondor_shadow(dprintf_dump_stack+0xb7)[0x4df855] condor_shadow[0x4dfac2] /lib64/libpthread.so.0[0x34c380e7c0] /lib64/libc.so.6(gsignal+0x35)[0x34c3030265] /lib64/libc.so.6(abort+0x110)[0x34c3031d10] condor_shadow(_EXCEPT_+0x1a5)[0x4ddde7] condor_shadow(_Z11pseudo_ulogP7ClassAd+0x37d)[0x4a0355] 11/23 14:29:30 (1.946) (18902): WriteUserLog: not initialized @ writeEvent() /lib64/libc.so.6(__libc_start_main+0xf4)[0x34c301d994] condor_shadow(_Z17do_REMOTE_syscallv+0x3609)[0x4a4f9d] 11/23 14:29:30 (1.945) (18875): ERROR "Error from slot4@...: ProcD has failed" at line 655 in file pseudo_ops.cpp 11/23 14:29:30 (1.946) (18902): ERROR "Error from slot8@...: ProcD has failed" at line 655 in file pseudo_ops.cpp condor_shadow[0x48ef69] condor_shadow(_ZN14RemoteResource14handleSysCallsEP6Stream+0x32)[0x49d22a] Stack dump for proceScondor_shadow(_ZN10DaemonCore24CallSocketHandler_workerEibP6Stream+0x279)[0x4cb7d7] 18816 at timestamp 1259004570 (17 frames) condor_shadow(_ZN10DaemonCore35CallSocketHandler_worker_demarshallEPv+0x39)[0x4cb9bf] condor_shadow(_ZN13CondorThreads8pool_addEPFvPvES0_PiPKc+0x3f)[0x56bdd9] condor_shadow(dprintf_dump_stack+0xb7)[0x4df855] condor_shadow(_ZN10DaemonCore17CallSocketHandlerERib+0x1cd)[0x4bdc21] condor_shadow[0x4dfac2] /lib64/libpthread.so.0[0x34c380e7c0] /lib64/libc.so.6(gsignal+0x35)[0x34c3030265] /lib64/libc.so.6(abort+0x110)[0x34c3031d10] condor_shadow(_ZN10DaemonCore6DriverEv+0x1876)[0x4bf518] condor_shadow(_EXCEPT_+0x1a5)[0x4ddde7] scondor_shadow(main+0x1917)[0x4d70d1] s 18875/lib64/libc.so.6(__libc_start_main+0xf4)[0x34c301d994] t at timestamp 12590045condor_shadow[0x48ef69] ack dump for process 18902 at timestamp 1259004570 (17 frames) condor_shadow(_Z11pseudo_ulogP7ClassAd+0x37d)[0x4a0355] 70 (17 frames) condor_shadow(_Z17do_REMOTE_syscallv+0x3609)[0x4a4f9d] condor_shadow(dprintf_dump_stack+0xb7)[0x4df855] condor_shadow(_ZN14RemoteResource14handleSysCallsEP6Stream+0x32)[0x49d22a] condor_shadow(dprintf_dump_stack+0xb7)[0x4df855] condor_shadow[0x4dfac2] /lib64/libpthread.so.0[0x34c380e7c0] condor_shadow(_ZN10DaemonCore24CallSocketHandler_workerEibP6Stream+0x279)[0x4cb7d7] /lib64/libc.so.6(gsignal+0x35)[0x34c3030265] condor_shadow[0x4dfac2] /lib64/libpthread.so.0[0x34c380e7c0] /lib64/libc.so.6(abort+0x110)[0x34c3031d10] /lib64/libc.so.6(gsignal+0x35)[0x34c3030265] /lib64/libc.so.6(abort+0x110)[0x34c3031d10] condor_shadow(_ZN10DaemonCore35CallSocketHandler_worker_demarshallEPv+0x39)[0x4cb9bf] condor_shadow(_EXCEPT_+0x1a5)[0x4ddde7] condor_shadow(_EXCEPT_+0x1a5)[0x4ddde7] condor_shadow(_ZN13CondorThreads8pool_addEPFvPvES0_PiPKc+0x3f)[0x56bdd9] condor_shadow(_Z11pseudo_ulogP7ClassAd+0x37d)[0x4a0355] condor_shadow(_Z11pseudo_ulogP7ClassAd+0x37d)[0x4a0355] condor_shadow(_ZN10DaemonCore17CallSocketHandlerERib+0x1cd)[0x4bdc21] condor_shadow(_Z17do_REMOTE_syscallv+0x3609)[0x4a4f9d] condor_shadow(_Z17do_REMOTE_syscallv+0x3609)[0x4a4f9d] condor_shadow(_ZN14RemoteResource14handleSysCallsEP6Stream+0x32)[0x49d22a] condor_shadow(_ZN10DaemonCore6DriverEv+0x1876)[0x4bf518] condor_shadow(_ZN14RemoteResource14handleSysCallsEP6Stream+0x32)[0x49d22a] condor_shadow(_ZN10DaemonCore24CallSocketHandler_workerEibP6Stream+0x279)[0x4cb7d7] condor_shadow(main+0x1917)[0x4d70d1] condor_shadow(_ZN10DaemonCore24CallSocketHandler_workerEibP6Stream+0x279)[0x4cb7d7] /lib64/libc.so.6(__libc_start_main+0xf4)[0x34c301d994] condor_shadow(_ZN10DaemonCore35CallSocketHandler_worker_demarshallEPv+0x39)[0x4cb9bf] condor_shadow(_ZN10DaemonCore35CallSocketHandler_worker_demarshallEPv+0x39)[0x4cb9bf] condor_shadow[0x48ef69] condor_shadow(_ZN13CondorThreads8pool_addEPFvPvES0_PiPKc+0x3f)[0x56bdd9] condor_shadow(_ZN13CondorThreads8pool_addEPFvPvES0_PiPKc+0x3f)[0x56bdd9] condor_shadow(_ZN10DaemonCore17CallSocketHandlerERib+0x1cd)[0x4bdc21] condor_shadow(_ZN10DaemonCore17CallSocketHandlerERib+0x1cd)[0x4bdc21] condor_shadow(_ZN10DaemonCore6DriverEv+0x1876)[0x4bf518] condor_shadow(_ZN10DaemonCore6DriverEv+0x1876)[0x4bf518] condor_shadow(main+0x1917)[0x4d70d1] condor_shadow(main+0x1917)[0x4d70d1] /lib64/libc.so.6(__libc_start_main+0xf4)[0x34c301d994] /lib64/libc.so.6(__libc_start_main+0xf4)[0x34c301d994] condor_shadow[0x48ef69] condor_shadow[0x48ef69] 11/23 14:29:30 Found have_phsy 1, have_core 1 11/23 14:29:30 Analyzing 8 processors with IDs 11/23 14:29:30 Comparing P0 and P1: 0!=1 or 0!=0 (! match) 11/23 14:29:30 Comparing P0 and P2: 0!=0 or 0!=1 (! match) 11/23 14:29:30 Comparing P0 and P3: 0!=1 or 0!=1 (! match) 11/23 14:29:30 Comparing P0 and P4: 0!=0 or 0!=2 (! match) 11/23 14:29:30 Comparing P0 and P5: 0!=1 or 0!=2 (! match) 11/23 14:29:30 Comparing P0 and P6: 0!=0 or 0!=3 (! match) 11/23 14:29:30 Comparing P0 and P7: 0!=1 or 0!=3 (! match) 11/23 14:29:30 P0: match->1 11/23 14:29:30 Comparing P1 and P2: 1!=0 or 0!=1 (! match) 11/23 14:29:30 Comparing P1 and P3: 1!=1 or 0!=1 (! match) 11/23 14:29:30 Comparing P1 and P4: 1!=0 or 0!=2 (! match) 11/23 14:29:30 Comparing P1 and P5: 1!=1 or 0!=2 (! match) 11/23 14:29:30 Comparing P1 and P6: 1!=0 or 0!=3 (! match) 11/23 14:29:30 Comparing P1 and P7: 1!=1 or 0!=3 (! match) 11/23 14:29:30 P1: match->1 11/23 14:29:30 Comparing P2 and P3: 0!=1 or 1!=1 (! match) 11/23 14:29:30 Comparing P2 and P4: 0!=0 or 1!=2 (! match) 11/23 14:29:30 Comparing P2 and P5: 0!=1 or 1!=2 (! match) 11/23 14:29:30 Comparing P2 and P6: 0!=0 or 1!=3 (! match) 11/23 14:29:30 Comparing P2 and P7: 0!=1 or 1!=3 (! match) 11/23 14:29:30 P2: match->1 11/23 14:29:30 Comparing P3 and P4: 1!=0 or 1!=2 (! match) 11/23 14:29:30 Comparing P3 and P5: 1!=1 or 1!=2 (! match) 11/23 14:29:30 Comparing P3 and P6: 1!=0 or 1!=3 (! match) 11/23 14:29:30 Comparing P3 and P7: 1!=1 or 1!=3 (! match) 11/23 14:29:30 P3: match->1 11/23 14:29:30 Comparing P4 and P5: 0!=1 or 2!=2 (! match) 11/23 14:29:30 Comparing P4 and P6: 0!=0 or 2!=3 (! match) 11/23 14:29:30 Comparing P4 and P7: 0!=1 or 2!=3 (! match) 11/23 14:29:30 P4: match->1 11/23 14:29:30 Comparing P5 and P6: 1!=0 or 2!=3 (! match) 11/23 14:29:30 Comparing P5 and P7: 1!=1 or 2!=3 (! match) 11/23 14:29:30 P5: match->1 11/23 14:29:30 Comparing P6 and P7: 0!=1 or 3!=3 (! match) 11/23 14:29:30 P6: match->1 11/23 14:29:30 P7: match->1 11/23 14:29:30 Using IDs: 8 processors, 8 CPUs, 0 HTs
What's the frequency of this bug? For instance, $ grep -e "STARTING UP" -e "signal 11" MasterLog; tail -n1 MasterLog 11/25 13:49:56 ** condor_master (CONDOR_MASTER) STARTING UP 11/25 14:06:09 The STARTD (pid 15559) died due to signal 11 (Segmentation fault) 11/25 14:21:20 exit Daemons::UpdateCollector That shows one crash over a period of about 30 minutes. However, we need more than 1 data point to have any confidence in the rate.
The condor_shadow failures are a response to a condor_starter failure. The condor_starter failure, only indirectly shown in the logs above, is because "ProcD has failed". The condor_procd is a process shared between the condor_startd and condor_starter processes to track process families. $ pstree -p 12992 condor_startd(12992)─┬─condor_procd(13254) ├─condor_starter(25486)───condor_exec.exe(25567) ├─condor_starter(25519)───condor_exec.exe(25568) ├─condor_starter(25542)───condor_exec.exe(25569) ├─condor_starter(25588) ├─condor_starter(25620) ├─condor_starter(25647) └─condor_starter(31030)───condor_exec.exe(31054) The procd, which is terminated as part of ~DaemonCore and Proc_Family_Cleanup, may be in a race with the condor_starter. This should be a separate BZ.
What is the occurrence of this bug? Basically, how many jobs are processed by a single startd before it crashes. Very roughly, $ echo "Jobs per crash: $(($(grep "Idle -> Busy" StartLog | wc -l) / $(grep -e "STARTING UP" StartLog | wc -l)))" Jobs per crash: 1129
The stack is easier to read after being piped through c++filt, $ grep -A24 Stack StartLog | c++filt Stack dump for process 15559 at timestamp 1259175969 (24 frames) condor_startd(dprintf_dump_stack+0xb7)[0x5020d1] condor_startd[0x50233e] /lib64/libpthread.so.0[0x34c380e7c0] /lib64/libc.so.6(abort+0x28f)[0x34c3031e8f] condor_startd(_EXCEPT_+0x1a5)[0x500663] condor_startd(Resource::wants_suspend()+0x113)[0x4a5065] condor_startd(ResState::eval()+0xe5)[0x4a40d9] condor_startd(ResState::change(State, Activity)+0x2af)[0x4a335b] condor_startd(ResState::change(State)+0x7e)[0x4a33e4] condor_startd(Resource::change_state(State)+0x1f)[0x49819d] condor_startd(accept_request_claim(Resource*)+0x66c)[0x4b1eae] condor_startd(request_claim(Resource*, Claim*, char*, Stream*)+0x144a)[0x4b343c] condor_startd(command_request_claim(Service*, int, Stream*)+0x2a6)[0x4b494e] condor_startd(DaemonCore::CallCommandHandler(int, Stream*, bool)+0x29f)[0x4d7675] condor_startd(DaemonCore::HandleReq(Stream*, Stream*)+0x3393)[0x4eb671] condor_startd(DaemonCore::HandleReqSocketHandler(Stream*)+0x99)[0x4ec287] condor_startd(DaemonCore::CallSocketHandler_worker(int, bool, Stream*)+0x279)[0x4ebffb] condor_startd(DaemonCore::CallSocketHandler_worker_demarshall(void*)+0x39)[0x4ec1e3] condor_startd(CondorThreads::pool_add(void (*)(void*), void*, int*, char const*)+0x3f)[0x579acd] condor_startd(DaemonCore::CallSocketHandler(int&, bool)+0x1cd)[0x4de445] condor_startd(DaemonCore::Driver()+0x1876)[0x4dfd3c] condor_startd(main+0x1917)[0x4f7785] /lib64/libc.so.6(__libc_start_main+0xf4)[0x34c301d994] condor_startd[0x490169]
NOTE (thanks GregT for the reminder): The check for WANT_SUSPEND is an EvalBool, which means WANT_SUSPEND evaluating to UNDEFINED results in this assertion. Debug with: WANT_SUSPEND = debug($(WANT_SUSPEND)) The assertion may be misguided. It attempts to assert that WANT_SUSPEND is set, and cannot distinguish between a WANT_SUSPEND that evaluates to UNDEFINED vs was never set.
$ condor_config_val -v WANT_SUSPEND WANT_SUSPEND: debug(( (TARGET.ImageSize < (15 * 1024)) || ((KeyboardIdle < 60) == False) || (TARGET.JobUniverse == 5) ) && ( ( (KeyboardIdle < 60) || ( (CpuBusyTime > 2 * 60) && (CurrentTime - JobStart) > 90 ) ) )) I would suspect JobStart, aka ATTR_JOB_START
Indeed JobStart is evaluating to UNDEFINED StartLog: 11/25 15:49:27 Classad debug: ImageSize --> ERROR 11/25 15:49:27 Classad debug: TARGET.ImageSize --> 22 11/25 15:49:27 Classad debug: KeyboardIdle --> ERROR 11/25 15:49:27 Classad debug: KeyboardIdle --> 66 11/25 15:49:27 Classad debug: CpuBusyTime --> ERROR 11/25 15:49:27 Classad debug: CpuBusyTime --> 318 11/25 15:49:27 Classad debug: CurrentTime --> 1259182167 11/25 15:49:27 Classad debug: JobStart --> UNDEFINED 11/25 15:49:27 Classad debug: debug(((TARGET.ImageSize < (15 * 1024)) || ((KeyboardIdle < 60) == FALSE) || (TARGET.JobUniverse == 5)) && (((KeyboardIdle < 60) || ((CpuBusyTime > 2 * 60) && (CurrentTime - JobStart) > 90)))) --> UNDEFINED 11/25 15:49:27 ERROR "Can't find WANT_SUSPEND in internal ClassAd" at line 1261 in file Resource.cpp
With debugging, $ grep -e "STARTING UP" -e "signal 11" MasterLog; tail -n1 MasterLog 11/25 13:49:56 ** condor_master (CONDOR_MASTER) STARTING UP 11/25 14:06:09 The STARTD (pid 15559) died due to signal 11 (Segmentation fault) 11/25 14:27:46 The STARTD (pid 12992) died due to signal 11 (Segmentation fault) 11/25 14:49:01 The STARTD (pid 28825) died due to signal 11 (Segmentation fault) 11/25 15:10:41 The STARTD (pid 11562) died due to signal 11 (Segmentation fault) 11/25 15:38:20 The STARTD (pid 27427) died due to signal 11 (Segmentation fault) 11/25 15:49:27 The STARTD (pid 29229) died due to signal 11 (Segmentation fault) 11/25 16:06:40 ProcAPI::getProcInfo() pid 16745 does not exist. $ grep "JobStart --> UNDEFINED" StartLog 11/25 15:49:27 Classad debug: JobStart --> UNDEFINED
Proposed workaround: $ grep ActivationTimer ~condor/condor_config.overrides ActivationTimer = ifThenElse(JobStart =!= UNDEFINED, (CurrentTime - JobStart), 0)
This crash was only happening when the evaluation of WANT_SUSPEND made it to evaluation of JobStart, which was protected by KeyboardIdle and CpuBusyTime. The crash can be easily reproduced with: WANT_SUSPEND = (CurrentTime - JobStart) < 90 JobStart is defined by a claim with an active job in the startd. It is not always present, making it a dangerous choice for default policy.
Two actions on this: 1) Update the assertion to differentiate between not present and evaluating to UNDEFINED 2) Update the default configuration to protect access to JobStart
This bug has likely been around for a very long time: Activation timer showed up in 1998 (git show 3cef52db). The conditional setting of ATTR_JOB_START has been around since at least 2003 (see V6_5-branch).
Upstream ticket: http://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=1001
commit 745a047243d94d23ac286fd604c57542f3f3cd15 Author: Matthew Farrellee <matt@> Date: Fri Dec 18 15:50:16 2009 -0500 Protect ActivationTimer from going to UNDEFINED (#1001) diff --git a/src/condor_examples/condor_config.generic b/src/condor_examples/condor_config.generic index 4a49eae..3d93b5a 100644 --- a/src/condor_examples/condor_config.generic +++ b/src/condor_examples/condor_config.generic @@ -622,7 +622,7 @@ MINUTE = 60 HOUR = (60 * $(MINUTE)) StateTimer = (CurrentTime - EnteredCurrentState) ActivityTimer = (CurrentTime - EnteredCurrentActivity) -ActivationTimer = (CurrentTime - JobStart) +ActivationTimer = ifThenElse(JobStart =!= UNDEFINED, (CurrentTime - JobStart), 0) LastCkpt = (CurrentTime - LastPeriodicCheckpoint) ## The JobUniverse attribute is just an int. These macros can be
commit 6d4b6e5119261851e17996e5121c42b7c3a045fe Author: Matthew Farrellee <matt@> Date: Fri Dec 18 15:55:12 2009 -0500 WANT_SUSPEND that is UNDEFINED, either by being absent or evaluating to UNDEFINED, now means FALSE. It used to mean EXCEPT! (#1001) diff --git a/src/condor_startd.V6/Resource.cpp b/src/condor_startd.V6/Resource.cpp index 6a23889..76da635 100644 --- a/src/condor_startd.V6/Resource.cpp +++ b/src/condor_startd.V6/Resource.cpp @@ -1255,10 +1255,8 @@ Resource::wants_suspend( void ) if( r_classad->EvalBool( "WANT_SUSPEND", r_cur->ad(), want_suspend ) == 0) { - // This should never happen, since we already check - // when we're constructing the internal config classad - // if we've got this defined. -Derek Wright 4/12/00 - EXCEPT( "Can't find WANT_SUSPEND in internal ClassAd" ); + // UNDEFINED means FALSE for WANT_SUSPEND + want_suspend = false; } } return want_suspend;
Fixed in 7.4.2-0.1
Tested on Rhel 5.5/4.8 x x86_64/i386 with condor-7.4.3-0.5. I submitted 100000 simple job from comment #1 and another 100000 simple jobs to condor with "WANT_SUSPEND = (CurrentTime - JobStart) < 90" in configuration. I didn't see any error or coredump from comment #1 -->VERIFIED
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Previously, if the user submitted around 10000 simple jobs "sleep 3", condor_startd and condor_shadow failed. With this update, condor_startd and condor_shadow work as expected, even with large numbers of jobs.
Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1 +1 @@ -Previously, if the user submitted around 10000 simple jobs "sleep 3", condor_startd and condor_shadow failed. With this update, condor_startd and condor_shadow work as expected, even with large numbers of jobs.+Previously, condor_startd and condor_shadow failed because default the value of WANT_SUSPEND policy becomes to UNDEFINED. With this update, condor_startd and condor_shadow work as expected, even if WANT_SUSPEND evaluates to UNDEFINED.
Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1 +1,4 @@ -Previously, condor_startd and condor_shadow failed because default the value of WANT_SUSPEND policy becomes to UNDEFINED. With this update, condor_startd and condor_shadow work as expected, even if WANT_SUSPEND evaluates to UNDEFINED.+C: The default WANT_SUSPEND policy included JobStart, which in some cases may be UNDEFINED. +C: The WANT_SUSPEND policy expression would evaluated to UNDEFINED when JobStart was also not defined. A WANT_SUSPEND of UNDEFINED was considered an error, and the Startd would exit. +F: Workaround by removing JobStart from WANT_SUSPEND. Fixed by making WANT_SUSPEND evaluating to UNDEFINED equivalent to evaluating to FALSE. +R: Startd now treats a WANT_SUSPEND that evaluates to UNDEFINED as if it evaluated to FALSE.
Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1,4 +1 @@ -C: The default WANT_SUSPEND policy included JobStart, which in some cases may be UNDEFINED. +The default 'WANT_SUSPEND' policy included the JobStart attribute, which in some cases was 'UNDEFINED'. A 'WANT_SUSPEND' with the value 'UNDEFINED' was considered an error and startd would exit. With this update, the JobStart attribute was removed from 'WANT_SUSPEND'. Startd now treats a 'WANT_SUSPEND' that evaluates to 'UNDEFINED' as if it evaluated to 'FALSE'.-C: The WANT_SUSPEND policy expression would evaluated to UNDEFINED when JobStart was also not defined. A WANT_SUSPEND of UNDEFINED was considered an error, and the Startd would exit. -F: Workaround by removing JobStart from WANT_SUSPEND. Fixed by making WANT_SUSPEND evaluating to UNDEFINED equivalent to evaluating to FALSE. -R: Startd now treats a WANT_SUSPEND that evaluates to UNDEFINED as if it evaluated to FALSE.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0773.html