Bug 540545

Summary: WANT_SUSPEND evaluating to UNDEFIEND causes condor_startd exception
Product: Red Hat Enterprise MRG Reporter: Martin Kudlej <mkudlej>
Component: condorAssignee: Matthew Farrellee <matt>
Status: CLOSED ERRATA QA Contact: Martin Kudlej <mkudlej>
Severity: high Docs Contact:
Priority: medium    
Version: 1.0CC: fnadge, matt
Target Milestone: 1.3   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
The default 'WANT_SUSPEND' policy included the JobStart attribute, which in some cases was 'UNDEFINED'. A 'WANT_SUSPEND' with the value 'UNDEFINED' was considered an error and startd would exit. With this update, the JobStart attribute was removed from 'WANT_SUSPEND'. Startd now treats a 'WANT_SUSPEND' that evaluates to 'UNDEFINED' as if it evaluated to 'FALSE'.
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-10-14 16:12:48 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Martin Kudlej 2009-11-23 16:16:45 UTC
Description of problem:
I've submitted 10000 simple jobs "sleep 3" and after few minutes I can see in logs that condor_startd and condor_shadow crash. There is no corefile in ~condor even if I've set up it properly.

Version-Release number of selected component (if applicable):
RHEL 5.3 and 5.4 with condor-7.4.1-0.6 and -0.7. There are i386 and x86_64 architectures there.

How reproducible:
100% submittion of 10000 simple jobs

Steps to Reproduce:
1. turn off qmf plugins
2. turn on advance logging
3. submit 10000 simple jobs
  
Actual results:
condor_shadow condor_startd crash

Expected results:
core dumps setup:
# common part rhel4/5/...
  ulimit -c unlimited
  export DAEMON_COREFILE_LIMIT='unlimited'
  grep DAEMON_COREFILE_LIMIT /etc/sysconfig/init ||
   echo "DAEMON_COREFILE_LIMIT='unlimited'" >>/etc/sysconfig/init

  # condor specific
  grep ^CREATE_CORE_FILES ${condor_config} || \
    echo "CREATE_CORE_FILES = True" >> ${condor_config}
  
    # condor optional (useful in some cases)
    grep ^ABORT_ON_EXCEPTION ${condor_config} || \
      echo "ABORT_ON_EXCEPTION = True" >> ${condor_config}

  # rhel5/... only
  echo 2 > /proc/sys/fs/suid_dumpable

  # rhel4 only
  echo 2 > /proc/sys/kernel/suid_dumpable

condor_shadow condor_startd don't crash

Additional info:
submition file:
executable = /bin/sleep
arguments = 3
universe=vanilla
should_transfer_files = YES
when_to_transfer_output = ON_EXIT

queue 10000

MasterLog:
11/23 14:28:44 ProcAPI::getProcInfo() pid 14406 does not exist.
11/23 14:28:44 ProcAPI::getProcInfo() pid 14406 does not exist.
11/23 14:28:44 ProcAPI::getProcInfo() pid 14406 does not exist.
11/23 14:28:44 ProcAPI::getProcInfo() pid 14406 does not exist.
11/23 14:28:44 ProcAPI::getProcInfo() pid 14407 does not exist.
11/23 14:28:44 ProcAPI::getProcInfo() pid 14407 does not exist.
11/23 14:28:44 ProcAPI::getProcInfo() pid 14407 does not exist.
11/23 14:28:44 ProcAPI::getProcInfo() pid 14407 does not exist.
11/23 14:28:44 ProcAPI::getProcInfo() pid 14407 does not exist.
11/23 14:29:30 DaemonCore: No more children processes to reap.
11/23 14:29:30 The STARTD (pid 24153) died due to signal 11 (Segmentation fault)
11/23 14:29:30 ProcAPI::getProcInfo() pid 18780 does not exist.
11/23 14:29:30 ProcAPI::getProcInfo() pid 18780 does not exist.
11/23 14:29:30 ProcAPI::getProcInfo() pid 18780 does not exist.
11/23 14:29:30 ProcAPI::getProcInfo() pid 18780 does not exist.
11/23 14:29:30 ProcAPI::getProcInfo() pid 18780 does not exist.
11/23 14:29:30 ProcAPI::getProcInfo() pid 18805 does not exist.
11/23 14:29:30 ProcAPI::getProcInfo() pid 18805 does not exist.
11/23 14:29:30 ProcAPI::getProcInfo() pid 18805 does not exist.
11/23 14:29:30 ProcAPI::getProcInfo() pid 18805 does not exist.

StartdLog:
11/23 14:29:30 slot1: match_info called
11/23 14:29:30 slot1: Received match <:51411>#1259003801#17#...
11/23 14:29:30 slot1: Started match timer (4569) for 120 seconds.
11/23 14:29:30 slot1: State change: match notification protocol successful
11/23 14:29:30 slot1: Changing state: Unclaimed -> Matched
11/23 14:29:30 slot1: Canceled match timer (4569)
11/23 14:29:30 slot1: Schedd addr = <:43049>
11/23 14:29:30 slot1: Alive interval = 300
11/23 14:29:30 slot1: Received ClaimId from schedd (<:51411>#1259003801#17#...)
11/23 14:29:30 slot1: Rank of this claim is: 0.000000
11/23 14:29:30 slot1: Request accepted.
11/23 14:29:30 slot1: Remote owner is condor@..
11/23 14:29:30 slot1: State change: claiming protocol successful
11/23 14:29:30 slot1: Changing state: Matched -> Claimed
11/23 14:29:30 slot1: Started ClaimLease timer (4572) w/ 1800 second lease duration
11/23 14:29:30 ERROR "Can't find WANT_SUSPEND in internal ClassAd" at line 1261 in file Resource.cpp
11/23 14:29:30 CronMgr: 0 jobs alive
11/23 14:29:30 slot1: Canceled ClaimLease timer (4572)
11/23 14:29:30 slot1: Changing state and activity: Claimed/Idle -> Preempting/Killing
11/23 14:29:30 Entered vacate_client <:43049> ...
11/23 14:29:30 slot1: State change: No preempting claim, returning to owner
11/23 14:29:30 slot1: Changing state and activity: Preempting/Killing -> Owner/Idle
11/23 14:29:30 slot1: State change: IS_OWNER is false
11/23 14:29:30 slot1: Changing state: Owner -> Unclaimed
11/23 14:29:30 slot2: Canceled ClaimLease timer (33)
11/23 14:29:30 slot2: Changing state and activity: Claimed/Busy -> Preempting/Killing
11/23 14:29:30 slot2: In Starter::kill() with pid 18939, sig 3 (SIGQUIT)
11/23 14:29:30 Send_Signal(): Doing kill(18939,3) [SIGQUIT]
11/23 14:29:30 slot3: Canceled ClaimLease timer (35)
11/23 14:29:30 slot3: Changing state and activity: Claimed/Busy -> Preempting/Killing
11/23 14:29:30 slot3: In Starter::kill() with pid 18780, sig 3 (SIGQUIT)
11/23 14:29:30 Send_Signal(): Doing kill(18780,3) [SIGQUIT]
11/23 14:29:30 slot4: Canceled ClaimLease timer (37)
11/23 14:29:30 slot4: Changing state and activity: Claimed/Busy -> Preempting/Killing
11/23 14:29:30 slot4: In Starter::kill() with pid 18889, sig 3 (SIGQUIT)
11/23 14:29:30 Send_Signal(): Doing kill(18889,3) [SIGQUIT]
11/23 14:29:30 slot5: Canceled ClaimLease timer (39)
11/23 14:29:30 slot5: Changing state and activity: Claimed/Busy -> Preempting/Killing
11/23 14:29:30 slot5: In Starter::kill() with pid 18828, sig 3 (SIGQUIT)
11/23 14:29:30 Send_Signal(): Doing kill(18828,3) [SIGQUIT]
11/23 14:29:30 slot6: Canceled ClaimLease timer (41)
11/23 14:29:30 slot6: Changing state and activity: Claimed/Busy -> Preempting/Killing
11/23 14:29:30 slot6: In Starter::kill() with pid 18805, sig 3 (SIGQUIT)
11/23 14:29:30 Send_Signal(): Doing kill(18805,3) [SIGQUIT]
11/23 14:29:30 slot7: Canceled ClaimLease timer (43)
11/23 14:29:30 slot7: Changing state and activity: Claimed/Busy -> Preempting/Killing
11/23 14:29:30 slot7: In Starter::kill() with pid 18863, sig 3 (SIGQUIT)
11/23 14:29:30 Send_Signal(): Doing kill(18863,3) [SIGQUIT]
11/23 14:29:30 slot8: Canceled ClaimLease timer (31)
11/23 14:29:30 slot8: Changing state and activity: Claimed/Busy -> Preempting/Killing
11/23 14:29:30 slot8: In Starter::kill() with pid 18915, sig 3 (SIGQUIT)
11/23 14:29:30 Send_Signal(): Doing kill(18915,3) [SIGQUIT]
11/23 14:29:30 startd exiting because of fatal exception.
Stack dump for process 24153 at timestamp 1259004570 (24 frames)
condor_startd(dprintf_dump_stack+0xb7)[0x5020d1]
condor_startd[0x50233e]
/lib64/libpthread.so.0[0x34c380e7c0]
/lib64/libc.so.6(abort+0x28f)[0x34c3031e8f]
condor_startd(_EXCEPT_+0x1a5)[0x500663]
condor_startd(_ZN8Resource13wants_suspendEv+0x113)[0x4a5065]
condor_startd(_ZN8ResState4evalEv+0xe5)[0x4a40d9]
condor_startd(_ZN8ResState6changeE5State8Activity+0x2af)[0x4a335b]
condor_startd(_ZN8ResState6changeE5State+0x7e)[0x4a33e4]
condor_startd(_ZN8Resource12change_stateE5State+0x1f)[0x49819d]
condor_startd(_Z20accept_request_claimP8Resource+0x66c)[0x4b1eae]
condor_startd(_Z13request_claimP8ResourceP5ClaimPcP6Stream+0x144a)[0x4b343c]
condor_startd(_Z21command_request_claimP7ServiceiP6Stream+0x2a6)[0x4b494e]
condor_startd(_ZN10DaemonCore18CallCommandHandlerEiP6Streamb+0x29f)[0x4d7675]
condor_startd(_ZN10DaemonCore9HandleReqEP6StreamS1_+0x3393)[0x4eb671]
condor_startd(_ZN10DaemonCore22HandleReqSocketHandlerEP6Stream+0x99)[0x4ec287]
condor_startd(_ZN10DaemonCore24CallSocketHandler_workerEibP6Stream+0x279)[0x4ebffb]
condor_startd(_ZN10DaemonCore35CallSocketHandler_worker_demarshallEPv+0x39)[0x4ec1e3]
condor_startd(_ZN13CondorThreads8pool_addEPFvPvES0_PiPKc+0x3f)[0x579acd]
condor_startd(_ZN10DaemonCore17CallSocketHandlerERib+0x1cd)[0x4de445]
condor_startd(_ZN10DaemonCore6DriverEv+0x1876)[0x4dfd3c]
condor_startd(main+0x1917)[0x4f7785]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x34c301d994]
condor_startd[0x490169]
11/23 14:29:40 Found have_phsy 1, have_core 1
11/23 14:29:40 Analyzing 8 processors with IDs
11/23 14:29:40 Comparing P0 and P1: 0!=1 or  0!=0 (! match)
11/23 14:29:40 Comparing P0 and P2: 0!=0 or  0!=1 (! match)
11/23 14:29:40 Comparing P0 and P3: 0!=1 or  0!=1 (! match)
11/23 14:29:40 Comparing P0 and P4: 0!=0 or  0!=2 (! match)
11/23 14:29:40 Comparing P0 and P5: 0!=1 or  0!=2 (! match)
11/23 14:29:40 Comparing P0 and P6: 0!=0 or  0!=3 (! match)
11/23 14:29:40 Comparing P0 and P7: 0!=1 or  0!=3 (! match)
11/23 14:29:40 P0: match->1
11/23 14:29:40 Comparing P1 and P2: 1!=0 or  0!=1 (! match)
11/23 14:29:40 Comparing P1 and P3: 1!=1 or  0!=1 (! match)
11/23 14:29:40 Comparing P1 and P4: 1!=0 or  0!=2 (! match)
11/23 14:29:40 Comparing P1 and P5: 1!=1 or  0!=2 (! match)
11/23 14:29:40 Comparing P1 and P6: 1!=0 or  0!=3 (! match)
11/23 14:29:40 Comparing P1 and P7: 1!=1 or  0!=3 (! match)
11/23 14:29:40 P1: match->1
11/23 14:29:40 Comparing P2 and P3: 0!=1 or  1!=1 (! match)
11/23 14:29:40 Comparing P2 and P4: 0!=0 or  1!=2 (! match)
11/23 14:29:40 Comparing P2 and P5: 0!=1 or  1!=2 (! match)
11/23 14:29:40 Comparing P2 and P6: 0!=0 or  1!=3 (! match)
11/23 14:29:40 Comparing P2 and P7: 0!=1 or  1!=3 (! match)
11/23 14:29:40 P2: match->1
11/23 14:29:40 Comparing P3 and P4: 1!=0 or  1!=2 (! match)
11/23 14:29:40 Comparing P3 and P5: 1!=1 or  1!=2 (! match)
11/23 14:29:40 Comparing P3 and P6: 1!=0 or  1!=3 (! match)
11/23 14:29:40 Comparing P3 and P7: 1!=1 or  1!=3 (! match)
11/23 14:29:40 P3: match->1
11/23 14:29:40 Comparing P4 and P5: 0!=1 or  2!=2 (! match)
11/23 14:29:40 Comparing P4 and P6: 0!=0 or  2!=3 (! match)
11/23 14:29:40 Comparing P4 and P7: 0!=1 or  2!=3 (! match)
11/23 14:29:40 P4: match->1
11/23 14:29:40 Comparing P5 and P6: 1!=0 or  2!=3 (! match)
11/23 14:29:40 Comparing P5 and P7: 1!=1 or  2!=3 (! match)
11/23 14:29:40 P5: match->1
11/23 14:29:40 Comparing P6 and P7: 0!=1 or  3!=3 (! match)
11/23 14:29:40 P6: match->1
11/23 14:29:40 P7: match->1
11/23 14:29:40 Using IDs: 8 processors, 8 CPUs, 0 HTs

ShadowLog:
11/23 14:29:29 (1.944) (18849): Set NumJobStarts to 1
11/23 14:29:29 (1.945) (18875): Resource slot4@... changing state from STARTUP to EXECUTING
11/23 14:29:29 (1.945) (18875): WriteUserLog: not initialized @ writeEvent()
11/23 14:29:29 (1.945) (18875): Started timer to evaluate periodic user policy expressions every 60 seconds
11/23 14:29:29 (1.945) (18875): QmgrJobUpdater: started timer to update queue every 900 seconds (tid=11)
11/23 14:29:29 (1.945) (18875): Set NumJobStarts to 1
11/23 14:29:29 (1.946) (18902): Resource slot8@... changing state from STARTUP to EXECUTING
11/23 14:29:29 (1.946) (18902): WriteUserLog: not initialized @ writeEvent()
11/23 14:29:29 (1.946) (18902): Started timer to evaluate periodic user policy expressions every 60 seconds
11/23 14:29:29 (1.946) (18902): QmgrJobUpdater: started timer to update queue every 900 seconds (tid=11)
11/23 14:29:29 (1.946) (18902): Set NumJobStarts to 1
11/23 14:29:29 (1.947) (18923): Resource slot2@... changing state from STARTUP to EXECUTING
11/23 14:29:29 (1.947) (18923): WriteUserLog: not initialized @ writeEvent()
11/23 14:29:29 (1.947) (18923): Started timer to evaluate periodic user policy expressions every 60 seconds
11/23 14:29:29 (1.947) (18923): QmgrJobUpdater: started timer to update queue every 900 seconds (tid=11)
11/23 14:29:29 (1.947) (18923): Set NumJobStarts to 1
11/23 14:29:30 (1.947) (18923): WriteUserLog: not initialized @ writeEvent()
11/23 14:29:30 (1.947) (18923): ERROR "Error from slot2@...: ProcD has failed" at line 655 in file pseudo_ops.cpp
11/23 14:29:30 (1.942) (18782): WriteUserLog: not initialized @ writeEvent()
11/23 14:29:30 (1.942) (18782): ERROR "Error from slot6@...: ProcD has failed" at line 655 in file pseudo_ops.cpp
Stack dump for process 18923 at timestamp 1259004570 (17 frames)
condor_shadow(dprintf_dump_stack+0xb7)[0x4df855]
condor_shadow[0x4dfac2]
/lib64/libpthread.so.0[0x34c380e7c0]
/lib64/libc.so.6(gsignal+0x35)[0x34c3030265]
/lib64/libc.so.6(abort+0x110)[0x34c3031d10]
condor_shadow(_EXCEPT_+0x1a5)[0x4ddde7]
condor_shadow(_Z11pseudo_ulogP7ClassAd+0x37d)[0x4a0355]
condor_shadow(_Z17do_REMOTE_syscallv+0x3609)[0x4a4f9d]
condor_shadow(_ZN14RemoteResource14handleSysCallsEP6Stream+0x32)[0x49d22a]
condor_shadow(_ZN10DaemonCore24CallSocketHandler_workerEibP6Stream+0x279)[0x4cb7d7]
condor_shadow(_ZN10DaemonCore35CallSocketHandler_worker_demarshallEPv+0x39)[0x4cb9bf]
condor_shadow(_ZN13CondorThreads8pool_addEPFvPvES0_PiPKc+0x3f)[0x56bdd9]
condor_shadow(_ZN10DaemonCore17CallSocketHandlerERib+0x1cd)[0x4bdc21]
condor_shadow(_ZN10DaemonCore6DriverEv+0x1876)[0x4bf518]
condor_shadow(main+0x1917)[0x4d70d1]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x34c301d994]
condor_shadow[0x48ef69]
11/23 14:29:30 (1.941) (18766): WriteUserLog: not initialized @ writeEvent()
11/23 14:29:30 (1.941) (18766): ERROR "Error from slot3@...: ProcD has failed" at line 655 in file pseudo_ops.cpp
11/23 14:29:30 (1.944) (18849): WriteUserLog: not initialized @ writeEvent()
Stack dump for process 18766 at timestamp 1259004570 (17 frames)
condor_shadow(dprintf_dump_stack+0xb7)[0x4df855]
condor_shadow[0x4dfac2]
/lib64/libpthread.so.0[0x34c380e7c0]
/lib64/libc.so.6(gsignal+0x35)[0x34c3030265]
/lib64/libc.so.6(abort+0x110)[0x34c3031d10]
condor_shadow(_EXCEPT_+0x1a5)[0x4ddde7]
condor_shadow(_Z11pseudo_ulogP7ClassAd+0x37d)[0x4a0355]
condor_shadow(_Z17do_REMOTE_syscallv+0x3609)[0x4a4f9d]
Stack dump for process 18782 at timestcondor_shadow(_ZN14RemoteResource14handleSysCallsEP6Stream+0x32)[0x49d22a]
condor_shadow(_ZN10DaemonCore24CallSocketHandler_workerEibP6Stream+0x279)[0x4cb7d7]
condor_shadow(_ZN10DaemonCore35CallSocketHandler_worker_demarshallEPv+0x39)[0x4cb9bf]
condor_shadow(_ZN13CondorThreads8pool_addEPFvPvES0_PiPKc+0x3f)[0x56bdd9]
condor_shadow(_ZN10DaemonCore17CallSocketHandlerERib+0x1cd)[0x4bdc21]
condor_shadow(_ZN10DaemonCore6DriverEv+0x1876)[0x4bf518]
condor_shadow(main+0x1917)[0x4d70d1]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x34c301d994]
condor_shadow[0x48ef69]
11/23 14:29:30 (1.944) (18849): ERROR "Error from slot7@...: ProcD has failed" at line 655 in file pseudo_ops.cpp
11/23 14:29:30 (1.945) (18875): WriteUserLog: not initialized @ writeEvent()
11/23 14:29:30 (1.943) (18816): WriteUserLog: not initialized @ writeEvent()
11/23 14:29:30 (1.943) (18816): ERROR "Error from slot5@...: ProcD has failed" at line 655 in file pseudo_ops.cpp
Stack dump for procesamp 1259004570 (17 frames)
Stack dump for process 18849 at timestamp 1259004570 (17 frames)
condor_shadow(dprintf_dump_stack+0xb7)[0x4df855]
condor_shadow[0x4dfac2]
/lib64/libpthread.so.0[0x34c380e7c0]
/lib64/libc.so.6(gsignal+0x35)[0x34c3030265]
/lib64/libc.so.6(abort+0x110)[0x34c3031d10]
condor_shadow(_EXCEPT_+0x1a5)[0x4ddde7]
condor_shadow(_Z11pseudo_ulogP7ClassAd+0x37d)[0x4a0355]
condor_shadow(_Z17do_REMOTE_syscallv+0x3609)[0x4a4f9d]
condor_shadow(_ZN14RemoteResource14handleSysCallsEP6Stream+0x32)[0x49d22a]
condor_shadow(_ZN10DaemonCore24CallSocketHandler_workerEibP6Stream+0x279)[0x4cb7d7]
condor_shadow(_ZN10DaemonCore35CallSocketHandler_worker_demarshallEPv+0x39)[0x4cb9bf]
condor_shadow(_ZN13CondorThreads8pool_addEPFvPvES0_PiPKc+0x3f)[0x56bdd9]
condor_shadow(_ZN10DaemonCore17CallSocketHandlerERib+0x1cd)[0x4bdc21]
condor_shadow(_ZN10DaemonCore6DriverEv+0x1876)[0x4bf518]
condor_shadow(main+0x1917)[0x4d70d1]
scondor_shadow(dprintf_dump_stack+0xb7)[0x4df855]
condor_shadow[0x4dfac2]
/lib64/libpthread.so.0[0x34c380e7c0]
/lib64/libc.so.6(gsignal+0x35)[0x34c3030265]
/lib64/libc.so.6(abort+0x110)[0x34c3031d10]
condor_shadow(_EXCEPT_+0x1a5)[0x4ddde7]
condor_shadow(_Z11pseudo_ulogP7ClassAd+0x37d)[0x4a0355]
11/23 14:29:30 (1.946) (18902): WriteUserLog: not initialized @ writeEvent()
 /lib64/libc.so.6(__libc_start_main+0xf4)[0x34c301d994]
condor_shadow(_Z17do_REMOTE_syscallv+0x3609)[0x4a4f9d]
11/23 14:29:30 (1.945) (18875): ERROR "Error from slot4@...: ProcD has failed" at line 655 in file pseudo_ops.cpp
11/23 14:29:30 (1.946) (18902): ERROR "Error from slot8@...: ProcD has failed" at line 655 in file pseudo_ops.cpp
condor_shadow[0x48ef69]
condor_shadow(_ZN14RemoteResource14handleSysCallsEP6Stream+0x32)[0x49d22a]
Stack dump for proceScondor_shadow(_ZN10DaemonCore24CallSocketHandler_workerEibP6Stream+0x279)[0x4cb7d7]
18816 at timestamp 1259004570 (17 frames)
condor_shadow(_ZN10DaemonCore35CallSocketHandler_worker_demarshallEPv+0x39)[0x4cb9bf]
condor_shadow(_ZN13CondorThreads8pool_addEPFvPvES0_PiPKc+0x3f)[0x56bdd9]
condor_shadow(dprintf_dump_stack+0xb7)[0x4df855]
condor_shadow(_ZN10DaemonCore17CallSocketHandlerERib+0x1cd)[0x4bdc21]
condor_shadow[0x4dfac2]
/lib64/libpthread.so.0[0x34c380e7c0]
/lib64/libc.so.6(gsignal+0x35)[0x34c3030265]
/lib64/libc.so.6(abort+0x110)[0x34c3031d10]
condor_shadow(_ZN10DaemonCore6DriverEv+0x1876)[0x4bf518]
condor_shadow(_EXCEPT_+0x1a5)[0x4ddde7]
scondor_shadow(main+0x1917)[0x4d70d1]
s 18875/lib64/libc.so.6(__libc_start_main+0xf4)[0x34c301d994]
t at timestamp 12590045condor_shadow[0x48ef69]
ack dump for process 18902 at timestamp 1259004570 (17 frames)
condor_shadow(_Z11pseudo_ulogP7ClassAd+0x37d)[0x4a0355]
70 (17 frames)
condor_shadow(_Z17do_REMOTE_syscallv+0x3609)[0x4a4f9d]
condor_shadow(dprintf_dump_stack+0xb7)[0x4df855]
condor_shadow(_ZN14RemoteResource14handleSysCallsEP6Stream+0x32)[0x49d22a]
condor_shadow(dprintf_dump_stack+0xb7)[0x4df855]
condor_shadow[0x4dfac2]
/lib64/libpthread.so.0[0x34c380e7c0]
condor_shadow(_ZN10DaemonCore24CallSocketHandler_workerEibP6Stream+0x279)[0x4cb7d7]
/lib64/libc.so.6(gsignal+0x35)[0x34c3030265]
condor_shadow[0x4dfac2]
/lib64/libpthread.so.0[0x34c380e7c0]
/lib64/libc.so.6(abort+0x110)[0x34c3031d10]
/lib64/libc.so.6(gsignal+0x35)[0x34c3030265]
/lib64/libc.so.6(abort+0x110)[0x34c3031d10]
condor_shadow(_ZN10DaemonCore35CallSocketHandler_worker_demarshallEPv+0x39)[0x4cb9bf]
condor_shadow(_EXCEPT_+0x1a5)[0x4ddde7]
condor_shadow(_EXCEPT_+0x1a5)[0x4ddde7]
condor_shadow(_ZN13CondorThreads8pool_addEPFvPvES0_PiPKc+0x3f)[0x56bdd9]
condor_shadow(_Z11pseudo_ulogP7ClassAd+0x37d)[0x4a0355]
condor_shadow(_Z11pseudo_ulogP7ClassAd+0x37d)[0x4a0355]
condor_shadow(_ZN10DaemonCore17CallSocketHandlerERib+0x1cd)[0x4bdc21]
condor_shadow(_Z17do_REMOTE_syscallv+0x3609)[0x4a4f9d]
condor_shadow(_Z17do_REMOTE_syscallv+0x3609)[0x4a4f9d]
condor_shadow(_ZN14RemoteResource14handleSysCallsEP6Stream+0x32)[0x49d22a]
condor_shadow(_ZN10DaemonCore6DriverEv+0x1876)[0x4bf518]
condor_shadow(_ZN14RemoteResource14handleSysCallsEP6Stream+0x32)[0x49d22a]
condor_shadow(_ZN10DaemonCore24CallSocketHandler_workerEibP6Stream+0x279)[0x4cb7d7]
condor_shadow(main+0x1917)[0x4d70d1]
condor_shadow(_ZN10DaemonCore24CallSocketHandler_workerEibP6Stream+0x279)[0x4cb7d7]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x34c301d994]
condor_shadow(_ZN10DaemonCore35CallSocketHandler_worker_demarshallEPv+0x39)[0x4cb9bf]
condor_shadow(_ZN10DaemonCore35CallSocketHandler_worker_demarshallEPv+0x39)[0x4cb9bf]
condor_shadow[0x48ef69]
condor_shadow(_ZN13CondorThreads8pool_addEPFvPvES0_PiPKc+0x3f)[0x56bdd9]
condor_shadow(_ZN13CondorThreads8pool_addEPFvPvES0_PiPKc+0x3f)[0x56bdd9]
condor_shadow(_ZN10DaemonCore17CallSocketHandlerERib+0x1cd)[0x4bdc21]
condor_shadow(_ZN10DaemonCore17CallSocketHandlerERib+0x1cd)[0x4bdc21]
condor_shadow(_ZN10DaemonCore6DriverEv+0x1876)[0x4bf518]
condor_shadow(_ZN10DaemonCore6DriverEv+0x1876)[0x4bf518]
condor_shadow(main+0x1917)[0x4d70d1]
condor_shadow(main+0x1917)[0x4d70d1]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x34c301d994]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x34c301d994]
condor_shadow[0x48ef69]
condor_shadow[0x48ef69]
11/23 14:29:30 Found have_phsy 1, have_core 1
11/23 14:29:30 Analyzing 8 processors with IDs
11/23 14:29:30 Comparing P0 and P1: 0!=1 or  0!=0 (! match)
11/23 14:29:30 Comparing P0 and P2: 0!=0 or  0!=1 (! match)
11/23 14:29:30 Comparing P0 and P3: 0!=1 or  0!=1 (! match)
11/23 14:29:30 Comparing P0 and P4: 0!=0 or  0!=2 (! match)
11/23 14:29:30 Comparing P0 and P5: 0!=1 or  0!=2 (! match)
11/23 14:29:30 Comparing P0 and P6: 0!=0 or  0!=3 (! match)
11/23 14:29:30 Comparing P0 and P7: 0!=1 or  0!=3 (! match)
11/23 14:29:30 P0: match->1
11/23 14:29:30 Comparing P1 and P2: 1!=0 or  0!=1 (! match)
11/23 14:29:30 Comparing P1 and P3: 1!=1 or  0!=1 (! match)
11/23 14:29:30 Comparing P1 and P4: 1!=0 or  0!=2 (! match)
11/23 14:29:30 Comparing P1 and P5: 1!=1 or  0!=2 (! match)
11/23 14:29:30 Comparing P1 and P6: 1!=0 or  0!=3 (! match)
11/23 14:29:30 Comparing P1 and P7: 1!=1 or  0!=3 (! match)
11/23 14:29:30 P1: match->1
11/23 14:29:30 Comparing P2 and P3: 0!=1 or  1!=1 (! match)
11/23 14:29:30 Comparing P2 and P4: 0!=0 or  1!=2 (! match)
11/23 14:29:30 Comparing P2 and P5: 0!=1 or  1!=2 (! match)
11/23 14:29:30 Comparing P2 and P6: 0!=0 or  1!=3 (! match)
11/23 14:29:30 Comparing P2 and P7: 0!=1 or  1!=3 (! match)
11/23 14:29:30 P2: match->1
11/23 14:29:30 Comparing P3 and P4: 1!=0 or  1!=2 (! match)
11/23 14:29:30 Comparing P3 and P5: 1!=1 or  1!=2 (! match)
11/23 14:29:30 Comparing P3 and P6: 1!=0 or  1!=3 (! match)
11/23 14:29:30 Comparing P3 and P7: 1!=1 or  1!=3 (! match)
11/23 14:29:30 P3: match->1
11/23 14:29:30 Comparing P4 and P5: 0!=1 or  2!=2 (! match)
11/23 14:29:30 Comparing P4 and P6: 0!=0 or  2!=3 (! match)
11/23 14:29:30 Comparing P4 and P7: 0!=1 or  2!=3 (! match)
11/23 14:29:30 P4: match->1
11/23 14:29:30 Comparing P5 and P6: 1!=0 or  2!=3 (! match)
11/23 14:29:30 Comparing P5 and P7: 1!=1 or  2!=3 (! match)
11/23 14:29:30 P5: match->1
11/23 14:29:30 Comparing P6 and P7: 0!=1 or  3!=3 (! match)
11/23 14:29:30 P6: match->1
11/23 14:29:30 P7: match->1
11/23 14:29:30 Using IDs: 8 processors, 8 CPUs, 0 HTs

Comment 1 Matthew Farrellee 2009-11-25 14:27:37 UTC
What's the frequency of this bug?

For instance,

$ grep -e "STARTING UP" -e "signal 11" MasterLog; tail -n1 MasterLog
11/25 13:49:56 ** condor_master (CONDOR_MASTER) STARTING UP
11/25 14:06:09 The STARTD (pid 15559) died due to signal 11 (Segmentation fault)
11/25 14:21:20 exit Daemons::UpdateCollector

That shows one crash over a period of about 30 minutes. However, we need more than 1 data point to have any confidence in the rate.

Comment 2 Matthew Farrellee 2009-11-25 14:47:07 UTC
The condor_shadow failures are a response to a condor_starter failure.

The condor_starter failure, only indirectly shown in the logs above, is because "ProcD has failed". The condor_procd is a process shared between the condor_startd and condor_starter processes to track process families.

$ pstree -p 12992
condor_startd(12992)─┬─condor_procd(13254)
                     ├─condor_starter(25486)───condor_exec.exe(25567)
                     ├─condor_starter(25519)───condor_exec.exe(25568)
                     ├─condor_starter(25542)───condor_exec.exe(25569)
                     ├─condor_starter(25588)
                     ├─condor_starter(25620)
                     ├─condor_starter(25647)
                     └─condor_starter(31030)───condor_exec.exe(31054)

The procd, which is terminated as part of ~DaemonCore and Proc_Family_Cleanup, may be in a race with the condor_starter. This should be a separate BZ.

Comment 3 Matthew Farrellee 2009-11-25 15:22:05 UTC
What is the occurrence of this bug?

Basically, how many jobs are processed by a single startd before it crashes.

Very roughly,

$ echo "Jobs per crash: $(($(grep "Idle -> Busy" StartLog | wc -l) / $(grep -e "STARTING UP" StartLog | wc -l)))"
Jobs per crash: 1129

Comment 4 Matthew Farrellee 2009-11-25 15:31:02 UTC
The stack is easier to read after being piped through c++filt,

$ grep -A24 Stack StartLog | c++filt 
Stack dump for process 15559 at timestamp 1259175969 (24 frames)
condor_startd(dprintf_dump_stack+0xb7)[0x5020d1]
condor_startd[0x50233e]
/lib64/libpthread.so.0[0x34c380e7c0]
/lib64/libc.so.6(abort+0x28f)[0x34c3031e8f]
condor_startd(_EXCEPT_+0x1a5)[0x500663]
condor_startd(Resource::wants_suspend()+0x113)[0x4a5065]
condor_startd(ResState::eval()+0xe5)[0x4a40d9]
condor_startd(ResState::change(State, Activity)+0x2af)[0x4a335b]
condor_startd(ResState::change(State)+0x7e)[0x4a33e4]
condor_startd(Resource::change_state(State)+0x1f)[0x49819d]
condor_startd(accept_request_claim(Resource*)+0x66c)[0x4b1eae]
condor_startd(request_claim(Resource*, Claim*, char*, Stream*)+0x144a)[0x4b343c]
condor_startd(command_request_claim(Service*, int, Stream*)+0x2a6)[0x4b494e]
condor_startd(DaemonCore::CallCommandHandler(int, Stream*, bool)+0x29f)[0x4d7675]
condor_startd(DaemonCore::HandleReq(Stream*, Stream*)+0x3393)[0x4eb671]
condor_startd(DaemonCore::HandleReqSocketHandler(Stream*)+0x99)[0x4ec287]
condor_startd(DaemonCore::CallSocketHandler_worker(int, bool, Stream*)+0x279)[0x4ebffb]
condor_startd(DaemonCore::CallSocketHandler_worker_demarshall(void*)+0x39)[0x4ec1e3]
condor_startd(CondorThreads::pool_add(void (*)(void*), void*, int*, char const*)+0x3f)[0x579acd]
condor_startd(DaemonCore::CallSocketHandler(int&, bool)+0x1cd)[0x4de445]
condor_startd(DaemonCore::Driver()+0x1876)[0x4dfd3c]
condor_startd(main+0x1917)[0x4f7785]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x34c301d994]
condor_startd[0x490169]

Comment 5 Matthew Farrellee 2009-11-25 15:49:57 UTC
NOTE (thanks GregT for the reminder): The check for WANT_SUSPEND is an EvalBool, which means WANT_SUSPEND evaluating to UNDEFINED results in this assertion.

Debug with: WANT_SUSPEND = debug($(WANT_SUSPEND))

The assertion may be misguided. It attempts to assert that WANT_SUSPEND is set, and cannot distinguish between a WANT_SUSPEND that evaluates to UNDEFINED vs was never set.

Comment 6 Matthew Farrellee 2009-11-25 15:53:32 UTC
$ condor_config_val -v WANT_SUSPEND
WANT_SUSPEND: debug(( (TARGET.ImageSize <  (15 * 1024)) || ((KeyboardIdle < 60) == False) || (TARGET.JobUniverse == 5) ) && ( ( (KeyboardIdle < 60) || ( (CpuBusyTime > 2 * 60) && (CurrentTime - JobStart) > 90 ) ) ))

I would suspect JobStart, aka ATTR_JOB_START

Comment 7 Matthew Farrellee 2009-11-25 15:55:43 UTC
Indeed JobStart is evaluating to UNDEFINED

StartLog:

11/25 15:49:27 Classad debug: ImageSize --> ERROR
11/25 15:49:27 Classad debug: TARGET.ImageSize --> 22
11/25 15:49:27 Classad debug: KeyboardIdle --> ERROR
11/25 15:49:27 Classad debug: KeyboardIdle --> 66
11/25 15:49:27 Classad debug: CpuBusyTime --> ERROR
11/25 15:49:27 Classad debug: CpuBusyTime --> 318
11/25 15:49:27 Classad debug: CurrentTime --> 1259182167
11/25 15:49:27 Classad debug: JobStart --> UNDEFINED
11/25 15:49:27 Classad debug: debug(((TARGET.ImageSize < (15 * 1024)) || ((KeyboardIdle < 60) == FALSE) || (TARGET.JobUniverse == 5)) && (((KeyboardIdle < 60) || ((CpuBusyTime > 2 * 60) && (CurrentTime - JobStart) > 90)))) --> UNDEFINED
11/25 15:49:27 ERROR "Can't find WANT_SUSPEND in internal ClassAd" at line 1261 in file Resource.cpp

Comment 8 Matthew Farrellee 2009-11-25 16:11:02 UTC
With debugging,

$ grep -e "STARTING UP" -e "signal 11" MasterLog; tail -n1 MasterLog
11/25 13:49:56 ** condor_master (CONDOR_MASTER) STARTING UP
11/25 14:06:09 The STARTD (pid 15559) died due to signal 11 (Segmentation fault)
11/25 14:27:46 The STARTD (pid 12992) died due to signal 11 (Segmentation fault)
11/25 14:49:01 The STARTD (pid 28825) died due to signal 11 (Segmentation fault)
11/25 15:10:41 The STARTD (pid 11562) died due to signal 11 (Segmentation fault)
11/25 15:38:20 The STARTD (pid 27427) died due to signal 11 (Segmentation fault)
11/25 15:49:27 The STARTD (pid 29229) died due to signal 11 (Segmentation fault)
11/25 16:06:40 ProcAPI::getProcInfo() pid 16745 does not exist.
$ grep "JobStart --> UNDEFINED" StartLog
11/25 15:49:27 Classad debug: JobStart --> UNDEFINED

Comment 9 Matthew Farrellee 2009-11-25 16:38:53 UTC
Proposed workaround:

$ grep ActivationTimer ~condor/condor_config.overrides
ActivationTimer = ifThenElse(JobStart =!= UNDEFINED, (CurrentTime - JobStart), 0)

Comment 10 Matthew Farrellee 2009-11-25 16:46:47 UTC
This crash was only happening when the evaluation of WANT_SUSPEND made it to evaluation of JobStart, which was protected by KeyboardIdle and CpuBusyTime.

The crash can be easily reproduced with: WANT_SUSPEND = (CurrentTime - JobStart) < 90

JobStart is defined by a claim with an active job in the startd. It is not always present, making it a dangerous choice for default policy.

Comment 11 Matthew Farrellee 2009-11-25 17:53:27 UTC
Two actions on this:

1) Update the assertion to differentiate between not present and evaluating to UNDEFINED
2) Update the default configuration to protect access to JobStart

Comment 12 Matthew Farrellee 2009-11-25 18:01:53 UTC
This bug has likely been around for a very long time: Activation timer showed up in 1998 (git show 3cef52db). The conditional setting of ATTR_JOB_START has been around since at least 2003 (see V6_5-branch).

Comment 13 Matthew Farrellee 2009-11-25 18:09:03 UTC
Upstream ticket: http://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=1001

Comment 14 Matthew Farrellee 2009-12-18 20:59:45 UTC
commit 745a047243d94d23ac286fd604c57542f3f3cd15
Author: Matthew Farrellee <matt@>
Date:   Fri Dec 18 15:50:16 2009 -0500

    Protect ActivationTimer from going to UNDEFINED (#1001)

diff --git a/src/condor_examples/condor_config.generic b/src/condor_examples/condor_config.generic
index 4a49eae..3d93b5a 100644
--- a/src/condor_examples/condor_config.generic
+++ b/src/condor_examples/condor_config.generic
@@ -622,7 +622,7 @@ MINUTE    = 60
 HOUR    = (60 * $(MINUTE))
 StateTimer = (CurrentTime - EnteredCurrentState)
 ActivityTimer = (CurrentTime - EnteredCurrentActivity)
-ActivationTimer = (CurrentTime - JobStart)
+ActivationTimer = ifThenElse(JobStart =!= UNDEFINED, (CurrentTime - JobStart), 0)
 LastCkpt   = (CurrentTime - LastPeriodicCheckpoint)
 
 ##  The JobUniverse attribute is just an int.  These macros can be

Comment 15 Matthew Farrellee 2009-12-18 21:00:00 UTC
commit 6d4b6e5119261851e17996e5121c42b7c3a045fe
Author: Matthew Farrellee <matt@>
Date:   Fri Dec 18 15:55:12 2009 -0500

    WANT_SUSPEND that is UNDEFINED, either by being absent or evaluating to UNDEFINED, now means FALSE. It used to mean EXCEPT! (#1001)

diff --git a/src/condor_startd.V6/Resource.cpp b/src/condor_startd.V6/Resource.cpp
index 6a23889..76da635 100644
--- a/src/condor_startd.V6/Resource.cpp
+++ b/src/condor_startd.V6/Resource.cpp
@@ -1255,10 +1255,8 @@ Resource::wants_suspend( void )
      if( r_classad->EvalBool( "WANT_SUSPEND",
                           r_cur->ad(),
                           want_suspend ) == 0) {
-           // This should never happen, since we already check
-           // when we're constructing the internal config classad
-           // if we've got this defined. -Derek Wright 4/12/00
-        EXCEPT( "Can't find WANT_SUSPEND in internal ClassAd" );
+           // UNDEFINED means FALSE for WANT_SUSPEND
+        want_suspend = false;
      }
   }
   return want_suspend;

Comment 16 Matthew Farrellee 2010-01-04 18:22:11 UTC
Fixed in 7.4.2-0.1

Comment 17 Martin Kudlej 2010-03-19 15:10:54 UTC
Tested on Rhel 5.5/4.8 x x86_64/i386 with condor-7.4.3-0.5.
I submitted 100000 simple job from comment #1 and another 100000 simple jobs to condor with "WANT_SUSPEND = (CurrentTime - JobStart) < 90" in configuration.
I didn't see any error or coredump from comment #1

-->VERIFIED

Comment 18 Florian Nadge 2010-10-08 10:19:30 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Previously, if the user submitted around 10000 simple jobs "sleep 3", condor_startd and condor_shadow failed. With this update, condor_startd and condor_shadow work as expected, even with large numbers of jobs.

Comment 19 Florian Nadge 2010-10-08 11:36:14 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1 +1 @@
-Previously, if the user submitted around 10000 simple jobs "sleep 3", condor_startd and condor_shadow failed. With this update, condor_startd and condor_shadow work as expected, even with large numbers of jobs.+Previously, condor_startd and condor_shadow failed because default the value of WANT_SUSPEND policy becomes to UNDEFINED.  With this update, condor_startd and condor_shadow work as expected, even if WANT_SUSPEND evaluates to UNDEFINED.

Comment 20 Matthew Farrellee 2010-10-08 11:39:36 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1 +1,4 @@
-Previously, condor_startd and condor_shadow failed because default the value of WANT_SUSPEND policy becomes to UNDEFINED.  With this update, condor_startd and condor_shadow work as expected, even if WANT_SUSPEND evaluates to UNDEFINED.+C: The default WANT_SUSPEND policy included JobStart, which in some cases may be UNDEFINED.
+C: The WANT_SUSPEND policy expression would evaluated to UNDEFINED when JobStart was also not defined. A WANT_SUSPEND of UNDEFINED was considered an error, and the Startd would exit.
+F: Workaround by removing JobStart from WANT_SUSPEND. Fixed by making WANT_SUSPEND evaluating to UNDEFINED equivalent to evaluating to FALSE.
+R: Startd now treats a WANT_SUSPEND that evaluates to UNDEFINED as if it evaluated to FALSE.

Comment 21 Martin Prpič 2010-10-10 11:23:58 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1,4 +1 @@
-C: The default WANT_SUSPEND policy included JobStart, which in some cases may be UNDEFINED.
+The default 'WANT_SUSPEND' policy included the JobStart attribute, which in some cases was 'UNDEFINED'. A 'WANT_SUSPEND' with the value 'UNDEFINED' was considered an error and startd would exit. With this update, the JobStart attribute was removed from 'WANT_SUSPEND'. Startd now treats a 'WANT_SUSPEND' that evaluates to 'UNDEFINED' as if it evaluated to 'FALSE'.-C: The WANT_SUSPEND policy expression would evaluated to UNDEFINED when JobStart was also not defined. A WANT_SUSPEND of UNDEFINED was considered an error, and the Startd would exit.
-F: Workaround by removing JobStart from WANT_SUSPEND. Fixed by making WANT_SUSPEND evaluating to UNDEFINED equivalent to evaluating to FALSE.
-R: Startd now treats a WANT_SUSPEND that evaluates to UNDEFINED as if it evaluated to FALSE.

Comment 23 errata-xmlrpc 2010-10-14 16:12:48 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0773.html