Created attachment 426935 [details] Excerpts of logs related to the job which made shadow crash Description of problem: While running a killsig stress test I've found that rarely shadow crashes: 06/23 00:43:23 (1935.0) (20663): Can no longer talk to condor_starter <10.16.66.140:44194> 06/23 00:43:23 (1935.1) (20666): Buf::write(): condor_write() failed 06/23 00:43:23 (1935.1) (20666): ERROR "Assertion ERROR on (result)" at line 325 in file NTreceivers.cpp 06/23 00:43:23 (1935.0) (20663): Trying to reconnect to disconnected job 06/23 00:43:23 (1935.0) (20663): LastJobLeaseRenewal: 1277268175 Wed Jun 23 00:42:55 2010 06/23 00:43:23 (1935.0) (20663): JobLeaseDuration: 1200 seconds 06/23 00:43:23 (1935.0) (20663): JobLeaseDuration remaining: 1172 06/23 00:43:23 (1935.0) (20663): Attempting to locate disconnected starter 06/23 00:43:23 (1935.0) (20663): locateStarter(): ClaimId (<10.16.66.140:44194>#1277265708#1#478b946b7b3c81d711671224105d7594ca463289) and GlobalJobId ( hp-ml150g6-01.rhts.bos.r edhat.com#1935.0#1277265799 ) not found 06/23 00:43:23 (1935.0) (20663): Reconnect FAILED: Job not found at execution machine 06/23 00:43:23 (1935.0) (20663): **** condor_shadow (condor_SHADOW) pid 20663 EXITING WITH STATUS 107 Stack dump for process 20666 at timestamp 1277268203 (15 frames) condor_shadow(dprintf_dump_stack+0x3f)[0x80ddbef] condor_shadow[0x80ddf4a] /lib/tls/libpthread.so.0[0xd98a98] /lib/tls/libc.so.6(abort+0xe9)[0xb57289] condor_shadow(_EXCEPT_+0x6d)[0x80dc61d] condor_shadow(_Z17do_REMOTE_syscallv+0x15c6)[0x80b4e06] condor_shadow(_ZN14RemoteResource14handleSysCallsEP6Stream+0x1e)[0x80ab67e] condor_shadow(_ZN10DaemonCore24CallSocketHandler_workerEibP6Stream+0x400)[0x80ca330] condor_shadow(_ZN10DaemonCore35CallSocketHandler_worker_demarshallEPv+0x20)[0x80ca7c0] condor_shadow(_ZN13CondorThreads8pool_addEPFvPvES0_PiPKc+0x27)[0x8149467] condor_shadow(_ZN10DaemonCore17CallSocketHandlerERib+0x199)[0x80beb59] condor_shadow(_ZN10DaemonCore6DriverEv+0x1ddd)[0x80c300d] condor_shadow(main+0x133e)[0x80d7abe] /lib/tls/libc.so.6(__libc_start_main+0xd3)[0xb42df3] condor_shadow(__gxx_personality_v0+0x14d)[0x80a2a41] 06/23 00:43:23 (1935.2) (20669): **** condor_shadow (condor_SHADOW) pid 20669 EXITING WITH STATUS 102 Version-Release number of selected component (if applicable): condor-7.4.3-0.20, RHEL4.8/5.5, i386/x86_64 How reproducible: Steps to Reproduce: 0) Submit many jobs (O(10000)) with killsig enabled. 1) wait for job running (usually one for slot); 2) condor_rm all the running jobs; 3) repeat the previous steps, after the 1st (at most the 3rd) attempt StartLog will contain something like:
The shadow (PID = 20666) is failing on some low level writes. 06/23 00:43:23 (1935.1) (20666): Buf::write(): condor_write() failed 06/23 00:43:23 (1935.1) (20666): ERROR "Assertion ERROR on (result)" at line 325 in file NTreceivers.cpp This could happen if the remote side was killed. It is acceptable for the shadow to crash like this. The Schedd will notice and take corrective action, such as starting a new shadow.