Bugzilla will be upgraded to version 5.0 on a still to be determined date in the near future. The original upgrade date has been delayed.
Bug 608104 - Sporadic crash of shadow
Sporadic crash of shadow
Status: CLOSED NOTABUG
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor (Show other bugs)
1.0
All Linux
medium Severity medium
: 2.0
: ---
Assigned To: Matthew Farrellee
MRG Quality Engineering
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2010-06-25 12:47 EDT by Luigi Toscano
Modified: 2011-01-07 13:27 EST (History)
1 user (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2011-01-07 13:27:22 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Excerpts of logs related to the job which made shadow crash (11.09 KB, text/plain)
2010-06-25 12:47 EDT, Luigi Toscano
no flags Details

  None (edit)
Description Luigi Toscano 2010-06-25 12:47:17 EDT
Created attachment 426935 [details]
Excerpts of logs related to the job which made shadow crash

Description of problem:
While running a killsig stress test I've found that rarely shadow crashes:

06/23 00:43:23 (1935.0) (20663): Can no longer talk to condor_starter <10.16.66.140:44194>
06/23 00:43:23 (1935.1) (20666): Buf::write(): condor_write() failed
06/23 00:43:23 (1935.1) (20666): ERROR "Assertion ERROR on (result)" at line 325 in file NTreceivers.cpp
06/23 00:43:23 (1935.0) (20663): Trying to reconnect to disconnected job
06/23 00:43:23 (1935.0) (20663): LastJobLeaseRenewal: 1277268175 Wed Jun 23 00:42:55 2010
06/23 00:43:23 (1935.0) (20663): JobLeaseDuration: 1200 seconds
06/23 00:43:23 (1935.0) (20663): JobLeaseDuration remaining: 1172
06/23 00:43:23 (1935.0) (20663): Attempting to locate disconnected starter
06/23 00:43:23 (1935.0) (20663): locateStarter(): ClaimId (<10.16.66.140:44194>#1277265708#1#478b946b7b3c81d711671224105d7594ca463289) and GlobalJobId ( hp-ml150g6-01.rhts.bos.r
edhat.com#1935.0#1277265799 ) not found
06/23 00:43:23 (1935.0) (20663): Reconnect FAILED: Job not found at execution machine
06/23 00:43:23 (1935.0) (20663): **** condor_shadow (condor_SHADOW) pid 20663 EXITING WITH STATUS 107
Stack dump for process 20666 at timestamp 1277268203 (15 frames)
condor_shadow(dprintf_dump_stack+0x3f)[0x80ddbef]
condor_shadow[0x80ddf4a]
/lib/tls/libpthread.so.0[0xd98a98]
/lib/tls/libc.so.6(abort+0xe9)[0xb57289]
condor_shadow(_EXCEPT_+0x6d)[0x80dc61d]
condor_shadow(_Z17do_REMOTE_syscallv+0x15c6)[0x80b4e06]
condor_shadow(_ZN14RemoteResource14handleSysCallsEP6Stream+0x1e)[0x80ab67e]
condor_shadow(_ZN10DaemonCore24CallSocketHandler_workerEibP6Stream+0x400)[0x80ca330]
condor_shadow(_ZN10DaemonCore35CallSocketHandler_worker_demarshallEPv+0x20)[0x80ca7c0]
condor_shadow(_ZN13CondorThreads8pool_addEPFvPvES0_PiPKc+0x27)[0x8149467]
condor_shadow(_ZN10DaemonCore17CallSocketHandlerERib+0x199)[0x80beb59]
condor_shadow(_ZN10DaemonCore6DriverEv+0x1ddd)[0x80c300d]
condor_shadow(main+0x133e)[0x80d7abe]
/lib/tls/libc.so.6(__libc_start_main+0xd3)[0xb42df3]
condor_shadow(__gxx_personality_v0+0x14d)[0x80a2a41]
06/23 00:43:23 (1935.2) (20669): **** condor_shadow (condor_SHADOW) pid 20669 EXITING WITH STATUS 102

Version-Release number of selected component (if applicable):
condor-7.4.3-0.20, RHEL4.8/5.5, i386/x86_64

How reproducible:

Steps to Reproduce:
0) Submit many jobs (O(10000)) with killsig enabled.
1) wait for job running (usually one for slot);
2) condor_rm all the running jobs;
3) repeat the previous steps, after the 1st (at most the 3rd) attempt StartLog
will contain something like:
Comment 1 Matthew Farrellee 2011-01-07 13:27:22 EST
The shadow (PID = 20666) is failing on some low level writes.

06/23 00:43:23 (1935.1) (20666): Buf::write(): condor_write() failed
06/23 00:43:23 (1935.1) (20666): ERROR "Assertion ERROR on (result)" at line
325 in file NTreceivers.cpp

This could happen if the remote side was killed.

It is acceptable for the shadow to crash like this. The Schedd will notice and take corrective action, such as starting a new shadow.

Note You need to log in before you can comment on or make changes to this bug.