608104 – Sporadic crash of shadow

Bug 608104 - Sporadic crash of shadow

Summary: Sporadic crash of shadow

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	condor
Sub Component:
Version:	1.0
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	2.0
Target Release:	---
Assignee:	Matthew Farrellee
QA Contact:	MRG Quality Engineering
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2010-06-25 16:47 UTC by Luigi Toscano
Modified:	2011-01-07 18:27 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2011-01-07 18:27:22 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Excerpts of logs related to the job which made shadow crash (11.09 KB, text/plain) 2010-06-25 16:47 UTC, Luigi Toscano	no flags	Details
View All

Description Luigi Toscano 2010-06-25 16:47:17 UTC

Created attachment 426935 [details]
Excerpts of logs related to the job which made shadow crash

Description of problem:
While running a killsig stress test I've found that rarely shadow crashes:

06/23 00:43:23 (1935.0) (20663): Can no longer talk to condor_starter <10.16.66.140:44194>
06/23 00:43:23 (1935.1) (20666): Buf::write(): condor_write() failed
06/23 00:43:23 (1935.1) (20666): ERROR "Assertion ERROR on (result)" at line 325 in file NTreceivers.cpp
06/23 00:43:23 (1935.0) (20663): Trying to reconnect to disconnected job
06/23 00:43:23 (1935.0) (20663): LastJobLeaseRenewal: 1277268175 Wed Jun 23 00:42:55 2010
06/23 00:43:23 (1935.0) (20663): JobLeaseDuration: 1200 seconds
06/23 00:43:23 (1935.0) (20663): JobLeaseDuration remaining: 1172
06/23 00:43:23 (1935.0) (20663): Attempting to locate disconnected starter
06/23 00:43:23 (1935.0) (20663): locateStarter(): ClaimId (<10.16.66.140:44194>#1277265708#1#478b946b7b3c81d711671224105d7594ca463289) and GlobalJobId ( hp-ml150g6-01.rhts.bos.r
edhat.com#1935.0#1277265799 ) not found
06/23 00:43:23 (1935.0) (20663): Reconnect FAILED: Job not found at execution machine
06/23 00:43:23 (1935.0) (20663): **** condor_shadow (condor_SHADOW) pid 20663 EXITING WITH STATUS 107
Stack dump for process 20666 at timestamp 1277268203 (15 frames)
condor_shadow(dprintf_dump_stack+0x3f)[0x80ddbef]
condor_shadow[0x80ddf4a]
/lib/tls/libpthread.so.0[0xd98a98]
/lib/tls/libc.so.6(abort+0xe9)[0xb57289]
condor_shadow(_EXCEPT_+0x6d)[0x80dc61d]
condor_shadow(_Z17do_REMOTE_syscallv+0x15c6)[0x80b4e06]
condor_shadow(_ZN14RemoteResource14handleSysCallsEP6Stream+0x1e)[0x80ab67e]
condor_shadow(_ZN10DaemonCore24CallSocketHandler_workerEibP6Stream+0x400)[0x80ca330]
condor_shadow(_ZN10DaemonCore35CallSocketHandler_worker_demarshallEPv+0x20)[0x80ca7c0]
condor_shadow(_ZN13CondorThreads8pool_addEPFvPvES0_PiPKc+0x27)[0x8149467]
condor_shadow(_ZN10DaemonCore17CallSocketHandlerERib+0x199)[0x80beb59]
condor_shadow(_ZN10DaemonCore6DriverEv+0x1ddd)[0x80c300d]
condor_shadow(main+0x133e)[0x80d7abe]
/lib/tls/libc.so.6(__libc_start_main+0xd3)[0xb42df3]
condor_shadow(__gxx_personality_v0+0x14d)[0x80a2a41]
06/23 00:43:23 (1935.2) (20669): **** condor_shadow (condor_SHADOW) pid 20669 EXITING WITH STATUS 102

Version-Release number of selected component (if applicable):
condor-7.4.3-0.20, RHEL4.8/5.5, i386/x86_64

How reproducible:

Steps to Reproduce:
0) Submit many jobs (O(10000)) with killsig enabled.
1) wait for job running (usually one for slot);
2) condor_rm all the running jobs;
3) repeat the previous steps, after the 1st (at most the 3rd) attempt StartLog
will contain something like:

Comment 1 Matthew Farrellee 2011-01-07 18:27:22 UTC

The shadow (PID = 20666) is failing on some low level writes.

06/23 00:43:23 (1935.1) (20666): Buf::write(): condor_write() failed
06/23 00:43:23 (1935.1) (20666): ERROR "Assertion ERROR on (result)" at line
325 in file NTreceivers.cpp

This could happen if the remote side was killed.

It is acceptable for the shadow to crash like this. The Schedd will notice and take corrective action, such as starting a new shadow.

Note You need to log in before you can comment on or make changes to this bug.