720507 – Schedd crash while asynchronously negotiating or claiming

Bug 720507 - Schedd crash while asynchronously negotiating or claiming

Summary: Schedd crash while asynchronously negotiating or claiming

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	condor
Sub Component:
Version:	2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	2.0.1
Target Release:	---
Assignee:	Will Benton
QA Contact:	Lubos Trilety
Docs Contact:
URL:	https://condor-wiki.cs.wisc.edu/index...
Whiteboard:
Duplicates (1):	756135 (view as bug list)
Depends On:
Blocks:	723887
TreeView+	depends on / blocked

Reported:	2011-07-11 21:18 UTC by Will Benton
Modified:	2011-12-08 17:22 UTC (History)
CC List:	7 users (show)
Fixed In Version:	condor-7.6.3-0.2
Doc Type:	Bug Fix
Doc Text:	C: In previous versions of Condor, there was a race condition in the Condor schedd. C: As a consequence of this race condition, it was possible for the schedd to access stale job ClassAds in certain situations when matchmaking was proceeding slowly. This would result in a crash. F: Due to a fix to the Condor schedd, this race condition no longer exists. R: The schedd will no longer try to access stale job ClassAds.
Clone Of:
Environment:
Last Closed:	2011-09-07 16:42:25 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2011:1249	0	normal	SHIPPED_LIVE	Moderate: Red Hat Enterprise MRG Grid 2.0 security, bug fix and enhancement update	2011-09-07 16:40:45 UTC

Description Will Benton 2011-07-11 21:18:53 UTC

Description of problem:

(This brief summary is copied from https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2294)

These days the schedd does a host of activities related to a job in the queue asynchronously, such as claiming of startds and negotiating with the central manager. So during the process of claiming or negotiating, the schedd will return to daemon core - what happens if the job being processed is removed while these asynch activities are still in progress? To deal with this possibility, the schedd will typically make a copy of the job classad. Unfortunately, the classad copy constructor and assignment operator does not do a deep copy of chained classads. As a result, if the job is removed from the job queue, any copies of the job ad laying about in various objects dealing w/ asynchronous operations now have an invalid pointer to the deleted cluster ad. 

Version-Release number of selected component (if applicable):

Upstream condor 7.6.0 and 7.6.1

How reproducible:

See linked external bug for a negotiator patch that will enable reproducing this behavior with a config knob.

Comment 2 Will Benton 2011-07-13 19:14:53 UTC

This is upstream in commits c6f0638c and e4a4587f.

Comment 5 Will Benton 2011-07-25 22:47:51 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
C:  In previous versions of Condor, there was a race condition in the Condor schedd.
C:  As a consequence of this race condition, it was possible for the schedd to access stale job ClassAds in certain situations when matchmaking was proceeding slowly.  This would result in a crash.
F:  Due to a fix to the Condor schedd, this race condition no longer exists.
R:  The schedd will no longer try to access stale job ClassAds.

Comment 8 Lubos Trilety 2011-08-08 14:24:48 UTC

Successfully reproduced on:
$CondorVersion: 7.6.1 Aug 05 2011 BuildID: RH-7.6.1-0.10 $
$CondorPlatform: X86_64-RedHat_5.6 $

08/08/11 13:55:54 (pid:614) SelfDrainingQueue job_is_finished_queue is empty, not resetting timer
08/08/11 13:55:54 (pid:614) Canceling timer for SelfDrainingQueue job_is_finished_queue (timer id: 22)
Stack dump for process 614 at timestamp 1312804554 (20 frames)
condor_schedd(dprintf_dump_stack+0x56)[0x597bf6]
condor_schedd[0x59a852]
/lib64/libpthread.so.0[0x354c60eb10]
/lib64/libc.so.6(__strcasecmp+0x20)[0x354ba7c100]
condor_schedd(_Z14_putOldClassAdP6StreamRN7classad7ClassAdEbbP10StringList+0x5bd)[0x596a9d]
condor_schedd(_Z13putOldClassAdP6StreamRN7classad7ClassAdEbP10StringList+0x14)[0x596e74]
condor_schedd(_ZN14compat_classad7ClassAd3putER6Stream+0x1b)[0x58a89b]
condor_schedd(_ZN15ScheddNegotiate11sendJobInfoEP4Sock+0x93)[0x4d3783]
condor_schedd(_ZN15ScheddNegotiate15messageReceivedEP11DCMessengerP4Sock+0x1c8)[0x4d3fa8]
condor_schedd(_ZN5DCMsg19callMessageReceivedEP11DCMessengerP4Sock+0x36)[0x50ce06]
condor_schedd(_ZN11DCMessenger7readMsgE18classy_counted_ptrI5DCMsgEP4Sock+0xe2)[0x50cfb2]
condor_schedd(_ZN11DCMessenger18receiveMsgCallbackEP6Stream+0xef)[0x51031f]
condor_schedd(_ZN10DaemonCore24CallSocketHandler_workerEibP6Stream+0x498)[0x4fdf48]
condor_schedd(_ZN10DaemonCore35CallSocketHandler_worker_demarshallEPv+0x1a)[0x4fe3ea]
condor_schedd(_ZN13CondorThreads8pool_addEPFvPvES0_PiPKc+0x38)[0x586178]
condor_schedd(_ZN10DaemonCore17CallSocketHandlerERib+0x149)[0x4f6ac9]
condor_schedd(_ZN10DaemonCore6DriverEv+0x1bb5)[0x4f8d35]
condor_schedd(main+0xe57)[0x4ec317]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x354ba1d994]
condor_schedd[0x486419]

Comment 10 Lubos Trilety 2011-08-09 14:40:35 UTC

Tested on:
$CondorVersion: 7.6.3 Aug 05 2011 BuildID: RH-7.6.3-0.3 $
$CondorPlatform: I686-RedHat_5.7 $

$CondorVersion: 7.6.3 Aug 05 2011 BuildID: RH-7.6.3-0.3 $
$CondorPlatform: X86_64-RedHat_5.7 $

$CondorVersion: 7.6.3 Aug 05 2011 BuildID: RH-7.6.3-0.3.el6 $
$CondorPlatform: I686-RedHat_6.1 $

$CondorVersion: 7.6.3 Aug 05 2011 BuildID: RH-7.6.3-0.3.el6 $
$CondorPlatform: X86_64-RedHat_6.1 $


No crash of schedd.

>>> VERIFIED

Comment 11 Matthew Farrellee 2011-08-09 17:44:34 UTC

The timeout is presumed to be https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2367

Comment 12 errata-xmlrpc 2011-09-07 16:42:25 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-1249.html

Comment 13 Timothy St. Clair 2011-12-08 17:22:42 UTC

*** Bug 756135 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.