Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 720507 - Schedd crash while asynchronously negotiating or claiming
Schedd crash while asynchronously negotiating or claiming
Status: CLOSED ERRATA
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor (Show other bugs)
2.0
Unspecified Unspecified
high Severity high
: 2.0.1
: ---
Assigned To: Will Benton
Lubos Trilety
https://condor-wiki.cs.wisc.edu/index...
:
: 756135 (view as bug list)
Depends On:
Blocks: 723887
  Show dependency treegraph
 
Reported: 2011-07-11 17:18 EDT by Will Benton
Modified: 2011-12-08 12:22 EST (History)
7 users (show)

See Also:
Fixed In Version: condor-7.6.3-0.2
Doc Type: Bug Fix
Doc Text:
C: In previous versions of Condor, there was a race condition in the Condor schedd. C: As a consequence of this race condition, it was possible for the schedd to access stale job ClassAds in certain situations when matchmaking was proceeding slowly. This would result in a crash. F: Due to a fix to the Condor schedd, this race condition no longer exists. R: The schedd will no longer try to access stale job ClassAds.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2011-09-07 12:42:25 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2011:1249 normal SHIPPED_LIVE Moderate: Red Hat Enterprise MRG Grid 2.0 security, bug fix and enhancement update 2011-09-07 12:40:45 EDT

  None (edit)
Description Will Benton 2011-07-11 17:18:53 EDT
Description of problem:

(This brief summary is copied from https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2294)

These days the schedd does a host of activities related to a job in the queue asynchronously, such as claiming of startds and negotiating with the central manager. So during the process of claiming or negotiating, the schedd will return to daemon core - what happens if the job being processed is removed while these asynch activities are still in progress? To deal with this possibility, the schedd will typically make a copy of the job classad. Unfortunately, the classad copy constructor and assignment operator does not do a deep copy of chained classads. As a result, if the job is removed from the job queue, any copies of the job ad laying about in various objects dealing w/ asynchronous operations now have an invalid pointer to the deleted cluster ad. 

Version-Release number of selected component (if applicable):

Upstream condor 7.6.0 and 7.6.1

How reproducible:

See linked external bug for a negotiator patch that will enable reproducing this behavior with a config knob.
Comment 2 Will Benton 2011-07-13 15:14:53 EDT
This is upstream in commits c6f0638c and e4a4587f.
Comment 5 Will Benton 2011-07-25 18:47:51 EDT
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
C:  In previous versions of Condor, there was a race condition in the Condor schedd.
C:  As a consequence of this race condition, it was possible for the schedd to access stale job ClassAds in certain situations when matchmaking was proceeding slowly.  This would result in a crash.
F:  Due to a fix to the Condor schedd, this race condition no longer exists.
R:  The schedd will no longer try to access stale job ClassAds.
Comment 8 Lubos Trilety 2011-08-08 10:24:48 EDT
Successfully reproduced on:
$CondorVersion: 7.6.1 Aug 05 2011 BuildID: RH-7.6.1-0.10 $
$CondorPlatform: X86_64-RedHat_5.6 $

08/08/11 13:55:54 (pid:614) SelfDrainingQueue job_is_finished_queue is empty, not resetting timer
08/08/11 13:55:54 (pid:614) Canceling timer for SelfDrainingQueue job_is_finished_queue (timer id: 22)
Stack dump for process 614 at timestamp 1312804554 (20 frames)
condor_schedd(dprintf_dump_stack+0x56)[0x597bf6]
condor_schedd[0x59a852]
/lib64/libpthread.so.0[0x354c60eb10]
/lib64/libc.so.6(__strcasecmp+0x20)[0x354ba7c100]
condor_schedd(_Z14_putOldClassAdP6StreamRN7classad7ClassAdEbbP10StringList+0x5bd)[0x596a9d]
condor_schedd(_Z13putOldClassAdP6StreamRN7classad7ClassAdEbP10StringList+0x14)[0x596e74]
condor_schedd(_ZN14compat_classad7ClassAd3putER6Stream+0x1b)[0x58a89b]
condor_schedd(_ZN15ScheddNegotiate11sendJobInfoEP4Sock+0x93)[0x4d3783]
condor_schedd(_ZN15ScheddNegotiate15messageReceivedEP11DCMessengerP4Sock+0x1c8)[0x4d3fa8]
condor_schedd(_ZN5DCMsg19callMessageReceivedEP11DCMessengerP4Sock+0x36)[0x50ce06]
condor_schedd(_ZN11DCMessenger7readMsgE18classy_counted_ptrI5DCMsgEP4Sock+0xe2)[0x50cfb2]
condor_schedd(_ZN11DCMessenger18receiveMsgCallbackEP6Stream+0xef)[0x51031f]
condor_schedd(_ZN10DaemonCore24CallSocketHandler_workerEibP6Stream+0x498)[0x4fdf48]
condor_schedd(_ZN10DaemonCore35CallSocketHandler_worker_demarshallEPv+0x1a)[0x4fe3ea]
condor_schedd(_ZN13CondorThreads8pool_addEPFvPvES0_PiPKc+0x38)[0x586178]
condor_schedd(_ZN10DaemonCore17CallSocketHandlerERib+0x149)[0x4f6ac9]
condor_schedd(_ZN10DaemonCore6DriverEv+0x1bb5)[0x4f8d35]
condor_schedd(main+0xe57)[0x4ec317]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x354ba1d994]
condor_schedd[0x486419]
Comment 10 Lubos Trilety 2011-08-09 10:40:35 EDT
Tested on:
$CondorVersion: 7.6.3 Aug 05 2011 BuildID: RH-7.6.3-0.3 $
$CondorPlatform: I686-RedHat_5.7 $

$CondorVersion: 7.6.3 Aug 05 2011 BuildID: RH-7.6.3-0.3 $
$CondorPlatform: X86_64-RedHat_5.7 $

$CondorVersion: 7.6.3 Aug 05 2011 BuildID: RH-7.6.3-0.3.el6 $
$CondorPlatform: I686-RedHat_6.1 $

$CondorVersion: 7.6.3 Aug 05 2011 BuildID: RH-7.6.3-0.3.el6 $
$CondorPlatform: X86_64-RedHat_6.1 $


No crash of schedd.

>>> VERIFIED
Comment 11 Matthew Farrellee 2011-08-09 13:44:34 EDT
The timeout is presumed to be https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2367
Comment 12 errata-xmlrpc 2011-09-07 12:42:25 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-1249.html
Comment 13 Timothy St. Clair 2011-12-08 12:22:42 EST
*** Bug 756135 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.