Description of problem: (This brief summary is copied from https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2294) These days the schedd does a host of activities related to a job in the queue asynchronously, such as claiming of startds and negotiating with the central manager. So during the process of claiming or negotiating, the schedd will return to daemon core - what happens if the job being processed is removed while these asynch activities are still in progress? To deal with this possibility, the schedd will typically make a copy of the job classad. Unfortunately, the classad copy constructor and assignment operator does not do a deep copy of chained classads. As a result, if the job is removed from the job queue, any copies of the job ad laying about in various objects dealing w/ asynchronous operations now have an invalid pointer to the deleted cluster ad. Version-Release number of selected component (if applicable): Upstream condor 7.6.0 and 7.6.1 How reproducible: See linked external bug for a negotiator patch that will enable reproducing this behavior with a config knob.
This is upstream in commits c6f0638c and e4a4587f.
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: C: In previous versions of Condor, there was a race condition in the Condor schedd. C: As a consequence of this race condition, it was possible for the schedd to access stale job ClassAds in certain situations when matchmaking was proceeding slowly. This would result in a crash. F: Due to a fix to the Condor schedd, this race condition no longer exists. R: The schedd will no longer try to access stale job ClassAds.
Successfully reproduced on: $CondorVersion: 7.6.1 Aug 05 2011 BuildID: RH-7.6.1-0.10 $ $CondorPlatform: X86_64-RedHat_5.6 $ 08/08/11 13:55:54 (pid:614) SelfDrainingQueue job_is_finished_queue is empty, not resetting timer 08/08/11 13:55:54 (pid:614) Canceling timer for SelfDrainingQueue job_is_finished_queue (timer id: 22) Stack dump for process 614 at timestamp 1312804554 (20 frames) condor_schedd(dprintf_dump_stack+0x56)[0x597bf6] condor_schedd[0x59a852] /lib64/libpthread.so.0[0x354c60eb10] /lib64/libc.so.6(__strcasecmp+0x20)[0x354ba7c100] condor_schedd(_Z14_putOldClassAdP6StreamRN7classad7ClassAdEbbP10StringList+0x5bd)[0x596a9d] condor_schedd(_Z13putOldClassAdP6StreamRN7classad7ClassAdEbP10StringList+0x14)[0x596e74] condor_schedd(_ZN14compat_classad7ClassAd3putER6Stream+0x1b)[0x58a89b] condor_schedd(_ZN15ScheddNegotiate11sendJobInfoEP4Sock+0x93)[0x4d3783] condor_schedd(_ZN15ScheddNegotiate15messageReceivedEP11DCMessengerP4Sock+0x1c8)[0x4d3fa8] condor_schedd(_ZN5DCMsg19callMessageReceivedEP11DCMessengerP4Sock+0x36)[0x50ce06] condor_schedd(_ZN11DCMessenger7readMsgE18classy_counted_ptrI5DCMsgEP4Sock+0xe2)[0x50cfb2] condor_schedd(_ZN11DCMessenger18receiveMsgCallbackEP6Stream+0xef)[0x51031f] condor_schedd(_ZN10DaemonCore24CallSocketHandler_workerEibP6Stream+0x498)[0x4fdf48] condor_schedd(_ZN10DaemonCore35CallSocketHandler_worker_demarshallEPv+0x1a)[0x4fe3ea] condor_schedd(_ZN13CondorThreads8pool_addEPFvPvES0_PiPKc+0x38)[0x586178] condor_schedd(_ZN10DaemonCore17CallSocketHandlerERib+0x149)[0x4f6ac9] condor_schedd(_ZN10DaemonCore6DriverEv+0x1bb5)[0x4f8d35] condor_schedd(main+0xe57)[0x4ec317] /lib64/libc.so.6(__libc_start_main+0xf4)[0x354ba1d994] condor_schedd[0x486419]
Tested on: $CondorVersion: 7.6.3 Aug 05 2011 BuildID: RH-7.6.3-0.3 $ $CondorPlatform: I686-RedHat_5.7 $ $CondorVersion: 7.6.3 Aug 05 2011 BuildID: RH-7.6.3-0.3 $ $CondorPlatform: X86_64-RedHat_5.7 $ $CondorVersion: 7.6.3 Aug 05 2011 BuildID: RH-7.6.3-0.3.el6 $ $CondorPlatform: I686-RedHat_6.1 $ $CondorVersion: 7.6.3 Aug 05 2011 BuildID: RH-7.6.3-0.3.el6 $ $CondorPlatform: X86_64-RedHat_6.1 $ No crash of schedd. >>> VERIFIED
The timeout is presumed to be https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2367
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-1249.html
*** Bug 756135 has been marked as a duplicate of this bug. ***