Bug 720507
Summary: | Schedd crash while asynchronously negotiating or claiming | ||
---|---|---|---|
Product: | Red Hat Enterprise MRG | Reporter: | Will Benton <willb> |
Component: | condor | Assignee: | Will Benton <willb> |
Status: | CLOSED ERRATA | QA Contact: | Lubos Trilety <ltrilety> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 2.0 | CC: | iboverma, jneedle, jthomas, ltoscano, ltrilety, matt, tstclair |
Target Milestone: | 2.0.1 | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
URL: | https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2294 | ||
Whiteboard: | |||
Fixed In Version: | condor-7.6.3-0.2 | Doc Type: | Bug Fix |
Doc Text: |
C: In previous versions of Condor, there was a race condition in the Condor schedd.
C: As a consequence of this race condition, it was possible for the schedd to access stale job ClassAds in certain situations when matchmaking was proceeding slowly. This would result in a crash.
F: Due to a fix to the Condor schedd, this race condition no longer exists.
R: The schedd will no longer try to access stale job ClassAds.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2011-09-07 16:42:25 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 723887 |
Description
Will Benton
2011-07-11 21:18:53 UTC
This is upstream in commits c6f0638c and e4a4587f. Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: C: In previous versions of Condor, there was a race condition in the Condor schedd. C: As a consequence of this race condition, it was possible for the schedd to access stale job ClassAds in certain situations when matchmaking was proceeding slowly. This would result in a crash. F: Due to a fix to the Condor schedd, this race condition no longer exists. R: The schedd will no longer try to access stale job ClassAds. Successfully reproduced on: $CondorVersion: 7.6.1 Aug 05 2011 BuildID: RH-7.6.1-0.10 $ $CondorPlatform: X86_64-RedHat_5.6 $ 08/08/11 13:55:54 (pid:614) SelfDrainingQueue job_is_finished_queue is empty, not resetting timer 08/08/11 13:55:54 (pid:614) Canceling timer for SelfDrainingQueue job_is_finished_queue (timer id: 22) Stack dump for process 614 at timestamp 1312804554 (20 frames) condor_schedd(dprintf_dump_stack+0x56)[0x597bf6] condor_schedd[0x59a852] /lib64/libpthread.so.0[0x354c60eb10] /lib64/libc.so.6(__strcasecmp+0x20)[0x354ba7c100] condor_schedd(_Z14_putOldClassAdP6StreamRN7classad7ClassAdEbbP10StringList+0x5bd)[0x596a9d] condor_schedd(_Z13putOldClassAdP6StreamRN7classad7ClassAdEbP10StringList+0x14)[0x596e74] condor_schedd(_ZN14compat_classad7ClassAd3putER6Stream+0x1b)[0x58a89b] condor_schedd(_ZN15ScheddNegotiate11sendJobInfoEP4Sock+0x93)[0x4d3783] condor_schedd(_ZN15ScheddNegotiate15messageReceivedEP11DCMessengerP4Sock+0x1c8)[0x4d3fa8] condor_schedd(_ZN5DCMsg19callMessageReceivedEP11DCMessengerP4Sock+0x36)[0x50ce06] condor_schedd(_ZN11DCMessenger7readMsgE18classy_counted_ptrI5DCMsgEP4Sock+0xe2)[0x50cfb2] condor_schedd(_ZN11DCMessenger18receiveMsgCallbackEP6Stream+0xef)[0x51031f] condor_schedd(_ZN10DaemonCore24CallSocketHandler_workerEibP6Stream+0x498)[0x4fdf48] condor_schedd(_ZN10DaemonCore35CallSocketHandler_worker_demarshallEPv+0x1a)[0x4fe3ea] condor_schedd(_ZN13CondorThreads8pool_addEPFvPvES0_PiPKc+0x38)[0x586178] condor_schedd(_ZN10DaemonCore17CallSocketHandlerERib+0x149)[0x4f6ac9] condor_schedd(_ZN10DaemonCore6DriverEv+0x1bb5)[0x4f8d35] condor_schedd(main+0xe57)[0x4ec317] /lib64/libc.so.6(__libc_start_main+0xf4)[0x354ba1d994] condor_schedd[0x486419] Tested on:
$CondorVersion: 7.6.3 Aug 05 2011 BuildID: RH-7.6.3-0.3 $
$CondorPlatform: I686-RedHat_5.7 $
$CondorVersion: 7.6.3 Aug 05 2011 BuildID: RH-7.6.3-0.3 $
$CondorPlatform: X86_64-RedHat_5.7 $
$CondorVersion: 7.6.3 Aug 05 2011 BuildID: RH-7.6.3-0.3.el6 $
$CondorPlatform: I686-RedHat_6.1 $
$CondorVersion: 7.6.3 Aug 05 2011 BuildID: RH-7.6.3-0.3.el6 $
$CondorPlatform: X86_64-RedHat_6.1 $
No crash of schedd.
>>> VERIFIED
The timeout is presumed to be https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2367 An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-1249.html *** Bug 756135 has been marked as a duplicate of this bug. *** |