Bug 519437 - Race condition can cause crash in collector and schedd if a query workers are enabled
Summary: Race condition can cause crash in collector and schedd if a query workers are...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: grid
Version: 1.1
Hardware: All
OS: Linux
medium
medium
Target Milestone: 1.2
: ---
Assignee: Robert Rati
QA Contact: Martin Kudlej
URL:
Whiteboard:
Depends On:
Blocks: 527551
TreeView+ depends on / blocked
 
Reported: 2009-08-26 15:47 UTC by Robert Rati
Modified: 2009-12-03 09:19 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Grid bug fix C: The collector and schedd were susceptible to a race condition. C: The race condition could cause the collector and schedd to core dump in some circumstances. F: Ensured that the termination signal is processed in the correct order. R: The race condition no longer exists, and core dumps no longer occur A race condition was causing the collector and schedd to core dump in some circumstances. The race condition was rectified by ensuring that the termination signal is handled correctly, and the core dumps no longer occur.
Clone Of:
Environment:
Last Closed: 2009-12-03 09:19:15 UTC


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2009:1633 normal SHIPPED_LIVE Red Hat Enterprise MRG Messaging and Grid Version 1.2 2009-12-03 09:15:33 UTC

Description Robert Rati 2009-08-26 15:47:27 UTC
Description of problem:
The collector and schedd are susceptible to a race condition that can cause a core dump.  If a shut down request is sent to the collector or schedd while they are processing a command that forces them to fork a subprocess (condor_q/condor_status), then the termination signal is processed first before cleaning up the child process.  When the child processes information is cleaned up, a core will result.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Robert Rati 2009-08-26 18:01:20 UTC
This can be reproduced on the collector by issuing:

condor_status -any & sleep 1 & killall -TERM condor_collector

Comment 2 Robert Rati 2009-08-27 15:38:35 UTC
Fixed in:
condor-7.3.2-0.5

Comment 3 Martin Kudlej 2009-10-09 10:55:37 UTC
I set up CREATE_CORE_FILES = TRUE and ulimit -c to non zero value. After command from first comment there are no coredump in LOG directory.
Tested on RHEL 5.4/4.8 x86_64/i386 with condor-7.4.0-0.(5.el5/4.el4) and it works -->VERIFIED

Comment 4 Irina Boverman 2009-10-29 14:29:56 UTC
Release note added. If any revisions are required, please set the 
"requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

New Contents:
please see bug summary.

Comment 5 Lana Brindley 2009-11-09 00:06:02 UTC
Release note updated. If any revisions are required, please set the 
"requires_release_notes"  flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

Diffed Contents:
@@ -1 +1,8 @@
-please see bug summary.+Grid bug fix
+
+C: The collector and schedd were susceptible to a race condition.
+C: The race condition could cause the collector and schedd to core dump in some circumstances.
+F: Ensured that the termination signal is processed in the correct order.
+R: The race condition no longer exists, and core dumps no longer occur
+
+A race condition was causing the collector and schedd to core dump in some circumstances. The race condition was rectified by ensuring that the termination signal is handled correctly, and the core dumps no longer occur.

Comment 7 errata-xmlrpc 2009-12-03 09:19:15 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2009-1633.html


Note You need to log in before you can comment on or make changes to this bug.