Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 519437

Summary: Race condition can cause crash in collector and schedd if a query workers are enabled
Product: Red Hat Enterprise MRG Reporter: Robert Rati <rrati>
Component: gridAssignee: Robert Rati <rrati>
Status: CLOSED ERRATA QA Contact: Martin Kudlej <mkudlej>
Severity: medium Docs Contact:
Priority: medium    
Version: 1.1CC: lbrindle, matt, mkudlej
Target Milestone: 1.2   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Grid bug fix C: The collector and schedd were susceptible to a race condition. C: The race condition could cause the collector and schedd to core dump in some circumstances. F: Ensured that the termination signal is processed in the correct order. R: The race condition no longer exists, and core dumps no longer occur A race condition was causing the collector and schedd to core dump in some circumstances. The race condition was rectified by ensuring that the termination signal is handled correctly, and the core dumps no longer occur.
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-12-03 09:19:15 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 527551    

Description Robert Rati 2009-08-26 15:47:27 UTC
Description of problem:
The collector and schedd are susceptible to a race condition that can cause a core dump.  If a shut down request is sent to the collector or schedd while they are processing a command that forces them to fork a subprocess (condor_q/condor_status), then the termination signal is processed first before cleaning up the child process.  When the child processes information is cleaned up, a core will result.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Robert Rati 2009-08-26 18:01:20 UTC
This can be reproduced on the collector by issuing:

condor_status -any & sleep 1 & killall -TERM condor_collector

Comment 2 Robert Rati 2009-08-27 15:38:35 UTC
Fixed in:
condor-7.3.2-0.5

Comment 3 Martin Kudlej 2009-10-09 10:55:37 UTC
I set up CREATE_CORE_FILES = TRUE and ulimit -c to non zero value. After command from first comment there are no coredump in LOG directory.
Tested on RHEL 5.4/4.8 x86_64/i386 with condor-7.4.0-0.(5.el5/4.el4) and it works -->VERIFIED

Comment 4 Irina Boverman 2009-10-29 14:29:56 UTC
Release note added. If any revisions are required, please set the 
"requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

New Contents:
please see bug summary.

Comment 5 Lana Brindley 2009-11-09 00:06:02 UTC
Release note updated. If any revisions are required, please set the 
"requires_release_notes"  flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

Diffed Contents:
@@ -1 +1,8 @@
-please see bug summary.+Grid bug fix
+
+C: The collector and schedd were susceptible to a race condition.
+C: The race condition could cause the collector and schedd to core dump in some circumstances.
+F: Ensured that the termination signal is processed in the correct order.
+R: The race condition no longer exists, and core dumps no longer occur
+
+A race condition was causing the collector and schedd to core dump in some circumstances. The race condition was rectified by ensuring that the termination signal is handled correctly, and the core dumps no longer occur.

Comment 7 errata-xmlrpc 2009-12-03 09:19:15 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2009-1633.html