Bug 526847

Summary: condor_startd SEGV when deleting dynamic slots
Product: Red Hat Enterprise MRG Reporter: Matthew Farrellee <matt>
Component: condorAssignee: Matthew Farrellee <matt>
Status: CLOSED ERRATA QA Contact: Luigi Toscano <ltoscano>
Severity: high Docs Contact:
Priority: high    
Version: 1.1CC: lbrindle, ltoscano
Target Milestone: 1.2   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Grid bug fix C: An update timer is registered for a resource (a dynamic slot) that has been deleted. C: The condor_startd crashes when the timer fires. F: The timer is no longer registered. R: Dynamic slots can be deleted without creating a crash. When an update timer was registered for a resource (a dynamic slot) that had been deleted, the condor_startd crashed when the timer fired. The timer is no longer registered, which means that dynamic slots can be deleted without creating a crash.
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-12-03 09:18:03 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 527551    

Description Matthew Farrellee 2009-10-02 02:58:04 UTC
Description of problem:

The condor_startd will non-deterministically SEGV with something similar to...

Stack dump for process 27184 at timestamp 1254139043 (12 frames)
condor_startd(dprintf_dump_stack+0xd0)[0x8137a56]
condor_startd[0x8137c16]
[0xdf0400]
condor_startd(caInsert(ClassAd*, ClassAd*, char const*, char const*)+0x8d [0x80ef9d2]
condor_startd(Resource::publish(ClassAd*, int)+0xb3)[0x80dec95]
condor_startd(Resource::publish_for_update(ClassAd*, ClassAd*)+0x19)[0x80df93d]
condor_startd(Resource::do_update()+0x5f)[0x80dfda1]
condor_startd(TimerManager::Timeout()+0x2c3)[0x813478f]
condor_startd(DaemonCore::Driver()+0x79c)[0x8115262]
condor_startd(main+0x180b)[0x812d0f2]
/lib/libc.so.6(__libc_start_main+0xe5)[0x7096e5]
condor_startd[0x80c8491]


Version-Release number of selected component (if applicable):

7.4.0-0.5

How reproducible:

100% (under the right conditions and with enough time)


Steps to Reproduce:
1. Setup a startd in a harsh environment, many partitionable slots, a short CLAIM_WORKLIFE and short UPDATE_INTERVAL, e.g.

NUM_CPUS = 250
UPDATE_INTERVAL = 3
SLOT_TYPE_1_PARTITIONABLE = TRUE
SLOT_TYPE_1 = CPUS=5
NUM_SLOTS_TYPE_1 = 50
NUM_SLOTS = 50
CLAIM_WORKLIFE = 35

2. Submit a bunch of job, shorter are better, e.g.

echo -e 'executable=/bin/sleep\narguments=5\nnotification=never\nqueue 50000\n' | condor_submit

3. Watch the StartLog, a crash should occur in 5 minutes or so

Comment 1 Matthew Farrellee 2009-10-02 03:38:42 UTC
Cause of the SEGV happens before the stack shown, with the registration of an update timer on a Resource (a slot) that has been deleted. The eval_and_update_all timer calls Resource::eval_and_update which calls eval_state resulting in the Resource being deleted and then update, which registers the timer in the stack.

This is fixed upstream and will be built into 7.4.0-0.6

commit 0d5e3ad8fc85f0cd0dc58f73b503c76c0ad49bc4
Author: Matthew Farrellee <matt>
Date:   Thu Oct 1 22:22:08 2009 -0400

Comment 3 Irina Boverman 2009-10-29 14:30:04 UTC
Release note added. If any revisions are required, please set the 
"requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

New Contents:
please see bug summary.

Comment 4 Luigi Toscano 2009-10-29 16:15:45 UTC
The crash does often show itself running the said configuration on condor-7.2.2-0.9, RHEL5.4 and 4.8, both i386 and x86_64.


On the same machines that crash has disappeared with condor-7.4.1-0.2. Changing the state to VERIFIED.

Comment 5 Lana Brindley 2009-11-09 03:34:53 UTC
Release note updated. If any revisions are required, please set the 
"requires_release_notes"  flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

Diffed Contents:
@@ -1 +1,8 @@
-please see bug summary.+Grid bug fix
+
+C: And update timer is registered on a resource (a dynamic slot) that has been deleted.
+C: The condor_startd crashes
+F: 
+R: Dynamic slots can be deleted without creating a crash.
+
+MORE INFORMATION REQUIRED FOR RELNOTE.

Comment 6 Matthew Farrellee 2009-11-09 12:07:02 UTC
Release note updated. If any revisions are required, please set the 
"requires_release_notes"  flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

Diffed Contents:
@@ -1,8 +1,6 @@
 Grid bug fix
 
-C: And update timer is registered on a resource (a dynamic slot) that has been deleted.
+C: And update timer is registered for a resource (a dynamic slot) that has been deleted.
-C: The condor_startd crashes
+C: The condor_startd crashes when the timer fires.
-F: 
+F: The timer is no longer registered.
-R: Dynamic slots can be deleted without creating a crash.
+R: Dynamic slots can be deleted without creating a crash.-
-MORE INFORMATION REQUIRED FOR RELNOTE.

Comment 7 Lana Brindley 2009-11-11 20:35:12 UTC
Release note updated. If any revisions are required, please set the 
"requires_release_notes"  flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

Diffed Contents:
@@ -1,6 +1,8 @@
 Grid bug fix
 
-C: And update timer is registered for a resource (a dynamic slot) that has been deleted.
+C: An update timer is registered for a resource (a dynamic slot) that has been deleted.
 C: The condor_startd crashes when the timer fires.
 F: The timer is no longer registered.
-R: Dynamic slots can be deleted without creating a crash.+R: Dynamic slots can be deleted without creating a crash.
+
+When an update timer was registered for a resource (a dynamic slot) that had been deleted, the condor_startd crashed when the timer fired. The timer is no longer registered, which means that dynamic slots can be deleted without creating a crash.

Comment 9 errata-xmlrpc 2009-12-03 09:18:03 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2009-1633.html