Bug 707335 - Negotiator crashes with hierarchical group quotas after reconfig
Summary: Negotiator crashes with hierarchical group quotas after reconfig
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor
Version: 2.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: 2.0.1
: ---
Assignee: Erik Erlandson
QA Contact: Lubos Trilety
URL:
Whiteboard:
Depends On:
Blocks: 723887
TreeView+ depends on / blocked
 
Reported: 2011-05-24 17:46 UTC by Erik Erlandson
Modified: 2011-09-07 16:41 UTC (History)
8 users (show)

Fixed In Version: condor-7.6.2-0.1
Doc Type: Bug Fix
Doc Text:
Cause: Negotiator depends on dynamically allocated data structure of accounting groups, which could be re-allocated by a reconfig event mid-cycle. Consequence: If a reconfig event occurs mid-negotiation cycle, the negotiator's structure pointers on stack could be invalidated, causing a memory read error and crash. Fix: Logic was added to delay reconfig events until any current negotiation cycle completes, and then execute the reconfig directly after the cycle. Result: Potential for reconfig causing negotiator read error and crash is eliminated.
Clone Of:
Environment:
Last Closed: 2011-09-07 16:41:33 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2011:1249 0 normal SHIPPED_LIVE Moderate: Red Hat Enterprise MRG Grid 2.0 security, bug fix and enhancement update 2011-09-07 16:40:45 UTC

Description Erik Erlandson 2011-05-24 17:46:27 UTC
Description of problem:


(from upstream): What's going on here is that the matchmaker calls DaemonCore::ServiceCommandThread in the middle of the negotiation cycle. This recursively calls the DaemonCore event loop, and if we get a reconfig there we recompute all the hierarchical group data structures, throw out the old ones, and return to the negotiator, which is still using the old ones, and bad thing happen. 


How reproducible: "low prob"


Steps to Reproduce:
Submit a bunch of jobs, then run

condor_reschedule ; condor_reconfig

a bunch of times until the negotiator crashes 

  
Actual results:
negotiator crashes


Expected results:
negotiator avoids reconfig in the middle of a cycle and does not crash


Additional info:
upstream: https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2172

Comment 1 Erik Erlandson 2011-06-01 18:31:17 UTC
Upstream fix is on 7.6 branch:
https://condor-wiki.cs.wisc.edu/index.cgi/chngview?cn=21901

Comment 3 Lubos Trilety 2011-07-21 15:44:45 UTC
I tried to reproduce the bug on condor-7.6.1-0.4.el5 with no results.

I used configuration mentioned in upstream ticket with NUM_CPUS set to 40. I
submitted following description file:
universe = vanilla
cmd = /bin/sleep
args = 30
#should_transfer_files = if_needed
#when_to_transfer_output = on_exit
queue 200
+AccountingGroup="group_gatekpr.user"
queue 200
+AccountingGroup="group_gatekpr.prod.user"
queue 200
+AccountingGroup="group_gatekpr.other.user"
queue 200
+AccountingGroup="group_calibrate.user"
queue 200
+AccountingGroup="group_opporA.user"
queue 200
+AccountingGroup="group_opporA.ligo.user"
queue 200
+AccountingGroup="group_opporA.CMS.user"
queue 200
+AccountingGroup="group_opporB.user"
queue 200
+AccountingGroup="group_opporB.SBGrid.user"
queue 200
+AccountingGroup="group_VOgener.user"
queue 200
+AccountingGroup="group_T3gen.user"
queue 200
+AccountingGroup="group_prod.user"
queue 200
+AccountingGroup="group_prod.hggs.user"
queue 200
+AccountingGroup="group_prod.ww.user"
queue 200
+AccountingGroup="group_prod.muon.user"
queue 200
+AccountingGroup="group_T3gen.other.user"
queue 200
+AccountingGroup="group_T3gen.eID.user"
queue 200
+AccountingGroup="group_T3gen.hggs.user"
queue 200
+AccountingGroup="group_T3gen.BSM.user"
queue 200
+AccountingGroup="group_T3gen.general.user"
queue 200

then run
while true; do condor_reschedule; condor_reconfig; done > /dev/null
or
while true; do condor_reschedule; condor_reconfig; sleep $(( $RANDOM % 5 ));
done > /dev/null

I tried also other random sleep intervals, but in all cases I was not able to
simulate crash of negotiator. Am I doing something wrong? Is there something
else needed?

Comment 4 Erik Erlandson 2011-07-21 16:04:26 UTC
(In reply to comment #3)
> I tried to reproduce the bug on condor-7.6.1-0.4.el5 with no results.
> 

I'm seeing slightly ambiguous results comparing which branches versus which tags the fix appears in.  It's on the upstream 7.6.1 branch, so it might help to repro against an earlier build.

> 
> I tried also other random sleep intervals, but in all cases I was not able to
> simulate crash of negotiator. Am I doing something wrong? Is there something
> else needed?

Repro of this bug is timing dependent:  A reconfig needs to happen during the negotiation cycle.  The best bet is setting things up to create as long a negotiation cycle as possible, and attempting to specifically time the reconfig to land inside the cycle.   Lots of slots, jobs, groups, etc.  Also setting "GROUP_QUOTA_ROUND_ROBIN_RATE = 1" is a nice expensive thing to do that will increase the length of the cycle and make repro more likely.

Comment 5 Lubos Trilety 2011-07-22 10:18:14 UTC
Successfully reproduced on:
condor-7.6.0-0.3

07/22/11 12:12:28   This submitter hit its submitterLimit.
Stack dump for process 14023 at timestamp 1311329548 (11 frames)
condor_negotiator(dprintf_dump_stack+0x56)[0x537856]
condor_negotiator[0x524682]
/lib64/libpthread.so.0[0x354c60eb10]
condor_negotiator(_ZN14compat_classad27ClassAdListDoesNotDeleteAds4NextEv+0xf)[0x52aa1f]
condor_negotiator(_ZN10Matchmaker18negotiateWithGroupEiddRN14compat_classad27ClassAdListDoesNotDeleteAdsER9HashTableI8MyStringS4_ES2_ffPKc+0x2a9)[0x480fb9]
condor_negotiator(_ZN10Matchmaker15negotiationTimeEv+0xa70)[0x4829f0]
condor_negotiator(_ZN12TimerManager7TimeoutEv+0x155)[0x4a6305]
condor_negotiator(_ZN10DaemonCore6DriverEv+0x248)[0x490c18]
condor_negotiator(main+0xe57)[0x4a52f7]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x354ba1d994]
condor_negotiator(__gxx_personality_v0+0x441)[0x468f49]

Comment 6 Lubos Trilety 2011-07-22 12:13:20 UTC
Tested on:
$CondorVersion: 7.6.3 Jul 13 2011 BuildID: RH-7.6.3-0.2.el5 $
$CondorPlatform: I686-RedHat_5.7 $

$CondorVersion: 7.6.3 Jul 13 2011 BuildID: RH-7.6.3-0.2.el5 $
$CondorPlatform: X86_64-RedHat_5.7 $

$CondorVersion: 7.6.3 Jul 13 2011 BuildID: RH-7.6.3-0.2.el6 $
$CondorPlatform: I686-RedHat_6.1 $

$CondorVersion: 7.6.3 Jul 13 2011 BuildID: RH-7.6.3-0.2.el6 $
$CondorPlatform: X86_64-RedHat_6.1 $

No crash of negotiator. Reconfig run after end of negotiator cycle.

>>> VERIFIED

Comment 7 Erik Erlandson 2011-07-25 22:33:42 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause:
Negotiator depends on dynamically allocated data structure of accounting groups, which could be re-allocated by a reconfig event mid-cycle.

Consequence:
If a reconfig event occurs mid-negotiation cycle, the negotiator's structure pointers on stack could be invalidated, causing a memory read error and crash.

Fix:
Logic was added to delay reconfig events until any current negotiation cycle completes, and then execute the reconfig directly after the cycle.

Result:
Potential for reconfig causing negotiator read error and crash is eliminated.

Comment 8 errata-xmlrpc 2011-09-07 16:41:33 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-1249.html


Note You need to log in before you can comment on or make changes to this bug.