Bug 707335

Summary: Negotiator crashes with hierarchical group quotas after reconfig
Product: Red Hat Enterprise MRG Reporter: Erik Erlandson <eerlands>
Component: condorAssignee: Erik Erlandson <eerlands>
Status: CLOSED ERRATA QA Contact: Lubos Trilety <ltrilety>
Severity: high Docs Contact:
Priority: high    
Version: 2.0CC: claudiol, jneedle, jthomas, ltrilety, matt, mkudlej, tstclair, whenry
Target Milestone: 2.0.1   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: condor-7.6.2-0.1 Doc Type: Bug Fix
Doc Text:
Cause: Negotiator depends on dynamically allocated data structure of accounting groups, which could be re-allocated by a reconfig event mid-cycle. Consequence: If a reconfig event occurs mid-negotiation cycle, the negotiator's structure pointers on stack could be invalidated, causing a memory read error and crash. Fix: Logic was added to delay reconfig events until any current negotiation cycle completes, and then execute the reconfig directly after the cycle. Result: Potential for reconfig causing negotiator read error and crash is eliminated.
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-09-07 16:41:33 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 723887    

Description Erik Erlandson 2011-05-24 17:46:27 UTC
Description of problem:


(from upstream): What's going on here is that the matchmaker calls DaemonCore::ServiceCommandThread in the middle of the negotiation cycle. This recursively calls the DaemonCore event loop, and if we get a reconfig there we recompute all the hierarchical group data structures, throw out the old ones, and return to the negotiator, which is still using the old ones, and bad thing happen. 


How reproducible: "low prob"


Steps to Reproduce:
Submit a bunch of jobs, then run

condor_reschedule ; condor_reconfig

a bunch of times until the negotiator crashes 

  
Actual results:
negotiator crashes


Expected results:
negotiator avoids reconfig in the middle of a cycle and does not crash


Additional info:
upstream: https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2172

Comment 1 Erik Erlandson 2011-06-01 18:31:17 UTC
Upstream fix is on 7.6 branch:
https://condor-wiki.cs.wisc.edu/index.cgi/chngview?cn=21901

Comment 3 Lubos Trilety 2011-07-21 15:44:45 UTC
I tried to reproduce the bug on condor-7.6.1-0.4.el5 with no results.

I used configuration mentioned in upstream ticket with NUM_CPUS set to 40. I
submitted following description file:
universe = vanilla
cmd = /bin/sleep
args = 30
#should_transfer_files = if_needed
#when_to_transfer_output = on_exit
queue 200
+AccountingGroup="group_gatekpr.user"
queue 200
+AccountingGroup="group_gatekpr.prod.user"
queue 200
+AccountingGroup="group_gatekpr.other.user"
queue 200
+AccountingGroup="group_calibrate.user"
queue 200
+AccountingGroup="group_opporA.user"
queue 200
+AccountingGroup="group_opporA.ligo.user"
queue 200
+AccountingGroup="group_opporA.CMS.user"
queue 200
+AccountingGroup="group_opporB.user"
queue 200
+AccountingGroup="group_opporB.SBGrid.user"
queue 200
+AccountingGroup="group_VOgener.user"
queue 200
+AccountingGroup="group_T3gen.user"
queue 200
+AccountingGroup="group_prod.user"
queue 200
+AccountingGroup="group_prod.hggs.user"
queue 200
+AccountingGroup="group_prod.ww.user"
queue 200
+AccountingGroup="group_prod.muon.user"
queue 200
+AccountingGroup="group_T3gen.other.user"
queue 200
+AccountingGroup="group_T3gen.eID.user"
queue 200
+AccountingGroup="group_T3gen.hggs.user"
queue 200
+AccountingGroup="group_T3gen.BSM.user"
queue 200
+AccountingGroup="group_T3gen.general.user"
queue 200

then run
while true; do condor_reschedule; condor_reconfig; done > /dev/null
or
while true; do condor_reschedule; condor_reconfig; sleep $(( $RANDOM % 5 ));
done > /dev/null

I tried also other random sleep intervals, but in all cases I was not able to
simulate crash of negotiator. Am I doing something wrong? Is there something
else needed?

Comment 4 Erik Erlandson 2011-07-21 16:04:26 UTC
(In reply to comment #3)
> I tried to reproduce the bug on condor-7.6.1-0.4.el5 with no results.
> 

I'm seeing slightly ambiguous results comparing which branches versus which tags the fix appears in.  It's on the upstream 7.6.1 branch, so it might help to repro against an earlier build.

> 
> I tried also other random sleep intervals, but in all cases I was not able to
> simulate crash of negotiator. Am I doing something wrong? Is there something
> else needed?

Repro of this bug is timing dependent:  A reconfig needs to happen during the negotiation cycle.  The best bet is setting things up to create as long a negotiation cycle as possible, and attempting to specifically time the reconfig to land inside the cycle.   Lots of slots, jobs, groups, etc.  Also setting "GROUP_QUOTA_ROUND_ROBIN_RATE = 1" is a nice expensive thing to do that will increase the length of the cycle and make repro more likely.

Comment 5 Lubos Trilety 2011-07-22 10:18:14 UTC
Successfully reproduced on:
condor-7.6.0-0.3

07/22/11 12:12:28   This submitter hit its submitterLimit.
Stack dump for process 14023 at timestamp 1311329548 (11 frames)
condor_negotiator(dprintf_dump_stack+0x56)[0x537856]
condor_negotiator[0x524682]
/lib64/libpthread.so.0[0x354c60eb10]
condor_negotiator(_ZN14compat_classad27ClassAdListDoesNotDeleteAds4NextEv+0xf)[0x52aa1f]
condor_negotiator(_ZN10Matchmaker18negotiateWithGroupEiddRN14compat_classad27ClassAdListDoesNotDeleteAdsER9HashTableI8MyStringS4_ES2_ffPKc+0x2a9)[0x480fb9]
condor_negotiator(_ZN10Matchmaker15negotiationTimeEv+0xa70)[0x4829f0]
condor_negotiator(_ZN12TimerManager7TimeoutEv+0x155)[0x4a6305]
condor_negotiator(_ZN10DaemonCore6DriverEv+0x248)[0x490c18]
condor_negotiator(main+0xe57)[0x4a52f7]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x354ba1d994]
condor_negotiator(__gxx_personality_v0+0x441)[0x468f49]

Comment 6 Lubos Trilety 2011-07-22 12:13:20 UTC
Tested on:
$CondorVersion: 7.6.3 Jul 13 2011 BuildID: RH-7.6.3-0.2.el5 $
$CondorPlatform: I686-RedHat_5.7 $

$CondorVersion: 7.6.3 Jul 13 2011 BuildID: RH-7.6.3-0.2.el5 $
$CondorPlatform: X86_64-RedHat_5.7 $

$CondorVersion: 7.6.3 Jul 13 2011 BuildID: RH-7.6.3-0.2.el6 $
$CondorPlatform: I686-RedHat_6.1 $

$CondorVersion: 7.6.3 Jul 13 2011 BuildID: RH-7.6.3-0.2.el6 $
$CondorPlatform: X86_64-RedHat_6.1 $

No crash of negotiator. Reconfig run after end of negotiator cycle.

>>> VERIFIED

Comment 7 Erik Erlandson 2011-07-25 22:33:42 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause:
Negotiator depends on dynamically allocated data structure of accounting groups, which could be re-allocated by a reconfig event mid-cycle.

Consequence:
If a reconfig event occurs mid-negotiation cycle, the negotiator's structure pointers on stack could be invalidated, causing a memory read error and crash.

Fix:
Logic was added to delay reconfig events until any current negotiation cycle completes, and then execute the reconfig directly after the cycle.

Result:
Potential for reconfig causing negotiator read error and crash is eliminated.

Comment 8 errata-xmlrpc 2011-09-07 16:41:33 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-1249.html