Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 707335 - Negotiator crashes with hierarchical group quotas after reconfig
Negotiator crashes with hierarchical group quotas after reconfig
Status: CLOSED ERRATA
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor (Show other bugs)
2.0
Unspecified Unspecified
high Severity high
: 2.0.1
: ---
Assigned To: Erik Erlandson
Lubos Trilety
:
Depends On:
Blocks: 723887
  Show dependency treegraph
 
Reported: 2011-05-24 13:46 EDT by Erik Erlandson
Modified: 2011-09-07 12:41 EDT (History)
8 users (show)

See Also:
Fixed In Version: condor-7.6.2-0.1
Doc Type: Bug Fix
Doc Text:
Cause: Negotiator depends on dynamically allocated data structure of accounting groups, which could be re-allocated by a reconfig event mid-cycle. Consequence: If a reconfig event occurs mid-negotiation cycle, the negotiator's structure pointers on stack could be invalidated, causing a memory read error and crash. Fix: Logic was added to delay reconfig events until any current negotiation cycle completes, and then execute the reconfig directly after the cycle. Result: Potential for reconfig causing negotiator read error and crash is eliminated.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2011-09-07 12:41:33 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2011:1249 normal SHIPPED_LIVE Moderate: Red Hat Enterprise MRG Grid 2.0 security, bug fix and enhancement update 2011-09-07 12:40:45 EDT

  None (edit)
Description Erik Erlandson 2011-05-24 13:46:27 EDT
Description of problem:


(from upstream): What's going on here is that the matchmaker calls DaemonCore::ServiceCommandThread in the middle of the negotiation cycle. This recursively calls the DaemonCore event loop, and if we get a reconfig there we recompute all the hierarchical group data structures, throw out the old ones, and return to the negotiator, which is still using the old ones, and bad thing happen. 


How reproducible: "low prob"


Steps to Reproduce:
Submit a bunch of jobs, then run

condor_reschedule ; condor_reconfig

a bunch of times until the negotiator crashes 

  
Actual results:
negotiator crashes


Expected results:
negotiator avoids reconfig in the middle of a cycle and does not crash


Additional info:
upstream: https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2172
Comment 1 Erik Erlandson 2011-06-01 14:31:17 EDT
Upstream fix is on 7.6 branch:
https://condor-wiki.cs.wisc.edu/index.cgi/chngview?cn=21901
Comment 3 Lubos Trilety 2011-07-21 11:44:45 EDT
I tried to reproduce the bug on condor-7.6.1-0.4.el5 with no results.

I used configuration mentioned in upstream ticket with NUM_CPUS set to 40. I
submitted following description file:
universe = vanilla
cmd = /bin/sleep
args = 30
#should_transfer_files = if_needed
#when_to_transfer_output = on_exit
queue 200
+AccountingGroup="group_gatekpr.user"
queue 200
+AccountingGroup="group_gatekpr.prod.user"
queue 200
+AccountingGroup="group_gatekpr.other.user"
queue 200
+AccountingGroup="group_calibrate.user"
queue 200
+AccountingGroup="group_opporA.user"
queue 200
+AccountingGroup="group_opporA.ligo.user"
queue 200
+AccountingGroup="group_opporA.CMS.user"
queue 200
+AccountingGroup="group_opporB.user"
queue 200
+AccountingGroup="group_opporB.SBGrid.user"
queue 200
+AccountingGroup="group_VOgener.user"
queue 200
+AccountingGroup="group_T3gen.user"
queue 200
+AccountingGroup="group_prod.user"
queue 200
+AccountingGroup="group_prod.hggs.user"
queue 200
+AccountingGroup="group_prod.ww.user"
queue 200
+AccountingGroup="group_prod.muon.user"
queue 200
+AccountingGroup="group_T3gen.other.user"
queue 200
+AccountingGroup="group_T3gen.eID.user"
queue 200
+AccountingGroup="group_T3gen.hggs.user"
queue 200
+AccountingGroup="group_T3gen.BSM.user"
queue 200
+AccountingGroup="group_T3gen.general.user"
queue 200

then run
while true; do condor_reschedule; condor_reconfig; done > /dev/null
or
while true; do condor_reschedule; condor_reconfig; sleep $(( $RANDOM % 5 ));
done > /dev/null

I tried also other random sleep intervals, but in all cases I was not able to
simulate crash of negotiator. Am I doing something wrong? Is there something
else needed?
Comment 4 Erik Erlandson 2011-07-21 12:04:26 EDT
(In reply to comment #3)
> I tried to reproduce the bug on condor-7.6.1-0.4.el5 with no results.
> 

I'm seeing slightly ambiguous results comparing which branches versus which tags the fix appears in.  It's on the upstream 7.6.1 branch, so it might help to repro against an earlier build.

> 
> I tried also other random sleep intervals, but in all cases I was not able to
> simulate crash of negotiator. Am I doing something wrong? Is there something
> else needed?

Repro of this bug is timing dependent:  A reconfig needs to happen during the negotiation cycle.  The best bet is setting things up to create as long a negotiation cycle as possible, and attempting to specifically time the reconfig to land inside the cycle.   Lots of slots, jobs, groups, etc.  Also setting "GROUP_QUOTA_ROUND_ROBIN_RATE = 1" is a nice expensive thing to do that will increase the length of the cycle and make repro more likely.
Comment 5 Lubos Trilety 2011-07-22 06:18:14 EDT
Successfully reproduced on:
condor-7.6.0-0.3

07/22/11 12:12:28   This submitter hit its submitterLimit.
Stack dump for process 14023 at timestamp 1311329548 (11 frames)
condor_negotiator(dprintf_dump_stack+0x56)[0x537856]
condor_negotiator[0x524682]
/lib64/libpthread.so.0[0x354c60eb10]
condor_negotiator(_ZN14compat_classad27ClassAdListDoesNotDeleteAds4NextEv+0xf)[0x52aa1f]
condor_negotiator(_ZN10Matchmaker18negotiateWithGroupEiddRN14compat_classad27ClassAdListDoesNotDeleteAdsER9HashTableI8MyStringS4_ES2_ffPKc+0x2a9)[0x480fb9]
condor_negotiator(_ZN10Matchmaker15negotiationTimeEv+0xa70)[0x4829f0]
condor_negotiator(_ZN12TimerManager7TimeoutEv+0x155)[0x4a6305]
condor_negotiator(_ZN10DaemonCore6DriverEv+0x248)[0x490c18]
condor_negotiator(main+0xe57)[0x4a52f7]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x354ba1d994]
condor_negotiator(__gxx_personality_v0+0x441)[0x468f49]
Comment 6 Lubos Trilety 2011-07-22 08:13:20 EDT
Tested on:
$CondorVersion: 7.6.3 Jul 13 2011 BuildID: RH-7.6.3-0.2.el5 $
$CondorPlatform: I686-RedHat_5.7 $

$CondorVersion: 7.6.3 Jul 13 2011 BuildID: RH-7.6.3-0.2.el5 $
$CondorPlatform: X86_64-RedHat_5.7 $

$CondorVersion: 7.6.3 Jul 13 2011 BuildID: RH-7.6.3-0.2.el6 $
$CondorPlatform: I686-RedHat_6.1 $

$CondorVersion: 7.6.3 Jul 13 2011 BuildID: RH-7.6.3-0.2.el6 $
$CondorPlatform: X86_64-RedHat_6.1 $

No crash of negotiator. Reconfig run after end of negotiator cycle.

>>> VERIFIED
Comment 7 Erik Erlandson 2011-07-25 18:33:42 EDT
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause:
Negotiator depends on dynamically allocated data structure of accounting groups, which could be re-allocated by a reconfig event mid-cycle.

Consequence:
If a reconfig event occurs mid-negotiation cycle, the negotiator's structure pointers on stack could be invalidated, causing a memory read error and crash.

Fix:
Logic was added to delay reconfig events until any current negotiation cycle completes, and then execute the reconfig directly after the cycle.

Result:
Potential for reconfig causing negotiator read error and crash is eliminated.
Comment 8 errata-xmlrpc 2011-09-07 12:41:33 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-1249.html

Note You need to log in before you can comment on or make changes to this bug.